Senior ML Engineer

Engineeringfull timeAustin, TXsenior level

About This Role

About the Role We're building ML systems that serve millions of users with sub-100ms latency requirements. As a Senior ML Engineer, you'll own the full lifecycle from research prototypes to production deployment—architecting scalable model serving infrastructure, instrumenting rigorous A/B testing frameworks, and driving technical decisions that directly impact product outcomes. You'll work in Austin's growing ML/AI community, collaborating with data engineers and product teams to turn ambitious ideas into reliable, high-performance systems that solve real business problems. Our Stack - Modern Python ML stack: PyTorch · TensorFlow · scikit-learn · pandas · NumPy - Cloud-native ML infrastructure on AWS, GCP, or Azure with Kubernetes orchestration - MLOps tooling: MLflow · Kubeflow · feature stores · experiment tracking · model registries - Observability and monitoring: Datadog · Grafana · custom metrics dashboards for model performance What You'll Do - Design and implement end-to-end ML pipelines from data ingestion through model training, evaluation, and production deployment using PyTorch/TensorFlow and cloud platforms - Own the architecture and performance optimization of model serving infrastructure, ensuring sub-100ms latency and 99.9% uptime for user-facing ML features - Establish rigorous experimentation frameworks including A/B testing, statistical analysis, and continuous monitoring to validate model improvements with measurable business impact - Drive technical roadmap decisions for ML infrastructure, evaluating emerging tools and adopting those that pragmatically advance system capabilities - Analyze production model behavior through deep metric analysis, debug performance degradation, and iterate rapidly to maintain model quality as data distributions shift - Collaborate with backend engineers to integrate ML predictions into product APIs, and partner with data engineers to build robust data pipelines feeding model training at scale - Mentor team members through technical design reviews, code reviews, and knowledge sharing on ML systems best practices What We're Looking For - 5+ years building and deploying machine learning systems in production—you've debugged model serving latency at 3am, not just trained models in notebooks - Deep expertise in Python ML frameworks (PyTorch or TensorFlow) with focus on model optimization, efficient training pipelines, and production-grade code quality - Hands-on experience with cloud ML infrastructure (AWS SageMaker, GCP Vertex AI, or Azure ML)—you've architected scalable model deployment pipelines, not just followed tutorials - Strong fundamentals in system design and distributed computing—ability to reason about tradeoffs in model serving architecture, caching strategies, and fault tolerance - Production experience with containerization and orchestration (Docker, Kubernetes) for ML workloads—you understand resource allocation, autoscaling, and cost optimization - Proficiency in MLOps practices and tools (MLflow, Kubeflow, or similar) with track record of building reproducible training pipelines and automated model evaluation frameworks - Solid SQL skills and data pipeline experience—ability to work closely with data engineers to ensure high-quality training data and efficient feature engineering - Demonstrated ability to translate business requirements into technical ML solutions, then drive them from experimentation through production deployment with measurable impact Nice to Have - Experience building and maintaining real-time model serving infrastructure with sub-100ms latency requirements and high availability guarantees - Track record of A/B testing ML models in production and using rigorous statistical analysis to validate improvements before full rollout - Familiarity with infrastructure-as-code (Terraform, CloudFormation) and GitOps workflows for managing ML infrastructure reproducibly

Requirements

5+ years building and deploying machine learning systems in production environments — you've debugged model serving latency at 3am, not just trained models in notebooks
Deep expertise in Python ML frameworks (PyTorch or TensorFlow) with focus on model optimization, efficient training pipelines, and production-grade code quality
Hands-on experience with cloud ML infrastructure (AWS SageMaker, GCP Vertex AI, or Azure ML) — you've architected scalable model deployment pipelines, not just followed tutorials
Strong fundamentals in system design and distributed computing — ability to reason about tradeoffs in model serving architecture, caching strategies, and fault tolerance
Production experience with containerization and orchestration (Docker, Kubernetes) for ML workloads — you understand resource allocation, autoscaling, and cost optimization
Proficiency in MLOps practices and tools (MLflow, Kubeflow, or similar) with track record of building reproducible training pipelines and automated model evaluation frameworks
Solid SQL skills and data pipeline experience — ability to work closely with data engineers to ensure high-quality training data and efficient feature engineering
Demonstrated ability to translate business requirements into technical ML solutions, then drive them from experimentation through production deployment with measurable impact

Skills

PythonPyTorchTensorFlowAWSDockerKubernetesMLflowSQL

Senior ML Engineer

About This Role

Requirements

Skills

Check your profile with AI