Senior ML Engineer
About This Role
About the Role We're building ML systems that serve millions of users with sub-100ms latency requirements. As a Senior ML Engineer, you'll own the full lifecycle from research prototypes to production deployment—architecting scalable model serving infrastructure, instrumenting rigorous A/B testing frameworks, and driving technical decisions that directly impact product outcomes. You'll work in Austin's growing ML/AI community, collaborating with data engineers and product teams to turn ambitious ideas into reliable, high-performance systems that solve real business problems. Our Stack - Modern Python ML stack: PyTorch · TensorFlow · scikit-learn · pandas · NumPy - Cloud-native ML infrastructure on AWS, GCP, or Azure with Kubernetes orchestration - MLOps tooling: MLflow · Kubeflow · feature stores · experiment tracking · model registries - Observability and monitoring: Datadog · Grafana · custom metrics dashboards for model performance What You'll Do - Design and implement end-to-end ML pipelines from data ingestion through model training, evaluation, and production deployment using PyTorch/TensorFlow and cloud platforms - Own the architecture and performance optimization of model serving infrastructure, ensuring sub-100ms latency and 99.9% uptime for user-facing ML features - Establish rigorous experimentation frameworks including A/B testing, statistical analysis, and continuous monitoring to validate model improvements with measurable business impact - Drive technical roadmap decisions for ML infrastructure, evaluating emerging tools and adopting those that pragmatically advance system capabilities - Analyze production model behavior through deep metric analysis, debug performance degradation, and iterate rapidly to maintain model quality as data distributions shift - Collaborate with backend engineers to integrate ML predictions into product APIs, and partner with data engineers to build robust data pipelines feeding model training at scale - Mentor team members through technical design reviews, code reviews, and knowledge sharing on ML systems best practices What We're Looking For - 5+ years building and deploying machine learning systems in production—you've debugged model serving latency at 3am, not just trained models in notebooks - Deep expertise in Python ML frameworks (PyTorch or TensorFlow) with focus on model optimization, efficient training pipelines, and production-grade code quality - Hands-on experience with cloud ML infrastructure (AWS SageMaker, GCP Vertex AI, or Azure ML)—you've architected scalable model deployment pipelines, not just followed tutorials - Strong fundamentals in system design and distributed computing—ability to reason about tradeoffs in model serving architecture, caching strategies, and fault tolerance - Production experience with containerization and orchestration (Docker, Kubernetes) for ML workloads—you understand resource allocation, autoscaling, and cost optimization - Proficiency in MLOps practices and tools (MLflow, Kubeflow, or similar) with track record of building reproducible training pipelines and automated model evaluation frameworks - Solid SQL skills and data pipeline experience—ability to work closely with data engineers to ensure high-quality training data and efficient feature engineering - Demonstrated ability to translate business requirements into technical ML solutions, then drive them from experimentation through production deployment with measurable impact Nice to Have - Experience building and maintaining real-time model serving infrastructure with sub-100ms latency requirements and high availability guarantees - Track record of A/B testing ML models in production and using rigorous statistical analysis to validate improvements before full rollout - Familiarity with infrastructure-as-code (Terraform, CloudFormation) and GitOps workflows for managing ML infrastructure reproducibly
Requirements
- 5+ years building and deploying machine learning systems in production environments — you've debugged model serving latency at 3am, not just trained models in notebooks
- Deep expertise in Python ML frameworks (PyTorch or TensorFlow) with focus on model optimization, efficient training pipelines, and production-grade code quality
- Hands-on experience with cloud ML infrastructure (AWS SageMaker, GCP Vertex AI, or Azure ML) — you've architected scalable model deployment pipelines, not just followed tutorials
- Strong fundamentals in system design and distributed computing — ability to reason about tradeoffs in model serving architecture, caching strategies, and fault tolerance
- Production experience with containerization and orchestration (Docker, Kubernetes) for ML workloads — you understand resource allocation, autoscaling, and cost optimization
- Proficiency in MLOps practices and tools (MLflow, Kubeflow, or similar) with track record of building reproducible training pipelines and automated model evaluation frameworks
- Solid SQL skills and data pipeline experience — ability to work closely with data engineers to ensure high-quality training data and efficient feature engineering
- Demonstrated ability to translate business requirements into technical ML solutions, then drive them from experimentation through production deployment with measurable impact
