Chicago, IL

ML Infrastructure Engineer

Machine Learningfull timesenior level

AI ScreenedRemote B2BEU Talent Pool246 applicants

For hiring agencies & HR teams

EU engineers, ready to place with your US clients

Pre-screened on AI. Remote B2B contracts. View 5 full profiles free — AI score, skills report, interview questions included.

About This Role

About the Role You'll architect and scale the ML platform infrastructure that powers model training, deployment, and monitoring across our organization. This role owns the distributed systems challenges that make production ML possible — optimizing GPU utilization for large-scale training jobs, building scalable model serving pipelines, and designing observability tooling that ensures model accuracy in production. You'll work at the intersection of systems engineering and machine learning, enabling data scientists and ML engineers to move faster while you solve the hard infrastructure problems that make their work reliable at scale. Our Stack - Core technologies: Python · Kubernetes · Docker · PyTorch/TensorFlow · MLflow · Terraform - Cloud & infrastructure: Multi-cloud environment (AWS/GCP/Azure) with heavy Kubernetes usage for training and serving workloads - Observability: Prometheus · Grafana · Datadog for monitoring distributed training jobs and model serving latency - The frontier: Room to evaluate and adopt emerging MLOps frameworks as the space evolves What You'll Do - Architect and implement distributed training infrastructure for large-scale model training workloads, optimizing GPU scheduling and resource allocation across Kubernetes clusters - Design and build scalable model serving pipelines that handle real-time inference requests with sub-100ms latency, including autoscaling and load balancing strategies - Own end-to-end platform reliability — establish monitoring, alerting, and debugging tooling to ensure ML infrastructure maintains 99.9% uptime - Evaluate tradeoffs between emerging MLOps patterns and technologies, proposing and prototyping new approaches to model deployment, experiment tracking, and feature serving - Build greenfield ML tooling that automates repetitive workflows for data scientists, from hyperparameter tuning orchestration to model versioning and A/B testing infrastructure - Partner with ML researchers and data engineers to understand pain points in their workflows, translating ambiguous platform needs into precise technical specifications - Drive technical decisions on platform architecture with minimal oversight, balancing immediate ML team needs against long-term scalability and maintainability What We're Looking For - 5–8 years of professional software engineering experience with a focus on building and operating production infrastructure at scale - Expert-level Python for building robust, production-grade systems that other engineers can maintain and extend - Deep hands-on experience with Kubernetes in production environments — you've debugged pods at 3am, optimized resource allocation, and designed multi-tenant cluster architectures - Strong distributed systems fundamentals with proven ability to design for fault tolerance, scalability, and observability — you think in terms of CAP theorem tradeoffs and failure modes - Proficiency in at least one major ML framework (PyTorch or TensorFlow) with understanding of training workflows, model serialization, and inference optimization - Production experience with containerization (Docker) and infrastructure-as-code (Terraform or similar) — you treat infrastructure changes with the same rigor as application code - Track record of architecting and implementing platform components with minimal oversight — you've owned technical roadmaps and made build-vs-buy decisions that stuck - Exceptional analytical and debugging skills with systematic approach to diagnosing performance bottlenecks, resource leaks, and distributed system failures Nice to Have - Experience building or contributing to MLOps platforms (MLflow, Kubeflow, Metaflow) with deep understanding of the model development lifecycle - Hands-on experience optimizing GPU utilization and distributed training infrastructure — you understand the economics of ML compute and how to maximize throughput - Background in highly regulated domains where model reliability and auditability are critical business requirements - Open-source contributions to ML infrastructure projects or widely-adopted internal tools you've open-sourced - Experience with LLM infrastructure patterns (vLLM, TGI, model quantization, serving optimization)

Requirements

5-8 years of professional software engineering experience with a focus on building and operating production infrastructure at scale
Expert-level Python for building robust, production-grade systems — you write code that other engineers can maintain and extend
Deep hands-on experience with Kubernetes in production environments — you've debugged pods at 3am, optimized resource allocation, and designed multi-tenant cluster architectures
Strong distributed systems fundamentals with proven ability to design for fault tolerance, scalability, and observability — you think in terms of CAP theorem tradeoffs and failure modes
Proficiency in at least one major ML framework (PyTorch or TensorFlow) with understanding of training workflows, model serialization, and inference optimization
Production experience with containerization (Docker) and infrastructure-as-code (Terraform or similar) — you treat infrastructure changes with the same rigor as application code
Track record of architecting and implementing platform components with minimal oversight — you've owned technical roadmaps and made build-vs-buy decisions that stuck
Exceptional analytical and debugging skills with systematic approach to diagnosing performance bottlenecks, resource leaks, and distributed system failures

Required Skills

KubernetesPythonPyTorchTensorFlowDockerAWSMLflow

Pre-screened Candidates

Machine Learning EngineerFit

Feb 25

Senior Machine Learning EngineerFit

Feb 25

CandidateFit

Feb 25

Senior Data ScientistReview

Feb 25

All profiles are anonymized for fair evaluation

Similar Positions

Candidates may also fit these roles

Senior ML Engineer

148 matched

Austin, TX

About the Role We're building ML systems that serve millions of users with sub-100ms latency requirements. As a Senior ML Engineer, you'll own the full lifecyc…

MLOps Engineer

12 matched

Boston, MA

About the Role You will own the complete ML lifecycle infrastructure that bridges research and production at scale. This is a systems-ownership role where you'…

ML Infrastructure Engineer / Founding ML Lead

33 matched

remote

**Founding ML Engineer / Future CTO** Build the AI Foundation & Grow into Technical Leadership Remote | Full-Time | Founding Team | Equity-Heavy --- **About t…

Platform Engineer

18 matched

Austin, TX

About the Role We're looking for an experienced platform engineer to architect and own the infrastructure foundations that power our engineering organization. …