Chicago, IL

ML Infrastructure Engineer

Machine Learningfull timesenior level
AI ScreenedRemote B2BEU Talent Pool246 applicants
For hiring agencies & HR teams

EU engineers, ready to place with your US clients

Pre-screened on AI. Remote B2B contracts. View 5 full profiles free — AI score, skills report, interview questions included.

About This Role

About the Role You'll architect and scale the ML platform infrastructure that powers model training, deployment, and monitoring across our organization. This role owns the distributed systems challenges that make production ML possible — optimizing GPU utilization for large-scale training jobs, building scalable model serving pipelines, and designing observability tooling that ensures model accuracy in production. You'll work at the intersection of systems engineering and machine learning, enabling data scientists and ML engineers to move faster while you solve the hard infrastructure problems that make their work reliable at scale. Our Stack - Core technologies: Python · Kubernetes · Docker · PyTorch/TensorFlow · MLflow · Terraform - Cloud & infrastructure: Multi-cloud environment (AWS/GCP/Azure) with heavy Kubernetes usage for training and serving workloads - Observability: Prometheus · Grafana · Datadog for monitoring distributed training jobs and model serving latency - The frontier: Room to evaluate and adopt emerging MLOps frameworks as the space evolves What You'll Do - Architect and implement distributed training infrastructure for large-scale model training workloads, optimizing GPU scheduling and resource allocation across Kubernetes clusters - Design and build scalable model serving pipelines that handle real-time inference requests with sub-100ms latency, including autoscaling and load balancing strategies - Own end-to-end platform reliability — establish monitoring, alerting, and debugging tooling to ensure ML infrastructure maintains 99.9% uptime - Evaluate tradeoffs between emerging MLOps patterns and technologies, proposing and prototyping new approaches to model deployment, experiment tracking, and feature serving - Build greenfield ML tooling that automates repetitive workflows for data scientists, from hyperparameter tuning orchestration to model versioning and A/B testing infrastructure - Partner with ML researchers and data engineers to understand pain points in their workflows, translating ambiguous platform needs into precise technical specifications - Drive technical decisions on platform architecture with minimal oversight, balancing immediate ML team needs against long-term scalability and maintainability What We're Looking For - 5–8 years of professional software engineering experience with a focus on building and operating production infrastructure at scale - Expert-level Python for building robust, production-grade systems that other engineers can maintain and extend - Deep hands-on experience with Kubernetes in production environments — you've debugged pods at 3am, optimized resource allocation, and designed multi-tenant cluster architectures - Strong distributed systems fundamentals with proven ability to design for fault tolerance, scalability, and observability — you think in terms of CAP theorem tradeoffs and failure modes - Proficiency in at least one major ML framework (PyTorch or TensorFlow) with understanding of training workflows, model serialization, and inference optimization - Production experience with containerization (Docker) and infrastructure-as-code (Terraform or similar) — you treat infrastructure changes with the same rigor as application code - Track record of architecting and implementing platform components with minimal oversight — you've owned technical roadmaps and made build-vs-buy decisions that stuck - Exceptional analytical and debugging skills with systematic approach to diagnosing performance bottlenecks, resource leaks, and distributed system failures Nice to Have - Experience building or contributing to MLOps platforms (MLflow, Kubeflow, Metaflow) with deep understanding of the model development lifecycle - Hands-on experience optimizing GPU utilization and distributed training infrastructure — you understand the economics of ML compute and how to maximize throughput - Background in highly regulated domains where model reliability and auditability are critical business requirements - Open-source contributions to ML infrastructure projects or widely-adopted internal tools you've open-sourced - Experience with LLM infrastructure patterns (vLLM, TGI, model quantization, serving optimization)

Requirements

  • 5-8 years of professional software engineering experience with a focus on building and operating production infrastructure at scale
  • Expert-level Python for building robust, production-grade systems — you write code that other engineers can maintain and extend
  • Deep hands-on experience with Kubernetes in production environments — you've debugged pods at 3am, optimized resource allocation, and designed multi-tenant cluster architectures
  • Strong distributed systems fundamentals with proven ability to design for fault tolerance, scalability, and observability — you think in terms of CAP theorem tradeoffs and failure modes
  • Proficiency in at least one major ML framework (PyTorch or TensorFlow) with understanding of training workflows, model serialization, and inference optimization
  • Production experience with containerization (Docker) and infrastructure-as-code (Terraform or similar) — you treat infrastructure changes with the same rigor as application code
  • Track record of architecting and implementing platform components with minimal oversight — you've owned technical roadmaps and made build-vs-buy decisions that stuck
  • Exceptional analytical and debugging skills with systematic approach to diagnosing performance bottlenecks, resource leaks, and distributed system failures

Required Skills

KubernetesPythonPyTorchTensorFlowDockerAWSMLflow