···
Log in / Register

ML Production Engineer — Origination Decisions

Indeed
Full-time
Onsite
No experience limit
No degree limit
Mexico
Favourites
Share

Description

Summary: Own the production lifecycle of ML-based decision services, focusing on reliable deployment, continuous monitoring, and easy evolution while understanding ML system failure modes. Highlights: 1. Own end-to-end vertical slice of ML product for lending 2. Focus on ML system reliability and graceful degradation 3. Key role in bridging Data Science and platform engineering teams DESCRIPTION **About the team** ------------------ The Origination Decisions team builds and operates the machine\-learning\-powered system that decides whether to approve loan applications and under which conditions. The team is small (4 people) and every member owns a vertical slice of the product end\-to\-end — from data pipelines through model training to production deployment — for a subset of lending products. You will therefore not only lead improvements in your area of expertise, but also regularly use the full stack as an end\-user, giving you first\-hand insight into what works and what doesn't. **The role** ------------ You will own the production lifecycle of our ML\-based decision services: deploying them reliably, monitoring them continuously, and making them easy to evolve. This is not a traditional DevOps or SRE role. You need to understand how machine\-learning systems fail — silently degrading predictions, distribution shifts, broken upstream schemas that subtly bias features — and design safeguards that catch these issues before they reach customers. **Key responsibilities** ------------------------ ### **Deployment \& release management** * Design and maintain the promotion pipeline from pull request to dev, staging, and production, including the criteria and automated checks at each gate. * Manage containerized services on Kubernetes: image optimization, resource scaling, granular per\-decider deployments. * Coordinate schema and API changes with the teams that maintain the upstream and downstream .NET / TypeScript services. ### **Testing \& quality gates** * Strengthen automated PR checks: decision\-impact visualizations, anomaly detection on training data and backpopulated predictions, and integration of upstream/downstream service code into automated LLM\-assisted reviews. * Improve the Bruno API test suites that run against the dev environment after every merge, balancing coverage with cost. * Extend the staging validation system that replays production traffic: detect divergences in computed features, approval statistics, and schema conformance between staging and production models. ### **Monitoring \& observability** * Design and maintain production monitoring: dashboards, alerts, and cross\-service distributed tracing of the full onboarding flow. * Define and track ML\-specific health metrics (approval rates, score distributions, feature drift) alongside standard service metrics (latency, error rates, resource usage). * Build tooling that transforms the internal decision trace into human\-readable explanations for operations and compliance stakeholders. ### **Reliability \& graceful degradation** * Coordinate with upstream data providers to define fallback strategies when external data is unavailable (secondary providers, default values, deferred decisions). * Extend the input\-validation framework so that non\-critical schema violations fall back to safe defaults (with alerts) while critical violations block the decision, and simulate the impact of those fallbacks on decision quality. ### **API design \& integration** * Design and implement new endpoints as the product evolves (e.g., counter\-offers, intermediary onboarding steps, modified loan conditions). * Integrate new data sources into the online decision path — including features from video\-call analysis and a low\-latency feature store for returning customers — in coordination with the pipeline engineer. ### **Performance optimization** * Profile and optimize inference time: replace heavy dependencies (e.g., LightGBM ONNX), evaluate faster data\-processing libraries (e.g., Polars over pandas), and offload hot paths with compiled code where justified. * Keep base Docker images lean and startup times low. ### **Cross\-team code review** * Review pull requests in adjacent repositories (primarily C\# / .NET and TypeScript / React) that affect the services immediately upstream or downstream of the decision system, to catch integration issues early. **Benefits** ------------ * Attractive compensation package, including stock options. * Fast\-paced environment with significant growth opportunities. * 15 annual vacation days \+ 7 annual personal days. * Option to work remotely 3\-4 days per week ; or fully\-remote (as long as you can come to CDMX \~twice a year) * Flexible work schedule REQUIREMENTS **Required skills** ------------------- * Production ML experience — You have deployed ML models to production and dealt with the failure modes specific to learned systems: silent degradation, training/serving skew, selection bias, data\-pipeline breakages, and schema drift. * Software engineering — Strong Python skills (you will work daily with FastAPI, Pydantic, and pytest). Comfortable reading and reviewing C\# and TypeScript code. * Containerisation \& orchestration — Hands\-on experience with Docker and Kubernetes in a production setting (resource management, rolling deployments, health probes). * Testing philosophy — You think in terms of layered validation (unit, integration, contract, shadow\-traffic comparison) and know how to balance coverage against cost and speed. * Monitoring \& observability — Experience designing dashboards, alerts, and distributed traces for services where "the service returned 200 but the answer was wrong" is a real failure mode. * API design — Ability to design clear, evolvable REST APIs and negotiate schema changes across teams. * Communication — You will be the main point of contact between Data Science and the platform engineering teams. Clear, precise written and verbal communication is essential. * Fluency in both Spanish and English. Most of our meetings are in Spanish, but the code and most documentation is written in English. **Nice\-to\-haves** ------------------- * Experience with model\-serving runtimes (ONNX Runtime, TensorFlow Serving, Triton) or model compilation/optimisation techniques. * Familiarity with Dagster, DVC, or similar ML pipeline / data\-orchestration tools. * Familiarity with the Prometheus / Grafana observability stack. * Experience with performance profiling and optimisation in Python (Polars, NumPy, Numba, Cython, or Rust extensions). * Exposure to financial services, credit decisioning, or regulated environments where auditability and explainability matter. * Experience building or maintaining CI/CD pipelines with automated ML\-specific validations (data quality checks, model performance gates, decision\-impact analysis). * Knowledge of the Azure ecosystem (AKS, ACR, Azure DevOps). * Familiarity with API\-testing tools such as Bruno or Postman for contract and integration testing. * Familiarity with Pants, or other similar build systems.

Source:  indeed View original post
Juan García
Indeed · HR

Company

Indeed
Cookie
Cookie Settings
Our Apps
Download
Download on the
APP Store
Download
Get it on
Google Play
© 2025 Servanan International Pte. Ltd.