




Summary: Seeking a Data Pipeline Engineer to build and maintain data pipelines, integrate new data sources, design feature computations, ensure data quality, and own a subset of loan products end-to-end within a small team. Highlights: 1. Owns a slice of the stack end-to-end and a subset of loan products 2. Focus on data quality and pipeline improvements 3. Opportunity to design advanced feature computations and ML integration DESCRIPTION **About the role** ------------------ Our Origination Decisions team builds the systems that decide, in real time, whether to grant a loan to an applicant and under which conditions (amount, term, interest rate). The team is small (4 people) and each member owns a slice of the stack end\-to\-end: * a Decider architect, who designs how decisions are composed and how we model profit; * a Data Scientist, who trains the most accurate ML models feeding those decisions; * a Deployment engineer, who ships the deciders to production and owns the quality gates around them; * **and the Data Pipeline engineer we are hiring with this role.** On top of your specialty, you will own a subset of our loan products end\-to\-end: you will build their datasets, train their models, configure their decider and follow them to production. This "dogfooding" keeps you close to the pain points your pipeline creates and is the main feedback loop that drives the roadmap of your area. **What you will own** --------------------- As the Data Pipeline engineer you are the main guarantor that the rest of the team always has fresh, trustworthy, and easy\-to\-use datasets to train models, analyze behavior and make decisions on. Concretely, you will: ### **Build and maintain the data pipeline** * Own the team's Dagster pipeline. * Keep assets fresh, observable, and cheap to recompute. Make it obvious to other team members which datasets exist, what they contain, and how to consume them. ### **Bring in new data sources** * Partner with external data providers on proofs\-of\-concept: organize backpopulation runs, send them the required samples, store and version the returned data, and evaluate whether the signal is worth productizing. * Similarly, explore the data available internally that is under\-used for purpose of origination decisions. For instance, transcript of collections calls could become features (for loan renovations), or groundtruth (to get a better picture of the customer than just the payments they made). * When a source is promising, integrate it end\-to\-end: reconcile backpopulation dumps with the live API feed, extract features consistently from both, and expose them to downstream consumers. ### **Design feature computations, not just move data around** Some of the pipeline work is pure data engineering (joins, aggregations, cleaning), but a lot of it is closer to applied math and ML: * Designing Long Term Value formulas that chain per\-loan profit estimates with time\-discounting and population averages for unseen future states (e.g. "average profit of a 3rd loan for customers similar to this one"), so the team can compare counterfactual policies such as "what would the LTV be if we only used base decider X?". * Building offline feature stores for known customers and serving them through a low\-latency store so the online decider can use information that doesn't fit in the request payload. * Running reject inference as a recurring process: periodically sampling past rejected applications, pulling fresh credit reports, turning them into pseudo\-groundtruths, and merging them into training datasets. * Using LLMs and other models inside the pipeline when it is the right tool (e.g. extracting features from video\-call transcripts, nowcasting loan profits and defaults). * Implementing feature pre\-selection inside the pipeline (ranking by predictive power, de\-correlating, keeping the top \~N) so that the datasets we ship are an order of magnitude smaller than today without losing signal. ### **Own data quality** * Add tests to most Dagster assets and make a deliberate choice for each one: does a failure block downstream assets, trigger an alert, or simply get logged? * Guarantee that refactors and migrations do not silently change the value of existing features. * When something breaks, investigate quickly, fix at the root, and leave behind a new test closer to the source of the problem so the class of bug cannot come back unnoticed. ### **Be a user of your own platform** You will also be responsible for a subset of our products: building their datasets, training ML models on them, configuring a decider and following it into production. Feedback you get as a user directly feeds the priorities of your pipeline work, and gives you concrete grounds to coordinate with the data scientist, decider and deployment engineers on cross\-team improvements. **What success looks like after 12 months** ------------------------------------------- * The team trusts the datasets by default: if a model behaves oddly, the first hypothesis is no longer "maybe the pipeline is wrong". * At least one new external data source has been integrated end\-to\-end, from POC to being used in a production decider. * The most important internal data sources are transformed into features available to the data scientist. * Training datasets are noticeably smaller and faster to load, thanks to in\-pipeline feature pre\-selection, without measurable loss in model performance. * For the products you own, you have shipped at least one improved decider to production, and the lessons from that experience have shaped concrete improvements in the shared pipeline. **Benefits** ------------ * Attractive compensation package, including stock options. * Fast\-paced environment with significant growth opportunities. * 15 annual vacation days \+ 7 annual personal days. * Option to work remotely 3\-4 days per week ; or fully\-remote (as long as you can come to CDMX \~twice a year) * Flexible work schedule REQUIREMENTS **Required skills** ------------------- * Strong Python for production data pipelines: clean typed code, tests, refactoring, code review. * Hands\-on experience with a modern orchestrator (Dagster ideally; Airflow, Prefect, Flyte or equivalent is fine) and with data warehouses (BigQuery or Snowflake / Redshift). Comfort writing non\-trivial SQL. * Solid mathematical foundations. You can read a spec like "discounted sum of future loan profits, falling back to population averages when the state is unseen" and turn it into a correct, well\-tested implementation, without hand\-waving over edge cases. * Working knowledge of ML. You don't need to be a researcher, but you should be comfortable with the basics: train/test splits, feature engineering, feature importance and correlation, how models are trained and served. You should be able to read the data scientist's code, understand what the features are used for, and design pipelines that make their life easier. * Testing and observability mindset. You naturally think about what can go wrong with a dataset, where tests should live, and when a failure deserves to wake someone up. * Collaboration skills. A significant part of the job is coordinating with external data providers and with the other three team members; being able to explain trade\-offs clearly (in English and ideally written documentation) is essential. * Fluency in both Spanish and English. Most of our meetings are in Spanish, but the code and most documentation is written in English. * Knowledge of git. **Nice to have** ---------------- * Experience with Dagster specifically (assets, partitions, IO managers, sensors) or with a comparable asset\-based orchestrator. * Experience building or operating a low\-latency feature store (Feast, Tecton, Vertex AI Feature Store, or a home\-grown one). * Familiarity with LLMs / NLP feature extraction in batch pipelines (prompting, caching, cost control, evaluation). * Experience with credit, lending or other risk\-modeling domains, and with concepts like reject inference, LTV, vintage analysis. * Exposure to geospatial data (zip\-code / manzana\-level features). * Experience designing feature\-selection pipelines at scale (mutual information, SHAP\-based ranking, correlation pruning, etc.). * Experience with the rest of our tech stack, and important libraries: TrueFoundry, DVC, Pants, Docker, FastAPI, pydantic.


