Senior ML Data Processing Developer

LawZero 28 days ago

Montreal, Quebec, Canada

Senior Level

Full-Time

Top Benefits

Comprehensive Health Benefits

Mental Health And Wellness Management Account

20 Days Of Vacation Per Year

About the role

We are seeking a Senior ML Data Processing Developer to participate in the development, curation, and scaling of our core data asset pipeline. Sitting at the intersection of data engineering, data curation, and machine learning, you will own the end-to-end pipeline that transforms raw web-scale data into high-signal datasets used to train the Scientist AI. In this role, you will not just manage data; you will engineer its quality. You will design algorithmic filtering, build model-based scoring mechanics, and ensure rigorous benchmark integrity to power the next generation of AI. And as our models push beyond established paradigms, you will design and implement novel data transformations that don't yet have playbooks, working at the frontier of what training data can be. We are hiring multiple people for this role, and responsibilities may be distributed across the team based on individual experience, skills, and interests.

Key Responsibilities

Partner with the Research team to define, build, automate, scale, and manage data pipelines that transform raw web-scale data into training datasets for the Scientist AI. Build and maintain data processing pipelines, including deduplication, model-based quality scoring, heuristic filtering, toxicity removal, PII scrubbing, metadata extraction, and proprietary data transformations, with full dataset versioning and provenance tracking, optimizing for throughput and cost at scale. Ensure all ingested data meets compliance requirements, internal Data Governance policies, and legal obligations. Develop and refine the scoring and filtering toolchain: heuristics, LLM-as-a-judge evaluators, ML classifiers, metadata extraction modules, and human-in-the-loop review workflows required for data processing and quality assurance. Instrument data processing pipelines with data-quality monitoring, guardrails, and alerting to catch regressions before they propagate downstream. Collaborate with the Research team and other teams to understand evolving data requirements, then identify and acquire large-scale text corpora that meet those requirements. This includes conducting systematic coverage analyses to identify gaps in the corpus and develop targeted acquisition strategies to address them, and working with the Legal & Governance Team to license new data sources. Design and maintain strict leakage detection mechanisms to guard against evaluation contamination across all stages of the data processing pipeline. Build internal tooling and interfaces that let researchers explore, query, and understand available datasets with minimal friction.

Skills and Qualifications

Degree in computer science, software engineering, or a related field. Proven track record of handling massive unstructured text datasets (trillion-token scale), with 5+ years of experience in data processing, machine learning engineering or Natural Language Processing (NLP). Hands-on experience with distributed processing frameworks (e.g., Spark, Ray, Flink), designing and optimizing high-throughput pipelines. Experience with data privacy implementation (PII scrubbing), content-safety filtering (toxicity, bias), and evaluation-contamination prevention. Demonstrated ability to work across Research, Engineering, and/or Legal/Governance teams, translating varied requirements into concrete pipeline work. Strong Python proficiency, including experience writing production-grade data-processing code. Experience with pipeline orchestration frameworks (e.g., Airflow, Prefect, Dagster).

Nice to have

Experience training, fine-tuning, or deploying ML models for data-quality tasks (classifiers, LLM-based evaluators) and familiarity with LLM inference optimization (e.g. vLLM, SGLang). Familiarity with containerized deployment (Docker, Kubernetes) and infrastructure-as-code practices. Familiarity with ML experiment tracking tools (e.g. Weights and Biases). Experience with data licensing workflows or web-scale data acquisition. Contributions to open-source data processing or NLP tooling.

What we offer

The chance to contribute meaningfully to a globally critical initiative Comprehensive health benefits (including mental health and wellness management account) 20 days of vacation per year upon start Employer contribution of 4% to your retirement savings, with no required employee match Additional compensation totaling 8% of your salary to apply towards additional retirement savings or bonuses (independent of group and individual performance) A team of passionate world-class experts in their field A collaborative and inclusive work environment in our vibrant office space in the heart of Little Italy, in the trendy Mile-Ex district, close to public transportation About LawZero LawZero is a non-profit organization committed to advancing research and creating technical solutions that enable safe-by-design AI systems. Its scientific direction is based on new research and methods proposed by Professor Yoshua Bengio, the most cited AI researcher in the world. Based in Montreal, LawZero’s research aims to build non-agentic AI that learns primarily to understand the world rather than to act in it, giving truthful answers to questions based on transparent and externalized probabilistic reasoning. Such AI systems could be used to accelerate scientific discovery, to provide oversight for agentic AI systems, and to advance the understanding of AI risks and how to avoid them. LawZero believes that AI should be cultivated as a global public good—developed and used safely towards human flourishing. For more information, visit www.lawzero.org You belong here At LawZero, diversity is important to us. We value a work environment that is fair, open and respectful of differences. We welcome applications from highly qualified individuals interested in working towards our mission in a respectful, inclusive and collaborative setting. Your personal information will be collected and processed by LawZero to evaluate your application for employment in compliance with our Privacy Policy. Under privacy laws in force in your country of residence, you may have several privacy rights, such as to request access to your personal information or to request that your personal information be rectified or erased. Details on how you can exercise your rights can be found in our Privacy Policy.

Not the right fit? Search for ML Data Processing Developer jobs in Montreal, Quebec, Canada

About LawZero

1-10

LawZero is a nonprofit organization committed to advancing research and creating technical solutions that enable safe-by-design AI systems. Its scientific direction is based on new research and methods led by Professor Yoshua Bengio, the most cited AI researcher in the world. Based in Montréal, LawZero’s research aims to build non-agentic AI that could be used to accelerate scientific discovery, to provide oversight for agentic AI systems, and to advance the understanding of AI risks and how to avoid them. LawZero believes that AI should be cultivated as a global public good—developed and used safely towards human flourishing.

Website LinkedIn

Similar Jobs