Omnilexvia Indeed

AI Engineer - Synthetic Data Generation

Zürich, ZH, CHPosted 1mo ago

ML EngineerMid LevelFull-time#llm#typescript

Not sure if you're a good fit?

Upload your resume and TixelJobs AI will compare it against AI Engineer - Synthetic Data Generation at Omnilex. Get a match score, missing keywords, and improvement tips before you apply.

Free preview · Your resume stays private

About the Role

Why Omnilex?

At Omnilex, we’re on a mission to transform the way lawyers work. Our AI-native platform lets legal professionals enhance their productivity in legal research and automate workflows. We collaborate closely with our clients and iterate at a market-leading pace. In a year, we have gone from an early MVP to a product used daily by thousands of legal professionals at our clients in Switzerland, Germany and Liechtenstein - and are now scaling rapidly across Europe.

We already stand out with handling unique challenges, including our combination of external data, customer-internal data and our own innovative AI-first legal commentaries.

You’ll be joining a young, passionate, and dynamic team of 14, with roots at ETH Zurich.

Your role

Do you get joy from turning messy legal texts into clean, structured, high-quality datasets that actually improve model behavior? Do you like building pipelines where every step is measurable: extraction quality, citation correctness, dedup rate, cost per item, throughput, and regression stability? Are you comfortable shipping pragmatic tooling (CLIs, validators, tests) around LLMs without hand-waving away edge cases? If so, we’d love to hear from you.

What you'll do

As an AI Engineer – Synthetic Data Generation, you will build and own pipelines that generate retrieval-ready and evaluation-grade synthetic datasets from real legal sources (court decisions, statutes, commentaries) across languages and jurisdictions, while keeping quality high and costs controlled.

Build multi-step generation pipelines (10+ steps): From DB selection pseudonymization extraction translation normalization deduplication validation classification rating export.
LLM integration, production-grade: Design robust prompt suites for extraction, translation, classification, and rating; enforce structured JSON outputs; handle retries, partial failures, and weird model behavior.
Quality assurance & filtering: Implement scoring systems (multi-criteria, consistent rubrics), dedup/near-dup suppression, and deterministic validators (especially for citations).
Citation processing at legal-grade precision: Extract, normalize, and validate citations across languages and formats (e.g., Art. 336c Abs. 1 OR, BGE 137 III 266 E. 3.2), including abbreviation mapping and normalization rules.
Cost & throughput optimization: Use batch APIs where appropriate, tune reasoning effort, control concurrency, count tokens, and keep runs cost-efficient (without sacrificing quality).
Developer tooling & CLI workflows: Build CLIs with progress tracking, configurable concurrency, and solid ergonomics for long-running jobs.
Testing across levels: Write unit/smoke/integration tests for pipelines and validators (including mocked LLMs where sensible and real API runs where needed).
Cross-team collaboration: Work closely with legal experts to define what “good” looks like for exam questions/commentaries, and translate that into measurable QA checks.

What you bring

Minimum qualifications

Experience building backend/data tooling with TypeScript/Node.js (strict typing, generics, async patterns).
Hands-on experience integrating LLM APIs (OpenAI/Anthropic or similar), including structured outputs (JSON), prompt iteration, and failure handling.
Strong data pipeline mindset: ETL workflows, transformation steps, validation, and reproducibility.
Solid SQL/PostgreSQL skills and experience with an ORM (bonus if Drizzle).
Experience writing reliable tests (e.g., Jest) and maintaining CI-friendly pipelines.
Fluent English; willing to work hybrid in Zurich (on-site at least two days/week), full-time.

Preferred qualifications

Familiarity with the Swiss legal system (court structure, citation norms, multilingual legal terminology).
Working proficiency in German; plus French/Italian is a strong advantage.
Experience with batch processing and cost-aware LLM operations (token budgeting, batching strategy, caching, early-exit).
Practical text processing skills: regex-heavy parsing, dedup/near-dup detection, similarity search (e.g., BM25 / MiniSearch).
Familiarity with our environment: Yarn workspaces/monorepos, NestJS, and pragmatic CLI tooling.

Benefits

Direct impact: Your datasets will directly shape model quality and evaluation reliability in legal research and reasoning.
Autonomy & ownership: Own the synthetic data pipeline end-to-end; prompts, validators, QA, exports, and cost controls.
Team: Work with a sharp interdisciplinary group at the intersection of AI, engineering, and law.
Compensation: CHF 8’000–13’000 per month + ESOP, depending on experience and skills.

We’re excited to hear from candidates who love building robust, cost-aware LLM pipelines and care about precision (especially when citations and multilingual legal nuance matter).

Ready to apply?

This job is active. Apply now to get in early.

Similar Jobs

Machine Learning Engineer

HR Ashwini k

Principal AI Engineer

Nxt Level

AI Engineer, AI Transformation

Idinsight

Machine Learning Engineer

Cisco

View all jobs