TixelJobs
O
Omnilexvia Indeed

AI Engineer - Synthetic Data Generation

Zürich, ZH, CHPosted 1mo ago
ML EngineerMid LevelFull-time#llm#typescript

Not sure if you're a good fit?

Upload your resume and TixelJobs AI will compare it against AI Engineer - Synthetic Data Generation at Omnilex. Get a match score, missing keywords, and improvement tips before you apply.

Free preview · Your resume stays private

About the Role

Why Omnilex?

At Omnilex, we’re on a mission to transform the way lawyers work. Our AI-native platform lets legal professionals enhance their productivity in legal research and automate workflows. We collaborate closely with our clients and iterate at a market-leading pace. In a year, we have gone from an early MVP to a product used daily by thousands of legal professionals at our clients in Switzerland, Germany and Liechtenstein - and are now scaling rapidly across Europe.

We already stand out with handling unique challenges, including our combination of external data, customer-internal data and our own innovative AI-first legal commentaries.

You’ll be joining a young, passionate, and dynamic team of 14, with roots at ETH Zurich.

Your role

Do you get joy from turning messy legal texts into clean, structured, high-quality datasets that actually improve model behavior? Do you like building pipelines where every step is measurable: extraction quality, citation correctness, dedup rate, cost per item, throughput, and regression stability? Are you comfortable shipping pragmatic tooling (CLIs, validators, tests) around LLMs without hand-waving away edge cases? If so, we’d love to hear from you.

What you'll do

As an AI Engineer – Synthetic Data Generation, you will build and own pipelines that generate retrieval-ready and evaluation-grade synthetic datasets from real legal sources (court decisions, statutes, commentaries) across languages and jurisdictions, while keeping quality high and costs controlled.

  • Build multi-step generation pipelines (10+ steps): From DB selection pseudonymization extraction translation normalization deduplication validation classification rating export.

  • LLM integration, production-grade: Design robust prompt suites for extraction, translation, classification, and rating; enforce structured JSON outputs; handle retries, partial failures, and weird model behavior.

  • Quality assurance & filtering: Implement scoring systems (multi-criteria, consistent rubrics), dedup/near-dup suppression, and deterministic validators (especially for citations).

  • Citation processing at legal-grade precision: Extract, normalize, and validate citations across languages and formats (e.g., Art. 336c Abs. 1 OR, BGE 137 III 266 E. 3.2), including abbreviation mapping and normalization rules.

  • Cost & throughput optimization: Use batch APIs where appropriate, tune reasoning effort, control concurrency, count tokens, and keep runs cost-efficient (without sacrificing quality).

  • Developer tooling & CLI workflows: Build CLIs with progress tracking, configurable concurrency, and solid ergonomics for long-running jobs.

  • Testing across levels: Write unit/smoke/integration tests for pipelines and validators (including mocked LLMs where sensible and real API runs where needed).

  • Cross-team collaboration: Work closely with legal experts to define what “good” looks like for exam questions/commentaries, and translate that into measurable QA checks.

What you bring

Minimum qualifications

  • Experience building backend/data tooling with TypeScript/Node.js (strict typing, generics, async patterns).

  • Hands-on experience integrating LLM APIs (OpenAI/Anthropic or similar), including structured outputs (JSON), prompt iteration, and failure handling.

  • Strong data pipeline mindset: ETL workflows, transformation steps, validation, and reproducibility.

  • Solid SQL/PostgreSQL skills and experience with an ORM (bonus if Drizzle).

  • Experience writing reliable tests (e.g., Jest) and maintaining CI-friendly pipelines.

  • Fluent English; willing to work hybrid in Zurich (on-site at least two days/week), full-time.

Preferred qualifications

  • Familiarity with the Swiss legal system (court structure, citation norms, multilingual legal terminology).

  • Working proficiency in German; plus French/Italian is a strong advantage.

  • Experience with batch processing and cost-aware LLM operations (token budgeting, batching strategy, caching, early-exit).

  • Practical text processing skills: regex-heavy parsing, dedup/near-dup detection, similarity search (e.g., BM25 / MiniSearch).

  • Familiarity with our environment: Yarn workspaces/monorepos, NestJS, and pragmatic CLI tooling.

Benefits

  • Direct impact: Your datasets will directly shape model quality and evaluation reliability in legal research and reasoning.

  • Autonomy & ownership: Own the synthetic data pipeline end-to-end; prompts, validators, QA, exports, and cost controls.

  • Team: Work with a sharp interdisciplinary group at the intersection of AI, engineering, and law.

  • Compensation: CHF 8’000–13’000 per month + ESOP, depending on experience and skills.

We’re excited to hear from candidates who love building robust, cost-aware LLM pipelines and care about precision (especially when citations and multilingual legal nuance matter).

Share