Founding ML Platform Engineer
Not sure if you're a good fit?
Upload your resume and TixelJobs AI will compare it against Founding ML Platform Engineer at nan. Get a match score, missing keywords, and improvement tips before you apply.
Free preview · Your resume stays private
About the Role
Company
Deep Reasoning Labs is building a deep reasoning layer for LLMs focused on long-horizon coding. Our iteration loop depends on a fast, reproducible training + evaluation platform: SFT, verifier/PRM training, and RLVR-style post-training, with strong lineage and cost-efficient GPU execution.
Role
You will own the internal training + research MLOps platform: scalable PEFT post-training (LoRA/QLoRA), dataset/label pipelines, data acquisition + ingestion, evaluation automation, experiment tracking, and cost-efficient GPU orchestration (including spot/preemptible strategies). You will also own the research inference layer (model serving, batching/caching, version routing) that closes the loop between training, rollouts, and evaluation. This is a “systems + research acceleration” role: your output is that research iterations become reliable, fast, price-efficient, and auditable.
What you’ll work on
Training + post-training pipelines (PEFT-first):
- Reproducible pipelines for: SFT, verifier/PRM training, and RLVR-style post-training
- Build repeatable LoRA/QLoRA fine-tuning pipelines (SFT + verifier/PRM training + RLVR-style updates where used), optimized for cost and iteration speed
- Robust checkpointing/resume and failure handling for long-running jobs
- Artifact management: dataset versions, configs, checkpoints, eval results, and model registry with lineage
Inference serving + rollout collection (research-grade):
- Operate an LLM serving stack (e.g., vLLM/SGLang) for policy + verifier/PRM models
- Optimize throughput/cost via batching, caching, scheduling, and profiling
- Build reliable rollout collection and replay tooling (configs, model versions, artifacts, traces)
GPU orchestration + cost efficiency:
- Multi-GPU training reliability (single-node initially; scale up over time)
- Spot/preemptible strategy: interruption-tolerant training, autoscaling, queueing, capacity-aware scheduling
- Performance tuning: profiling, dataloading, communication overhead reduction, utilization improvements
Data acquisition + ingestion (training/eval):
- Build ingestion pipelines for code/text/trace datasets, including programmatic collection from select web sources where appropriate
- Implement deduping, normalization, provenance tracking, and dataset versioning
- Ensure operational robustness (rate limiting, retries, incremental crawls, change detection) and practical compliance hygiene (respect access policies/ToS where required)
What success looks like (first ~90 days)
- One-command reproducible pipeline for baseline SFT + verifier/PRM training + evaluation
- Spot/preemptible training that is interruption-tolerant (checkpoint/resume) and not babysat
- Clear dataset + model lineage (you can answer: “what data created this model and what changed?”)
- Automated eval + regression detection integrated into the iteration loop
Requirements (must-have)
- Strong systems + ML infra experience: training pipelines, data systems, reliability engineering
- Strong data engineering fundamentals: building ingestion pipelines, handling messy sources, deduping, and dataset versioning/provenance.
- Experience running LLM inference serving (vLLM/SGLang/TGI), including batching/caching and performance tuning.
- Hands-on experience running multi-GPU training (PyTorch distributed: DDP/FSDP/DeepSpeed/etc.)
- Strong cloud + IaC skills (AWS/GCP; Terraform/CloudFormation/Pulumi)
- Track record building reproducible pipelines (artifact/version management, experiment tracking)
- Performance mindset: profiling, bottleneck identification, cost/perf tradeoffs
Nice-to-have
- Spot/preemptible fleet orchestration at scale (autoscaling, capacity strategy)
- RLHF/RLAIF infrastructure (reward models, preference pipelines, rollout collection)
- LLM serving/inference performance experience (to close the train→serve loop)
- Experience building reliable crawlers/scrapers and incremental ingestion systems (queueing, rate limits, backoff, change detection).
- Familiarity with code datasets, build/test tooling, or program analysis signals
Tech stack (likely)
Linux, Python, PyTorch distributed (DDP/FSDP/DeepSpeed), job orchestration (Kubernetes/ECS/queues), object storage, experiment tracking, IaC, internal eval infrastructure.
Location / work model
Remote-first (US/Canada). Strong preference for overlap with Pacific Time. Periodic in-person sprints in SF are a plus.
Compensation
Market-competitive base (location-based) + meaningful founding-level equity.
Pay: $90,000.00 - $200,000.00 per year
Work Location: Remote
Ready to apply?
This job is active. Apply now to get in early.