nanvia Indeed

Founding ML Platform Engineer

REMOTE$90K - $200K/yrPosted 2mo ago

MLOpsMid LevelFull-time#python#pytorch#llm#kubernetes#aws#gcp

Not sure if you're a good fit?

Upload your resume and TixelJobs AI will compare it against Founding ML Platform Engineer at nan. Get a match score, missing keywords, and improvement tips before you apply.

Free preview · Your resume stays private

About the Role

Company

Deep Reasoning Labs is building a deep reasoning layer for LLMs focused on long-horizon coding. Our iteration loop depends on a fast, reproducible training + evaluation platform: SFT, verifier/PRM training, and RLVR-style post-training, with strong lineage and cost-efficient GPU execution.

Role

You will own the internal training + research MLOps platform: scalable PEFT post-training (LoRA/QLoRA), dataset/label pipelines, data acquisition + ingestion, evaluation automation, experiment tracking, and cost-efficient GPU orchestration (including spot/preemptible strategies). You will also own the research inference layer (model serving, batching/caching, version routing) that closes the loop between training, rollouts, and evaluation. This is a “systems + research acceleration” role: your output is that research iterations become reliable, fast, price-efficient, and auditable.

What you’ll work on

Training + post-training pipelines (PEFT-first):

Reproducible pipelines for: SFT, verifier/PRM training, and RLVR-style post-training
Build repeatable LoRA/QLoRA fine-tuning pipelines (SFT + verifier/PRM training + RLVR-style updates where used), optimized for cost and iteration speed
Robust checkpointing/resume and failure handling for long-running jobs
Artifact management: dataset versions, configs, checkpoints, eval results, and model registry with lineage

Inference serving + rollout collection (research-grade):

Operate an LLM serving stack (e.g., vLLM/SGLang) for policy + verifier/PRM models
Optimize throughput/cost via batching, caching, scheduling, and profiling
Build reliable rollout collection and replay tooling (configs, model versions, artifacts, traces)

GPU orchestration + cost efficiency:

Multi-GPU training reliability (single-node initially; scale up over time)
Spot/preemptible strategy: interruption-tolerant training, autoscaling, queueing, capacity-aware scheduling
Performance tuning: profiling, dataloading, communication overhead reduction, utilization improvements

Data acquisition + ingestion (training/eval):

Build ingestion pipelines for code/text/trace datasets, including programmatic collection from select web sources where appropriate
Implement deduping, normalization, provenance tracking, and dataset versioning
Ensure operational robustness (rate limiting, retries, incremental crawls, change detection) and practical compliance hygiene (respect access policies/ToS where required)

What success looks like (first ~90 days)

One-command reproducible pipeline for baseline SFT + verifier/PRM training + evaluation
Spot/preemptible training that is interruption-tolerant (checkpoint/resume) and not babysat
Clear dataset + model lineage (you can answer: “what data created this model and what changed?”)
Automated eval + regression detection integrated into the iteration loop

Requirements (must-have)

Strong systems + ML infra experience: training pipelines, data systems, reliability engineering
Strong data engineering fundamentals: building ingestion pipelines, handling messy sources, deduping, and dataset versioning/provenance.
Experience running LLM inference serving (vLLM/SGLang/TGI), including batching/caching and performance tuning.
Hands-on experience running multi-GPU training (PyTorch distributed: DDP/FSDP/DeepSpeed/etc.)
Strong cloud + IaC skills (AWS/GCP; Terraform/CloudFormation/Pulumi)
Track record building reproducible pipelines (artifact/version management, experiment tracking)
Performance mindset: profiling, bottleneck identification, cost/perf tradeoffs

Nice-to-have

Spot/preemptible fleet orchestration at scale (autoscaling, capacity strategy)
RLHF/RLAIF infrastructure (reward models, preference pipelines, rollout collection)
LLM serving/inference performance experience (to close the train→serve loop)
Experience building reliable crawlers/scrapers and incremental ingestion systems (queueing, rate limits, backoff, change detection).
Familiarity with code datasets, build/test tooling, or program analysis signals

Tech stack (likely)

Linux, Python, PyTorch distributed (DDP/FSDP/DeepSpeed), job orchestration (Kubernetes/ECS/queues), object storage, experiment tracking, IaC, internal eval infrastructure.

Location / work model

Remote-first (US/Canada). Strong preference for overlap with Pacific Time. Periodic in-person sprints in SF are a plus.

Compensation

Market-competitive base (location-based) + meaningful founding-level equity.

Pay: $90,000.00 - $200,000.00 per year

Work Location: Remote

Ready to apply?

This job is active. Apply now to get in early.

Similar Jobs

Senior Automation & AI Platform Engineer

Gbsgroup

Data Science & MLOps Specialist (m/f/d)

DEKRA España

AI Infrastructure Engineer

Scout Motors Inc.

Junior MLOps Engineer

Zzazz

View all jobs