B
Bespokelabsvia Ashby
DevOps / Site Reliability Engineer
REMOTEPosted 2w ago
devopsMid LevelFull-time#remote
Not sure if you're a good fit?
Upload your resume and TixelJobs AI will compare it against DevOps / Site Reliability Engineer at Bespokelabs. Get a match score, missing keywords, and improvement tips before you apply.
Free preview · Your resume stays private
About the Role
About Bespoke Labs
Bespoke Labs is an AI research and data company building the datasets, benchmarks, and evaluation infrastructure that power frontier AI models. We're backed by leading investors, trusted by top AI labs, and have research accepted at venues like ICLR 2026. Our team is small, moves fast, and has an outsized impact on how the next generation of AI is built.
The Role
We're looking for a mid-level DevOps / Site Reliability Engineer to own and scale our cloud infrastructure. You'll work closely with engineering and ML teams to keep our systems reliable, observable, and fast — directly supporting the infrastructure that powers AI data pipelines at scale.
What You'll Do
- Own cloud infrastructure on AWS — EC2, EKS, RDS, S3, IAM, VPC
- Manage Kubernetes clusters and container orchestration end-to-end
- Build and maintain CI/CD pipelines using GitHub Actions or similar
- Implement monitoring, alerting, and observability stacks (Prometheus, Grafana, or DataDog)
- Improve reliability, performance, and security of production systems
- Automate infrastructure with Terraform or similar IaC tools
- Debug and resolve issues across complex, distributed systems
- Participate in design reviews and help raise the infrastructure bar
What We're Looking For
- 3–5 years in DevOps, SRE, or infrastructure engineering
- Strong AWS experience — EKS, EC2, RDS, S3, IAM
- Kubernetes — deployment, scaling, troubleshooting in production
- CI/CD pipelines — GitHub Actions, ArgoCD, or similar
- Infrastructure as Code — Terraform, Pulumi, or CDK
- Python or Go scripting
- Experience working in production environments with real users
- Comfort with ambiguity and ability to operate autonomously
Nice to Have
- Experience supporting ML training workloads or GPU clusters
- Familiarity with distributed computing or large-scale data pipelines
- Prior work at an AI, ML, or data company
- Open-source contributions or published technical writing
What We Offer
- Competitive compensation and meaningful equity
- Direct impact on frontier AI model training and evaluation infrastructure
- Flexible, remote-friendly environment with low bureaucracy
- A small, high-caliber team with deep AI research expertise
- Health, wellness, and learning & development benefits
Bespoke Labs is an AI research and data company building the datasets, benchmarks, and evaluation infrastructure that power frontier AI models. We're backed by leading investors, trusted by top AI labs, and have research accepted at venues like ICLR 2026. Our team is small, moves fast, and has an outsized impact on how the next generation of AI is built.
The Role
We're looking for a mid-level DevOps / Site Reliability Engineer to own and scale our cloud infrastructure. You'll work closely with engineering and ML teams to keep our systems reliable, observable, and fast — directly supporting the infrastructure that powers AI data pipelines at scale.
What You'll Do
- Own cloud infrastructure on AWS — EC2, EKS, RDS, S3, IAM, VPC
- Manage Kubernetes clusters and container orchestration end-to-end
- Build and maintain CI/CD pipelines using GitHub Actions or similar
- Implement monitoring, alerting, and observability stacks (Prometheus, Grafana, or DataDog)
- Improve reliability, performance, and security of production systems
- Automate infrastructure with Terraform or similar IaC tools
- Debug and resolve issues across complex, distributed systems
- Participate in design reviews and help raise the infrastructure bar
What We're Looking For
- 3–5 years in DevOps, SRE, or infrastructure engineering
- Strong AWS experience — EKS, EC2, RDS, S3, IAM
- Kubernetes — deployment, scaling, troubleshooting in production
- CI/CD pipelines — GitHub Actions, ArgoCD, or similar
- Infrastructure as Code — Terraform, Pulumi, or CDK
- Python or Go scripting
- Experience working in production environments with real users
- Comfort with ambiguity and ability to operate autonomously
Nice to Have
- Experience supporting ML training workloads or GPU clusters
- Familiarity with distributed computing or large-scale data pipelines
- Prior work at an AI, ML, or data company
- Open-source contributions or published technical writing
What We Offer
- Competitive compensation and meaningful equity
- Direct impact on frontier AI model training and evaluation infrastructure
- Flexible, remote-friendly environment with low bureaucracy
- A small, high-caliber team with deep AI research expertise
- Health, wellness, and learning & development benefits
Ready to apply?
This job is active. Apply now to get in early.