TixelJobs
C
Cerebrasvia Greenhouse

Site Reliability Engineer - Ops & Automation

Sunnyvale CA or Toronto CanadaPosted 1mo ago
OtherMid LevelFull-time#ai-lab

Not sure if you're a good fit?

Upload your resume and TixelJobs AI will compare it against Site Reliability Engineer - Ops & Automation at Cerebras. Get a match score, missing keywords, and improvement tips before you apply.

Free preview · Your resume stays private

About the Role

Cerebras Systems builds the world's largest AI chip, 56 times larger than GPUs. Our novel wafer-scale architecture provides the AI compute power of dozens of GPUs on a single chip, with the programming simplicity of a single device. This approach allows Cerebras to deliver industry-leading training and inference speeds and empowers machine learning users to effortlessly run large-scale ML applications, without the hassle of managing hundreds of GPUs or TPUs.  

Cerebras' current customers include top model labs, global enterprises, and cutting-edge AI-native startups. OpenAI recently announced a multi-year partnership with Cerebras, to deploy 750 megawatts of scale, transforming key workloads with ultra high-speed inference. 

Thanks to the groundbreaking wafer-scale architecture, Cerebras Inference offers the fastest Generative AI inference solution in the world, over 10 times faster than GPU-based hyperscale cloud inference services. This order of magnitude increase in speed is transforming the user experience of AI applications, unlocking real-time iteration and increasing intelligence via additional agentic computation.

About the Role 

We are building a high-performance SRE function to support one of the world’s fastest-growing AI inference services, powered by the Wafer-Scale Engine (WSE), helping deliver infrastructure for frontier-class models from leading model builders such as OpenAI. 

This role offers immediate ownership of real production systems at a growing scale, direct mentorship from seasoned engineers, and close collaboration with incoming Staff SREs who will focus on long-term automation. After ~1 month of shared hands-on operations with the Staff engineers, you’ll primarily operate the current setup, bring up new capacity in high-stakes environments and help bring new continuous delivery pipelines into production use. 

If you thrive in high-ownership SRE roles at scale and want to help shape a team from the ground up in cutting-edge AI Inference infrastructure, this is your chance. 

This role does not require 24/7 on-call rotations. 

Key Responsibilities 

  • Remain hands-on with operational execution (releases, capacity changes, cluster upgrades) over the next year as we build robust continuous delivery pipelines and self-service capabilities 
  • Contribute to the development of self-service CD pipelines for key workflows using our stack: Kubernetes, Bazel, Prometheus/Grafana/InfluxDB, Python, and Go. 
  • Build reusable automation and internal developer tools that minimize operational toil and cross-team friction 
  • Share
Job Not Found | TixelJobs — Jobs at AI Companies