Reliability Engineer
Not sure if you're a good fit?
Upload your resume and TixelJobs AI will compare it against Reliability Engineer at Graphcore. Get a match score, missing keywords, and improvement tips before you apply.
Free preview · Your resume stays private
About the Role
About Graphcore
At Graphcore, we’re building the future of AI compute.We’re a team of semiconductor, software and AI experts, with deep experience in creating the complete AI compute stack - from silicon and software to infrastructure at datacenter scale.As part of the SoftBank Group, backed by significant long-term investment, we are delivering key technology into the fast-growing SoftBank AI ecosystem.To meet the vast and exciting AI opportunity, Graphcore is expanding its teams around the world.We are bringing together the brightest minds to solve the toughest problems, in a place where everyone has the opportunity to make an impact on the company, our products and the future of artificial intelligence.
Job Summary
Responsible for system-level reliability of AI servers with liquid cooling and HVDC architectures, owning reliability validation, shock & vibration robustness, and failure analysis from board to rack level to ensure safe transport, deployment, and long-term datacenter operation.
Key Responsibilities and skills
- Plan and execute reliability validation across board, server, and rack levels.
- Define and run environmental, accelerated, and mechanical tests, including thermal/power cycling, humidity, corrosion, shock & vibration, and HALT/HASS.
- Lead shock & vibration validation for transportation, handling, seismic, and operational conditions.
- Assess reliability risks for liquid cooling systems (leakage, fatigue, pump life, corrosion, coolant stability).
- Evaluate HVDC mechanical and electrical robustness (busbars, connectors, power interfaces).
- Perform reliability prediction and life data analysis (Weibull, MTBF).
- Lead cross-functional design reviews and drive risk mitigation.
- Conduct failure analysis and RCA using standard FA methodologies.
- Define and maintain reliability and S&V test specifications (JEDEC, Telcordia GR-63, JESD22, MIL-STD-810, ISTA, ASHRAE, UL, IEC).
- Implement On-going Reliability Test (ORT) for production quality.
Ready to apply?
This job is active. Apply now to get in early.