Lead Software Engineer, AI Infrastructure
Not sure if you're a good fit?
Upload your resume and TixelJobs AI will compare it against Lead Software Engineer, AI Infrastructure at Thealleninstitute. Get a match score, missing keywords, and improvement tips before you apply.
Free preview · Your resume stays private
About the Role
Persons in these roles are expected to work from our offices in Seattle. On-site requirements vary based on position and team. If you have questions about on-site work arrangements for this role, please ask your recruiter.
Our base salary range is $146,880 - $220,320, and in addition we have generous bonus plans to provide a competitive compensation package.
Who You Are
You are a visionary leader who occupies the space between high-level software orchestration and low-level system performance. You are motivated by the idea that world-class infrastructure should be a catalyst for public good, not a proprietary secret. You understand that in the world of frontier AI, the "software" and the "hardware" are a single, inseparable organism. You are as comfortable designing a distributed scheduling algorithm in Go as you are debugging a NCCL timeout or optimizing an InfiniBand fabric.
You lead by example, blending the rigor of a Lead Software Engineer with the pragmatic, hands-on urgency of an HPC operator. Not only do you build systems, but you also ensure they thrive under the immense pressure of training world-class AI models.
Who We Are
While much of the AI industry has moved behind closed APIs, proprietary datasets, and "black box" infrastructure, Ai2 remains a lighthouse for Open Science. Founded by the late Paul Allen, we are a non-profit research institute dedicated to building AI for the common good.
We don't have a stock price to defend or a walled garden to protect. Instead, we have a mission: to provide the global research community with the transparent, high-performance foundations they need to achieve humanity-enriching breakthroughs.
What makes us different:
- Radical Transparency: We don't just release model weights; we release the data, the training code, and the infrastructure insights. We believe the "how" is just as important as the "what."
- Mission over Margin: Our "bottom line" is scientific impact. This gives us the unique freedom to prioritize technical elegance, long-term stability, and open-source contributions over quarterly profit targets.
- The Best of Both Worlds: We operate at the pace and scale of a world-class tech startup but with the intellectual soul of a research lab.
- The Beaker Ecosystem: We build and operate systems like Beaker to coordinate the simultaneous training of frontier models (like OLMo) across massive GPU clusters. Our job is to ensure that the next great AI breakthrough isn't stalled by a resource bottleneck or a proprietary gatekeeper.
Your Next Challenge
At Ai2, we believe that the most important AI breakthroughs should be transparent and accessible. Your challenge is to build the infrastructure that makes this possible. You will bridge the gap between our researchers, our orchestration platform (Beaker) and our GPU clusters.
You will be a technical lead responsible for ensuring that when a researcher submits a job, the software schedules it intelligently and the hardware executes it flawlessly. This involves:
- Designing for Scale: Architecting the next generation of our orchestration layer to ensure that the highest value workloads receive GPU time.
- Operational Excellence: Moving our HPC operations from manual intervention to high-level automation.
- Performance Engineering: Working directly with researchers to squeeze every bit of performance out of our GPU-accelerated computing environment.
Your Responsibilities
- Strategic Leadership: Develop the roadmap for managing large-scale HPC systems, including the deployment of compute, networking, and storage in partnership with leadership.
- Full-Stack Ownership: Lead the design and delivery of critical systems that span the entire stack—from the Beaker job scheduler to the execution runtime.
- System Automation: Build innovative tooling and software-defined infrastructure to accelerate researcher velocity and automate cluster health management.
- Performance Optimization: Conduct root-cause analysis on complex distributed system failures and implement optimizations for distributed workloads.
- Mentorship & Culture: Foster a high-performance culture by reviewing code/design docs, mentoring team members, and driving process improvements across the organization.
- Evangelism: Represent Ai2’s infrastructure work across internal research teams.
What You’ll Need
- 10+ years of professional experience developing business-critical software and operating large-scale compute infrastructure. Proficiency in Go and/or Python preferred.
- Bachelor’s degree in related field; relevant advanced degree may substitute for equivalent years of technical work experience
- Deep Linux Expertise: Expert-level knowledge of Linux internals, and container runtimes like Docker.
- Distributed Systems Mastery: A proven track record of designing, debugging, and optimizing high-scale distributed systems and databases.
- HPC Foundations: Applied experience with workload schedulers (like Kubernetes or Slurm) and high-performance networking (NCCL and InfiniBand).
- Cloud & Hardware Hybridity: Familiarity with the nuances of on-prem GPU cluster management and cloud infrastructure (GCP, AWS).
- Communication: Exceptional writing skills and the ability to drive consensus across diverse groups of researchers and engineers.
- A principled approach to engineering: you care about how systems are built and are excited by the unique constraints and freedoms of a non-profit research environment.
Bonus Qualifications
- Prior experience training or fine-tuning frontier AI models.
- Deep systems administration expertise or "Site Reliability Engineering" (SRE) background in an HPC context.
- Experience contributing to open-source infrastructure or orchestration projects.
- Familiarity with on-prem storage systems like WEKA and Ceph.
Physical Demands and Work Environment:
The physical demands described here are representative of those that must be met by a team member to successfully perform the essential functions of this position. Reasonable accommodations may be made to enable individuals with disabilities to perform the functions.
- Must be able to remain in a stationary position for long periods of time.
- The ability to communicate information and ideas so others will understand. Must be able to exchange accurate information in these situations.
- The ability to observe details at close range.
- Can work under deadlines.
A Little More About Ai2:
Ai2 is a Seattle based non-profit AI research institute founded in 2014 by the late Paul Allen. Our mission is building breakthrough AI to solve the world’s biggest problems. We develop foundational AI research and innovation to deliver real-world impact through large-scale open models, data, robotics, conservation, and beyond.
In addition to Ai2’s core mission, we also aim to contribute to humanity through our treatment of each member of the Ai2 Team. Some highlights are:
- We are a learning organization – because everything Ai2 does is ground-breaking, we are learning every day. Similarly, through weekly Ai2 Academy lectures, a wide variety of world-class AI experts as guest speakers, and our commitment to your personal on-going education, Ai2 is a place where you will have opportunities to continue learning alongside your coworkers.
- We value diversity - We seek to hire, support, and promote people from all genders, ethnicities, and all levels of experience regardless of age. We particularly encourage applications from women, non-binary individuals, people of color, members of the LGBTQA+ community, and people with disabilities of any kind.&a
Ready to apply?
This job is active. Apply now to get in early.