Senior Cloud Support Engineer II - AI/ML
Not sure if you're a good fit?
Upload your resume and TixelJobs AI will compare it against Senior Cloud Support Engineer II - AI/ML at Digitalocean98. Get a match score, missing keywords, and improvement tips before you apply.
Free preview · Your resume stays private
About the Role
Dive in and do the best work of your career at DigitalOcean. Journey alongside a strong community of top talent who are relentless in their drive to build the simplest scalable cloud. If you have a growth mindset, naturally like to think big and bold, and are energized by the fast-paced environment of a true industry disruptor, you’ll find your place here. We value winning together—while learning, having fun, and making a profound difference for the dreamers and builders in the world.
We are seeking an exceptional Senior Cloud Support Engineer II to join our AI/ML Support team at DigitalOcean. This is our highest individual contributor level within the Support organization, representing the pinnacle of technical expertise, customer advocacy, and strategic impact.
As a Senior Cloud Support Engineer II, you will serve as the ultimate technical authority for our most complex customer challenges, particularly around Kubernetes (K8S) and GPU/GradientAI workloads. You'll bridge the gap between deep support expertise and solutions architecture, designing sophisticated cloud infrastructure solutions while maintaining the customer-first mentality that defines our Support organization. This role combines the architectural thinking of a Solutions Architect with the hands-on troubleshooting excellence and customer empathy expected from our Support team. You will also participate in an operational on-call rotation to support critical incidents and escalations.
What You'll Do
Technical Leadership & Expertise
- Serve as the ultimate escalation point for the most complex, business-critical customer issues across Kubernetes, GPU/GradientAI, and AI/ML infrastructure, coordinating cross-functional responses that span Engineering, Product, and Operations
- Architect enterprise-grade solutions for customers building large-scale AI/ML workloads on DigitalOcean, including multi-cluster Kubernetes deployments, distributed GPU training infrastructure, and hybrid/multi-cloud architectures
- Lead technical discovery and solution design for strategic accounts, conducting deep-dive architectural reviews, performance optimization workshops, and proof-of-concept implementations
- Drive resolution of systemic technical challenges by identifying patterns across customer issues, partnering with Engineering to implement platform-level improvements, and advocating for product enhancements that eliminate entire classes of problems
- Research and evaluate emerging technologies in the AI/ML and cloud infrastructure space, identifying opportunities for DigitalOcean to differentiate and expand our capabilities
Customer Impact & Strategic Partnerships
- Act as a trusted technical advisor to our highest-value customers and strategic partners, building deep relationships with their technical teams and understanding their business objectives
- Design and deliver Professional Services engagements for enterprise customers requiring sophisticated AI/ML infrastructure implementations, managing complex project timelines, stakeholder expectations, and technical deliverables
- Conduct executive technical briefings and workshops that articulate DigitalOcean's platform capabilities, architectural best practices, and roadmap vision to C-level and VP-level stakeholders
- Partner strategically with Customer Success to drive expansion opportunities, prevent churn through proactive technical guidance, and transform technical challenges into growth opportunities
- Influence product strategy by synthesizing customer insights, competitive intelligence, and technical trends into actionable recommendations for Product and Engineering leadership
Organizational Leadership
- Mentor and develop IC1-IC3 engineers through structured coaching, technical reviews, pair troubleshooting sessions, and career development guidance
- Design and implement support frameworks including escalation workflows, troubleshooting methodologies, automation tools, and operational best practices that elevate team capabilities
- Create authoritative technical documentation including architectural reference guides, troubleshooting runbooks, customer-facing solution guides, and internal training curricula
- Lead critical incident response for platform-wide or high-impact customer issues, coordinating cross-functional war rooms and ensuring timely, effective resolution
- Represent the Support organization in cross-functional initiatives, product design reviews, and strategic planning sessions, ensuring the voice of the customer influences critical decisions
Domain Specialization
Primary Focus Areas:
- Kubernetes (K8S): Expert-level architecture, troubleshooting, and optimization for production workloads
- GPU/GradientAI: Deep expertise in GPU infrastructure, distributed training, inference optimization, and Generative AI for our GradientAI platform
Valuable Additional Expertise:
- Bare Metal Infrastructure: Hardware provisioning, server configuration, performance tuning
- Advanced Networking: BGP, VPNs, load balancing, network security, and complex multi-region architectures
What You'll Add to DigitalOcean
Required Experience & Expertise
Technical Background
- 7+ years of progressive experience in technical support, solutions engineering, DevOps, or site reliability engineering roles with consistent demonstration of technical leadership
- 5+ years in senior technical customer-facing roles with proven ability to manage enterprise customer relationships and complex technical engagements
- Expert-level Kubernetes knowledge: Production-scale architecture design, cluster operations, advanced troubleshooting, performance optimization, security hardening, and networking (CNI, service meshes, ingress controllers)
- Deep GPU/AI/ML infrastructure expertise: Multi-GPU and multi-node training, distributed computing frameworks, GPU resource management, inference optimization, and production ML deployment patterns
AI/ML Technical Depth
- Advanced understanding of production AI/ML pipelines including model training, optimization, deployment, and monitoring at scale
- Extensive experience with major ML frameworks (PyTorch, TensorFlow, Hugging Face) including distributed training strategies and production deployment patterns
- Expertise in GPU optimization techniques: CUDA programming concepts, TensorRT, vLLM, model quantization (INT4, INT8, FP8), and inference performance tuning
- Deep knowledge of MLOps practices: CI/CD for ML, model versioning, experiment tracking, feature stores, and production monitoring
- Experience with large-scale distributed AI/ML workloads including data parallelism, model parallelism, and mixed-precision training
Cloud Infrastructure & Architecture
- Proven experience designing fault-tolerant, scalable cloud architectures with deep consideration for cost optimization, security, compliance, and operational excellence
- Expert-level Linux system administration: Kernel tuning, performance profiling, security hardening, advanced troubleshooting, and automation
- Advanced networking expertise: Deep understanding of TCP/IP, routing protocols, load balancing, CDNs, VPNs, network security, and troubleshooting complex network issues
- Strong programming skills in Python with experience in at least one additional systems language (Go, Rust, C++, or similar)
- Exte
Ready to apply?
This job is active. Apply now to get in early.