Digitalocean98via Greenhouse

Senior Cloud Support Engineer II - AI/ML

HyderabadPosted 1w ago

ML EngineerSeniorFull-time

Not sure if you're a good fit?

Upload your resume and TixelJobs AI will compare it against Senior Cloud Support Engineer II - AI/ML at Digitalocean98. Get a match score, missing keywords, and improvement tips before you apply.

Free preview · Your resume stays private

About the Role

Dive in and do the best work of your career at DigitalOcean. Journey alongside a strong community of top talent who are relentless in their drive to build the simplest scalable cloud. If you have a growth mindset, naturally like to think big and bold, and are energized by the fast-paced environment of a true industry disruptor, you’ll find your place here. We value winning together—while learning, having fun, and making a profound difference for the dreamers and builders in the world.

We are seeking an exceptional Senior Cloud Support Engineer II to join our AI/ML Support team at DigitalOcean. This is our highest individual contributor level within the Support organization, representing the pinnacle of technical expertise, customer advocacy, and strategic impact.

As a Senior Cloud Support Engineer II, you will serve as the ultimate technical authority for our most complex customer challenges, particularly around Kubernetes (K8S) and GPU/GradientAI workloads. You'll bridge the gap between deep support expertise and solutions architecture, designing sophisticated cloud infrastructure solutions while maintaining the customer-first mentality that defines our Support organization. This role combines the architectural thinking of a Solutions Architect with the hands-on troubleshooting excellence and customer empathy expected from our Support team. You will also participate in an operational on-call rotation to support critical incidents and escalations.

What You'll Do

Technical Leadership & Expertise

Serve as the ultimate escalation point for the most complex, business-critical customer issues across Kubernetes, GPU/GradientAI, and AI/ML infrastructure, coordinating cross-functional responses that span Engineering, Product, and Operations
Architect enterprise-grade solutions for customers building large-scale AI/ML workloads on DigitalOcean, including multi-cluster Kubernetes deployments, distributed GPU training infrastructure, and hybrid/multi-cloud architectures
Lead technical discovery and solution design for strategic accounts, conducting deep-dive architectural reviews, performance optimization workshops, and proof-of-concept implementations
Drive resolution of systemic technical challenges by identifying patterns across customer issues, partnering with Engineering to implement platform-level improvements, and advocating for product enhancements that eliminate entire classes of problems
Research and evaluate emerging technologies in the AI/ML and cloud infrastructure space, identifying opportunities for DigitalOcean to differentiate and expand our capabilities

Customer Impact & Strategic Partnerships

Act as a trusted technical advisor to our highest-value customers and strategic partners, building deep relationships with their technical teams and understanding their business objectives
Design and deliver Professional Services engagements for enterprise customers requiring sophisticated AI/ML infrastructure implementations, managing complex project timelines, stakeholder expectations, and technical deliverables
Conduct executive technical briefings and workshops that articulate DigitalOcean's platform capabilities, architectural best practices, and roadmap vision to C-level and VP-level stakeholders
Partner strategically with Customer Success to drive expansion opportunities, prevent churn through proactive technical guidance, and transform technical challenges into growth opportunities
Influence product strategy by synthesizing customer insights, competitive intelligence, and technical trends into actionable recommendations for Product and Engineering leadership

Organizational Leadership

Mentor and develop IC1-IC3 engineers through structured coaching, technical reviews, pair troubleshooting sessions, and career development guidance
Design and implement support frameworks including escalation workflows, troubleshooting methodologies, automation tools, and operational best practices that elevate team capabilities
Create authoritative technical documentation including architectural reference guides, troubleshooting runbooks, customer-facing solution guides, and internal training curricula
Lead critical incident response for platform-wide or high-impact customer issues, coordinating cross-functional war rooms and ensuring timely, effective resolution
Represent the Support organization in cross-functional initiatives, product design reviews, and strategic planning sessions, ensuring the voice of the customer influences critical decisions

Domain Specialization

Primary Focus Areas:

Kubernetes (K8S): Expert-level architecture, troubleshooting, and optimization for production workloads
GPU/GradientAI: Deep expertise in GPU infrastructure, distributed training, inference optimization, and Generative AI for our GradientAI platform

Valuable Additional Expertise:

Bare Metal Infrastructure: Hardware provisioning, server configuration, performance tuning
Advanced Networking: BGP, VPNs, load balancing, network security, and complex multi-region architectures

What You'll Add to DigitalOcean

Required Experience & Expertise

Technical Background

7+ years of progressive experience in technical support, solutions engineering, DevOps, or site reliability engineering roles with consistent demonstration of technical leadership
5+ years in senior technical customer-facing roles with proven ability to manage enterprise customer relationships and complex technical engagements
Expert-level Kubernetes knowledge: Production-scale architecture design, cluster operations, advanced troubleshooting, performance optimization, security hardening, and networking (CNI, service meshes, ingress controllers)
Deep GPU/AI/ML infrastructure expertise: Multi-GPU and multi-node training, distributed computing frameworks, GPU resource management, inference optimization, and production ML deployment patterns

AI/ML Technical Depth

Advanced understanding of production AI/ML pipelines including model training, optimization, deployment, and monitoring at scale
Extensive experience with major ML frameworks (PyTorch, TensorFlow, Hugging Face) including distributed training strategies and production deployment patterns
Expertise in GPU optimization techniques: CUDA programming concepts, TensorRT, vLLM, model quantization (INT4, INT8, FP8), and inference performance tuning
Deep knowledge of MLOps practices: CI/CD for ML, model versioning, experiment tracking, feature stores, and production monitoring
Experience with large-scale distributed AI/ML workloads including data parallelism, model parallelism, and mixed-precision training

Cloud Infrastructure & Architecture

Proven experience designing fault-tolerant, scalable cloud architectures with deep consideration for cost optimization, security, compliance, and operational excellence
Expert-level Linux system administration: Kernel tuning, performance profiling, security hardening, advanced troubleshooting, and automation
Advanced networking expertise: Deep understanding of TCP/IP, routing protocols, load balancing, CDNs, VPNs, network security, and troubleshooting complex network issues
Strong programming skills in Python with experience in at least one additional systems language (Go, Rust, C++, or similar)
Exte

Share

Ready to apply?

This job is active. Apply now to get in early.

Similar Jobs

H
Machine Learning Engineer
HR Ashwini k
N
Principal AI Engineer
Nxt Level
I
AI Engineer, AI Transformation
Idinsight
C
Machine Learning Engineer
Cisco

View all jobs