TixelJobs
D
Digitalocean98via Greenhouse

Senior Cloud Support Engineer II - AI/ML

HyderabadPosted 1w ago
ML EngineerSeniorFull-time

Not sure if you're a good fit?

Upload your resume and TixelJobs AI will compare it against Senior Cloud Support Engineer II - AI/ML at Digitalocean98. Get a match score, missing keywords, and improvement tips before you apply.

Free preview · Your resume stays private

About the Role

Dive in and do the best work of your career at DigitalOcean. Journey alongside a strong community of top talent who are relentless in their drive to build the simplest scalable cloud. If you have a growth mindset, naturally like to think big and bold, and are energized by the fast-paced environment of a true industry disruptor, you’ll find your place here.  We value winning together—while learning, having fun, and making a profound difference for the dreamers and builders in the world. 

We are seeking an exceptional Senior Cloud Support Engineer II to join our AI/ML Support team at DigitalOcean. This is our highest individual contributor level within the Support organization, representing the pinnacle of technical expertise, customer advocacy, and strategic impact.

As a Senior Cloud Support Engineer II, you will serve as the ultimate technical authority for our most complex customer challenges, particularly around Kubernetes (K8S) and GPU/GradientAI workloads. You'll bridge the gap between deep support expertise and solutions architecture, designing sophisticated cloud infrastructure solutions while maintaining the customer-first mentality that defines our Support organization. This role combines the architectural thinking of a Solutions Architect with the hands-on troubleshooting excellence and customer empathy expected from our Support team. You will also participate in an operational on-call rotation to support critical incidents and escalations.

What You'll Do

Technical Leadership & Expertise

  • Serve as the ultimate escalation point for the most complex, business-critical customer issues across Kubernetes, GPU/GradientAI, and AI/ML infrastructure, coordinating cross-functional responses that span Engineering, Product, and Operations
  • Architect enterprise-grade solutions for customers building large-scale AI/ML workloads on DigitalOcean, including multi-cluster Kubernetes deployments, distributed GPU training infrastructure, and hybrid/multi-cloud architectures
  • Lead technical discovery and solution design for strategic accounts, conducting deep-dive architectural reviews, performance optimization workshops, and proof-of-concept implementations
  • Drive resolution of systemic technical challenges by identifying patterns across customer issues, partnering with Engineering to implement platform-level improvements, and advocating for product enhancements that eliminate entire classes of problems
  • Research and evaluate emerging technologies in the AI/ML and cloud infrastructure space, identifying opportunities for DigitalOcean to differentiate and expand our capabilities

Customer Impact & Strategic Partnerships

  • Act as a trusted technical advisor to our highest-value customers and strategic partners, building deep relationships with their technical teams and understanding their business objectives
  • Design and deliver Professional Services engagements for enterprise customers requiring sophisticated AI/ML infrastructure implementations, managing complex project timelines, stakeholder expectations, and technical deliverables
  • Conduct executive technical briefings and workshops that articulate DigitalOcean's platform capabilities, architectural best practices, and roadmap vision to C-level and VP-level stakeholders
  • Partner strategically with Customer Success to drive expansion opportunities, prevent churn through proactive technical guidance, and transform technical challenges into growth opportunities
  • Influence product strategy by synthesizing customer insights, competitive intelligence, and technical trends into actionable recommendations for Product and Engineering leadership

Organizational Leadership

  • Mentor and develop IC1-IC3 engineers through structured coaching, technical reviews, pair troubleshooting sessions, and career development guidance
  • Design and implement support frameworks including escalation workflows, troubleshooting methodologies, automation tools, and operational best practices that elevate team capabilities
  • Create authoritative technical documentation including architectural reference guides, troubleshooting runbooks, customer-facing solution guides, and internal training curricula
  • Lead critical incident response for platform-wide or high-impact customer issues, coordinating cross-functional war rooms and ensuring timely, effective resolution
  • Represent the Support organization in cross-functional initiatives, product design reviews, and strategic planning sessions, ensuring the voice of the customer influences critical decisions

Domain Specialization

Primary Focus Areas:

  • Kubernetes (K8S): Expert-level architecture, troubleshooting, and optimization for production workloads
  • GPU/GradientAI: Deep expertise in GPU infrastructure, distributed training, inference optimization, and Generative AI for our GradientAI platform

Valuable Additional Expertise:

  • Bare Metal Infrastructure: Hardware provisioning, server configuration, performance tuning
  • Advanced Networking: BGP, VPNs, load balancing, network security, and complex multi-region architectures

What You'll Add to DigitalOcean

Required Experience & Expertise

Technical Background

  • 7+ years of progressive experience in technical support, solutions engineering, DevOps, or site reliability engineering roles with consistent demonstration of technical leadership
  • 5+ years in senior technical customer-facing roles with proven ability to manage enterprise customer relationships and complex technical engagements
  • Expert-level Kubernetes knowledge: Production-scale architecture design, cluster operations, advanced troubleshooting, performance optimization, security hardening, and networking (CNI, service meshes, ingress controllers)
  • Deep GPU/AI/ML infrastructure expertise: Multi-GPU and multi-node training, distributed computing frameworks, GPU resource management, inference optimization, and production ML deployment patterns

AI/ML Technical Depth

  • Advanced understanding of production AI/ML pipelines including model training, optimization, deployment, and monitoring at scale
  • Extensive experience with major ML frameworks (PyTorch, TensorFlow, Hugging Face) including distributed training strategies and production deployment patterns
  • Expertise in GPU optimization techniques: CUDA programming concepts, TensorRT, vLLM, model quantization (INT4, INT8, FP8), and inference performance tuning
  • Deep knowledge of MLOps practices: CI/CD for ML, model versioning, experiment tracking, feature stores, and production monitoring
  • Experience with large-scale distributed AI/ML workloads including data parallelism, model parallelism, and mixed-precision training

Cloud Infrastructure & Architecture

  • Proven experience designing fault-tolerant, scalable cloud architectures with deep consideration for cost optimization, security, compliance, and operational excellence
  • Expert-level Linux system administration: Kernel tuning, performance profiling, security hardening, advanced troubleshooting, and automation
  • Advanced networking expertise: Deep understanding of TCP/IP, routing protocols, load balancing, CDNs, VPNs, network security, and troubleshooting complex network issues
  • Strong programming skills in Python with experience in at least one additional systems language (Go, Rust, C++, or similar)
  • Exte
Share