G
Genesis Aivia Ashby
Member of Technical Staff, Training (Bay Area, Remote)
Bay AreaPosted 1w ago
OtherStaff+Full-time
Not sure if you're a good fit?
Upload your resume and TixelJobs AI will compare it against Member of Technical Staff, Training (Bay Area, Remote) at Genesis Ai. Get a match score, missing keywords, and improvement tips before you apply.
Free preview · Your resume stays private
About the Role
WHAT YOU’LL DO
- Drive down wall-clock time to convergence by profiling and eliminating bottlenecks across the foundation model training stack stack, from data pipelines to GPU kernels
- Design, build, and optimize distributed training systems (PyTorch) for multi-node GPU clusters, ensuring scalability, robustness, and high utilization
- Implement efficient low-level code (CUDA, cuDNN, Triton, custom kernels) and integrate it seamlessly into high-level training frameworks
- Optimize workloads for hardware efficiency: CPU/GPU compute balance, memory management, data throughput, and networking
- Develop monitoring and debugging tools for large-scale runs, enabling rapid diagnosis of performance regressions and failures
WHAT YOU’LL BRING
- Deep experience in distributed systems, ML infrastructure, or high-performance computing (8+ years)
- Production-grade expertise in Python
- Low-level performance mastery: CUDA/cuDNN/Triton, CPU–GPU interactions, data movement, and kernel optimization
- Scaling at the frontier: experience with PyTorch and training jobs using data, context, pipeline, and model parallelism
- System-level mindset with a track record of tuning hardware–software interactions for maximum utilization
- Drive down wall-clock time to convergence by profiling and eliminating bottlenecks across the foundation model training stack stack, from data pipelines to GPU kernels
- Design, build, and optimize distributed training systems (PyTorch) for multi-node GPU clusters, ensuring scalability, robustness, and high utilization
- Implement efficient low-level code (CUDA, cuDNN, Triton, custom kernels) and integrate it seamlessly into high-level training frameworks
- Optimize workloads for hardware efficiency: CPU/GPU compute balance, memory management, data throughput, and networking
- Develop monitoring and debugging tools for large-scale runs, enabling rapid diagnosis of performance regressions and failures
WHAT YOU’LL BRING
- Deep experience in distributed systems, ML infrastructure, or high-performance computing (8+ years)
- Production-grade expertise in Python
- Low-level performance mastery: CUDA/cuDNN/Triton, CPU–GPU interactions, data movement, and kernel optimization
- Scaling at the frontier: experience with PyTorch and training jobs using data, context, pipeline, and model parallelism
- System-level mindset with a track record of tuning hardware–software interactions for maximum utilization
Ready to apply?
This job is active. Apply now to get in early.