About OpsHeaven Technologies (OHT):
OpsHeaven Technologies (OHT) is a forward-thinking company focused on delivering custom software solutions, team augmentation services, and fostering innovation through our startup studio. At OHT, we strive to create impactful and scalable technology solutions that bring value to our clients across diverse industries.

Responsibilities:

Manage and maintain high-performance machine learning (ML) training clusters, ensuring they run smoothly and efficiently.
Monitor hardware health, troubleshoot issues, and implement fixes to maintain system stability.
Automate cluster management processes, ensuring that code is the primary tool for handling hardware at scale.
Lead and mentor a team of engineers, providing technical guidance and ensuring alignment with project goals.
Collaborate with cross-functional teams to ensure customer requirements are met and to provide technical expertise.
Predict and prevent hardware failures through data-driven analysis as our infrastructure scales.
Develop and maintain detailed documentation to support the ongoing management and expansion of compute clusters.
Engage with customers to understand their needs and provide support related to cluster performance and reliability.

Qualifications:

3-5 years of experience in managing GPU training clusters, with a deep understanding of Linux, CUDA, NCCL, and Infiniband.
Proven track record in building large-scale, self-correcting systems that ensure hardware operates optimally.
Experience managing and mentoring engineering teams.
Strong commitment to creating and maintaining comprehensive documentation.
Ability to work independently and as part of a team, with a proactive approach to problem-solving.
Excellent communication skills, both written and spoken, with fluency in English being a must.

Nice to Haves:

Experience with Rust, particularly in the context of virtual machine (VM) orchestration.
Knowledge of distributed storage systems such as Weka, VAST, or Ceph.
Familiarity with high-performance computing (HPC) network architectures like eBGP, fat-tree, VXLAN, and MCLAG.
Experience with Linux virtualization technologies including KVM, QEMU, and libvirt.
Expertise in performance optimization of machine learning kernels.

Compute Cluster Engineer

Apply for this position

Explore The Possibilities..