About the Role
This is an Expression of Interest, not an active role.
We run GPU clusters on AMD Instinct and Nvidia HGX-class hardware. The systems engineering job is everything from firmware and ROCm or CUDA stacks down through fabric, optics, RDMA and storage, up to tenant-ready clusters.
If you have built or operated production GPU systems at meaningful scale, we want to know who you are.
Responsibilities
- Bring up new GPU clusters: firmware, BIOS, driver stack, fabric configuration, validation.
- Tune and troubleshoot RDMA, RoCE and NCCL or RCCL behaviour at the cluster level.
- Operate ROCm, CUDA and the supporting library stack across tenants.
- Coordinate with platform, network and DC teams on capacity, reliability and hardware swaps.
- Write the runbooks the next operator will rely on.
Required Skills and Experience
- Hands-on experience with production GPU clusters, AMD Instinct or Nvidia HGX-class.
- Strong Linux fundamentals, kernel and driver-level troubleshooting.
- Understanding of RDMA fabric design, NCCL or RCCL tuning, and multi-node training performance.
- Comfort with firmware updates, hardware diagnostics and vendor escalations.
- Methodical. You isolate the variable rather than swap the part.
About OneQode
OneQode is a global provider of performance digital infrastructure. With a vertically-integrated platform that spans cloud compute, low-latency networking and sovereign technology across over 30 datacentres in 5 continents, they enable enterprises, governments and performance-hungry businesses to run AI & mission-critical workloads at scale, across the globe.
How to Apply
If this sounds like you, we'd love to hear from you.
Click the button below to apply.