Link copied to clipboard!
Back to Jobs
Senior Hardware Engineer GPU & AI Infrastructure at Roblox
Roblox
San Mateo, CA
Engineering
Posted 0 days ago
Job Description
As a member of the Infrastructure Foundation Hardware Engineering team you will play a key role in enabling our mission to deliver a reliable high-performing and cost-efficient infrastructure that powers the worlds this specialized role you will be the technical lead for our GPU and AI accelerator ecosystem. You will be responsible for the full lifecycle of GPU hardware from initial architectural evaluation and firmware qualification to large-scale fleet integration and performance tuning. You will ensure that Robloxs massive-scale rendering and ML workloads run on the most optimized and stable hardware possible.You Will:Architect & Prototype: Prototype next-generation GPU-accelerated hardware platforms ensuring seamless integration between high-density compute nodes high-speed interconnects (NVLink/PCIe Gen5/6) and system firmware.GPU Optimization: Drive the integration performance testing and debugging of GPUs in our fleet focusing specifically on hardware-level optimizations driver tuning and thermal/power management.Validation & Certification: Develop and execute rigorous evaluation and stress-testing strategies for GPU-heavy server platforms to ensure they meet Robloxs unique demands for real-time rendering and low-latency AI inference.Firmware & Systems: Lead firmware qualification (BIOS/BMC) and troubleshooting implementing automation systems to manage GPU health firmware updates.Vendor Collaboration: Provide technical guidance and deep-dive feedback to hardware vendors. Lead critical investigations into component-level failures triaging issues across the hardware driver and kernel layers.Observability: Build and maintain advanced monitoring stacks (Grafana/Prometheus) to track GPU metrics like HBM utilization thermal throttling events and PCIe bandwidth saturation.You Have:Education: BA/BS Degree in Electrical Engineering Computer Engineering or related field with equivalent practical experience.GPU Expertise: 5 years of hardware engineering experience with a specific focus on GPU architecture (NVIDIA HGX/MGX platforms preferred) AI accelerators or high-performance compute (HPC) systems.Deep Technical Knowledge: In-depth understanding of modern data center technologies including PCIe fabric NVLink InfiniBand and liquid cooling systems for high-TDP hardware.Testing Skills: Hands-on experience testing and validating CPU Memory (HBM/DDR5) Storage (NVMe) and high-speed networking subsystems in a Linux environment.Programming: Proficiency in Python Go or C for developing hardware validation tools and automation scripts.Systemic Debugging: Expert-level skills in debugging complex server issues remotely with the ability to analyze kernel logs hardware registers and bus-level captures.You Are:A Problem Solver: Decisive and effective at tracking hardware issues from identification through to fleet-wide resolution.A Communicator: Excellent oral and written communication skills; able to translate complex hardware constraints into actionable insights for software teams.Collaborative: Strong interpersonal skills with the ability to lead cross-functional projects with Data Center Ops SRE and external vendors.Adaptable: Willing to travel occasionally to data centers or vendor sites to oversee hardware deployments or first-of-a-kind builds.Required Experience:Senior IC Key Skills Active Directory Administration,Animal,Apparel,Entry Level,Jboss,Inventory Management Employment Type : Full Time Experience: years Vacancy: 1
Resume Suggestions
Highlight relevant experience and skills that match the job requirements to demonstrate your qualifications.
Quantify your achievements with specific metrics and results whenever possible to show impact.
Emphasize your proficiency in relevant technologies and tools mentioned in the job description.
Showcase your communication and collaboration skills through examples of successful projects and teamwork.