Back to Jobs
DSM-H LLC

Cloud Engineer (NVIDIA GPU-based systems) at DSM-H LLC

DSM-H LLC Chicago, Illinois

Job Description

Locations: Open for Dallas, TX, Peoria, IL, Phoenix, AZ, Cary, NC as well Typical task breakdown: - Administer and maintain GPU-accelerated servers and clusters, including NVIDIA A100, H100, and other high-end GPU sets. - Manage and optimize NVIDIA software stack components such as CUDA, cuDNN, TensorRT, NCCL, and NGC containers. - Monitor system performance, troubleshoot hardware/software issues, and ensure high availability of AI infrastructure. - Collaborate with DevOps and AI teams to support containerized workflows (Docker, Kubernetes) and distributed training environments. - Implement security best practices and ensure compliance with internal and external standards. - Lead upgrades, patching, and lifecycle management of GPU servers and related infrastructure. - Provide documentation, automation scripts, and training for internal teams. Work environment: Candidates must be able to go into office 1 day a week and eventually go into office 5 days a week when notified. Education & Experience Required: - Bachelor’s Degree with a minimum of 8 years work experience, 5+ years of experience in server administration, with at least 3 years focused on NVIDIA GPU-based systems Technical Skills: - 5+ years of experience in server administration, with at least 3 years focused on NVIDIA GPU-based systems. - Deep understanding of Linux system administration, especially in HPC or AI environments. - Hands-on experience with NVIDIA GPU drivers, CUDA toolkit, and performance tuning. - Familiarity with Slurm, Kubernetes, or other job scheduling and orchestration tools. - Experience with monitoring tools (e.g., Prometheus, Grafana) and infrastructure automation (e.g., Ansible, Terraform). - Strong scripting skills (Bash, Python, etc.). - Excellent problem-solving and communication skills. (Desired) - NVIDIA Certified Professional or similar credentials. - Experience with multi-GPU and multi-node training setups. - Familiarity with AI/ML frameworks (e.g., PyTorch, TensorFlow) and their GPU dependencies. - Exposure to cloud-based GPU infrastructure (AWS, Azure, GCP).

Resume Suggestions

Highlight relevant experience and skills that match the job requirements to demonstrate your qualifications.

Quantify your achievements with specific metrics and results whenever possible to show impact.

Emphasize your proficiency in relevant technologies and tools mentioned in the job description.

Showcase your communication and collaboration skills through examples of successful projects and teamwork.

Explore More Opportunities