Linux Server GPU Engineer
Location: Bethesda, MD
Security Clearance: Must have an active TS/SCI clearance and must be able to achieve a TS/SCI with CI polygraph.
Certification: at a minimum, meet DoD 8570.11- IAT Level II certification requirements (currently Security+ CE, CCNA-Security, GICSP, GSEC, or SSCP along with an appropriate computing environment (CE) certification). An IAT Level III certification would also be acceptable (CASP+, CCNP Security, CISA, CISSP, GCED, GCIH, CCSP).
Our Client is currently seeking a talented
Linux Server GPU Engineer to fill a crucial position supporting the National Media Exploitation Center (NMEC). This role demands technical expertise in administering Nvidia DGX1 and A100 servers within both physical and virtual environments. The ideal candidate will possess strong attention to detail to effectively address customer inquiries and concerns. Responsibilities include interfacing with administrators to address service inquiries and resolve issues promptly. Duties also encompass diagnosing customer problems and implementing appropriate corrective measures to restore service functionality. Additionally, the role involves analyzing recurring issues and devising preventive solutions, as well as assessing existing infrastructure for performance enhancements. The individual will play a key role in providing operational support for systems and software in a large, multi-enclave enterprise environment. Collaboration within a team setting is essential to ensure the fulfillment of mission requirements and the seamless operation of customer capabilities. Furthermore, individuals in this role may be required to perform technical software configuration, rebooting, and other remedial actions on customer servers. The Customer adopts an Agile Framework to effectively plan and execute all initiatives. The work location for this role is at the Intelligence Community Campus in Bethesda.
Responsibilities:
- GPU Architecture and Design: Collaborate with a multidisciplinary team to define, develop, and optimize GPU architectures, ensuring they meet stringent performance, power efficiency, and feature requirements. Leverage industry insights to drive design decisions. Ensure that GPU designs and integrations are not only optimized for Linux but are also adaptable to other operating systems.
- Operating System Integration: Work closely with operating system developers to ensure smooth GPU integration with Linux-based systems. Optimize GPU drivers for compatibility, performance, and reliability in a Linux environment. Provide regular maintenance and updates to ensure continued compatibility.
- Hardware Expertise: Contribute to the design and development of GPU hardware, providing insights into hardware architecture to ensure efficient interaction with software components. Maintain and update hardware designs as needed.
- CUDA (Compute Unified Device Architecture) /OpenCL (Open Computing Language) Programming: Develop and optimize applications using CUDA or OpenCL, harnessing the full potential of GPU hardware for parallel processing, high-performance computing, and machine learning on Linux platforms. Maintain and update software for optimal performance.
- Performance Analysis: Analyze GPU performance, identify bottlenecks, and develop strategies to enhance performance across various applications in Linux, addressing both hardware and software considerations. Regularly monitor and improve performance.
- GPU Tooling: Create and maintain debugging tools, profiling utilities, and performance analysis software tailored for Linux systems to facilitate efficient GPU development and troubleshooting. Keep tools up-to-date and functional.
- Power Efficiency: Work on power management techniques to optimize GPU power consumption, ensuring efficient operation on both mobile and desktop Linux platforms. Continuously assess and enhance power efficiency strategies.
- Testing and Validation: Design and execute tests to validate GPU performance and functionality on Linux, including stress testing, benchmarking, and debugging to ensure robust operation. Maintain and expand the testing suite.
- Documentation: Maintain comprehensive technical documentation, including architectural specifications, code documentation, and Linux-specific best practices for GPU development. Keep documentation up-to-date with changes and improvements.
- Industry Insight: Stay updated on the latest trends, innovations, and competitive landscapes within the GPU industry, contributing to research efforts and proposing Linux-specific approaches to GPU design and optimization. Share regular updates and insights with the team.
Minimum Requirement
- Bachelor's or higher degree in Computer Science, Electrical Engineering, or a related field. Additional years of experience may be considered in lieu of a degree.
- 10+ years of relevant systems engineering experience
- Proven experience in GPU architecture design, and GPU performance optimization.
- Expertise in operating system integration for Linux.
- Strong understanding of computer hardware architecture, particularly as it relates to Linux systems.
- Knowledge of parallel computing, graphics algorithms, and real-time rendering in Linux environments.
- Familiarity with GPU debugging tools and profiling software for Linux.
- Excellent problem-solving skills and the ability to collaborate within a team.
- Strong communication skills for conveying technical information in a Linux context.
- Proficiency with scripting languages such as Python or BASH.
- Proficiency with automation tools such Ansible, Puppet, Salt, Terraform, etc.
- Candidate must, at a minimum, meet DoD 8570.11- IAT Level II certification requirements (currently Security+ CE, CCNA-Security, GICSP, GSEC, or SSCP along with an appropriate computing environment (CE) certification). An IAT Level III certification would also be acceptable (CASP+, CCNP Security, CISA, CISSP, GCED, GCIH, CCSP).
Preferred Qualification
- Published research or contributions in the GPU industry, especially related to Linux.
- Experience with machine learning and neural network frameworks on GPUs in Linux.
- Knowledge of GPU virtualization, cloud computing, and emerging Linux-based technologies in the field.
- Proficiency in programming languages such as GPU-specific languages.
- Experience with container technologies (Docker, Kubernetes)
- Experience with Prometheus/Grafana for monitoring
- Knowledge of distributed resource scheduling systems [Slurm (preferred), LSF, etc.]
- Familiarity with CUDA and managing GPU-accelerated computing systems
- Basic knowledge of deep learning frameworks and algorithms