Job Description
Design and implement high-performance Python and C++ code for vLLM-based inference systems, GPU kernels, and numerical methods.
*Telecommuting permitted: work may be performed within normal commuting distance from the Red Hat, Inc. office in Boston, MA.
What You Will Do:
Develop, test, and optimize LLM inference algorithms, including quantization and sparsification techniques, to improve latency, throughput, and memory use.
Conduct performance profiling and modeling on NVIDIA GPUs using tools such as Nsight, tune CUDA, Triton, or CUTLASS kernels for deep neural networks.
Participate in technical design reviews and propose innovative HPC solutions for large-scale model serving.
Review peer code promptly and leverage AI-assisted development tools to uphold code quality standards.
Collaborate with cross-functional AI, product, and research teams to deliver features to Red Hat AI Inference Platform.
Apply for this Position
Ready to join Red Hat, Inc.? Click the button below to submit your application.
Submit Application