Job Description

Design and implement high-performance Python and C++ code for vLLM-based inference systems, GPU kernels, and numerical methods.

*Telecommuting permitted: work may be performed within normal commuting distance from the Red Hat, Inc. office in Boston, MA.

What You Will Do:

  • Develop, test, and optimize LLM inference algorithms, including quantization and sparsification techniques, to improve latency, throughput, and memory use.

  • Conduct performance profiling and modeling on NVIDIA GPUs using tools such as Nsight, tune CUDA, Triton, or CUTLASS kernels for deep neural networks.

  • Participate in technical design reviews and propose innovative HPC solutions for large-scale model serving.

  • Review peer code promptly and leverage AI-assisted development tools to uphold code quality standards.

  • Collaborate with cross-functional AI, product, and research teams to deliver features to Red Hat AI Inference Platform.

  • Apply for this Position

    Ready to join Red Hat, Inc.? Click the button below to submit your application.

    Submit Application