Member of Technical Staff - Inference
IT
San Francisco, CA, USA
Posted on Jul 1, 2026
Optimize token processing down to the lowest layers of the stack. You'll optimize kernel performance, develop new scheduling and parallelism strategies, and help us squeeze every FLOP out of our hardware. What you’ll do: Modify and extend state-of-the-art inference engines like vLLM and SGLang. Understand every microsecond of GPU time during a forward pass; be able to explain every kernel launch on an NSys profile. Design and implement exotic parallelism schemes to work with 'interesting' hardware topologies. Write custom GPU kernels to excel in specific regimes, such as cascade attention. What we’re looking for: Strong understanding of LLM mechanics (KV cache, mixture-of-experts, prefill vs. decode phases). Interest in MLSys research (speculative decoding, sparse attention). Familiarity with modern, tile-based GPU programming (Triton, CUTLASS, ThunderKittens), or interest in learning these.