Department of Computational and Data Sciences
Department Seminar
Speaker : Dr. Ashish Panwar, Microsoft Research India
Title : Enabling Determinism in LLM Inference
Date & Time: January 21st, 2026 (Wednesday), 11:00 AM
Venue : # 102, CDS Seminar Hall
ABSTRACT
In LLM inference, the same prompt may yield different outputs across different runs even when sampling hyper-parameters are fixed. At the system level, this non-determinism stems from the non-associativity of floating-point arithmetic combined with dynamic batching, as GPU kernels adapt their reduction strategies based on the batch size. A straightforward way to enforce determinism is to disable dynamic batching, but this severely degrades throughput. Another approach is to make kernels batch-invariant; however, this tightly couples determinism to kernel design, requiring specialized kernels that apply a universal reduction strategy to all tokens regardless of batch size. This coupling also imposes fixed runtime overheads, regardless of how much of the workload actually requires determinism.
I will present LLM-42, an alternative approach to enable determinism in LLM inference. LLM-42 is inspired by speculative execution and some interesting properties of GPU kernel implementations. Our key observation is that determinism does not require a universal reduction strategy: it suffices that each token position is decoded using a consistent reduction schedule. Moreover, most GPU kernels already use shape-consistent reductions. Leveraging these observations, LLM-42 decodes tokens along a non-deterministic fast path and enforces determinism via a lightweight verify–rollback loop. The verifier replays candidate tokens under a fixed-shape reduction schedule, commits those that are guaranteed to be consistent across runs, and rolls back those violating determinism. By decoupling determinism from kernel design, LLM-42 achieves deterministic inference with unmodified kernels and incurs overhead only in proportion to the traffic that requires determinism.
BIO: Ashish Panwar is a Principal Researcher at Microsoft Research India, where he explores methods to improving large language model inference. His broader research interests span operating systems, memory systems, and GPUs. Prior to joining Microsoft Research in 2022, he obtained his MSc (Engg) and PhD from the CSA department at IISc where he was advised by Prof. K. Gopinath and Prof. Arkaprava Basu.
Host Faculty: Jayant Haritsa, CDS
ALL ARE WELCOME



