LLM context lengths are exploding, and architectures are moving toward complex attention schemes like Multi-Head Latent Attention (MLA) and Grouped Query…
LLM context lengths are exploding, and architectures are moving toward complex attention schemes like Multi-Head Latent Attention (MLA) and Grouped Query Attention (GQA). As a result, AI ”speed of thought” is increasingly governed not by the massive throughput of matrix multiplications, but by the transcendental math of the softmax function. Transcendentals refer to functions that cannot be…
