Per-Tensor and Per-Block Scaling Strategies for Effective FP8 Training

In this blog post, we’ll break down the main FP8 scaling strategies—per-tensor scaling, delayed and current scaling, and per-block scaling (including the…

In this blog post, we’ll break down the main FP8 scaling strategies—per-tensor scaling, delayed and current scaling, and per-block scaling (including the Blackwell-backed MXFP8 format)—and explain why each is essential for maintaining numerical stability and accuracy during low-precision training. Understanding these approaches will help with choosing the right recipe for your own FP8 workflows.

Source

Leave a Reply

Your email address will not be published.

Previous post I played Elden Ring Nightreign for goofy co-op shenanigans, but wound up punched in my feelings by the Revenant’s remembrance questline
Next post ‘Give me more of that’: Monster Hunter Wilds players have discovered its returning monsters got hands, and they’re thrilled to be getting clobbered