AMD’s tweaked RDNA 3.5 GPU is solely focused on improving mobile gaming performance

At an event in Los Angeles last week, AMD went through in more detail all the changes it’s introducing with the Zen 5 CPU architecture. For a brief while, the chip giant also explained what’s new in RDNA 3.5—a “fractional improvement” that’s been “bolted onto” the current graphics processor design. In short, it’s all about optimising rendering performance in mobile applications.

The updated design was introduced by Mark Papermaster, AMD’s chief technology officer, and he began by pointing out that the changes culminated from its collaboration with Samsung, which licenses AMD’s graphics tech for the Exynos range of smartphone and tablet processors.

“A lot of the techniques are ideal for notebooks,” he said. “They’re ideal to give you that same great Radeon graphics experience, but at a much-reduced power and a much higher efficiency.”

There are no sweeping changes but that’s to be expected from the architecture’s codename. RDNA 3.5 exists to improve some of the performance bottlenecks that AMD’s GPUs come across when used in low-power, low shader count configurations—namely the integrated Radeon GPUs in its mobile APUs, used in laptops and most handheld gaming PCs.

In the case of the latter, these typically run with power budgets of 15 W or so and while they can be given more, it’s still significantly less power than the lowest of discrete GPUs get to enjoy. For example, a Radeon RX 6400 can use up to 54 W, which is 80% more power than the GPU in the Asus ROG Ally can demand.

Coupled with having a small number of Compute Units (CUs), it means that certain rendering operations, that would normally be of no concern to a desktop GPU, become more of a limiting factor in the overall performance. The first that Papermaster identified was the texture sampling rate.

In RDNA 3, each CU houses four texture units, each of which can sample and return one bilinearly-filtered texel per clock cycle. Papermaster said that AMD has doubled this figure to eight, in RDNA 3.5, though you might be wondering why. Low-power integrated GPUs don’t run as fast as discrete desktop chips and coupled with the fact that they use system memory for VRAM, texturing is quite a slow process for mobile GPUs.

By doubling the sampler count, the chip can fetch twice as many texels per clock cycle, making up for the lower core clocks. The lack of VRAM bandwidth isn’t necessarily an issue because texture sampling incurs huge latencies, anyway.

Image 1 of 4

(Image credit: AMD)

Image 2 of 4

(Image credit: AMD)

Image 3 of 4

(Image credit: AMD)

Image 4 of 4

(Image credit: AMD)

However, AMD’s presentation slides state that this doubling is only for “a subset of the most common texture sampling operations” so it’s perhaps not as clear cut as just doubling the texture unit count—I wonder if this is actually more a case of improvements in how some of the vector memory image instructions are processed. When I eventually get my hands on an RNDA 3.5 GPU I’ll hopefully be able to delve into what’s actually doubled in nature.

Something else that has been given a two times performance boost are vector-heavy operations that involve the interpolation or comparison of values. These involve doing multiple data reads from the vector register files and since the performance of this is affected by clock speeds, it makes some sense to improve matters here. I’m not entirely convinced that such routines are a significant bottleneck for integrated GPUs but obviously AMD feels that it is.

Or it could be to do with all the other changes that have been implemented in RDNA 3.5 that are about improving memory management. Since iGPUs don’t have the power budget to enjoy super-high clock speeds nor have the room for copious amounts of cache, every cycle saved in any kind of memory operation is a good thing in the mobile world.

Overall, RDNA 3.5 is simply laser-focused on both the memory and the shader execution to significantly improve our efficiency of graphics and yet deliver that same Radeon experience that our customers expect.

Mark Papermaster, AMD

These include a new instruction that detects if a single-use write operation has been issued and allows that to be skipped, letting the GPU move on to the next instruction. Data writes, especially to RAM, can be very slow, whereas a bunch of vector multiplications can be done in a few cycles.

The way that primitives (groups of vertices that make up a shape) are processed in batches has been fine-tuned to take more advantage of spatial locality. The data for primitives is naturally grouped in cache or RAM, so if you run through an operation to generate a memory address to fetch data, there’s a very high chance that the very next address will also be the same primitive (aka spatial locality). Improving how all this is managed means fewer system memory access and address operations are required.

RDNA 3.5 also sports better memory compression algorithms and the iGPU’s memory controller has been properly optimised for LPDDR5, the RAM of choice for handheld gaming PCs and increasingly more laptops. Accessing system memory for graphics routines isn’t just slow, it’s also very energy-inefficient compared to cache.

To summarize it all, it’s about doing more for the same or less amount of energy, and to that end, AMD provided a performance comparison between an RDNA 3.5-powered Strix Point APU and an RDNA 3 Hawk Point. Specifically, it was a Ryzen AI 9 HX 370 versus a Ryzen 7 8840U, both capped to 15 W.

More on the Steam Deck

(Image credit: Future)

Best handheld gaming PC: What’s the best travel buddy?
Steam Deck OLED review: Our verdict on Valve’s handheld.
Best Steam Deck accessories: Get decked out.
Steam Deck battery life: What’s the real battery life?

The new GPU is around 32% faster than the previous generation in the old 3DMark Time Spy benchmark and 19% faster in the lightweight Night Raid test. On face value, those figures look impressive, but the HX 370 sports 33% more CUs than the 8840U. While we don’t know what clock speeds the GPUs in those chips were running at, the fact that one of them has substantially more shader units than the other isn’t something one can dismiss.

But taking the Strix Point chip’s scores as an overall view of its capabilities is worth doing. I ran Time Spy and Night Raid on my ROG Ally, set to 15 W, and got results of 2,915 and 19,994 respectively—16% and 52% slower than the Ryzen AI 9 HX 370. How much of that is down to CPU cores, shader counts, and clock speeds is anyone’s guess at this moment in time, but it bodes well for gaming at least.

At this moment in time, however, I’m not convinced that the RDNA 3.5 updates will have all that much of a say in gaming compared to the increase in CPU cores and shader count.

Leave a Reply

Your email address will not be published.

Previous post Lenovo Legion Pro 5i 16 Gen 9 review
Next post Dyson’s humongous ANC headphones look great for Princess Leia impressions from the side, but from the front, I’m not convinced