Archive | April, 2012

Mobile GPUs : Architectures

5 Apr

Mobile GPUs series is divided into 6 parts:

  1. Mobile GPUs : Introduction & Challenges
  2. Mobile GPUs : Vendors
  3. Mobile GPUs : Architectures
  4. Mobile GPUs : Benchmarks
  5. Mobile GPUs : Pitfalls
  6. Mobile GPUs : Outlook


I am happy to present to you the third post of this Mobile-GPU blog series.

Today we will dig a bit more into technical issues, at least in comparison to the previous posts. In general my focus for this part is on the leading Mobile GPU architectures such as Adreno (Qualcomm), GeForce ULP (nVidia), Mali (ARM) and PowerVR (Imagination). For several reasons I had to drop Vivante in this blog post, even though (according to Jon Peddie Research [1]) Vivante is #2 in GPU-IP business.

There are many ways how to characterise GPU architectures, too many. So lets keep it simple and brief by focusing just on shaders and the way how the rasterisation/rendering is organised. I also like to mention here that in comparison to desktop GPUs from nVidia/AMD only very little is revealed about the actual makeup of mobile GPUs.

Anyway lets get started – to be OpenGL|ES 2.0 compliant it is required to support Vertex and Pixel/Fragment shaders, those shaders can be dedicated or unified. Actually, even to determine the number of shaders is not an easy task, some GPUs have a fine granulated alu/shader grid, some only a few more complex shaders. You see shader is not like shader.

Mobile SoC GPU Comparison – Source AnandTech [3]
Adreno 225 PowerVR SGX 540 PowerVR SGX 543 PowerVR SGX 543MP2 Mali-400 MP4 GeForce ULP Kal-El GeForce
SIMD Name USSE USSE2 USSE2 Core Core Core
# of SIMDs 8 4 4 8 4 + 1 8 12
MADs per SIMD 4 2 4 4 4 / 2 1 ?
Total MADs 32 8 16 32 18 8 ?

In terms of the rendering there are two extremes Immediate-Mode-Rendering (IMR) and Tile-Based-Deferred-Rendering (TBDR). I like to explain it in a bit more detail starting in the past with simple GPU architectures and ending today.

Early GPUs (especially in embedded space) were IMR based, here the CPU did set required parameters directly on the GPU and a write access to a certain register triggered the start of the rendering process. The triangle rasterisation was done in one pass and most often span-wise (top to bottom, left to right). CPU had to wait until GPU is ready again to issue next triangle, obviously this approach causes a lot synchronisation overhead on both CPU and GPU. But GPU evolved, dedicated Caches, Shadow-Registers and Display-List (Command-List) got introduced. In early times, the Display-List was just a small (a few kb) ring-buffer, mainly to decouple CPU and GPU in order to save wait-cycles. Nowadays where most SoCs have Unified Memory Access (UMA), the display list can be stored anywhere in memory.

In all modern GPUs the rasterisation is done with a tile based approach. The SW-Driver buffers as much as possible of the scene and then renders all triangles tile by tile into the framebuffer. The tile-buffer is on-chip-memory, which leads to significantly lower framebuffer related external memory bandwidth consumption. However, the rendering of the tiles itself is often done in more IMR kind of way. So there is still a significant performance difference between rendering a scene from back to front or front to back. Hidden Surface Removal algorithms like Early-Z are not working efficiently in the back to front case, because tons of texels will get fetched even there are not visible. Only PowerVR is offering pixel based deferred rendering.

Adreno GeForce ULP Mali-400 Mali T604- PowerVR
Unified Shader yes no no yes yes

*TBIMR = Tile Based Immediate Mode Rendering – driver buffers as much as possible of the scene and then renders the buffer tile by tile in a normal IMR manner

The Adreno architecture is based on AMD Z430 of the Imageon Family which was introduced by ATI in 2002. Based on what is stated in [2] and [3] it seems that Adreno is rendering tile-based but each tile in a IMR kind of way. The HW seems to be optimised for triangle stripes, which very likely causes a big load on the CPU. Because the CPU then has to perform the task to transform meshes into triangle stripes. Another interesting aspect, clipping seems to be a big performance issue on Adreno architecture. My educated guess, the clipping algorithm does not just clip the triangle geometry alone, but also calculates for each new vertex the clipped attributes like texture coordinates etc. Maybe a quick example will explain it better, a triangle needs to get clipped, lets say this leads to 2 new vertices (one get dropped). Per vertex attributes are Texture Coordinates (2D), Vertex Colour (4D) and Vertex Normal (4D). The clipper then calculates the two new vertices but also the related attributes. In our case this would roughly lead to extra (2D+4D+4D=10) 10*2 multiply add operations. On top the attribute rasterisation setup, which now needs to be done 2 times, instead of 1 time (in our case 1 triangle became 2 triangles). I also would guess, that the vertex shading and primitive assembly (including clipping) is done per tile. This approach is in terms of HW complexity a lot simpler but very costly in terms of cycle count. Lets say you have a 800×480 Framebuffer and using 32×32 Tiles, then you need to touch all scene data 375 times to render the whole Framebuffer. Which does not sound very effective to me.

GeForce ULP
In general nVidia benefits heavily from its Desktop GPU expertise, so it is not a surprise that the GeForce ULP deployed in Tegra 2 is based on NV40 architecture. The NV40 architecture got introduced 2004 and supports OpenGL2 and DirectX 9c. Because it is a common architecture I will not dig into details, however, more details can be found here [6].

Has a modern straightforward architecture. The rendering is done tile based, vertex shading one time per frame, rasterisation in homogeneous space [4] no explicit clipping required. Early-Z and 4 levels of depth/stencil hierarchy supported, which enables 4MSAA nearly for free. The shaders are SIMD/VLIW based, but more details about the shaders itself are not revealed so far. The Mali T604/658 (Midgard architecture) is supporting Framebuffer compression and a new texture compression format called ASTC [5]. Midgard is also going big steps towards general GPGPU applications, by supporting 64bit data types natively and more importantly CCI-400. The tight connection between CPU and GPU is a very powerful solutions which provides high flexibility and excellent performance on low-power footprint.

PowerVR is on the market for quite sometime now, because of that I will be brief on the architecture itself. Biggest difference to Andreno, GeFore ULP and Mali is the fact that PowerVR is deferring the rendering on per-pixel-base, the others are only deferring on per-tile-base. It means that PowerVR, for instance, tries to fetch texel only for pixels which contribute to final framebuffer pixel. I have to admit, this architecture works perfectly in fixed-function (OpenGL|ES 1.1) and low-poly-count Use-Cases. But in Use-Cases where you have tons of triangles and complex shaders I suppose the performance will drop significantly, maybe even under what you can achieve on other GPUs.

And once again, thanks for your time.



[2] AdrenoTM 200 Performance Optimization –OpenGL ES Tips and Tricks [March 2010]