Mobile GPUs : Architectures

5 Apr

Mobile GPUs series is divided into 6 parts:

  1. Mobile GPUs : Introduction & Challenges
  2. Mobile GPUs : Vendors
  3. Mobile GPUs : Architectures
  4. Mobile GPUs : Benchmarks
  5. Mobile GPUs : Pitfalls
  6. Mobile GPUs : Outlook


I am happy to present to you the third post of this Mobile-GPU blog series.

Today we will dig a bit more into technical issues, at least in comparison to the previous posts. In general my focus for this part is on the leading Mobile GPU architectures such as Adreno (Qualcomm), GeForce ULP (nVidia), Mali (ARM) and PowerVR (Imagination). For several reasons I had to drop Vivante in this blog post, even though (according to Jon Peddie Research [1]) Vivante is #2 in GPU-IP business.

There are many ways how to characterise GPU architectures, too many. So lets keep it simple and brief by focusing just on shaders and the way how the rasterisation/rendering is organised. I also like to mention here that in comparison to desktop GPUs from nVidia/AMD only very little is revealed about the actual makeup of mobile GPUs.

Anyway lets get started – to be OpenGL|ES 2.0 compliant it is required to support Vertex and Pixel/Fragment shaders, those shaders can be dedicated or unified. Actually, even to determine the number of shaders is not an easy task, some GPUs have a fine granulated alu/shader grid, some only a few more complex shaders. You see shader is not like shader.

Mobile SoC GPU Comparison – Source AnandTech [3]
Adreno 225 PowerVR SGX 540 PowerVR SGX 543 PowerVR SGX 543MP2 Mali-400 MP4 GeForce ULP Kal-El GeForce
SIMD Name USSE USSE2 USSE2 Core Core Core
# of SIMDs 8 4 4 8 4 + 1 8 12
MADs per SIMD 4 2 4 4 4 / 2 1 ?
Total MADs 32 8 16 32 18 8 ?

In terms of the rendering there are two extremes Immediate-Mode-Rendering (IMR) and Tile-Based-Deferred-Rendering (TBDR). I like to explain it in a bit more detail starting in the past with simple GPU architectures and ending today.

Early GPUs (especially in embedded space) were IMR based, here the CPU did set required parameters directly on the GPU and a write access to a certain register triggered the start of the rendering process. The triangle rasterisation was done in one pass and most often span-wise (top to bottom, left to right). CPU had to wait until GPU is ready again to issue next triangle, obviously this approach causes a lot synchronisation overhead on both CPU and GPU. But GPU evolved, dedicated Caches, Shadow-Registers and Display-List (Command-List) got introduced. In early times, the Display-List was just a small (a few kb) ring-buffer, mainly to decouple CPU and GPU in order to save wait-cycles. Nowadays where most SoCs have Unified Memory Access (UMA), the display list can be stored anywhere in memory.

In all modern GPUs the rasterisation is done with a tile based approach. The SW-Driver buffers as much as possible of the scene and then renders all triangles tile by tile into the framebuffer. The tile-buffer is on-chip-memory, which leads to significantly lower framebuffer related external memory bandwidth consumption. However, the rendering of the tiles itself is often done in more IMR kind of way. So there is still a significant performance difference between rendering a scene from back to front or front to back. Hidden Surface Removal algorithms like Early-Z are not working efficiently in the back to front case, because tons of texels will get fetched even there are not visible. Only PowerVR is offering pixel based deferred rendering.

Adreno GeForce ULP Mali-400 Mali T604- PowerVR
Unified Shader yes no no yes yes

*TBIMR = Tile Based Immediate Mode Rendering – driver buffers as much as possible of the scene and then renders the buffer tile by tile in a normal IMR manner

The Adreno architecture is based on AMD Z430 of the Imageon Family which was introduced by ATI in 2002. Based on what is stated in [2] and [3] it seems that Adreno is rendering tile-based but each tile in a IMR kind of way. The HW seems to be optimised for triangle stripes, which very likely causes a big load on the CPU. Because the CPU then has to perform the task to transform meshes into triangle stripes. Another interesting aspect, clipping seems to be a big performance issue on Adreno architecture. My educated guess, the clipping algorithm does not just clip the triangle geometry alone, but also calculates for each new vertex the clipped attributes like texture coordinates etc. Maybe a quick example will explain it better, a triangle needs to get clipped, lets say this leads to 2 new vertices (one get dropped). Per vertex attributes are Texture Coordinates (2D), Vertex Colour (4D) and Vertex Normal (4D). The clipper then calculates the two new vertices but also the related attributes. In our case this would roughly lead to extra (2D+4D+4D=10) 10*2 multiply add operations. On top the attribute rasterisation setup, which now needs to be done 2 times, instead of 1 time (in our case 1 triangle became 2 triangles). I also would guess, that the vertex shading and primitive assembly (including clipping) is done per tile. This approach is in terms of HW complexity a lot simpler but very costly in terms of cycle count. Lets say you have a 800×480 Framebuffer and using 32×32 Tiles, then you need to touch all scene data 375 times to render the whole Framebuffer. Which does not sound very effective to me.

GeForce ULP
In general nVidia benefits heavily from its Desktop GPU expertise, so it is not a surprise that the GeForce ULP deployed in Tegra 2 is based on NV40 architecture. The NV40 architecture got introduced 2004 and supports OpenGL2 and DirectX 9c. Because it is a common architecture I will not dig into details, however, more details can be found here [6].

Has a modern straightforward architecture. The rendering is done tile based, vertex shading one time per frame, rasterisation in homogeneous space [4] no explicit clipping required. Early-Z and 4 levels of depth/stencil hierarchy supported, which enables 4MSAA nearly for free. The shaders are SIMD/VLIW based, but more details about the shaders itself are not revealed so far. The Mali T604/658 (Midgard architecture) is supporting Framebuffer compression and a new texture compression format called ASTC [5]. Midgard is also going big steps towards general GPGPU applications, by supporting 64bit data types natively and more importantly CCI-400. The tight connection between CPU and GPU is a very powerful solutions which provides high flexibility and excellent performance on low-power footprint.

PowerVR is on the market for quite sometime now, because of that I will be brief on the architecture itself. Biggest difference to Andreno, GeFore ULP and Mali is the fact that PowerVR is deferring the rendering on per-pixel-base, the others are only deferring on per-tile-base. It means that PowerVR, for instance, tries to fetch texel only for pixels which contribute to final framebuffer pixel. I have to admit, this architecture works perfectly in fixed-function (OpenGL|ES 1.1) and low-poly-count Use-Cases. But in Use-Cases where you have tons of triangles and complex shaders I suppose the performance will drop significantly, maybe even under what you can achieve on other GPUs.

And once again, thanks for your time.



[2] AdrenoTM 200 Performance Optimization –OpenGL ES Tips and Tricks [March 2010]






5 Responses to “Mobile GPUs : Architectures”

  1. dave May 10, 2012 at 11:59 pm #

    Modern GPUs are tile based??? What have you been smoking? Tile methods are dreadful for high-complexity rendering. PowerVR failed completely in the PC GPU market for this reason. As mobile devices need more powerful graphics, tile-render architectures will die out completely.

    PowerVR only gained (temporary) success in the mobile space because the early stages needed power efficiency on low triangle scenes with quite high texture rates. Tegra2 is NOT tile-based (as its poor performance against Mali and PowerVR in tile-friendly benchmarks prove). Tegra2 uses large numbers of internal caches to reduce access to external memory, which is not the same thing as tiles.

    Your table also quotes completely wrong specs and performance figures.Tegra2 has a shader clock 1/2 of its quoted GPU clock. A 250Mhz Tegra2 GPU is thus only 1/2 as fast as a 250 Mhz Mali-400MP1 (single pixel pipeline version) in pixel operations.

    Never ever trust figures you see on sites like ‘Anandtech’. Never trust implied performance figures from manufactures either. Real-world tests are the only judge. From such tests, ARM is honest in its clock rates, whereas Nvidia quotes dishonest clock-rates.

    Adreno is out of the game. PowerVR quoted figures are also exaggerated (as benchmarks prove). Nvidia’s Tegra3 fails to catch up to ARM’s Mali-400MP4 (the 4 pixel pipe part). Tegra3 actually has 1/2 the pixel rate of this part.

    Nvidia knows its current Tegra range is terrible, which is why Tegra4 and on are different designs, with far higher performance. ARM knows that the tile modes of its GPU will be less and less useful, as image scene complexity grows. PowerVR looks like Apple will be its last customer, before it either goes bust, or is bought by Apple. AMD is due to enter the ARM market with its GCN graphic core (again, nothing to do with ’tiling’).

    • bastianzuehlke May 13, 2012 at 10:37 am #

      I agree that Adreno will vanish over the next years and I do not believe that Eric Demers [1] could stop the downturn. I already speculated about “Eventually, Qualcomm will become a licensee” it in my first blog post (October 2011) of this series.

      Actually, I was also quite disappointed about the performance of the Nvidia Tegra 2 and I can not understand why Tegra 3 comes with the same architecture, even after first announcements have indicated differently. However, Nvidia has massive know how, so I do expect better from Tegra 4. But I still like the performance I have on my iPad2 or Galaxy S2.

      In regards to the Anandtech table, I am showing it basically to give some details about the shader structures. I did not discuss the depicted performance numbers itself, as you have quite rightly stated, most of the time they are bogus. The only way to compare GPU cores is with appropriate use cases under real life conditions. When I was working at NXP, I was technical lead of a team which evaluated several GPUs for a STB-SoC [2]. Our approach was: we developed a few STB related Use-Case demos with OpenGL|ES, we also provided the SoC memory type, data width, frequency and to be expected latency figures to the GPU-IP vendors. As a result we got from them the FPS (frames per second) and a detailed list of required read/write memory burst accesses. We also implemented our own GPU Simulator to verify the given results. Actually it was necessary, because some GPU-IP vendor have cheated like hell. Finally, after taken everything (technical, business and political issues) into account, management decided for PowerVR SGX 531. To finish this topic, obviously, I can not reveal any details here, which are not already available in public domain, so I have to stick to articles from Anandtech. If I like it or not.

      Dave, I would be grateful if you could point me to some more reliable sources. Besides, my next blog entry is about benchmarking, to do so I bought a couple of devices:
      HTC Wildfire S (Adreno 200), Samsung Galaxy S2 (Mali400MP4), Apple iPad2 (PowerVR SGX 543MP2), Acer A100 (GeForce ULP), ..

      Tile based or not tile based, thats the question. It seems we both have a slightly different definition of tile based. In your definition, the frame buffer is divided into equal sized non-overlapping rectangles – tiles. The complete scene will be buffered and a eglSwapBuffers like command initiate/triggers the rendering. Rendering itself is done tile by tile. Beside that, the GPU has (at least one) dedicated tile buffer memory which stores color, depth, stencil and optionally other GPU specific values. My definition is more general and less strict, all modern GPUs are dividing the frame buffer into kind of rectangle parts, with one or more hierarchies. Those rectangle parts/areas/segments/sections – tiles, might overlap (overlapping makes pixel order synchronisation more painful) and could have variable sizes. Often there is no dedicated tile cache, but several other caches. All modern GPUs are buffering triangle batches (not always the complete scene) and then rendering these batches in kind of tiles. No current render unit is rasterising triangles separately and completely span by span, that would be cache inefficient.

      However, I have to admit maybe I should use another term, tile based is too much associated with PowerVR Architecture.

      To finish this reply, today May 2012, I am pleased with the performance of Mali and SGX, disappointed in Adreno and Tegra. Can´t say anything about Vivante, DMP or Ziilabs due to no hands on experience. I am looking forward to the first ARM based SoC, if you know more about availability let me know.


  2. Larry Watson June 29, 2012 at 4:20 am #

    Hi, what do you think of Broadcom’s GPU solution, part of the VideoCore IP ? It’s shipped with various products as coprocessor or SoC .

    • bastianzuehlke June 29, 2012 at 7:49 am #

      Obviously Broadcom is a major player in the industry and BCM 2835 is the heart of Raspberry Pi, so shame on me for not emphasising VideoCore solution. However, unfortunately, I do not have any details about the architecture itself. Actually, I would be happy if someone could provide or point me to some details. In general, as I stated in my first October post.

      In my view, especially IP-Vendors have to offer an overall comprehensive solution including Video, Graphics, Image and general GPGPU support. At the moment Video and Graphics are often two cores (both Ziilabs and Broadcom claim to have a unified GPU+Video Architecture). I do think that that two separate cores are not necessary. In my opinion a clever merged video + gpgpu architecture would reduce Si-Area and Power-Consumption significantly. It would also reduces the effort/cost in terms of SW development. I personally can see big potentials here.

  3. Sean Lumly September 27, 2013 at 8:22 pm #

    Just a slight correction (re: TDIMR), Mali T6xx GPUs have recently been revealed to perform an operation called “First Pixel Kill” that allows a fragment shader to stop working if there is a surface on the render queue that has a pixel that is closer to the display. Thus, without sorting, it can either eliminate drawing a surface, or immediately start drawing a surface if it takes priority. This is a very elegant solution to overdraw, and really isn’t IMR in that devs can supply polygons in any order, and still reap the benefits of somthing like PowerVR style DR.

Leave a Reply to Sean Lumly Cancel reply

Please log in using one of these methods to post your comment: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: