Tag Archives: Mali

Mobile GPUs : Architectures

5 Apr

Mobile GPUs series is divided into 6 parts:

  1. Mobile GPUs : Introduction & Challenges
  2. Mobile GPUs : Vendors
  3. Mobile GPUs : Architectures
  4. Mobile GPUs : Benchmarks
  5. Mobile GPUs : Pitfalls
  6. Mobile GPUs : Outlook


I am happy to present to you the third post of this Mobile-GPU blog series.

Today we will dig a bit more into technical issues, at least in comparison to the previous posts. In general my focus for this part is on the leading Mobile GPU architectures such as Adreno (Qualcomm), GeForce ULP (nVidia), Mali (ARM) and PowerVR (Imagination). For several reasons I had to drop Vivante in this blog post, even though (according to Jon Peddie Research [1]) Vivante is #2 in GPU-IP business.

There are many ways how to characterise GPU architectures, too many. So lets keep it simple and brief by focusing just on shaders and the way how the rasterisation/rendering is organised. I also like to mention here that in comparison to desktop GPUs from nVidia/AMD only very little is revealed about the actual makeup of mobile GPUs.

Anyway lets get started – to be OpenGL|ES 2.0 compliant it is required to support Vertex and Pixel/Fragment shaders, those shaders can be dedicated or unified. Actually, even to determine the number of shaders is not an easy task, some GPUs have a fine granulated alu/shader grid, some only a few more complex shaders. You see shader is not like shader.

Mobile SoC GPU Comparison – Source AnandTech [3]
Adreno 225 PowerVR SGX 540 PowerVR SGX 543 PowerVR SGX 543MP2 Mali-400 MP4 GeForce ULP Kal-El GeForce
SIMD Name USSE USSE2 USSE2 Core Core Core
# of SIMDs 8 4 4 8 4 + 1 8 12
MADs per SIMD 4 2 4 4 4 / 2 1 ?
Total MADs 32 8 16 32 18 8 ?

In terms of the rendering there are two extremes Immediate-Mode-Rendering (IMR) and Tile-Based-Deferred-Rendering (TBDR). I like to explain it in a bit more detail starting in the past with simple GPU architectures and ending today.

Early GPUs (especially in embedded space) were IMR based, here the CPU did set required parameters directly on the GPU and a write access to a certain register triggered the start of the rendering process. The triangle rasterisation was done in one pass and most often span-wise (top to bottom, left to right). CPU had to wait until GPU is ready again to issue next triangle, obviously this approach causes a lot synchronisation overhead on both CPU and GPU. But GPU evolved, dedicated Caches, Shadow-Registers and Display-List (Command-List) got introduced. In early times, the Display-List was just a small (a few kb) ring-buffer, mainly to decouple CPU and GPU in order to save wait-cycles. Nowadays where most SoCs have Unified Memory Access (UMA), the display list can be stored anywhere in memory.

In all modern GPUs the rasterisation is done with a tile based approach. The SW-Driver buffers as much as possible of the scene and then renders all triangles tile by tile into the framebuffer. The tile-buffer is on-chip-memory, which leads to significantly lower framebuffer related external memory bandwidth consumption. However, the rendering of the tiles itself is often done in more IMR kind of way. So there is still a significant performance difference between rendering a scene from back to front or front to back. Hidden Surface Removal algorithms like Early-Z are not working efficiently in the back to front case, because tons of texels will get fetched even there are not visible. Only PowerVR is offering pixel based deferred rendering.

Adreno GeForce ULP Mali-400 Mali T604- PowerVR
Unified Shader yes no no yes yes

*TBIMR = Tile Based Immediate Mode Rendering – driver buffers as much as possible of the scene and then renders the buffer tile by tile in a normal IMR manner

The Adreno architecture is based on AMD Z430 of the Imageon Family which was introduced by ATI in 2002. Based on what is stated in [2] and [3] it seems that Adreno is rendering tile-based but each tile in a IMR kind of way. The HW seems to be optimised for triangle stripes, which very likely causes a big load on the CPU. Because the CPU then has to perform the task to transform meshes into triangle stripes. Another interesting aspect, clipping seems to be a big performance issue on Adreno architecture. My educated guess, the clipping algorithm does not just clip the triangle geometry alone, but also calculates for each new vertex the clipped attributes like texture coordinates etc. Maybe a quick example will explain it better, a triangle needs to get clipped, lets say this leads to 2 new vertices (one get dropped). Per vertex attributes are Texture Coordinates (2D), Vertex Colour (4D) and Vertex Normal (4D). The clipper then calculates the two new vertices but also the related attributes. In our case this would roughly lead to extra (2D+4D+4D=10) 10*2 multiply add operations. On top the attribute rasterisation setup, which now needs to be done 2 times, instead of 1 time (in our case 1 triangle became 2 triangles). I also would guess, that the vertex shading and primitive assembly (including clipping) is done per tile. This approach is in terms of HW complexity a lot simpler but very costly in terms of cycle count. Lets say you have a 800×480 Framebuffer and using 32×32 Tiles, then you need to touch all scene data 375 times to render the whole Framebuffer. Which does not sound very effective to me.

GeForce ULP
In general nVidia benefits heavily from its Desktop GPU expertise, so it is not a surprise that the GeForce ULP deployed in Tegra 2 is based on NV40 architecture. The NV40 architecture got introduced 2004 and supports OpenGL2 and DirectX 9c. Because it is a common architecture I will not dig into details, however, more details can be found here [6].

Has a modern straightforward architecture. The rendering is done tile based, vertex shading one time per frame, rasterisation in homogeneous space [4] no explicit clipping required. Early-Z and 4 levels of depth/stencil hierarchy supported, which enables 4MSAA nearly for free. The shaders are SIMD/VLIW based, but more details about the shaders itself are not revealed so far. The Mali T604/658 (Midgard architecture) is supporting Framebuffer compression and a new texture compression format called ASTC [5]. Midgard is also going big steps towards general GPGPU applications, by supporting 64bit data types natively and more importantly CCI-400. The tight connection between CPU and GPU is a very powerful solutions which provides high flexibility and excellent performance on low-power footprint.

PowerVR is on the market for quite sometime now, because of that I will be brief on the architecture itself. Biggest difference to Andreno, GeFore ULP and Mali is the fact that PowerVR is deferring the rendering on per-pixel-base, the others are only deferring on per-tile-base. It means that PowerVR, for instance, tries to fetch texel only for pixels which contribute to final framebuffer pixel. I have to admit, this architecture works perfectly in fixed-function (OpenGL|ES 1.1) and low-poly-count Use-Cases. But in Use-Cases where you have tons of triangles and complex shaders I suppose the performance will drop significantly, maybe even under what you can achieve on other GPUs.

And once again, thanks for your time.


[1] http://www.eetimes.com/electronics-news/4304118/Imagination-outstrips-all-other-GPU-IP-suppliers

[2] AdrenoTM 200 Performance Optimization –OpenGL ES Tips and Tricks [March 2010]

[3] http://www.anandtech.com/show/4940/qualcomm-new-snapdragon-s4-msm8960-krait-architecture/3

[4] http://www.ece.unm.edu/course/ece595/docs/olano.pdf

[5] http://blogs.arm.com/multimedia/643-astc-texture-compression-arm-pushes-the-envelope-in-graphics-technology/

[6] http://ixbtlabs.com/articles2/gffx/nv40-part1-a.html


Mobile GPUs : Vendors

6 Mar

Mobile GPUs series is divided into 6 parts:

  1. Mobile GPUs : Introduction & Challenges
  2. Mobile GPUs : Vendors
  3. Mobile GPUs : Architectures
  4. Mobile GPUs : Benchmarks
  5. Mobile GPUs : Pitfalls
  6. Mobile GPUs : Outlook

Dear Reader,

I am welcoming you to the second post of the Mobile-GPU blog series.

Today we are going to deal with GPU-Vendors in a brief alphabetic ordered overview.


ARM entered the mobile GPU market in year 2006, by acquiring a norwegian company called Falanx [1] – the acquisition happened shortly after the scandinavians found a licensee for their first Mali GPU.

The brits took the decision to start a GPU division quite seriously – the Mali Team got quickly integrated into the Media Processing Divion. The original Trondheim team got heavily extended and in the US, ARM did setup a R&D division led by graphics veteran Tom Olsen.

First drop was the Mali55 an OpenGL|ES 1.1 Core – next was an OpenGL|ES 2.0 Core named Mali200, followed by Mali400 and Mali400MP. Latest cores are Mali T-604/658 based on Midgard-Architecture, the feature-set is what you expect from Desktop Solutions, including DirectX 11 and OpenCL (full profile). Highlight for me is the support of ARM CoreLink Cache Coherent Interconnect technology, which enables sophisticated mobile GPGPU applications.

In general, most of the licensee of Mali Technology are located in Asia (Samsung – Galaxy S2 [5]), but you can also find the IP in the Thor Chipset of STE and even Intel is now a licensee. In terms of Sales & Marketing, I am sure that ARM benefits from the fact that the Cortex family is dominating the embedded CPU market.


I only have little details about this japanese company. All what I know is that Nintendo is using their GPU-IP-Core for the Nintendo 3DS [3], maybe for the Nintendo DS as well. The GPU within 3DS is in terms of feature-set between OpenGL|ES 1.1 and OpenGL|ES 2.0. It exists a kind of vertex shader, but no real pixel shader.


A couple of years ago I did evaluate their Flash Lite solution on Altera FPGA, which seemed to be a solid peace of engineering work. In general this french company is focused on FPGA based graphic solution for industrial space. No signs in direction of more advanced 3D cores.


The PowerVR Technology from Imagination is todays most successful GPU-IP. You can find PowerVR in all iOS-Devices from Apple, but also in chips from Intel (Atom), TI (OMAP) and many many more.

How did their start ? Some of us might remember that Sega-Dreamcast (1998) was equipped with a PowerVR MBX like GPU. Even though Dreamcast was not a big commercial hit, it is surely an important milestone of the PowerVR success story.

Based on a robust financial outlook due to the massive incoming royalty stream [2] and the most comprehensive IP Portfolio, it is fair to say that Imagination will be the GPU-IP-Leader for the next 2-3 of years.


It is a rather small company based in Croatia with offices in Germany and Japan. LogicBricks is offering simple 2D Blitengine but also OpenGL|ES 1.1 solution, mainly for Xilinx FPGAs. The OpenGl|ES 1.1 solution is basically a HW-Raster- and HW-Pixel-Pipeline unit, all the rest runs as SW on the CPU.

Nexus Chips

Based in Seoul and focused on GPU-Chip and GPU-IP selling. The portfolio includes OpenGL|ES 1.1/2.0, OpenVG 1.1 and quite interesting a dedicated Skia core.


Another small GPU-IP vendor from Japan with focus on OpenVG and OpenGL|ES 1.1. It seems that Takumi was not able to extend their portfolio towards programmable solutions, so I doubt that they will stay in business for long.


TES-DST is more a general engineering service company then a focused IP Vendor. The IP Portfolio got stucked on OpenGL|ES 1.1, no evidence for a coming programmable solution, even though D/AVE 3D includes programmable shaders. Also a bit odd, the company is not longer member of Khronos Group. However, they have just released an advanced 2D Core which fits perfectly to niche markets like Industrial- or Medical-Devices.

Think Silicon

This greek GPU IP-Provider has a focus on very Low-End GPUs. Low-End in terms of SI-Area, power consumption but also feature-set. The current flagship IP is an OpenVG compliant core with less then 150K Gates. Beside that Think Silicon is part of LPGPU an EU-funded research project [4].


Most likely number three GPU-IP Vendor, after Imagination and ARM. Originally founded by nVidia engineers with initial funding from Marvell. Very focused portfolio and strategy, strong customer base in Asia. Very cost effective development; Core Team in Silicon Valley and major workforce in China.

For me one of the most impressive success stories in the embedded GPU-IP business.


Business focus of this irish company is on premium Video-Chip-Solution for mobile Devices. However, the list of company advisors named Ville Miettinen (former CTO of Hybrid Graphics). Due to the fact that their Myriad Architecture could handle GPU like requirements, I would not be surprised if an OpenGL|ES SW Stack might already exists.


Well, I suppose I do not need to make many words about this leading GPU company. In the past years, nVidia invested quite a lot into scalable graphics technology to fit the requirements of high-end PCs and premium mobile devices such as tablets or smartphones. The company is not longer just a GPU-Chip vendor, with the latest Tegra chips and by acquiring Icera [6] they become more and more a complete solution provider like Qualcomm.


The shooting star under the Semis and one of the winners from the smartphone hype. It was a brilliant move from them to acquire the mobile graphics section of AMD, instead of licensing it from 3rd Party. The AMD-IP was ready to integrate and surely gave Qualcomm a big cost benefit. However, in the long run I suppose that Qualcomm will start to license 3rd Party GPU-IP, because the internal IP-Development can not keep up with the competition.


Formerly 3DLabs, are veterans in the GPU business. Today, ZiiLABS is focusing on chips specifically for the Android Tablet market. ZiiLABS has an In-House combined Video and Graphics IP Architecture called StemCell media processing array. For me such combined solution seems to be a very cost effective solution, in terms of SI-Area and total cost-of-ownership. According to ZiiLABS the solution is also OpenCL compliant.

And again, thanks for your time.


[1] http://www.arm.com/about/newsroom/13706.php
[2] http://www.eetimes.com/electronics-news/4236158/Imagination-technologies-to-see-big-royalty-uptick-
[3] http://wn.com/nintendo_3ds_n3ds_gpu_specs_features_especificaiones_departure_date_fecha_de_salida
[4] http://lpgpu.org/wp/
[5] http://en.wikipedia.org/wiki/Samsung_Galaxy_S_II
[6] http://pressroom.nvidia.com/easyir/customrel.do?easyirid=A0D622CE9F579F09&version=live&prid=753498&releasejsp=release_157&xhtml=true

Mobile GPUs : Introduction & Challenges

18 Oct

Dear Fellows,

a warm welcome to this blog posts about mobile GPUs.

This little series is divided into 6 parts:

  1. Mobile GPUs : Introduction & Challenges
  2. Mobile GPUs : Vendors
  3. Mobile GPUs : Architectures
  4. Mobile GPUs : Benchmarks
  5. Mobile GPUs : Pitfalls
  6. Mobile GPUs : Outlook

In this first post I like to give a brief overview about mobile GPUs and a bit about OpenGL|ES 2.0 (OGLES2), beside few technical facts also some side related personal coloured comments. I also like to dig into the main challenges/requirements a mobile GPU is facing in todays mobile devices. Actually, they are a bit different to what you have in desktop space. The primary concerns in mobile embedded space are: SiIicon-Area, Power- and Memory-Consumption, Cost, secondary Performance and Feature-Richness.

Second post gives an (uncompleted as ever) overview about the Vendors of mobile GPUs; here I differentiate between those who are selling chips with own GPUs (such as nVidiaZiiLABS and Qualcomm) and pure IP-Vendors (such as Imagination Technologies (Img), ARM and Vivante). I will not talk about Intel and AMD here, because my focus is more on embedded. And honestly, Intel and AMD are not playing a big role here at the moment. Intel has no own Mobile GPU Technology, so they have licensed PowerVR [1] from Imagination Technologies and maybe others as well. AMD itself has sold its mobile technology [2] (formerly ATI- and partly Bitboys-Technology) to Qualcomm. Which could have been one of the reasons why the former CEO Dirk Meyer had to resign. However, AMD recently teamed-up with ARM [3], so lets see what this will bring in terms of new Mobile Chips (x86+Mali or ARM+Radeon ??).

In the 3rd post it will be a bit more technical. We will dig into the architectures of some mobile GPUs. Not in any detail and only the (in my feelings) important ones like ARM/Mali,  Qualcomm/Snapdragon, nVidia/Tegra and Img/PowerVR.

After we´ve got a glimpse about the different architectures, lets do some practice. In Part 4 we will start with an overview and discussion about existing benchmarks. Beside that I also like to present my own Benchmark. My idea is to create a set of Benchmarks which does fit perfectly well to each type of architecture and a set which does absolutely not. For instance, PowerVR is using a technique called tile-based deferred rendering (TBDR) [4].  So the use-case could be heavy overdraw with opaque textures, in this case PowerVR should beat the other architectures significantly. All benchmarks will be executed either on Android or iOS devices.

Post number 5 is about pitfalls you have to face in your daily programming life. In the iOS World every GPU is based either on Img-MBX or -SGX [5], but in Android world you have many more different GPUs deployed in many different devices and not all GPU solutions are robust and fulfilling the spec. Issues and issues everywhere. Lets take a closer look and see how worse it really is.

In final post we will ask the Crystal Ball about future trends and what will become hot.


What DirectX-11 (DX11) for desktop GPU is OGLES2 for mobile GPU, the “standard Graphics API”. Well, some Microsoft slides talked about DirectX-9 on Windows-Phones [6], but if you do a reality check, the DirectX-API is not exposed to the developer. All what XNA on Phone 7 offers is a kind of OpenGL|ES1.0 (OGLES1) like feature set, in my view quite unimpressive.

As most of you know, OGLES2  is basically a subset of the desktop OpenGL 2.x plus some additional features.  It is also the base for WebGL and with OpenGL 4.2 (OGL4) onwards you are able to create an OGLES2 compatible context. Also Mesa offers OGLES2 libraries and there are a couple of more options to start development without the need of a real mobile device.

OGLES2 is in the market for quite some time now, the first provisional specification got released 2005 and finalised 2007. It is widely supported by Android, iOS and other operating systems. In case of Android over 90% of all devices are supporting OGLES2 and Apple is supporting it since iPhone 3GS and iPod Touch (3rd Generation). According to Google about 190 Mio. devices have been activated until October 2011, there is no official number from Apple, but according to many articles 160-190 Mio. iOS devices in total have been sold until today (PC market is about 380 Mio. units per year).

However, the feature set of OGLES2 is quite limited in comparison to what DX11 or OGL4 offers. Especially GPGPU kind of tasks can not be performed well. The Android development team already reacted on this by introducing RenderScript and I am quite confident that OpenCL could be found in mobile devices quite soon [7]. Actually everyone was expecting OpenCL in today´s iOS, but it seems we still have to wait, unfortunately.

Anyway, even with “just” OGLES2 you can do a lot of great stuff and I think it is quite right to say that until today it only exists a few Apps (mainly Games) utilizing all the existing rendering features. Interesting point, most of the commercially successful games are simple 2D Games with very little usage of high sophisticated shaders. Maybe it is because Smartphone displays are and will be always limited (until we have the glasses like in the Denno Coil Anime), so it is not much fun to play Crysis like games on Mobile. Or the reason is that the market was/is not interesting enough.

In my opinion, mobile devices are meant to ease daily live, keep you connected to your families/friends/job/interests. So everything is about user-experience and new ways of dealing with information and content. Quite different use-case then the good old ordinary desktop PC.

Lets get back to GPGPU topic. Actually, OpenCL is for me the key enabler for augmented reality kind of applications. In augmented reality (AR) everything comes together: Video, 3D Graphics, UI, Image-processing etc. AR offers a new way to deal with the world around us. Ok at the moment most Apps are not just more then a gimmick and honestly I am not very excited about location based AR, but in a few years AR will become a serious market. So mark my words.


Obviously a Mobile device is not permanently plugged, so power efficiency is key for a mobile-GPU. Today, performance in terms of fillrate is mainly limited by the available memory bandwidth, this is true for either PC and Mobile. The iPhone 3GS comes with a 480×320 pixel display and iPhone 4 with 960×480. 4 times more pixels requires nearly 4 times higher memory bandwidth. That could lead to more signals from the multimedia SoC to the outside memory chips, but also to a higher frequency. Especially, chip to chip communication requires a lot of power.

To increase computational performance, we could either increase frequency or we could add more logic which would lead to more Silicon Area. More Si-Area, often means more power consumption and higher chip cost. Uhhmm, the world wants mobile devices with less power consumption and at lowest price as possible.

We talked a bit about power consumption, now lets deal with costs. Lets do a daft example calculation about Si-Area cost,lets take the Vivante GC1000 Core (I took it because all data are publicly available), according to the documentation Si-Area (40nm lp ) is ~3.5mm^2. Lets assume ~$0.07 per 1mm^2 in 40nm (hope these numbers are not far away from reality). I am also quite confident that such GPU needs a 2nd Level Cache from about 64KB – 256KB, so finally my guess would be ~$0.18-$0.30 cost range per chip just for the GPU-Subsystem. In case your core is from an IP-Vendor you might have to add royalties, a kind of one-time licensing fee and maintenance fee. Depending on the Vendor, market and core I would expect that just  the GPU takes a serious part of the total SoC cost. Beside that today´s GPU-Vendors must offer a comprehensive SW Stack along with their HW, such as OGLES2, OpenCL, EGL. DirectX-9c (getting more important) and OpenVG (getting less important). Also good support of Renderscript and Flash 10+ is important. In future OGLES3 and DX11 will become mandatory.


No doubt, mobile and embedded GPUs are getting more and more sophisticated. That will have an impact on the number of vendors. At the moment there is a long list of vendors, but this business is not longer an emerging market. I suppose consolidation is about to start. Besides, in my opinion to offer a separate GPU is not enough to be competitive on a bigger scale. In my view, especially IP-Vendors have to offer an overall comprehensive solution including Video, Graphics, Image and general GPGPU support. At the moment Video and Graphics are often two cores (both Ziilabs and Broadcom claim to have a unified GPU+Video Architecture). I do think that that two separate cores are not necessary. In my opinion a clever merged video + gpgpu architecture would  reduce Si-Area and Power-Consumption significantly. It would also reduces the effort/cost in terms of SW development. I personally can see big potentials here.

End Remark

Next time we will deal with the companies listed below and some of their architectures. I try to deal with many vendors I could but especially the list of GPU-Chip-Vendors is uncompleted. In fact just a few years ago many Semi companies had they own development teams and technology. But step by step, chip by chip they have started to license external IP. And I would not be surprised if even a company like Qualcomm will become a licensee too.

List of GPU-IP-Vendors:

List of GPU-Chip-Vendors (with In-House mobile GPU technology):

Thanks for your time,