Tag Archives: android

CV Algorithms on mobile GPUs

16 Apr

Computer Vision is becoming some kind of hype these days. It is an interesting topic, though. (Personally, I had my first experience 1999 on a SGI-Onyx with 4 Mips CPUs and a big (in terms of actual size) GPU, when I developed VFx-Plugins for Discreet Inferno.)

Today, I want to know how well those algorithms could be processed on embedded GPUs. To ensure to get best performance  I have implemented some algorithms by using OpenGL|ES 2.x, OpenCL 1.1 and Renderscript.


In this blog I only show a reduced list of algorithms. I suppose the depicted OpenCL code is quite self explanatory. Please note the code is not necessary the actual code used to determine the performance results.

__kernel void filter_memfill(__write_only image2d_t texOut, const float invwidth, const float invheight)
const int x = get_global_id(0);
const int y = get_global_id(1);
int2 coord = (int2)(x, y);
float4 pixel=(float4)0;
pixel.x = ((float)(x)) * invwidth;
pixel.y = ((float)(y)) * invheight;
write_imagef(texOut, coord, pixel);

__kernel void filter_memcpy(__write_only image2d_t texOut, __read_only image2d_t texIn)
const int x = get_global_id(0);
const int y = get_global_id(1);
const sampler_t smp = CLK_FILTER_NEAREST;
int2 coord = (int2)(x, y);
float4 pixel = read_imagef(texIn, smp, coord);
write_imagef(texOut, coord, pixel);

Convolve 3×3/5×5
__kernel void filter_filter3x3(__write_only image2d_t texOut, __read_only image2d_t texIn, __global float* mat)
const sampler_t smp = CLK_FILTER_NEAREST;
int2 coord = (int2)(get_global_id(0), get_global_id(1));

int x,y;
int i=0;
float4 col = (float4)0;

for (y=-1;y<=1;y++)
int2 uv;
uv.y = coord.y + y;
for (x=-1;x<=1;x++)
uv.x = coord.x + x;
col = col + read_imagef(texIn, smp, uv) * mat[i];

write_imagef(texOut, coord, col);

__kernel void filter_rgb2hsv(__write_only image2d_t texOut, __read_only image2d_t texIn)
const sampler_t smp = CLK_FILTER_NEAREST;
int2 coord = (int2)(get_global_id(0), get_global_id(1));
float3 c = read_imagef(texIn, smp, coord).xyz;

float4 col = RGBtoHSV(c);

write_imagef(texOut, coord, col);

Histogram (RGB separated)
__kernel void filter_histogram(__read_only image2d_t texIn, __global unsigned int *buffer)
const int x = get_global_id(0);
const int y = get_global_id(1);
int2 coord = (int2)(x, y);
uint3 i = read_imageui(texIn, smp, coord).xyz;

List of used devices:

Nexus 10:
Android: V4.2.1 (JOP40D)
GPU: Quad-core Mali T6044 (350-500Mhz. assumption)
Memory: 32-bit Dual-channel 800 MHz LPDDR3/DDR3 (12.8 GB/sec)

Nexus 7:
Android: V4.2.1 (JOP40D)
GPU: ULP GeForce (416 MHz.)
Memory: 32-bit single-channel 667 MHz DDR3-1333 (5.34 GB/sec)

Nexus 4:
Android: V4.2.2 (JDQ39)
GPU: Adreno 320
Memory: 32-bit Dual-channel 500 MHz LPDDR2 (8.5 GB/sec)


Performance values are in mili-second.

RI = Renderscript intrinsic usage
FS = Filterscript usage
RS = Renderscript usage
GL = OpenGL|ES 2.0 usage
CL = OpenCL usage (OpenCL 1.1 on Mali-T 604 GPU)

Table (Rev 0.9, 16th April 2013)

Please note, this blog only contains a subset of the benchmark.

Nexus 4 (Adreno 320)
128×128 512×512 1024×1024
Fill(FS) 1.78 8.51
Fill(GL) 0.64 2.47
Blit(RI) 1.13 4.79
Blit(GL) 0.79 2.96
Convolve3x3(RI) 1.74 6.52
Convolve3x3(GL) 2.07


Convolve5x5(RI) 2.85 13
RGB2HSV(FS) 7.74 31.2
RGB2HSV(GL) 0.94 3.44
Histogram(RS) 2.64 11.93
Nexus 7 (ulpGeForce Tegra 4)
128×128 512×512 1024×1024
Fill(FS) 0.14 2.31 9.18
Fill(GL) 0.06 0.58 2.3
Blit(RI) 0.04 1.17


Blit(GL) 0.06 0.63 2.28
Convolve3x3(RI) 0.12 2.75 13.59
Convolve3x3(GL) 0.31 5.11 20.45
Convolve5x5(RI) 0.43 4.28 16.23
Convolve5x5(GL) 0.77 12.75 51.09
RGB2HSV(FS) 0.55 8.84 34.5
RGB2HSV(GL) 0.18 3.16 12.68
Histogram(RS) 0.17 3.07 12.12
Nexus 10 (Mali-T604)
128×128 512×512 1024×1024
Fill(FS) 0.11 0.65 2.62


0.47 1.82
Fill(CL) 0.03 0.47 1.86
Blit(RI) 0.09 1.38 3.4
Blit(GL) 0.05 0.48 2.17
Blit(CL) 0.036 0.51 2.03
Convolve3x3(RI) 0.18 1.85 5.82
Convolve3x3(GL) 0.09 1.75 7.38
Convolve5x5(RI) 1.14 6.78 19.78
Convolve5x5(GL) 0.46 7.63 31.29
RGB2HSV(FS) 0.09 2.53 8.93
RGB2HSV(GL) 0.08 1.13


RGB2HSV(CL) 0.076 1.14 4.57
Histogram(RS) 0.3 4.79 18.46
Histogram(CL) 0.20 3.82 14.68

RED = Most likely not been performed on GPU.


Mobile GPUs : Architectures

5 Apr

Mobile GPUs series is divided into 6 parts:

  1. Mobile GPUs : Introduction & Challenges
  2. Mobile GPUs : Vendors
  3. Mobile GPUs : Architectures
  4. Mobile GPUs : Benchmarks
  5. Mobile GPUs : Pitfalls
  6. Mobile GPUs : Outlook


I am happy to present to you the third post of this Mobile-GPU blog series.

Today we will dig a bit more into technical issues, at least in comparison to the previous posts. In general my focus for this part is on the leading Mobile GPU architectures such as Adreno (Qualcomm), GeForce ULP (nVidia), Mali (ARM) and PowerVR (Imagination). For several reasons I had to drop Vivante in this blog post, even though (according to Jon Peddie Research [1]) Vivante is #2 in GPU-IP business.

There are many ways how to characterise GPU architectures, too many. So lets keep it simple and brief by focusing just on shaders and the way how the rasterisation/rendering is organised. I also like to mention here that in comparison to desktop GPUs from nVidia/AMD only very little is revealed about the actual makeup of mobile GPUs.

Anyway lets get started – to be OpenGL|ES 2.0 compliant it is required to support Vertex and Pixel/Fragment shaders, those shaders can be dedicated or unified. Actually, even to determine the number of shaders is not an easy task, some GPUs have a fine granulated alu/shader grid, some only a few more complex shaders. You see shader is not like shader.

Mobile SoC GPU Comparison – Source AnandTech [3]
Adreno 225 PowerVR SGX 540 PowerVR SGX 543 PowerVR SGX 543MP2 Mali-400 MP4 GeForce ULP Kal-El GeForce
SIMD Name USSE USSE2 USSE2 Core Core Core
# of SIMDs 8 4 4 8 4 + 1 8 12
MADs per SIMD 4 2 4 4 4 / 2 1 ?
Total MADs 32 8 16 32 18 8 ?

In terms of the rendering there are two extremes Immediate-Mode-Rendering (IMR) and Tile-Based-Deferred-Rendering (TBDR). I like to explain it in a bit more detail starting in the past with simple GPU architectures and ending today.

Early GPUs (especially in embedded space) were IMR based, here the CPU did set required parameters directly on the GPU and a write access to a certain register triggered the start of the rendering process. The triangle rasterisation was done in one pass and most often span-wise (top to bottom, left to right). CPU had to wait until GPU is ready again to issue next triangle, obviously this approach causes a lot synchronisation overhead on both CPU and GPU. But GPU evolved, dedicated Caches, Shadow-Registers and Display-List (Command-List) got introduced. In early times, the Display-List was just a small (a few kb) ring-buffer, mainly to decouple CPU and GPU in order to save wait-cycles. Nowadays where most SoCs have Unified Memory Access (UMA), the display list can be stored anywhere in memory.

In all modern GPUs the rasterisation is done with a tile based approach. The SW-Driver buffers as much as possible of the scene and then renders all triangles tile by tile into the framebuffer. The tile-buffer is on-chip-memory, which leads to significantly lower framebuffer related external memory bandwidth consumption. However, the rendering of the tiles itself is often done in more IMR kind of way. So there is still a significant performance difference between rendering a scene from back to front or front to back. Hidden Surface Removal algorithms like Early-Z are not working efficiently in the back to front case, because tons of texels will get fetched even there are not visible. Only PowerVR is offering pixel based deferred rendering.

Adreno GeForce ULP Mali-400 Mali T604- PowerVR
Unified Shader yes no no yes yes

*TBIMR = Tile Based Immediate Mode Rendering – driver buffers as much as possible of the scene and then renders the buffer tile by tile in a normal IMR manner

The Adreno architecture is based on AMD Z430 of the Imageon Family which was introduced by ATI in 2002. Based on what is stated in [2] and [3] it seems that Adreno is rendering tile-based but each tile in a IMR kind of way. The HW seems to be optimised for triangle stripes, which very likely causes a big load on the CPU. Because the CPU then has to perform the task to transform meshes into triangle stripes. Another interesting aspect, clipping seems to be a big performance issue on Adreno architecture. My educated guess, the clipping algorithm does not just clip the triangle geometry alone, but also calculates for each new vertex the clipped attributes like texture coordinates etc. Maybe a quick example will explain it better, a triangle needs to get clipped, lets say this leads to 2 new vertices (one get dropped). Per vertex attributes are Texture Coordinates (2D), Vertex Colour (4D) and Vertex Normal (4D). The clipper then calculates the two new vertices but also the related attributes. In our case this would roughly lead to extra (2D+4D+4D=10) 10*2 multiply add operations. On top the attribute rasterisation setup, which now needs to be done 2 times, instead of 1 time (in our case 1 triangle became 2 triangles). I also would guess, that the vertex shading and primitive assembly (including clipping) is done per tile. This approach is in terms of HW complexity a lot simpler but very costly in terms of cycle count. Lets say you have a 800×480 Framebuffer and using 32×32 Tiles, then you need to touch all scene data 375 times to render the whole Framebuffer. Which does not sound very effective to me.

GeForce ULP
In general nVidia benefits heavily from its Desktop GPU expertise, so it is not a surprise that the GeForce ULP deployed in Tegra 2 is based on NV40 architecture. The NV40 architecture got introduced 2004 and supports OpenGL2 and DirectX 9c. Because it is a common architecture I will not dig into details, however, more details can be found here [6].

Has a modern straightforward architecture. The rendering is done tile based, vertex shading one time per frame, rasterisation in homogeneous space [4] no explicit clipping required. Early-Z and 4 levels of depth/stencil hierarchy supported, which enables 4MSAA nearly for free. The shaders are SIMD/VLIW based, but more details about the shaders itself are not revealed so far. The Mali T604/658 (Midgard architecture) is supporting Framebuffer compression and a new texture compression format called ASTC [5]. Midgard is also going big steps towards general GPGPU applications, by supporting 64bit data types natively and more importantly CCI-400. The tight connection between CPU and GPU is a very powerful solutions which provides high flexibility and excellent performance on low-power footprint.

PowerVR is on the market for quite sometime now, because of that I will be brief on the architecture itself. Biggest difference to Andreno, GeFore ULP and Mali is the fact that PowerVR is deferring the rendering on per-pixel-base, the others are only deferring on per-tile-base. It means that PowerVR, for instance, tries to fetch texel only for pixels which contribute to final framebuffer pixel. I have to admit, this architecture works perfectly in fixed-function (OpenGL|ES 1.1) and low-poly-count Use-Cases. But in Use-Cases where you have tons of triangles and complex shaders I suppose the performance will drop significantly, maybe even under what you can achieve on other GPUs.

And once again, thanks for your time.


[1] http://www.eetimes.com/electronics-news/4304118/Imagination-outstrips-all-other-GPU-IP-suppliers

[2] AdrenoTM 200 Performance Optimization –OpenGL ES Tips and Tricks [March 2010]

[3] http://www.anandtech.com/show/4940/qualcomm-new-snapdragon-s4-msm8960-krait-architecture/3

[4] http://www.ece.unm.edu/course/ece595/docs/olano.pdf

[5] http://blogs.arm.com/multimedia/643-astc-texture-compression-arm-pushes-the-envelope-in-graphics-technology/

[6] http://ixbtlabs.com/articles2/gffx/nv40-part1-a.html