Motivation for hand-optimized Assembly code

There’s a popular saying that “in 90% of cases, a modern compiler writes faster code than a typical Assembly programmer would”. But anyone that has actually tested this theory knows how wrong this statement is! Hand-written Assembly code is ALWAYS faster and/or smaller than the equivalent compiled code, as long as the programmer understands the many intricate details of the CPU they are developing for. eg: I wrote both an optimized C function and an optimized Assembly function (using NEON SIMD instructions) to shrink an image by 4, and my Assembly code was 30x faster than my best C code! Even the best C compilers are still terrible at using SIMD acceleration, which is a feature that is available on most modern CPUs and can allow code to run between 4 to 50 times faster, and yet is rarely used properly!

ARM’s RVDS compiler typically generates code that is upto 2x faster than any other C compiler for ARM, but on most ARM devices, hand-written Assembly code can often be 10x faster! (Assuming you use SIMD vectorization such as ARM’s NEON Media Processing Engine or Intel’s MMX/SSE/AVX). This is similar to the speedups you can expect from GPGPU acceleration (using NVidia’s CUDA or OpenCL), but on a small mobile device rather than an expensive desktop video card! And luckily the iPhone, iPad, iPod, Raspberry Pi, ODROID and Android phones & tablets nearly all use ARM CPUs with NEON vector processing, so you can use the same Assembly code in apps for the official iPhone App Store and the Android Market (with NDK) and Raspberry Pi. And with the recent popularity of ARM CPUs in portable devices, this is likely to continue for several generations of smartphones, tablets, and ultra-portables (eg: in the NVidia Tegra3 “Kal-el”, TI OMAP4, QualComm Snapdragon S4 “Krait”, Apple iPad2 & iPhone5, etc). Obviously you shouldn’t write a whole app using Assembly language, but if you need certain loops to run as fast as possible, then a few sections of Assembly language might be exactly what you need!

Modern processor architectures are much more complicated now than they were at the start of the PC era, which definitely makes efficient Assembly code hard to write by hand, but it also makes efficient code hard for a compiler to generate, and so there is significant room for improvement in efficient code design.

UPDATE: Note that Cortex-A9 and Cortex-A15 CPUs are much more advanced than Cortex-A5, Cortex-A7 & Cortex-A8, so the advantages of Assembly code & NEON SIMD will be less important in Cortex-A9 than in simpler devices such as Cortex-A8.

Free libraries with hand-optimized Assembly code

There are already some free libraries of hand-optimized code for Intel x86 and ARM CPUs, so for some tasks you can simply use one of these existing libraries from your C/C++ code without doing any Assembly language code yourself.

For ARM CPUs (including nearly all smartphones, tablets & Linux embedded systems):

  • ARM’s OpenMAX DL implementation with hundreds of fast functions for ARMv7A Cortex-A8 and ARM11 (devices such as iPhone, iPad, Android, Raspberry Pi, BlackBerry PlayBook, Palm Pre, etc). OpenMAX functions are arranged for Audio Codecs, Image Codecs, Image Processing, Signal Processing, and Video Codecs
  • Eigen high-level C++ math library has SIMD vectorization for both Intel SSE and ARM NEON.
  • If you use ARM’s DS-5 or RVDS 4 compiler, you can enable auto vectorization so it will try to optimize your C code using NEON, perhaps generating code that runs twice as fast as normal.
  • Or if you use GCC or LLVM or CodeSourcery you can also enable auto vectorization, but it rarely makes any improvement (in XCode 3, it would be “GCC 4.2 – Language” -> “Other C flags”): “-O3 -mcpu=cortex-a8 -mfpu=neon -mfloat-abi=softfp -ftree-vectorize -ffast-math -funsafe-math-optimizations -fsingle-precision-constant”

For Intel x86 (desktop) CPUs:

  • Eigen high-level C++ math library with SIMD vectorization for both Intel SSE and ARM NEON.
  • SIMDx86 low-level SIMD functions for Intel MMX,SSE,SSE2,SSE3,SSSE3 and AMD 3DNow!+.
  • SSEPlus low-level Intrinsic SIMD functions for Intel SSE,SSE2,SSE3,SSSE3,SSE4,SSE5.
  • libSIMD low-level SIMD functions for Intel SSE,SSE2.
  • Perhaps other SIMD libraries, on a search of SourceForge for SIMD.
  • Commercial or free math libraries such as ATLASPLASMAlibFlame, Intel’s MKL & IPP.

Leave a Reply

Your email address will not be published. Required fields are marked *