Floating-Point, Vectors, and ABI Problems
The original ARM design didn't come with a floating-point unit. This lack wasn't unusual at the time; the 486 was the first x86 chip to have an integrated FPU, and it even offered a cheaper flavor that omitted the FPU. Because ARM cores typically come in SoCs, they went much longer before most ARM users started complaining about this lack. Most things that require more than the occasional floating-point instruction will execute much faster (and with a fraction of the power consumption) on the DSP core than the ARM core in a typical SoC, so there's little reason to try running it on the ARM core.
This was a problem for Linux. The original application binary interface (ABI) passed floating-point values in floating-point registers. Sounds sensible, but what happens when there's no floating-point unit, and thus no floating-point registers? Each instruction in the caller storing a value in the register raises an illegal instruction trap, which causes the kernel to load the current instruction and emulate it, copying the value into a shadow area. The same thing happens for every other floating-point instruction. Floating-point code runs at about 1% of full speed when there's no FPU. In contrast, code targeting a machine without an FPU runs at around 10% of full speed.
This speed is fine if you can compile two versions of your code, one for ARM-with-FPU and one for ARM-without-FPU, but it's not ideal if you want to distribute a single ARM binary. The Embedded ABI (EABI) solves this problem by passing floating-point values in integer registers. You can just swap out the implementations of functions that take floating-point arguments, without having to modify the caller. Even if your binary assumes hardware float and forces the trap-and-emulate mode, when it calls library functions that take floats then the calls will be fast and so will the implementations.
The downside: If you're doing floating-point operations on both the caller and called sides of the function, you need to copy the values from the FPU registers and back again, which is slower on a machine that really has an FPU. Interestingly, it's often faster to store floating-point values to memory and load them back in than to copy them to integer registers, because the store can happen in the background, whereas a register-to-register copy stalls the integer pipeline (for around 20 cycles) until it completes. You often get better performance if you write the floating-point values to memory, do something else, and then call a function that needs to use the values with pointers to the floats as arguments, rather than the floats themselves.
Unfortunately, most ARM chips that people use to run something as heavy as Linux now actually do have an FPU. In fact, they often have two: vector floating-point (VFP) and NEON. The term vector floating point is quite misleading; it's a full IEEE floating-point unit, supporting both single- and double-precision floating-point values. It's faster than a soft-float implementation, but still pretty slow on something like the Cortex A8, which has a non-pipelined implementation.
The NEON unit is a vector coprocessor, with a lot of features, unfortunately not including a double-precision floating point. Using 32-bit floats is significantly faster on a typical modern ARM chip than 64-bit operations. Using only half of the vector unit gives good performance, but using all of it is even better; for good performance, make sure that your data is laid out for easy vectorization.
One common "gotcha" with C code is that floating-point constants are double-precision, unless explicitly followed with an f. The type-promotion rules mean that you can multiply a float by a constant, and you'll get a 64-bit operation, which can be more than an order of magnitude slower on an ARM chip than the 32-bit operation that you actually wanted.