Use Custom Hardware
If your target is a modern handheld, it probably contains an ARM system-on-chip (SoC), with an ARM CPU core along with a load of other stuff. Exactly what this "other stuff" is depends on the chip, but these days it tends to include a digital signal processor (DSP) and a graphics processing unit (GPU), among other things.
Like the CPU, these are general-purpose processing units, but they're optimized for different types of code. Some algorithms run far more efficiently on one of these units than on the CPU core. A good example is audio decoding. Decoding something like an MP3 on the ARM core of an OMAP3 series SoC uses 5[nd]10 times as much power as doing the same task with the C64x DSP in the same package.
The GPU can also make a huge difference. Something like the iPhone or a modern Android device uses a lot of flashy animation effects, implemented by rendering interface components to GPU textures and then compositing them on the GPU. The GPU is typically designed to render several million textured triangles per second. Rendering the few hundred in a typical 2D user interface, using simple orthogonal projection and a bit of blending, uses a tiny fraction of the GPU's power. Doing something similar on the CPU, in contrast, would cause a noticeable spike in the load, and therefore in the power consumption.