It's actually slightly misleading to talk about the ARM instruction set, because a modern ARM chip supports several. One of the advantages that x86 has over most RISC chips is instruction density. x86 instructions are variable-length, which means that common instructions typically have a shorter encoding and so take up less space in instruction cache. Therefore, x86 chips need smaller instruction caches for the same performance. This is very important. An instruction cache miss can cause the processor to stall for 150 or so cyclesif that happens often, your processor throughput drops dramatically.
The cost of supporting multiple instruction sets is increased complexity of the instruction decoder. The ARM instruction decoder takes a 32-bit word and just needs to test a few bits to know where to dispatch the instruction. The x86 decoder needs to read the bits in sequence, find breaks between instructions, and so on. On something like the Atom, the decoder can account for around 20% of the total power consumption.
Worse, you can't turn off the instruction decoder very often, if at all. Something like the FPU or SSE unit can be powered down while it's not executing floating-point or vector instructions. The same is true of any of the other execution units. But it's not true of the decoder, which must remain powered on as long as you're fetching instructions. Intel's latest server chips turn it off periodically by caching decoded micro-ops and powering only the decoder when fetching instructions to the micro-op cache. Unfortunately, the micro-ops are about as complex to decode as ARM instructions, so this technique doesn't save anything relative to ARM.
Of course, the ARM instruction set has the opposite set of trades. It needs more instruction cache than a variable-length encoding does. The ARM solution is to add a second instruction decoder. Thumb code was introduced as a subset of ARM operations. In modern ARM chips, thumb code is extended with the Thumb-2 instruction set, which contains a more powerful subset. Each corresponds to an ARM instruction, but is only 16 bits long. Some of the savings comes from reducing the number of registers that can be accessed; most thumb instructions can access only the bottom half of the register set. They also support only a subset of the operations of the full instruction setthe most commonly used subset.
The CPU is always in ARM, Thumb, or Thumb-2 mode (or, occasionally, in one of a few less-common modes). Because switching between modes requires an explicit instruction, an ARM chip needs three or more instruction decoders. However, each decoder is quite simple, and the chip needs to power only one at a time, combining the power efficiency of simple decoders with the instruction cache-usage of variable-length decoders.
Thumb modes can be enabled on any granularity. For example, a compiler can compile some functions in Thumb-2 mode and some in ARM mode, using Thumb-2 if the space savings outweighs the potential need for more instructions. In some cases, a loop might be compiled to Thumb-2 code, while the rest of the function is in ARM mode.
Slightly older ARM chips also included a mode called Jazelle, which enabled a decoder for Java bytecode instructions. Most of these instructions were executed directly, whereas the more complex instructions raised an interrupt. The Java virtual machine would catch the interrupt and interpret the complex instructions. This setup achieved similar performance to a JIT compiler, but with a lower memory footprint.
Modern ARM chips no longer include Jazelle mode. A modern handheld has 128 or 256MB of RAM, which is more than enough for a full just-in-time (JIT) compiler. Instead of Jazelle mode, they provide Thumb-2EE mode, which is a slightly modified version of Thumb-2, designed to be used as a target for JIT-compiling languages that run in virtual machines. This design includes instructions for things like bounds-checked array access.