Home > Articles > Operating Systems, Server > Solaris

ZFS Uncovered

  • Print
  • + Share This
When Sun designed ZFS, they threw away the rule book and created something that has no direct analogue in any other UNIX-like system. David Chisnall looks at what changes they've made to conventional storage models, what assumptions are built into the system, and how it all fits together.
Like this article? We recommend

Every few years, someone makes a prediction about how much of a particular computing resource is likely to be needed in the future. Later, everyone laughs at how naïve they were. In designing ZFS, Sun has tried to avoid this mistake.

While the rest of the world is switching to 64-bit filesystems, Sun is rolling out a 128-bit filesystem. Are we ever going to need that much space? Not for a while. The mass of the Earth is something like 6 × 10^24 kilograms. If we were to take that much mass of hydrogen, we would have 3.6 × 10^48 atoms. A 128-bit filesystem can index 2^128, or 10^38 allocation units. If you built a storage system in which each atom stored a single bit of hydrogen (discounting the space you would need for control logic), you would be able to build about 300,000 such drives out of something the mass of the Earth if each one had a 128-bit filesystem with 4KB allocation units. We’ll be building hard drives the size of continents before we hit disk space limits with ZFS.

So is there any point to a 128-bit filesystem? Not exactly. However, if current trends continue, we’ll start to hit the limits of a 64-bit filesystem in the next 5–10 years. An 80-bit filesystem would probably last long enough for other unforeseen limitations to cause it to be replaced before it ran out of space, but for most computers dealing with 80-bit numbers is more difficult than dealing with 128-bit numbers. Thus, Sun went with an 128-bit system.

Endian Independence

When you write data to a disk (or a network) you have to be careful with the byte order. If you’re loading and storing data on just one machine, you can write out the contents of the machine’s registers however they happen to be represented. The problem comes when you start sharing data. Anything smaller than a byte is fine (unless you happen to be using a VAX), but larger quantities need to have a well-defined byte order.

The two most common byte orderings are named after the egg-eating philosophies in Jonathan Swift’s Gulliver’s Travels. Big-endian notation numbers the bytes as 1234, while little-endian machines store them in a 4321 order. Some machines use something like 1324, but generally people try to avoid mentioning those.

Most filesystems are designed to run on a particular architecture. Although it may be ported to other architectures later, each filesystem tends to store metadata in the byte ordering that was native to that filesystem’s original architecture. Apple’s HFS+ is a good example of this practice. Since HFS+ originated on the Power PC, the filesystem data structures are stored in big-endian format. On Intel Macs, you have to reverse the byte order every time you load or store to disk. The BSWAP instruction on x86 chips allows for quick reversal, however, so it’s not a huge performance hit.

Sun is in an interesting position when it comes to byte order, selling and supporting Solaris on SPARC64 and x86-64 machines. SPARC64 is big-endian and x86-64 is little-endian; whichever solution Sun chooses will make one of its filesystems slower than the other on one of Sun’s supported architectures.

Sun’s solution? Don’t choose. Every data structure in ZFS is written in the byte order of the machine writing it, along with a flag to indicate what byte order was used. A ZFS volume on an Opteron machine will be little-endian; one controlled by an UltraSPARC will be big-endian. If you swap the disk between the two machines, it still will work—and the more you write to it, the more it will become optimized for native reading.

  • + Share This
  • 🔖 Save To Your Account