Home > Articles > Operating Systems, Server > Linux/UNIX/Open Source

UNIX Disk Usage

  • Print
  • + Share This
This chapter is from the book

Identifying the Biggest Files

We've explored the du command, sprinkled in a wee bit of sort for zest, and now it's time to accomplish a typical sysadmin task: Find the biggest files and directories in a given area of the system.

Task 3.4: Finding Big Files

The du command offers the capability to either find the largest directories, or the combination of the largest files and directories, but it doesn't offer a way to examine just files. Let's see what we can do to solve this problem.

  1. First off, it should be clear that the following command will produce a list of the five largest directories in my home directory:

    # du | sort -rn | head -5
    28484  .
    13984  ./Lynx
    10464  ./IBM
    6848   ./Lynx/src
    3092   ./Gator

    In a similar manner, the five largest directories in /usr/share and in the overall file system (ignoring the likely /proc errors):

    # du /usr/share | sort -rn | head -5
    543584 /usr/share
    200812 /usr/share/doc
    53024  /usr/share/gnome
    48028  /usr/share/gnome/help
    31024  /usr/share/apps
    # du / | sort -rn | head -5
    1471213 /
    1257652 /usr
    543584  /usr/share
    436648  /usr/lib
    200812  /usr/share/doc

    All well and good, but how do you find and test just the files?

  2. The easiest solution is to use the find command. find will be covered in greater detail later in the book, but for now, just remember that find lets you quickly search through the entire file system, and performs the action you specify on all files that match your selection criteria.

    For this task, we want to isolate our choices to all regular files, which will omit directories, device drivers, and other unusual file system entries. That's done with -type f.

    In addition, we're going to use the -printf option of find to produce exactly the output that we want from the matched files. In this instance, we'd like the file size in kilobytes, and the fully qualified filename. That's surprisingly easy to accomplish with a printf format string of %k %p.

    Put all these together and you end up with the command

    find . -type f -printf "%k %p\n"

    The two additions here are the ., which tells find to start its search in the current directory, and the \n sequence in the format string, which is translated into a carriage return after each entry.

TIP

Don't worry too much if this all seems like Greek to you right now. Hour 12, "Managing Disk Quotas," will talk about the many wonderful features of find. For now, just type in what you see here in the book.

  1. Let's see it in action:

    # find . -type f -printf "%k %p\n" | head
    4 ./.kde/Autostart/Autorun.desktop
    4 ./.kde/Autostart/.directory
    4 ./.emacs
    4 ./.bash_logout
    4 ./.bash_profile
    4 ./.bashrc
    4 ./.gtkrc
    4 ./.screenrc
    4 ./.bash_history
    4 ./badjoke

    You can see where the sort command is going to prove helpful! In fact, let's preface head with a sort -rn to identify the ten largest files in the current directory, or the following:

    # find . -type f -printf "%k %p\n" | sort -rn | head
    8488 ./IBM/j2sdk-1_3_0_02-solx86.tar
    1812 ./Gator/Snapshots/MAILOUT.tar.Z
    1208 ./IBM/fop.jar
    1076 ./Lynx/src/lynx
    1076 ./Lynx/lynx
    628  ./Gator/Lists/Inactive-NonAOL-list.txt
    496  ./Lynx/WWW/Library/Implementation/libhttp://www.a
    480  ./Gator/Lists/Active-NonAOL-list.txt
    380  ./Lynx/src/GridText.c
    372  ./Lynx/configure

    Very interesting information to be able to ascertain, and it'll even work across the entire file system (though it might take a few minutes, and, as usual, you might see some /proc hiccups):

    # find / -type f -printf "%k %p\n" | sort -rn | head
    26700 /usr/lib/libc.a
    19240 /var/log/cron
    14233 /var/lib/rpm/Packages
    13496 /usr/lib/netscape/netscape-communicator
    12611 /tmp/partypages.tar
    9124  /usr/lib/librpmdb.a
    8488  /home/taylor/IBM/j2sdk-1_3_0_02-solx86.tar
    5660  /lib/i686/libc-2.2.4.so
    5608  /usr/lib/qt-2.3.1/lib/libqt-mt.so.2.3.1
    5588  /usr/lib/qt-2.3.1/lib/libqt.so.2.3.1

    Recall that the output is in 1KB blocks, so libc.a is pretty huge at more than 26MB!

  2. You might find that your version of find doesn't include the snazzy new GNU find -printf flag (neither Solaris nor Darwin do, for example). If that's the case, you can at least fake it in Darwin, with the somewhat more convoluted

    # find . -type f -print0 | xargs -0 ls -s | sort -rn | head
    781112 ./Documents/Microsoft User Data/Office X Identities/Main Identity/Database
     27712 ./Library/Preferences/Explorer/Download Cache
     20824 ./.Trash/palmdesktop40maceng.sit
     20568 ./Library/Preferences/America Online/Browser Cache/IE Cache.waf
     20504 ./Library/Caches/MS Internet Cache/IE Cache.waf
     20496 ./Library/Preferences/America Online/Browser Cache/IE Control Cache.waf
     20496 ./Library/Caches/MS Internet Cache/IE Control Cache.waf
     20488 ./Library/Preferences/America Online/Browser Cache/cache.waf
     20488 ./Library/Caches/MS Internet Cache/cache.waf
     18952 ./.Trash/Palm Desktop Installer/Contents/MacOSClassic/Installer

    Here we not only have to print the filenames and feed them to the xargs command, we also have to compensate for the fact that most of the filenames will have spaces within their names, which will break the normal pipe. Instead, find has a -print0 option that terminates each filename with a null character. Then the -0 flag indicates to xargs that it's getting null-terminated filenames.

CAUTION

Actually, Darwin doesn't really like this kind of command at all. If you want to ascertain the largest files, you'd be better served to explore the -ls option to find and then an awk to chop out the file size:

find /home -type f -ls | awk '{ print $7" "$11 }'

Of course, this is a slower alternative that'll work on any Unix system, if you really want.

  1. To just calculate the sizes of all files in a Solaris system, you can't use printf or -print0, but if you omit the concern for filenames with spaces in them (considerably less likely on a more traditional Unix environment like Solaris anyway), you'll find that the following works fine:

    # find / -type f -print | xargs ls -s | sort -rn | head
    55528 /proc/929/as
    26896 /proc/809/as
    26832 /usr/j2se/jre/lib/rt.jar
    21888 /usr/dt/appconfig/netscape/.netscape.bin
    21488 /usr/java1.2/jre/lib/rt.jar
    20736 /usr/openwin/lib/locale/zh_TW.BIG5/X11/fonts/TT/ming.ttf
    18064 /usr/java1.1/lib/classes.zip
    16880 /usr/sadm/lib/wbem/store
    16112 /opt/answerbooks/english/solaris_8/SUNWaman/books/REFMAN3B/index/index.dat
    15832 /proc/256/as

    Actually, you can see that the memory allocation space for a couple of running processes has snuck into the listing (the /proc directory). We'll need to screen those out with a simple grep -v:

    # find / -type f -print | xargs ls -s | sort -rn | grep -v '/proc' | head
    26832 /usr/j2se/jre/lib/rt.jar
    21888 /usr/dt/appconfig/netscape/.netscape.bin
    21488 /usr/java1.2/jre/lib/rt.jar
    20736 /usr/openwin/lib/locale/zh_TW.BIG5/X11/fonts/TT/ming.ttf
    18064 /usr/java1.1/lib/classes.zip
    16880 /usr/sadm/lib/wbem/store
    16112 /opt/answerbooks/english/solaris_8/SUNWaman/books/REFMAN3B/index/index.dat
    12496 /usr/openwin/lib/llib-lX11.ln
    12160 /opt/answerbooks/english/solaris_8/SUNWaman/books/REFMAN3B/ebt/REFMAN3B.edr
    9888  /usr/j2se/src.jar

The find command is somewhat like a Swiss army knife. It can do hundreds of different tasks in the world of Unix. For our use here, however, it's perfect for analyzing disk usage on a per-file basis.

  • + Share This
  • 🔖 Save To Your Account