Home > Articles > Web Development > Perl

  • Print
  • + Share This
  • 💬 Discuss

Item 57. Leave most of the data on disk to save memory.

Datasets today can be huge. Whether you are sequencing DNA or parsing weblogs, the amount of data that is collected can easily surpass the amount of data that can be contained in the memory of your program. It is not uncommon for Perl programmers who work with large data sets to see the dreaded "Out of memory!" error.

When this happens, there are a few things you can do. One idea is to check how much memory your process can use. The fix might be as simple as having your operating system allocate more memory to the program.

Increasing memory limits is really only a bandage for larger algorithmic problems. If the data you are working with can grow, you're bound to hit memory limits again.

There are a few strategies that you can use to reduce the memory footprint of your program.

Read files line-by-line

The first and most obvious strategy is to read the data you are processing line-by-line instead of loading entire data sets into memory. You could read an entire file into an array:

open my ($fh), '<', $file or die;
my @lines = <$fh>;

However, if you don't need all of the data at once, read only as much as you need for the next operation:

open my ($fh), '<', $file or die;

while (<$fh>) {
  #... do something with the line
}

Store large hashes in DBM files

There is a common pattern of problem in which you have some huge data set that you have to cycle through while looking up values keyed in another potentially large data set. For instance, you might have a lookup file of names by ID and a log file of IDs and times when that ID logged in to your system. If the set of lookup data is sufficiently large, it might be wise to load it into a hash that is backed by a DBM file. This keeps the lookups on the filesystem, freeing up memory. In the build_lookup subroutine in the example below, it looks like you have all of the data in memory, but you've actually stored it in a file connected to a tied hash:

use Fcntl;  # For O_RDWR, O_CREAT, etc.

my ( $lookup_file, $data_file ) = @ARGV;

my $lookup = build_lookup($lookup_file);

open my ($data_fh), '<', $data_file or die;

while (<$data_fh>) {
  chomp;
  my @row = split;
  if ( exists $lookup->{ $row[0] } ) {
    print "@row\n";
  }
}

sub build_lookup {
  my ($file) = @_;
  open my ($lookup_fh), '<', $lookup_file or die;

  require SDBM_File;
  tie( my %lookup, 'SDBM_File', "lookup.$$",
    O_RDWR | O_CREAT, 0666 )
    or die
    "Couldn't tie SDBM file 'filename': $!; aborting";

  while (<$lookup_file_handle>) {
    chomp;
    my ( $key, $value ) = split;
    $lookup{$key} = $value;
  }

  return \%lookup;
}

Building the lookup can be costly, so you want to minimize the number of times that you have to do it. If possible, prebuild the lookup DBM file and just load it at run time. Once you have it, you shouldn't have to rebuild it. You can even share it between programs.

SDBM_File is a Perl implementation of DBM that doesn't scale very well. If you have NDBM_File or GDBM_File available on your system, opt for those instead.

Read files as if they were arrays

If key-based lookup by way of a hash isn't flexible enough, you can use Tie::File to treat a file's lines as an array, even though you don't have them in memory. You can navigate the file as if it were a normal array. You can access any line in the file at any time, like in this random fortune printing program:

use Tie::File;

tie my @fortunes, 'Tie::File', $fortune_file
  or die "Unable to tie $fortune_file";

foreach ( 1 .. 10 ) {
  print $fortunes[ rand @fortunes ];
}

Use temporary files and directories

If these prebuilt solutions don't work for you, you can always write temporary files yourself. The File::Temp module helps by automatically creating a unique temporary file name and by cleaning up the file after you are done with it. This can be especially handy if you need to completely create a new version of a file, but replace it only once you're done creating it:

use File::Temp qw(tempfile);

my ( $fh, $file_name ) = tempfile();

while (<>) {
  print {$fh} uc $_;
}

$fh->close;

rename $file_name => $final_name;

File::Temp can even create a temporary directory that you can use to store multiple files in. You can fetch several Web pages and store them for later processing:

use File::Temp qw(tempdir);
use File::Spec::Functions;
use LWP::Simple qw(getstore);

my ($temp_dir) = tempdir( CLEANUP => 1 );

my %searches = (
  google    => 'http://www.google.com/#hl=en&q=perl',
  yahoo     => 'http://search.yahoo.com/search?p=perl',
  microsoft => 'http://www.bing.com/search?q=perl',
);

foreach my $search ( keys %searches ) {
  getstore( $searches{$search},
    catfile( $temp_dir, $search ) ) );
}

There's one caution with File::Temp: it opens its files in binary mode. If you need line-ending translations or a different encoding (Item 73), you have the use binmode on the filehandle yourself.

Things to remember

  • Store large hashes on disk in DBM files to save memory.
  • Treat files as arrays with Tie::File.
  • Use File::Temp to create temporary files and directories.
  • + Share This
  • 🔖 Save To Your Account

Discussions

comments powered by Disqus