Jun 17

Working with very large hashes

Category: Linux,Perl   — Published by tengo on June 17, 2008 at 10:59 am

Recently, I had to wrangle a large dataset, with over 3 million key-value pairs. I needed to iterate over them in a sorted way and I needed the hash-structure to weed out "already-seen-keys".

My first approach was to build a hash in memory, with the usual my %hash, then adding keys and values in a giant loop. It was fast, but also very memory intensive. The problems came when I needed to have a look at the hash again to do some tidy up and when merging it with another even larger hash. The result was a large resources hog. Time to redesign the code.

So the second iteration was to keep the hash as small as possible. I though a good idea was to use delete() to throw away unneeded overhead keys... Wrong! My machine essentially locked up and got into serious disk thrashing. Time to redesign my code again.

Finally the solution was to operate in chunks and use bits out of the MapReduce box of tricks.If you face similar problems, try to keep the actual data the script has in memory small and the design lean.

Lesson 1: Don't try delete(), as it is very resource intensive. Create a new hash instead.

for my $key (keys %oldhash){
   if($some_case_is_true){ 
       $newhash{$key} = $oldhash{$key};
   } 
}

Iterate over the old one, skip
what you would normally delete, and write to a new hash. This is especially true for large tied hashes (where every delete seems to start a complete new build of the hash...)!

Lesson 2: Don't try a sort() while iterating over a large hash. Something like:

for my $key (sort { $hash{$a} <=> $hash{$b} } keys %hash){
}

is deadly (especially sorting by value). Think about your design. Why do you need sorted keys and if so, why didn't you sort them in the first place, while building the hash? In my case the hash was so large that the only way to handle it gracefully in my environment was to use a tied hash. This gave me the option to use the BTREE type of hash, which does a very efficient sort on the hash's build time. By using this construct, I could then access the keys in a nice sorted way.

tie %hash, 'DB_File', "hash.dbfile", $flags, $mode, $DB_BTREE;
# ... 
for my $key (keys %hash){
}

Lesson 2a: Don't force your array into a list in any way (by calling keys)

For example when you might want to flatten our your hash, never "prepare a list of all hash-keys", which essentially is what happens when you call keys. If you can't avoid it, use each to get around using keys.

while(my ($key,$value) = each %{$hash}){
 # do something with $key and/or $value;
}

Lesson 3: Your hash will outgrow memory

Depending on your key/value size, you will sooner or later hit a wall in terms of system memory. Period.

And when that happens, you can decide between some sort of bucketing, effectively splitting your hash into manageable chunks. Or by using precision, which might be one solution if you use your hash as a lookup database, for example by putting Bloom::Faster, a Bloom filter into the mix.

Lesson 4: Are you misusing your hash as a database?

The lesson before revealed it: sometimes hashes grow large because you are using them as some sort of lookup database. And a database might be the way to go. A hash is some sort of in-memory database, yes. But database engineers have been tackling db challenges like yours for ages now and the performance hit is often not as big as you might think. They found suitable solutions. And not all dbs are disk based and thus slow, they might use some sort of memory caching, or machine sharding+memory based lookups or whatever.
Benchmark your lookups against a tuned (indexed) db setup and see if the performacne hit in comparison to a vanially in-memory hash is really that big, given that the db approach is not currently falling apart as with your vanilla hash... Don't reinvent the wheel.