Jun 17

Working with very large hashes

Tag: Uncategorizedtengo @ 10:59 am

Recently, I had to wrangle a large dataset, with over 3 million key-value pairs. I need to iterate over them in a sorted way and I needed the hash-structure to weed out “already-seen-keys”.

My first approach was to build a hash in memory, with the usual my %hash, then adding keys and values in a giant loop. It was fast, but also very memory intensive. The problems came when I needed to have a look at the hash again to do some tidy up and when merging it with another even larger hash. The result was a large resources hog. Time to redesign the code.

So the second iteration was to keep the hash as small as possible. I though a good idea was to use delete() to throw away unneeded overhead keys… Wrong! My machine essentially locked up and got into serious disk thrashing. Time to redesign my code again.

Finally the solution was to operate in chunks and use bits out of the MapReduce box of tricks.If you face similar problems, try to keep the actual data the script has in memory small and the design lean.

Lesson 1: Don’t try delete(), as it is very resource intensive. Create a new hash instead.

for my $key (keys %oldhash){

if($some_case_is_true){

$newhash{$key} = $oldhash{$key};

}

}

Iterate over the old one, skip
what you would normally delete, and write to a new hash. This is especially true for large tied hashes (where every delete seems to start a complete new build of the hash…)!

Lesson 2: Don’t try a sort() while iterating over a large hash. Something like:

for my $key (sort { $hash{$a} <=> $hash{$b} } keys %hash){

}

is deadly (especially sorting by value). Think about your design. Why do you need sorted keys and if so, why didn’t you sort them in the first place, while building the hash? In my case the hash was so large that the only way to handle it gracefully in my environment was to use a tied hash. This gave me the option to use the BTREE type of hash, which does a very efficient sort on the hash’s build time. By using this construct, I could then access the keys in a nice sorted way.

tie %hash, ‘DB_File’, “hash.dbfile”, $flags, $mode, $DB_BTREE;

for my $key (keys %hash){

}


Random Posts

Leave a Reply