<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet href="http://feeds.feedburner.com/~d/styles/rss2full.xsl" type="text/xsl" media="screen"?><?xml-stylesheet href="http://feeds.feedburner.com/~d/styles/itemcontent.css" type="text/css" media="screen"?><!-- generator="wordpress/2.3.3" --><rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:wfw="http://wellformedweb.org/CommentAPI/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0" version="2.0">

<channel>
	<title>Zen of Perl</title>
	<link>http://perl.goeszen.com</link>
	<description>rant, meditate and then truly go zen about perl</description>
	<pubDate>Mon, 21 Jul 2008 20:17:25 +0000</pubDate>
	<generator>http://wordpress.org/?v=2.3.3</generator>
	<language>en</language>
			<atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="self" href="http://feeds.feedburner.com/GoesZenPerl" type="application/rss+xml" /><feedburner:browserFriendly></feedburner:browserFriendly><item>
		<title>Compile a perl script that uses Wx with pp and PAR</title>
		<link>http://perl.goeszen.com/compile-a-perl-script-that-uses-wx-with-pp-and-par.html</link>
		<comments>http://perl.goeszen.com/compile-a-perl-script-that-uses-wx-with-pp-and-par.html#comments</comments>
		<pubDate>Sun, 20 Jul 2008 16:30:27 +0000</pubDate>
		<dc:creator>tengo</dc:creator>
		
		<category><![CDATA[Uncategorized]]></category>

		<category><![CDATA[compile]]></category>

		<category><![CDATA[par]]></category>

		<category><![CDATA[pp]]></category>

		<category><![CDATA[wx]]></category>

		<guid isPermaLink="false">http://perl.goeszen.com/compile-a-perl-script-that-uses-wx-with-pp-and-par.html</guid>
		<description><![CDATA[Compiling scripts with pp, PAR&#8217;s helper script, to executable binaries (.exe files on Win32) should be a pretty straightforward process. Anyway, if you are developing GUI applications, probably with Wx, you will surely run into some problems.
Upon execution, the generated execs will complain about missing libraries, DLLs or similar. This is because pp does a [...]]]></description>
			<content:encoded><![CDATA[<p>Compiling scripts with pp, PAR&#8217;s helper script, to executable binaries (.exe files on Win32) should be a pretty straightforward process. Anyway, if you are developing GUI applications, probably with Wx, you will surely run into some problems.</p>
<p>Upon execution, the generated execs will complain about missing libraries, DLLs or similar. This is because pp does a lot, but it is somehow blind to see the Wx bindings dependencies. So you need to tell it that you have additional libraries for it to include in the build. You can do so by using the -M switch, or by using <a href="http://search.cpan.org/%7Emdootson/">Mark Dootson</a>&#8217;s excellent <a href="http://search.cpan.org/perldoc?Wx::Perl::Packager">Wx::Perl::Packager</a>! On Win32 ActivePerl setups, you might want to do &#8220;ppm install http://www.wxperl.co.uk/repository/Wx-Perl-Packager.ppd&#8221;, as not all repositories carry a ppm for this module.</p>
<p>Then, simply add this line at the very top of your Wx-using script:</p>
<blockquote><p>use Wx::Perl::Packager;</p></blockquote>
<p>and then use Wx::Perl::Packager&#8217;s &#8220;wxpar&#8221; drop-in replacement for pp. So, to compile, do:</p>
<blockquote><p>wxpar -o myapp.exe myapp.pl</p></blockquote>
<p>optionally adding a bit of salt like in the docs</p>
<blockquote><p>wxpar &#8211;gui &#8211;icon=myicon.ico -o myprog.exe myscript.pl</p></blockquote>
<p>and there you go.</p>
<div style="clear: both;"><a href="http://www.addthis.com/bookmark.php" onclick="window.open('http://www.addthis.com/bookmark.php?pub=&amp;url=http%3A%2F%2Fperl.goeszen.com%2Fcompile-a-perl-script-that-uses-wx-with-pp-and-par.html&amp;title=Compile+a+perl+script+that+uses+Wx+with+pp+and+PAR', 'addthis', 'scrollbars=yes,menubar=no,width=620,height=520,resizable=yes,toolbar=no,location=no,status=no'); return false;" title="Bookmark using any bookmark manager!" target="_blank"><img src="http://s3.addthis.com/button1-bm.gif" width="125" height="16" border="0" /></a></div>]]></content:encoded>
			<wfw:commentRss>http://perl.goeszen.com/compile-a-perl-script-that-uses-wx-with-pp-and-par.html/feed</wfw:commentRss>
		</item>
		<item>
		<title>Shorthand if-clause</title>
		<link>http://perl.goeszen.com/shorthand-if-clause.html</link>
		<comments>http://perl.goeszen.com/shorthand-if-clause.html#comments</comments>
		<pubDate>Sat, 21 Jun 2008 14:10:09 +0000</pubDate>
		<dc:creator>tengo</dc:creator>
		
		<category><![CDATA[Uncategorized]]></category>

		<category><![CDATA[if]]></category>

		<category><![CDATA[shorthand]]></category>

		<guid isPermaLink="false">http://perl.goeszen.com/shorthand-if-clause.html</guid>
		<description><![CDATA[There is a handy short version for a classic if-else-statement that is very useful, but everytime I&#8217;d like to use it, I just can&#8217;t fully remember what its syntax was. And looking it up on google is hard, because &#8220;if&#8221; is a very common word&#8230; It is especially useful on initializing variables in a cgi [...]]]></description>
			<content:encoded><![CDATA[<p>There is a handy short version for a classic if-else-statement that is very useful, but everytime I&#8217;d like to use it, I just can&#8217;t fully remember what its syntax was. And looking it up on google is hard, because &#8220;if&#8221; is a very common word&#8230; It is especially useful on initializing variables in a cgi environment, where the start value of a $var depends on a passed form value.</p>
<p>So here it is. Instead of writing:</p>
<blockquote><p>form-&gt;{var} = undef;<br />
my $var;<br />
if($form-&gt;{var}){<br />
$var = $form-&gt;{var};<br />
}else{<br />
$var = 1;<br />
}</p></blockquote>
<p>use this elegant shorthand if-clause:</p>
<blockquote><p>my $var = $form-&gt;{var} ? $form-&gt;{var} : 1;</p></blockquote>
<p>which is short for: <em>if $form-&gt;{var} is true ($form-&gt;{var} set), use it, else default to 1</em></p>
<p><strong>Another shorthand variable declaration</strong></p>
<p>The above is useful if you need to declare a variable and three variables are involved. If you just need to test if a passed variable holds a value/is true, another construct might be useful:</p>
<blockquote><p>$var = $form-&gt;{var} || 1;</p></blockquote>
<p>which is short for: <em>if $form-&gt;{var} holds a (true) value, use it, else default to 1</em></p>
<div style="clear: both;"><a href="http://www.addthis.com/bookmark.php" onclick="window.open('http://www.addthis.com/bookmark.php?pub=&amp;url=http%3A%2F%2Fperl.goeszen.com%2Fshorthand-if-clause.html&amp;title=Shorthand+if-clause', 'addthis', 'scrollbars=yes,menubar=no,width=620,height=520,resizable=yes,toolbar=no,location=no,status=no'); return false;" title="Bookmark using any bookmark manager!" target="_blank"><img src="http://s3.addthis.com/button1-bm.gif" width="125" height="16" border="0" /></a></div>]]></content:encoded>
			<wfw:commentRss>http://perl.goeszen.com/shorthand-if-clause.html/feed</wfw:commentRss>
		</item>
		<item>
		<title>Working with very large hashes</title>
		<link>http://perl.goeszen.com/working-with-very-large-hashes.html</link>
		<comments>http://perl.goeszen.com/working-with-very-large-hashes.html#comments</comments>
		<pubDate>Tue, 17 Jun 2008 17:59:18 +0000</pubDate>
		<dc:creator>tengo</dc:creator>
		
		<category><![CDATA[Uncategorized]]></category>

		<category><![CDATA[hashes]]></category>

		<guid isPermaLink="false">http://perl.goeszen.com/working-with-very-large-hashes.html</guid>
		<description><![CDATA[Recently, I had to wrangle a large dataset, with over 3 million key-value pairs. I need to iterate over them in a sorted way and I needed the hash-structure to weed out &#8220;already-seen-keys&#8221;.
My first approach was to build a hash in memory, with the usual my %hash, then adding keys and values in a giant [...]]]></description>
			<content:encoded><![CDATA[<p>Recently, I had to wrangle a large dataset, with over 3 million key-value pairs. I need to iterate over them in a sorted way and I needed the hash-structure to weed out &#8220;already-seen-keys&#8221;.</p>
<p>My first approach was to build a hash in memory, with the usual <em>my %hash</em>, then adding keys and values in a giant loop. It was fast, but also very memory intensive. The problems came when I needed to have a look at the hash again to do some tidy up and when merging it with another even larger hash. The result was a large resources hog. Time to redesign the code.</p>
<p>So the second iteration was to keep the hash as small as possible. I though a good idea was to use <em>delete()</em> to throw away unneeded overhead keys&#8230; Wrong! My machine essentially locked up and got into serious disk thrashing. Time to redesign my code again.</p>
<p>Finally  the solution was to operate in chunks and use bits out of the <a href="http://en.wikipedia.org/wiki/MapReduce">MapReduce</a> box of tricks.If you face similar problems, try to keep the actual data the script has in memory small and the design lean.</p>
<p><strong>Lesson 1</strong>: Don&#8217;t try <em>delete()</em>, as it is very resource intensive. Create a new hash instead.</p>
<p><em>for my $key (keys %oldhash){</em></p>
<p><em>    if($some_case_is_true){ </em></p>
<p><em>        $newhash{$key} = $oldhash{$key};</em></p>
<p><em>    } </em></p>
<p><em>}</em></p>
<p>Iterate over the old one, skip<br />
what you would normally delete, and write to a new hash. This is especially true for large tied hashes (where every <em>delete</em> seems to start a complete new build of the hash&#8230;)!</p>
<p><strong>Lesson 2</strong>: Don&#8217;t try a <em>sort()</em> while iterating over a large hash. Something like:</p>
<p><em>for my $key (sort { $hash{$a} &lt;=&gt; $hash{$b} } keys %hash){</em></p>
<p><em>}</em></p>
<p>is deadly (especially sorting by value). Think about your design. Why do you need sorted keys and if so, why didn&#8217;t you sort them in the first place, while <u>building</u> the hash? In my case the hash was so large that the only way to handle it gracefully in my environment was to use a tied hash. This gave me the option to use the BTREE type of hash, which does a very efficient sort on the hash&#8217;s build time. By using this construct, I could then access the keys in a nice sorted way.</p>
<p><em>tie %hash, &#8216;DB_File&#8217;, &#8220;hash.dbfile&#8221;, $flags, $mode, $DB_BTREE;</em></p>
<p><em>&#8230; </em></p>
<p><em>for my $key (keys %hash){</em></p>
<p><em>}</em></p>
<div style="clear: both;"><a href="http://www.addthis.com/bookmark.php" onclick="window.open('http://www.addthis.com/bookmark.php?pub=&amp;url=http%3A%2F%2Fperl.goeszen.com%2Fworking-with-very-large-hashes.html&amp;title=Working+with+very+large+hashes', 'addthis', 'scrollbars=yes,menubar=no,width=620,height=520,resizable=yes,toolbar=no,location=no,status=no'); return false;" title="Bookmark using any bookmark manager!" target="_blank"><img src="http://s3.addthis.com/button1-bm.gif" width="125" height="16" border="0" /></a></div>]]></content:encoded>
			<wfw:commentRss>http://perl.goeszen.com/working-with-very-large-hashes.html/feed</wfw:commentRss>
		</item>
		<item>
		<title>How to format a string with leading zeros?</title>
		<link>http://perl.goeszen.com/how-to-format-a-string-with-leading-zeros.html</link>
		<comments>http://perl.goeszen.com/how-to-format-a-string-with-leading-zeros.html#comments</comments>
		<pubDate>Sun, 01 Jun 2008 16:09:52 +0000</pubDate>
		<dc:creator>tengo</dc:creator>
		
		<category><![CDATA[Uncategorized]]></category>

		<category><![CDATA[sprintf]]></category>

		<guid isPermaLink="false">http://perl.goeszen.com/how-to-format-a-string-with-leading-zeros.html</guid>
		<description><![CDATA[A quick reminder:
How do I pad a string so that a number or string gets leading zeros?
Answer:
By using sprintf:
 my $number = 123;
$number = sprintf(&#8221;%07d&#8221;, $number);
print $number;
Output: &#8220;0000123&#8243;
after the %: &#8220;0&#8243; is the character to add, &#60;number&#62;d is the amount of digits (that&#8217;s why it&#8217;s &#8220;d&#8221;) to add. See the documentation for sprintf.
]]></description>
			<content:encoded><![CDATA[<p>A quick reminder:</p>
<p><em>How do I pad a string so that a number or string gets leading zeros</em>?</p>
<p>Answer:</p>
<p>By using sprintf:</p>
<blockquote><p> my $number = 123;<br />
$number = sprintf(&#8221;%07d&#8221;, $number);<br />
print $number;</p></blockquote>
<p>Output: &#8220;0000123&#8243;</p>
<p>after the %: &#8220;0&#8243; is the character to add, &lt;number&gt;d is the amount of digits (that&#8217;s why it&#8217;s &#8220;d&#8221;) to add. See the <a href="http://perl.active-venture.com/pod/func/sprintf.html">documentation for sprintf</a>.</p>
<div style="clear: both;"><a href="http://www.addthis.com/bookmark.php" onclick="window.open('http://www.addthis.com/bookmark.php?pub=&amp;url=http%3A%2F%2Fperl.goeszen.com%2Fhow-to-format-a-string-with-leading-zeros.html&amp;title=How+to+format+a+string+with+leading+zeros%3F', 'addthis', 'scrollbars=yes,menubar=no,width=620,height=520,resizable=yes,toolbar=no,location=no,status=no'); return false;" title="Bookmark using any bookmark manager!" target="_blank"><img src="http://s3.addthis.com/button1-bm.gif" width="125" height="16" border="0" /></a></div>]]></content:encoded>
			<wfw:commentRss>http://perl.goeszen.com/how-to-format-a-string-with-leading-zeros.html/feed</wfw:commentRss>
		</item>
		<item>
		<title>Converting ANY video with ffmpeg (letterboxing/pillarboxing)</title>
		<link>http://perl.goeszen.com/converting-any-video-with-ffmpeg-letterboxingpillarboxing.html</link>
		<comments>http://perl.goeszen.com/converting-any-video-with-ffmpeg-letterboxingpillarboxing.html#comments</comments>
		<pubDate>Thu, 24 Apr 2008 12:17:26 +0000</pubDate>
		<dc:creator>tengo</dc:creator>
		
		<category><![CDATA[Uncategorized]]></category>

		<category><![CDATA[ffmpeg]]></category>

		<category><![CDATA[letterboxing]]></category>

		<category><![CDATA[pillarboxing]]></category>

		<category><![CDATA[transcoding]]></category>

		<category><![CDATA[video]]></category>

		<guid isPermaLink="false">http://perl.goeszen.com/converting-any-video-with-ffmpeg-letterboxingpillarboxing.html</guid>
		<description><![CDATA[When you use ffmpeg to transcode videos from various sources and in various sizes, formats and aspect ratios to a given destination format, you can&#8217;t rely on ffmpeg alone to produce the expected results. In this post we will have a look at how we can dynamically letterbox or pillarbox (black bars on the sides [...]]]></description>
			<content:encoded><![CDATA[<p>When you use<a href="http://ffmpeg.mplayerhq.hu/"> ffmpeg</a> to transcode videos from various sources and in various sizes, formats and aspect ratios to a given destination format, you can&#8217;t rely on ffmpeg alone to produce the expected results. In this post we will have a look at how we can dynamically<em> letterbox</em> or <em>pillarbox</em> (black bars on the sides of the video, instead of top/bottom) any input video so that we can fit the video source into a fixed dimension target frame without breaking aspect ratio by squeezing the image.</p>
<blockquote><p>my $input_width;<br />
my $input_height;</p>
<p># probe file for aspect<br />
open(FFMPEG, &#8220;ffmpeg -i &#8216;$inFile&#8217; 2&gt;&amp;1 |&#8221;) or $error .= &#8220;Couldn&#8217;t read from ffmpeg: $!\n&#8221;;<br />
while (&lt;FFMPEG&gt;) {<br />
if($_ =~ /Video:.+ (\d+)x(\d+)/){<br />
$input_width = $1;<br />
$input_height= $2;<br />
}<br />
}<br />
if(!$input_width &amp;&amp; !$input_height){<br />
die &#8220;Could not fully get input video dimensions: $input_width x $input_height\n&#8221;;<br />
}</p></blockquote>
<p>Now, let&#8217;s define the target video dimensions:</p>
<blockquote><p>my $output_width = 480;<br />
my $output_height= 320;</p>
<p># we need this for the ffmpeg command&#8230; ugly but works<br />
my $padone = &#8216;-padtop&#8217;;<br />
my $padtwo = &#8216;-padbottom&#8217;;</p></blockquote>
<p>Then, here comes the magic that calculates the thickness of the black bars. This assumes that the video is actually full screen or less/letterboxed. If the video is less wide than the target frame size, the below routine will give a negative letterboxing result (which we will correct later on and use the value for pillarboxing):</p>
<blockquote><p># calculate padding (for black bar letterboxing/pillarboxing)<br />
my $input_aspect= $input_width / $input_height;<br />
my $conv_height    = int( ($output_width / $input_aspect) );<br />
$conv_height    -= 1 if $conv_height % 2 == 1;<br />
my $conv_pad    = int( (($output_height - $conv_height) / 2.0) );<br />
$conv_pad    -= 1 if $conv_pad % 2 == 1;</p></blockquote>
<p>Now we know the thickness of the black bars, but we must have a look at the video&#8217;s dimensions to actually know if we need to letterbox of pillarbox. Note the use of the value 1.333. This is the aspect ratio of your used target frame size. We use 480&#215;320 which results in an aspect of 1.333 (480/320) and is the basis for a &#8220;breakingpoint&#8221; in our calculation:</p>
<blockquote><p>my $aspect_mode;<br />
if($input_aspect &lt; 1.333){<br />
$aspect_mode = &#8216;pillarboxing&#8217;;<br />
$conv_pad *= -1;    # negative to positive values<br />
$padone = &#8216;-padleft&#8217;;    # padding on sides<br />
$padtwo = &#8216;-padright&#8217;;<br />
}else{# default<br />
$aspect_mode = &#8216;letterboxing&#8217;;<br />
}</p></blockquote>
<p>One last hack for the ffmpeg command:</p>
<blockquote><p># need for our command..<br />
my $wxh = $output_width .&#8217;x&#8217;. $conv_height;</p></blockquote>
<p>That&#8217;s it! Now let&#8217;s send these results to our ffmpeg command to start transcoding:</p>
<blockquote><p><em>open(FFMPEG, &#8220;ffmpeg -y -i &#8216;$inFile&#8217; -f flv -s $wxh $padone $conv_pad $padtwo $conv_pad -ar  &#8230;&#8221;);</em></p></blockquote>
<p>(the &#8220;-s&#8221; parameter informs ffmpeg about the target size, $padone and $padtwo change, depending on letterboxing or pillarboxing, to &#8220;-padtop/-padbottom&#8221; and &#8220;-padleft/-padright&#8221; while $conv_pad holds the symmetrical value for the black bars thickness.)</p>
<p>Further: An interesting algorithm to detect aspect ratio can be found in <a href="http://episteme.arstechnica.com/eve/forums/a/tpc/f/96509133/m/659007163831">this thread</a>:</p>
<blockquote><p>my $aspect = $w/$h;<br />
my $size;<br />
if (abs($aspect - 16/9) &lt; 0.02) {<br />
$size = &#8220;320&#215;180&#8243;;<br />
} elsif (abs($aspect - 4/3) &lt; 0.02) {<br />
$size = &#8220;320&#215;240&#8243;;<br />
} else {<br />
die &#8220;Weird aspect ratio: ${w}x${h} = $aspect\n&#8221;;<br />
}</p></blockquote>
<div style="clear: both;"><a href="http://www.addthis.com/bookmark.php" onclick="window.open('http://www.addthis.com/bookmark.php?pub=&amp;url=http%3A%2F%2Fperl.goeszen.com%2Fconverting-any-video-with-ffmpeg-letterboxingpillarboxing.html&amp;title=Converting+ANY+video+with+ffmpeg+%28letterboxing%2Fpillarboxing%29', 'addthis', 'scrollbars=yes,menubar=no,width=620,height=520,resizable=yes,toolbar=no,location=no,status=no'); return false;" title="Bookmark using any bookmark manager!" target="_blank"><img src="http://s3.addthis.com/button1-bm.gif" width="125" height="16" border="0" /></a></div>]]></content:encoded>
			<wfw:commentRss>http://perl.goeszen.com/converting-any-video-with-ffmpeg-letterboxingpillarboxing.html/feed</wfw:commentRss>
		</item>
		<item>
		<title>What I’ve learned from writing a large scale search engine</title>
		<link>http://perl.goeszen.com/what-ive-learned-from-writing-a-large-scale-search-engine.html</link>
		<comments>http://perl.goeszen.com/what-ive-learned-from-writing-a-large-scale-search-engine.html#comments</comments>
		<pubDate>Wed, 26 Mar 2008 22:09:57 +0000</pubDate>
		<dc:creator>tengo</dc:creator>
		
		<category><![CDATA[Uncategorized]]></category>

		<category><![CDATA[design]]></category>

		<category><![CDATA[perl]]></category>

		<category><![CDATA[search]]></category>

		<category><![CDATA[search engine]]></category>

		<guid isPermaLink="false">http://perl.goeszen.com/2008/03/26/what-ive-learned-from-writing-a-large-scale-search-engine/</guid>
		<description><![CDATA[Writing a large-scale web-crawling and web-indexing search engine from scratch is a large beast to tame and in many cases a project that is heading for desaster right from the start. As you can read in Alex&#8217;s worklog for the ongoing effort to manage the Majestic-12 distributed search engine, writing a crawler alone can cost [...]]]></description>
			<content:encoded><![CDATA[<p>Writing a large-scale web-crawling and web-indexing search engine from scratch is a large beast to tame and in many cases a project that is heading for desaster right from the start. As you can read in <a href="http://www.majestic12.co.uk/about.php">Alex&#8217;s worklog</a> for the ongoing effort to manage the Majestic-12 distributed search engine, writing a crawler alone can cost you a lot of time and hair, let alone the giant task to write all three modules: the crawler, the indexer and the search frontend&#8230; but let&#8217;s start from the beginning.</p>
<h3><strong>Care about copyright!  and some other essential things</strong></h3>
<p>Before you start, you will want to take some time and think about the implications and legal aspects of running a search engine. Crawling the web at high speed will put a lot of stress on networks and foreign servers. It&#8217;s good to play by the rules. You should do the little extra work to make your robot a good robot and not let it behave like many of the spam-bots out there. As a good start, <a href="http://www.robotstxt.org/wc/guidelines.html">read the robot guidelines here</a>.</p>
<p>OK, now after reading what a good robot should look like, let&#8217;s think about a few more things - first of all: copyright. Crawling and thus downloading large portions of the web is in it&#8217;s essence a breach of copyright. Many pages online forbid such procedures. BUT, they do allow google, yahoo and all the other search engines to crawl, download and index their sites. Why? Because these crawlers play by the rules. And the rules for a search engine are that, yes, you can download and store content but you should do it to send visitors to their sites. All storage of copyrighted content should only be done for the use of building the search index, it should be limited in terms of how long you will store this data and you should give proper indication of where and when you accessed a certain page.</p>
<p>The next thing to keep in mind is that your robot should follow the robots.txt rules. The robots.txt file is, as you already know from reading the &#8220;<a href="http://www.robotstxt.org/wc/guidelines.html">Guidelines for a well-behaved robot</a>&#8220;, a simple ruleset for machines visiting a server. The webmaster informs your agent with this simple text file about what part of a website may be indexed and, more important, what part may be not. So, by all means, follow the robot.txt rules. In perl, you can use a module like <a href="http://search.cpan.org/dist/libwww-perl/lib/LWP/RobotUA.pm">LWP::RobotUA</a> , <a href="http://search.cpan.org/~gaas/libwww-perl-5.808/lib/WWW/RobotRules.pm">WWW::RobotRules</a> or similar to effectively do so.</p>
<p>The last essential thing to keep in mind is: properly identify your crawler as a search engine crawler in the User Agent String! <a href="http://www.robotstxt.org/db.html">This page here</a> has a list of the most important robots out there. You can see: everyone else does it and it&#8217;s good practice, so do it as well. Be sure to identify your crawler, for example, like this:</p>
<pre> require LWP::UserAgent;
my $ua = LWP::UserAgent-&gt;new;
$ua-&gt;agent('My Crawler. Visit http://www.foo.bar/robot.html for more information');</pre>
<p>and please also include some email, url or similar on how you can be contacted so that a webmaster who notices your crawler running wild can contact you and thus help resolve the issue.</p>
<h3>The basic design of a search engine</h3>
<p>When thinking of how a search engine will work and how you will actually implement all that knowledge, you will soon discover that a modern search engine consists of several specialised components. Read about it in <a href="http://www.howstuffworks.com/search-engine.htm">Howstuffworks</a> or <a href="http://citeseer.ist.psu.edu/brin98anatomy.html">The Anatomy of a Large-Scale Hypertextual Web Search Engine</a> by the Google Guys - whatever you do: get used to reading stuff like that. As I pointed out earlier, searching is pushing the limits of modern computing, so get used to reading scientific stuff like that.</p>
<p>Now, as a search engine might appear to be a simplistic construct, it is really a complex thing to master. And as each part of that complex thing is destinned to do certain very special things, you can break the complex of a search engine down into two, or as I like to see it, three major parts: the crawler, the indexer/merger and the frontnend .</p>
<h4>The crawler</h4>
<p>The crawler, spider or web robot is your content aggregator. It scours the web and systematically downloads websites, discovers links, follows these extracted links and again downloads these newly discovered resources. While traversing the web, a crawler will exncounter all kinds of things. You must decide on the protocols the crawler supports (http only, or mms, ftp, etc as well), wheather it downloads only text sources (html, php, asf and all that other extensions) and how it recognizes these content types.</p>
<p>A good start to give the crawler a certain knowledge about the resource is to have it look at the actual extension, like the .html ending. But many resources have no ending at all, like the &#8220;http://www.foo.bar/&#8221; root page and the like. A more cleverly approach would be to look at the content-type the server sends and actually have the crawler itself send the <code>HTTP_ACCEPT header "</code><code>text/*</code><code>".</code></p>
<p>While speaking of headers, a good idea would be to implement support for compressed/gzipped pages. Many servers support this feature and you can download pages in a fraction of the uncompressed size/speed. But keep in mind that the perl LWP library doesn&#8217;t handle gzipped pages by default. You need to have Compress::Zlib and the associated libraries installed and content needs to be accessed via the $response-&gt;decoded_content function of LWP.</p>
<p>The next thing is Encoding. Once there were 255 characters on one<a href="http://en.wikipedia.org/wiki/Code_page"> codepage</a>, that were the 50s, 60s and 70s. A bit later, computer science discovered that there were more characters than the once defined characters of the ASCII alphabet, like Umlauts and such. The internet happend right inbetween that and the next iteration was an ugly hack that is called Character Encoding. Yes, there were 255 positions on that codepage to fill with letters, but now each Country/Language used different characters referenced by the same id on the the codepage. You can only know which codepage to use by labelling it with names like latin-1 (or iso-8859-1), SHIFT_JIS etc. A disaster. When browsers arrived it got even worse. A thing called <a href="http://search.cpan.org/~gaas/HTML-Parser-3.56/lib/HTML/Entities.pm">HTML Entities</a> was instroduced. This enabled users to use the raw US-ASCII codepage and render umlauts and special characters with the help of descriptive patterns like &#8221; &#8221;</p>
<p>Finally, UTF-8 saw the light of day. The idea was, as can be read <a href="http://en.wikipedia.org/wiki/Utf8#History">here</a>, to use two bits instead of just one, thus creating a much larger codepage with space for every imaginable character, making Codepages and Entities a thing of the past. Anyway, there a re many pages out there that could be server in plain utf8 but are still using HTML Entities on top of that, some claim to be in us-ascii while they are in reality latin-1 encoded and on and on and on.</p>
<p>The lesson you can learn from that is that your crawler should be able to  read and recognize all that, as a last resort by <a href="http://search.cpan.org/~dankogai/Encode-2.24/lib/Encode/Guess.pm">guessing</a> the used encoding. A good idea would be to convert everything to UTF-8. You can use decoded_content() for that and all the tools of the <a href="http://search.cpan.org/~dankogai/Encode-2.24/Encode.pm">Encode</a> module. One thing: use a newer, utf-8 ware version of perl if you&#8217;d like to do so.</p>
<p>After your crawler has finally successfully fetched some content (more pitfalls here: 404 errors, redirections, revisits you need to schedule for later) you need to parse the content in one way or another. Use a readily available HTML parser for that, writing a parser is a road you won&#8217;t go down! Nested data structures are a thing that can make a man mad. And: Don&#8217;t even think about using regular expressions/regexes for anything in that direction!</p>
<p>The last step is link extraction: You need to decide wheater your crawler inits the linkextract or if a dedicated process does so. How will your extor handle or detect already visited URLs: a <a href="http://search.cpan.org/~mceglows/Bloom-Filter-1.0/Filter.pm">bloom filter</a>? a sorting algorithm with uniq? Think about that!</p>
<p>And how do you store all these URLs and all this content for later analysis? You can read about the limits of traditional storage below, but be sure to have a clever structure for the organisation of all this data. My advice, without giving too much away, would be to use buckets and chunks of data. How exactly you can implement this is another homework left for you, but make sure you use compression, some kind of a bucket index (maybe blend in some SQL here, but only some, see my comments below) and clever sorting/footprint reduction techniques for the URL lists may also help.</p>
<p>Puh, you made it throught, everything works. But what is that? It is sloooow! Now comes the interesting part, it works, but it is an expensive operation. More homework here: DNS lookups needs to be cached, utf8 operations/detection are too expensive, the parser is a resource hog&#8230; Some requests time out and all of that needs to be scalable, concurrent and running on multiple machines and in multiple instances. You see, the crawler alone can cost you hair&#8230;</p>
<h4>The indexer/merger</h4>
<p>OK, you&#8217;ve made it this far. You have extracted thousands of unique urls, visited them and downloaded the content wich is really only textual content. The next step is now: indexing it.</p>
<p>Most search engines, at least all the seriously large ones, don&#8217;t use techniques to really do fulltext searches on the content. Opening all those files and running regexes on the text in them would takes ages once you surpassed the mark of a few hundred content objects. The solution to this problem is turning the whole idea upside down and using what is called an <a href="http://en.wikipedia.org/wiki/Inverted_index">inverted index</a>.</p>
<p>An inverted index is like the index in a book. Texts are analysed and your process stores each word, each new word it finds, in a kind of dictionary, remembering the position of the word on-the-way. This dictionary, upon search, is used to identify the objects or positions within your body of content that match your keyword and are used to compute the results list. On a search tat consists of multiple keywords, the searcher identifies the matching entries/objects for each of thos words and then identifies the intersectioning entries.</p>
<p>So, write a routine that harvests all single words from your fetched content and compiles a list of what is called postings. Then arrange your content in a content db accompanying your word index.</p>
<p>As simple as it sounds, as difficult it is in reality. How do you differentiate between &#8220;book&#8221; and &#8220;books&#8221;, will it be the same word or will it be regarded as different concepts upon search. The quality of your index is crucial for the quality of your search engine&#8217;s results. A user looking for &#8220;short films&#8221; should get similar matches when looking for &#8220;shortfilms&#8221; - at least in theory.</p>
<p>Writing a good indexer is as demanding as writing a good crawler. Just some keywords to get you started: you should be aware to linguistic morphology, <a href="http://en.wikipedia.org/wiki/Stemming">stemming</a>, a fixed dictionary might help with popularity ranking, <a href="http://en.wikipedia.org/wiki/Synonym">synonyms</a> and concepts. Everybody can write a script that downloads stuff from the web (ok it&#8217;s a bit more demanding), but actually making sense of it is another cup of tea and still a big issue in <a href="http://en.wikipedia.org/wiki/Category:Information_retrieval">information retrieval</a> and research.</p>
<p>Actually, the functionality described above is the more traditional approach to searching. It is fast for a keywords based search, but nearly useless when you need to compute associations (like a human would do), similar objects or for answering semantic search requests. So it is up to you to invent the next best search engine. Tell me about your successes. It might be a <a href="http://www.perl.com/pub/a/2003/02/19/engine.html">Vector Search</a> approach, a semantic engine, or a <a href="http://www.yr-bcn.es/demos/microsearch/">mixture of new techniques</a>.</p>
<p>Anyway, the result of your indexer will, in most cases, be a kind of solid search index, a large construct of your dictionary and your indexed content - in one way or another. This search index will be accessed by your frontend for the search results.<br />
Also, your index needs to be updated, re-checked (if indexed sources still exist) and merged with newly discovered content. Merging an index with a new chunck of data is an expensive operation. You will need to develop routines or systems to do it in a quick and feasable manner. Solutions might be to use multiple larger indexes, incremental merging or the like. An interesting problem to mediate about.</p>
<h4>The web front-end (presentation layer)</h4>
<p>Actually, developing the web frontend is the easiest part. In most cases you will have guidelines in terms of design and layout that were developed in conjunction with the overall userexperience that was chosen for the project and the overall corporate identity tha twill be in effect. Remember that you should inform the user about returned objects as much as possible while being brief and simple. And &#8220;form follows function&#8221;! <a href="http://www.google.com/">Google</a> is a good example for a text search engine that&#8217;s on concise on text results. The results layout of <a href="http://www.lumerias.com/">video search engine Lumerias</a> is also simplistic by design but has some extensions and modifications that give respect to the nature of presenting video results.</p>
<p>Talking mainly about design here is due to the fact that the underlying technology for the search should already be available from you writing the indexer. The searcher component is the corresponding negative (or positive, depends on how you put it) to your indexer technology. The searcher reverses the algorithm of your indexer and presents matching results based on the user&#8217;s input. As simple as that.</p>
<h3><strong>The limits of current hardware and software</strong></h3>
<p>When you start to write your code, regardless of the language you do it in, you will soon reach the limits of what&#8217;s possible in information retrieval running code on current hardware and the way software works today. Your piece of software will munge billions of urls, will see thousands of doublettes, errors and pitfalls and will generate Terabytes of traffic - so be prepared for a bumpy ride. And this isn&#8217;t only the case when you <a href="http://www.majestic12.co.uk/research/WWW-Search-Engine-But-Not-In-Perl.pp">try to do it all in perl</a>&#8230;</p>
<h4>Rule 1: Filesystem is always too slow<em> </em></h4>
<p>When thinking about storing stuff, your first idea might be: &#8220;let&#8217;s just save it to filesystem&#8221;, one data object = one file. Nice idea, but a mistake! The filesystem mgith work similar to a database but it&#8217;s actually too slow when you rely on it in thousands of operations. As a rule of thumb you should try to avoid filesystem accesses as good as you can. Load chunks of data from disk into memory, process them and write results back. But do it chunck-wise and not for every object. That&#8217;s the way to go!</p>
<h4>Rule 2: SQL is too slow</h4>
<p>Regardless of how many indexes you create, how optimized your queries are and how fast and well-scaled your MySQL or whatever server might be: SQL will always be too slow for the job! SQL might be great to do complicated queries but its absolutely useless to store URLs in it or store page content or anything that will go into the hundreds of thousands of objects. Believe me, I&#8217;ve been there. In-memory databases are no cure. And even committing 1,000 pooled queries at once won&#8217;t give you that extra speed you&#8217;ll need. Keep it simple! (repeat two times) and just use flat-files combined with a system of chunks of data. Process them one by one and you will be a big step closer to your goal of actually getting it to run longer than a few hours without out-of-memory of a frozen system.</p>
<h4>Rule 3: Storage is always too small</h4>
<p>After running your search engine for a few hour, days, weeks. Sooner or later you will run out of storage. Your first reflex might be to add a few drives. The common idea is to add a physical abstraction layer between your OS and your drives and <a href="http://en.wikipedia.org/wiki/Logical_volume_management">virtualize your storage volumes</a>. Nice idea, but you will see that when you keep adding drives each day, there will come a day when that <a href="http://en.wikipedia.org/wiki/Mean_time_between_failures">mean time between failure</a> rate (which appears to be quite low from a personal perspective) will starting to hit you on the professional level with thousands of drives spinning. So by all means, expect failure and implement measures to face them. Google uses the <a href="http://en.wikipedia.org/wiki/Google_File_System">Google File System</a>, a cluster of commodity hardware that doesn&#8217;t try to be flawless but instead backs up itself and makes desaster recovery for single nodes a simple process. Learn from that and use the <a href="http://en.wikipedia.org/wiki/Hadoop">Hadoop</a> framework for the heavy lifting. The adventurous might as well <a href="http://en.wikipedia.org/wiki/Hadoop#Hadoop_on_Amazon_EC2.2FS3_services">outsource the storage</a> part to Amazon&#8217;s EC2 and S3 services, but be prepared for some hefty invoices when doing so on a search service.</p>
<p>I hope this small quide is a good start for you, the aspiring search engine coder. I was involved in the development of the <a href="http://www.lumerias.com/">video search engine Lumerias.com</a> and these are just  a few of the issues we faced while developing that beast. This article is far from complete.</p>
<p>Again, if you would like to learn more about search engines, be sure to milk citeseer for all whitepapers you can get, you might as well read <a href="http://www.acmqueue.com/modules.php?name=Content&amp;pa=showpage&amp;pid=143">Anna&#8217;s amusing article</a> on how to write a search service over at ACM. The out-of-the-box search engine <a href="http://lucene.apache.org/java/docs/">Lucene</a> can be a good source of inspiration. Are you crawling the whole web, or will you run a <a href="http://citeseer.ist.psu.edu/chakrabarti99focused.html">focused crawler</a>? <a href="http://lucene.apache.org/nutch/">Nutch</a> can be a good point to start without writing a single line of code.</p>
<div style="clear: both;"><a href="http://www.addthis.com/bookmark.php" onclick="window.open('http://www.addthis.com/bookmark.php?pub=&amp;url=http%3A%2F%2Fperl.goeszen.com%2Fwhat-ive-learned-from-writing-a-large-scale-search-engine.html&amp;title=What+I%26%238217%3Bve+learned+from+writing+a+large+scale+search+engine', 'addthis', 'scrollbars=yes,menubar=no,width=620,height=520,resizable=yes,toolbar=no,location=no,status=no'); return false;" title="Bookmark using any bookmark manager!" target="_blank"><img src="http://s3.addthis.com/button1-bm.gif" width="125" height="16" border="0" /></a></div>]]></content:encoded>
			<wfw:commentRss>http://perl.goeszen.com/what-ive-learned-from-writing-a-large-scale-search-engine.html/feed</wfw:commentRss>
		</item>
		<item>
		<title>Install cpan on a server without root access</title>
		<link>http://perl.goeszen.com/install-cpan-on-a-server-without-root-access.html</link>
		<comments>http://perl.goeszen.com/install-cpan-on-a-server-without-root-access.html#comments</comments>
		<pubDate>Thu, 20 Mar 2008 10:31:00 +0000</pubDate>
		<dc:creator>admin</dc:creator>
		
		<category><![CDATA[setup]]></category>

		<category><![CDATA[installation]]></category>

		<category><![CDATA[non-root]]></category>

		<category><![CDATA[webserver]]></category>

		<guid isPermaLink="false">http://perl.goeszen.com/2008/03/20/install-cpan-on-a-server-without-root-access/</guid>
		<description><![CDATA[As you can see from reading this discussion, installing a cpan module, without root access, can be daunting. When you managed to install your own compiled Perl on a non-root account, this is in most cases earlier or later the next step/problem. Do this:
login via ssh
ssh remotehost.com -l myusername
create these directories:
myperl/man
myperl/man/man1
myperl/man/man3

then start your perl and [...]]]></description>
			<content:encoded><![CDATA[<p>As you can see from reading <a href="http://www.perlmonks.org/?node_id=610637">this discussion</a>, installing a cpan module, without root access, can be daunting. When you managed to <a href="http://linux.goeszen.com/2008/03/20/compile-your-own-perl-in-a-custom-directory/">install your own compiled Perl on a non-root account</a>, this is in most cases earlier or later the next step/problem. Do this:</p>
<blockquote><p>login via ssh<br />
ssh remotehost.com -l myusername</p></blockquote>
<p>create these directories:</p>
<blockquote><p>myperl/man<br />
myperl/man/man1<br />
myperl/man/man3
</p></blockquote>
<p>then start your perl and cpan</p>
<blockquote><p>path/to/my/perl -MCPAN -e &#8217;shell&#8217;</p></blockquote>
<p>in cpan enter:</p>
<blockquote><p>cpan&gt; o conf makepl_arg &#8220;LIB=/path/to/myperl/lib \<br />
INSTALLMAN1DIR=/path/to/myperl/man/man1 \<br />
INSTALLMAN3DIR=/path/to/myperl/man/man3&#8243;</p></blockquote>
<p>the best thing to do is to copy in these lines via [CTRL]+[SHIFT]+[V] and after that remove all wrong newlines</p>
<p>Finally<br />
cpan&gt; o conf commit</p>
<p>sets this conf default for the future</p>
<p>After that, installing things like Bugzilla is simple:</p>
<blockquote><p>install Bundle::Bugzilla</p></blockquote>
<p>just remember to define the path to your libs in scripts</p>
<blockquote><p>use lib &#8220;/path/to/lib/&#8221;;</p></blockquote>
<div style="clear: both;"><a href="http://www.addthis.com/bookmark.php" onclick="window.open('http://www.addthis.com/bookmark.php?pub=&amp;url=http%3A%2F%2Fperl.goeszen.com%2Finstall-cpan-on-a-server-without-root-access.html&amp;title=Install+cpan+on+a+server+without+root+access', 'addthis', 'scrollbars=yes,menubar=no,width=620,height=520,resizable=yes,toolbar=no,location=no,status=no'); return false;" title="Bookmark using any bookmark manager!" target="_blank"><img src="http://s3.addthis.com/button1-bm.gif" width="125" height="16" border="0" /></a></div>]]></content:encoded>
			<wfw:commentRss>http://perl.goeszen.com/install-cpan-on-a-server-without-root-access.html/feed</wfw:commentRss>
		</item>
		<item>
		<title>What it is all about</title>
		<link>http://perl.goeszen.com/what-it-is-all-about.html</link>
		<comments>http://perl.goeszen.com/what-it-is-all-about.html#comments</comments>
		<pubDate>Thu, 20 Mar 2008 10:11:05 +0000</pubDate>
		<dc:creator>admin</dc:creator>
		
		<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://perl.goeszen.com/2008/03/20/what-it-is-all-about/</guid>
		<description><![CDATA[In this blog, I&#8217;d like to publish trick, tips and hints related to the scripting/programming language Perl, which should not be confused with PEARL.
Although the origin of the name Perl are a more adventurous story, today most agree that Perl stands for Practical Extraction and Report Language, which summarises quite well what it does but [...]]]></description>
			<content:encoded><![CDATA[<p>In this blog, I&#8217;d like to publish trick, tips and hints related to the scripting/programming language <a href="http://en.wikipedia.org/wiki/Perl">Perl</a>, which should not be confused with <a href="http://en.wikipedia.org/wiki/PEARL_%28programming_language%29">PEARL</a>.</p>
<p>Although the origin of the name Perl are a more adventurous story, today most agree that Perl stands for <strong>P</strong>ractical <strong>E</strong>xtraction and <strong>R</strong>eport <strong>L</strong>anguage, which summarises quite well what it does but hopelessly understates what&#8217;s possible with it. Perl can be used for simple, shell-like programming and automation or for highly complicated tasks in information retrieval, bio-informatics and more.</p>
<p>The Camel is, since <a href="http://en.wikipedia.org/wiki/Programming_Perl">O&#8217;Reilly&#8217;s books</a> the official mascot and <a href="http://search.cpan.org">cpan</a> the vast repository of helper modules written by contributing authors. Have fun!</p>
<div style="clear: both;"><a href="http://www.addthis.com/bookmark.php" onclick="window.open('http://www.addthis.com/bookmark.php?pub=&amp;url=http%3A%2F%2Fperl.goeszen.com%2Fwhat-it-is-all-about.html&amp;title=What+it+is+all+about', 'addthis', 'scrollbars=yes,menubar=no,width=620,height=520,resizable=yes,toolbar=no,location=no,status=no'); return false;" title="Bookmark using any bookmark manager!" target="_blank"><img src="http://s3.addthis.com/button1-bm.gif" width="125" height="16" border="0" /></a></div>]]></content:encoded>
			<wfw:commentRss>http://perl.goeszen.com/what-it-is-all-about.html/feed</wfw:commentRss>
		</item>
	</channel>
</rss>
