Monday, December 14, 2009

MP3 links are volatile.

One of the biggest challenges in maintaining an mp3 search engine is the short life span of most mp3 files on the web.  We call this "volatile" content - it's always changing and one can't rely on a file being alive for very long. 

To combat this, at SkreemR we are continually testing to see whether or not links in our index are still valid.  The trick is to do this without downloading any files which would eat up our (and the host sites) bandwidth very quickly.  So we have a program that checks all the links and does a "HEAD" request to see if the file is still available and hasn't changed.  The HEAD request is quite handy - there's a good post about it here if you are interested:

http://www.greywyvern.com/?post=272

The article sums up the value of HEAD quite well: "You can verify that a file exists, and is the proper MIME-type, without actually downloading all of the data contained within that file."  One problem is that some servers block or do not implement HEAD requests, but mostly it works quite well for us.

We try to validate each link every 24 hours.  With an index of over 10 million links, our program has to be very fast (it is!).  Occasionally you may come across a link in SkreemR that doesn't work but that means you have hit it sometime in that 24 hr window between checks.  For the most part, I think compared to other mp3 search engines we have the best ratio of working vs broken links.  That's probably why people like to use our API and why people who do not use our API like to try and scrape our site for links...but that's a post for another day ;)

No comments:

Post a Comment