About two years ago I wrote a series of Media Center ad-ins, two of them heavily relied on screen scraping and based on how one built Media Center add-ins at the time I did a large part of them in ECMAScript.
With that design constraint (its arguably not a constraint but go with me here) this left me doing regular expressions as the best way to extract values from pages to build my 10' user experience.
Well this is a pretty fragile way to go, its pretty hard to craft specifically vague regular expressions (ones that will survive the most common site changes) plus testing these things is a real PITA, despite that I made the plug-in and it worked pretty well for quite a while.
Well I have been thinking about re-releasing one of these add-ins based on the new Media Center SDK and MCML, as part of this the design peridigm has gotten back to managed code so this last weekend I installed Visual Studio Express and started working on the screen scraping portion of a re-release (who knows if I will ever finish this, things are pretty busy with work these days).
In any event back when I was the Program Manager for classes in the system.security.cryptography namespace in .NET I ran across this set of classes called the HTML Agility Pack. What this does is give you the ability to navigate HTML programmatically like you would with XML; this means you can do cool stuff like apply XPATH queries to the documents.
To get a idea of how powerful this is as a concept consider this, the following excerpt quickly and easily breaks the Blockbuster rental queue into per movie xml blobs that can easily be parsed using additional XPATH queries.
HtmlNodeCollection movies = doc.DocumentNode.SelectNodes("//div[@class='disc' or @class='disc ']");
foreach (HtmlNode movie in movies)
{
// do stuff of interest here
}
That's WAY easier that doing this with regular expressions and way less likley to change as its not dependent on the sequence of tags, or spaces or other oddities.
Its still screen scraping and its not perfect, for example if you see this XPATH query includes two checks, one for divs with the class of 'disc' and the other for divs with the class of 'disc ' (notice the space). There are fancier ways to deal with this but the above query "gets the job done(tm)'.
Well back to where I was going in the first place, Simon Mourier (author of the HTML Agility Pack) is my hero...