HTML Agility Pack oh how I love thee...

About two years ago I wrote a series of Media Center ad-ins, two of them heavily relied on screen scraping and based on how one built Media Center add-ins at the time I did a large part of them in ECMAScript.

With that design constraint (its arguably not a constraint but go with me here) this left me doing regular expressions as the best way to extract values from pages to build my 10' user experience.

Well this is a pretty fragile way to go, its pretty hard to craft specifically vague regular expressions (ones that will survive the most common site changes) plus testing these things is a real PITA, despite that I made the plug-in and it worked pretty well for quite a while.

Well I have been thinking about re-releasing one of these add-ins based on the new Media Center SDK and MCML, as part of this the design peridigm has gotten back to managed code so this last weekend I installed Visual Studio Express and started working on the screen scraping portion of a re-release (who knows if I will ever finish this, things are pretty busy with work these days).

In any event back when I was the Program Manager for classes in the system.security.cryptography namespace in .NET I ran across this set of classes called the HTML Agility Pack. What this does is give you the ability to navigate HTML programmatically like you would with XML; this means you can do cool stuff like apply XPATH queries to the documents.

To get a idea of how powerful this is as a concept consider this, the following excerpt quickly and easily breaks the Blockbuster rental queue into per movie xml blobs that can easily be parsed using  additional XPATH queries.

HtmlNodeCollection movies = doc.DocumentNode.SelectNodes("//div[@class='disc' or @class='disc ']");

foreach (HtmlNode movie in movies)

{

// do stuff of interest here

}

 

That's WAY easier that doing this with regular expressions and way less likley to change as its not dependent on the sequence of tags, or spaces or other oddities.

Its still screen scraping and its not perfect, for example if you see this XPATH query includes two checks, one for divs with the class of 'disc' and the other for divs with the class of 'disc ' (notice the space). There are fancier ways to deal with this but the above query "gets the job done(tm)'.

Well back to where I was going in the first place, Simon Mourier (author of the HTML Agility Pack) is my hero...

 

Print | posted on Monday, March 12, 2007 9:32 PM

Feedback


 re: HTML Agility Pack oh how I love thee... 1/6/2010 11:45 AM Bill Deihl

Well it looks like it has been a while since you posted this.

I just pulled down the html agility pack. (Jan 2010)

I am writing a sitefinity control to read some legacy html pages and shove the content in to the new site. (Don’t ask why, I just work here…)

Anyhow, htmlagility pack was super easy, passed it the stream, called GetElementbyId to pull the old content div and, voila, content. If you need to pull content out of a webpage, this is a super easy way to go.

It’s a little scary that it’s a beta but it worked great for my purposes. Hopefully they will get back on it. I may try to run some linq stuff against it too.

Enjoy.

Bill

Title  
Name  
Email
Url
Comments   
Please add 5 and 3 and type the answer here: