Screen scraping in C#

Published Feb 13, 2007

Some say that screen scraping is a lost art because it is no longer an advanced discipline. That may be right, but there are different ways of doing it. Here are some different ways that all are perfectly acceptable, but can be used for various different purposes.

Old school

It’s old school because this approach has existed since .NET 1.0. It is highly flexible and lets you make the request asynchronously.

public static string ScreenScrape(string url)
{
System.Net.WebRequest request = System.Net.WebRequest.Create(url);
// set properties of the request
using (System.Net.WebResponse response = request.GetResponse())
{
using (System.IO.StreamReader reader = new System.IO.StreamReader(response.GetResponseStream()))
{
return reader.ReadToEnd();
}
}
}

Modern>

In .NET 2.0 we can use the WebClient class, which is a cleaner way of solving the same problem. It is equally as flexible and can also work asynchronous.

public static string ScreenScrape(string url)
{
using (System.Net.WebClient client = new System.Net.WebClient())
{
// set properties of the client
return client.DownloadString(url);
}
}

The one-liner>

This is a short version of the Modern approach, but it deserves to be on the list because it is a one-liner. Tell a nineteen ninetees developer that you can do screen scraping in one line of code and he wont believe you. The approach is not flexible in any way and cannot be used asynchronously.

public static string ScreenScrape(string url)
{
return new System.Net.WebClient().DownloadString(url);
}

That concludes the medley of screen scraping approaches. Pick the one you find best for the given situation.

Simple RSS reader class in C#

Published Feb 8, 2007

Update February 11th. The class now updates every feed in the cache asynchronously and automatically.

On the login screen of Headlight, we are soon adding news updates so that our customers can see what is going on with the product. The content is delivered from our company blog via RSS and we are probably going to use FeedBurner’s JavaScript to display the latest items. Then I started thinking about how easy it would be to write a simple RSS feed parser in C# instead.

It should support caching so it doesn’t parse the feed at every page request. I know there are some very good RSS libraries such as RSS.NET, but I wanted to build it myself. Now it is one hour later and this is the result.

Examples of use

By using the Create method of the RssReader class, you specify a TimeSpan of when the feed should expire from the cache. In this example it expires after two hours. Remember not to dispose or use the "using" claus when you use the CreateAndCache method. Otherwise you dispose the cached instance.

RssReader reader = RssReader.CreateAndCache("http://feeds.feedburner.com/netslave", new TimeSpan(2, 0, 0));
foreach (RssItem item in reader.Items)
{
Response.Write(item.Title + "<br />");
}

You can also use the class directly without caching.

using (RssReader rss = new RssReader())
{
rss.FeedUrl = "http://feeds.feedburner.com/netslave";
foreach (RssItem item in rss.Execute())
{
Response.Write(item.Title + "<br />");
}
}

It doesn’t parse all the different XML tags of a RSS feed, just the basic ones. However, it is very easy to add more yourself.

Download

RssReader.zip (1,5 KB)