Hi, my name is Gaurav. I am a Software Developer with over 2 years of experience on various platforms like C#.Net , ASP.Net ,WPF etc. I am excited to share my ideas on developing web scrappers here.

What is Web Scraping

Web Scraping (also known as Screen Scraping, Web Data Extraction, Web Harvesting etc.), simply put, is a technique used to extract large amounts of data from websites.

Data from third party websites on the Internet can normally be viewed only using a web browser. For example, consider the data listings at yellow pages directories, real estate sites, social networks, industrial inventory, online shopping sites, contact databases etc. None of these websites offer the functionality to save a copy of the data to your local storage. So one is left with the only option of manual copy and paste of the data displayed on the site. This is of course very monotonous and time-taking.

Web Scraping helps you automate this process, so that you don’t have to manually copy paste the information every time. The Web Scraping software will do this for you in just a few seconds.

A Web Scraping software interacts with websites in the same way as any web browser. The only difference is that it saves the target data to your local system rather than displaying it on the screen.

Web Scraping Tools

Generally, scraping of web pages is done with HttpWebRequest and HttpWebResponse method of C# in ASP.NET. However, it is observed that when server side navigation is performed using AJAX in the application, then it becomes very difficult to fetch page data using HttpWebRequest method (we need to perform tricks to fetch next page data). The same thing can be done with Watin Tool very easily and quickly. My objective here is not to challenge HttpWebRequest and HttpWebResponse methods, but to show how effectively we can do web site scraping using testing tools like Watin.

webscraping_2

C# Web Crawler

Abot is an open source C# web crawler built for speed and flexibility. It takes care of the low level plumbing (multithreading, http requests, scheduling, link parsing, etc.). You just register for events to process the page data. You can also plug-in your own implementations of core interfaces to take complete control over the crawl process.

What’s So Great About It?

C# web crawlers are know for their following characteristics:

• Open Source (Free for commercial and personal use)
• Speed
• Every part of the architecture is pluggable
• Heavily unit tested (High code coverage)
• Very lightweight
• Easy to get started quickly
• No database required
• Easily customize crawl behavior (When to crawl a page, when not to crawl, when to crawl page’s links, etc.)
• Runs on Mono

Simple Web Crawler in C# to get Html content

webscraping_3

using System.IO;
using System.Net;
// ...
// ... some other code
// ...
WebRequest request = WebRequest.Create("http://google.com/");//Type url here
using (WebResponse response = request.GetResponse())
{
    using (StreamReader responseReader = new StreamReader(response.GetResponseStream()))
    {
        string responseData = responseReader.ReadToEnd();
        string path= Server.MapPath(".");
        if(!File.Exists(path+"//Asptrick.html"))
            File.Create(path+"//Asptrick.html");
        // Where do you want to save content
        using (StreamWriter writer = new StreamWriter(path + "//Asptrick.html")) 
        {
            writer.Write(responseData);
        }
    }
}

How to identify Requests

Use a regular browser and fiddler (if the developer tools are not up to scratch) and take a look at the request and response headers.

Build up your requests and request headers to match what the browser sends (you can use a couple of different browsers to asses if this makes a difference). In regards to ‘getting blocked after a certain amount of calls’ – throttle your calls. Only make one call every x seconds. Behave nicely to the site and it will behave nicely to you.

Chances are good that they simply look at the number of calls from your IP address per second and if it passes a threshold, the IP address gets blocked.

So you see how interesting it is to design a web scrapper which would save a lot of your time and effort in getting data for business intelligence or market research.