Hi, my name is Gaurav. I am a Software Developer with over 2 years of experience on various platforms like C#.Net , ASP.Net ,WPF etc. I am excited to share my ideas on developing web scrappers here.

What is Web Scraping

Web Scraping (also known as Screen Scraping, Web Data Extraction, Web Harvesting etc.), simply put, is a technique used to extract large amounts of data from websites.

Data from third party websites on the Internet can normally be viewed only using a web browser. For example, consider the data listings at yellow pages directories, real estate sites, social networks, industrial inventory, online shopping sites, contact databases etc. None of these websites offer the functionality to save a copy of the data to your local storage. So one is left with the only option of manual copy and paste of the data displayed on the site. This is of course very monotonous and time-taking.

Web Scraping helps you automate this process, so that you don’t have to manually copy paste the information every time. The Web Scraping software will do this for you in just a few seconds.

A Web Scraping software interacts with websites in the same way as any web browser. The only difference is that it saves the target data to your local system rather than displaying it on the screen.

Web Scraping Tools

Generally, scraping of web pages is done with HttpWebRequest and HttpWebResponse method of C# in ASP.NET. However, it is observed that when server side navigation is performed using AJAX in the application, then it becomes very difficult to fetch page data using HttpWebRequest method (we need to perform tricks to fetch next page data). The same thing can be done with Watin Tool very easily and quickly. My objective here is not to challenge HttpWebRequest and HttpWebResponse methods, but to show how effectively we can do web site scraping using testing tools like Watin.


C# Web Crawler

Abot is an open source C# web crawler built for speed and flexibility. It takes care of the low level plumbing (multithreading, http requests, scheduling, link parsing, etc.). You just register for events to process the page data. You can also plug-in your own implementations of core interfaces to take complete control over the crawl process.

What’s So Great About It?

C# web crawlers are know for their following characteristics:

• Open Source (Free for commercial and personal use)
• Speed
• Every part of the architecture is pluggable
• Heavily unit tested (High code coverage)
• Very lightweight
• Easy to get started quickly
• No database required
• Easily customize crawl behavior (When to crawl a page, when not to crawl, when to crawl page’s links, etc.)
• Runs on Mono

Simple Web Crawler in C# to get Html content


using System.IO;
using System.Net;
// ...
// ... some other code
// ...
WebRequest request = WebRequest.Create("http://google.com/");//Type url here
using (WebResponse response = request.GetResponse())
    using (StreamReader responseReader = new StreamReader(response.GetResponseStream()))
        string responseData = responseReader.ReadToEnd();
        string path= Server.MapPath(".");
        // Where do you want to save content
        using (StreamWriter writer = new StreamWriter(path + "//Asptrick.html")) 

How to identify Requests

Use a regular browser and fiddler (if the developer tools are not up to scratch) and take a look at the request and response headers.

Build up your requests and request headers to match what the browser sends (you can use a couple of different browsers to asses if this makes a difference). In regards to ‘getting blocked after a certain amount of calls’ – throttle your calls. Only make one call every x seconds. Behave nicely to the site and it will behave nicely to you.

Chances are good that they simply look at the number of calls from your IP address per second and if it passes a threshold, the IP address gets blocked.

So you see how interesting it is to design a web scrapper which would save a lot of your time and effort in getting data for business intelligence or market research.

blog lam dep | toc dep | giam can nhanh


toc ngan dep 2016 | duong da dep | 999+ kieu vay dep 2016

| toc dep 2016 | du lichdia diem an uong

xem hai

the best premium magento themes

dat ten cho con

áo sơ mi nữ

giảm cân nhanh

kiểu tóc đẹp

đặt tên hay cho con

xu hướng thời trangPhunuso.vn

shop giày nữ

giày lười nữgiày thể thao nữthời trang f5Responsive WordPress Themenha cap 4 nong thonmau biet thu deptoc dephouse beautifulgiay the thao nugiay luoi nutạp chí phụ nữhardware resourcesshop giày lườithời trang nam hàn quốcgiày hàn quốcgiày nam 2015shop giày onlineáo sơ mi hàn quốcshop thời trang nam nữdiễn đàn người tiêu dùngdiễn đàn thời tranggiày thể thao nữ hcmphụ kiện thời trang giá rẻ