Using HTTPWebRequest & HTTPWebResponse to automate web browsing

If you are a developer, there will be times when you want to automate a users action in a browser. For ex: if you are trying to download a complete website locally (also called web scraping), you will need to recursively fetch pages from the website and navigate to sub pages. This can be done manually but will take enormous time if the website has hundreds of pages.

One way is to automate this process. Here I demonstrate a simple script to fetch one page and read its content. Here is the sample code:

1> string URL = "";
3> HttpWebRequest HWR = (HttpWebRequest)HttpWebRequest.Create(URL);
4> HWR.Method = "GET";
5> StreamReader SR = new StreamReader(HWR.GetResponse().GetResponseStream());
6> string Response = SR.ReadToEnd();
8> string Pattern = @"dd?d?.dd?d?.dd?d?.dd?d?";
9> Regex R = new Regex(Pattern, RegexOptions.Singleline | RegexOptions.IgnoreCase);
10> Match M = R.Match(Response);
11> string IP = M.ToString();
13> SR.Close();

Line 1 sets up the web page to download. You can set this to a sub page as well.
Line 3 creates a new HTTPWebRequest object. This class in .NET allows you to make HTTPWebRequest and supports custom headers as well. In this example we have not used any headers.
Line 4 sets the request method. Most popular are GET and POST. POST is used when submitting forms. GET is the basic request type for fetching a page.
Line 5 HTTP protocol is a streaming protocol and hence we require a StreamReader to read the response
Line 6 captures the response as an in-memory string. This might be ok for smaller pages but for larger pages you might want to consider saving to disk or an external database
Line 8 demonstrates parsing page response to extract some data (in this case IP address). Any IP address in a valid format is extracted. Since Match class is used only the first match is returned which is stored in the string IP

RegEx itself is a popular class in .NET and worth learning at the earliest. It has been very useful for me. Will blog about it later.

Leave a Reply

Your email address will not be published. Required fields are marked *