There are many questions and discussions about scraping websites built with ASP (.aspx extension). People ask how to scrape ajax content, how to emulate button clicks, what is __doPostBack etc.

Scraping of ASP isn’t that difficult, you just have to be careful. In this article I’ll show how to do it and will provide base code for emulating ASP clicks and post requests.

We will use https://ucpi.sco.ca.gov/ucp/Default.aspx as an example site to scrape. As usual our tools are PHP, Curl and SimpleHtmlDom.

General information about ASP scraping

In order to succesfully scrape ASP website you need to do the following:

  1. Load page with form, get and later use cookies it sets.
  2. Get values of all INPUT elements where name starts with ““ and post them with your request. You should get fields like “VIEWSTATE”, “VIEWSTATEGENERATOR”, “EVENTVALIDATION”, probably something else. There won’t be necessarily all of them at once.
  3. If you want to get the next page, or click some button with ajax, set “EVENTTARGET” and “EVENTARGUMENT” in your post request. For example:
    1
     <a href="javascript:__doPostBack('ctl00$ContentPlaceHolder1$gvResults','Page$2')">2</a>

Here “EVENTTARGET” is “ctl00$ContentPlaceHolder1$gvResults” and “EVENTARGUMENT” is “Page$2”.

  1. Follow redirects and send requests to the last page you were redirected to.

Download and parse page

At first we need to get SimpleHtmlDom object to work with:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
include_once __DIR__.'/simple_html_dom.php';
define('COOKIE_FILE', __DIR__.'/cookie.txt');
@unlink(COOKIE_FILE); //clear cookies before we start
define('CURL_LOG_FILE', __DIR__.'/request.txt');
@unlink(CURL_LOG_FILE);//clear curl log
/**Get simplehtmldom object from url
* @param $url
* @param $post
* @return bool|simple_html_dom
*/
public function getDom($url, $post = false) {
$f = fopen(CURL_LOG_FILE, 'a+'); // curl session log file
if($this->lastUrl) $header[] = "Referer: {$this->lastUrl}";
$curlOptions = array(
CURLOPT_ENCODING => 'gzip,deflate',
CURLOPT_AUTOREFERER => 1,
CURLOPT_CONNECTTIMEOUT => 120, // timeout on connect
CURLOPT_TIMEOUT => 120, // timeout on response
CURLOPT_URL => $url,
CURLOPT_SSL_VERIFYPEER => false,
CURLOPT_SSL_VERIFYHOST => false,
CURLOPT_FOLLOWLOCATION => true,
CURLOPT_MAXREDIRS => 9,
CURLOPT_RETURNTRANSFER => 1,
CURLOPT_HEADER => 0,
CURLOPT_USERAGENT => "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36",
CURLOPT_COOKIEFILE => COOKIE_FILE,
CURLOPT_COOKIEJAR => COOKIE_FILE,
CURLOPT_STDERR => $f, // log session
CURLOPT_VERBOSE => true,
);
if($post) { // add post options
$curlOptions[CURLOPT_POSTFIELDS] = $post;
$curlOptions[CURLOPT_POST] = true;
}
$curl = curl_init();
curl_setopt_array($curl, $curlOptions);
$data = curl_exec($curl);
$this->lastUrl = curl_getinfo($curl, CURLINFO_EFFECTIVE_URL); // get url we've been redirected to
curl_close($curl);
if($this->dom) {
$this->dom->clear();
$this->dom = false;
}
$dom = $this->dom = str_get_html($data);
fwrite($f, "{$post}\n\n");
fwrite($f, "-----------------------------------------------------------\n\n");
fclose($f);
return $dom;
}

I’ll explain a little bit what’s going on:

  1. Log curl sessions to request.txt, so later we can debug if something goes wrong

    => $f```
    1
    2. Tell Curl to use cookies ``` phpCURLOPT_COOKIEFILE => COOKIE_FILE, CURLOPT_COOKIEJAR => COOKIE_FILE

  2. Don’t verify SSL certificates

    => false, CURLOPT_SSL_VERIFYHOST => false,```. It allows to scrape SSL websites. Otherwise we'll get an error from Curl if it can't validate SSL certificate.
    1
    2
    3
    4
    5
    6
    7
    4. Get final url after all redirects and save it to ``` php$this->lastUrl```. Then use it as referer header
    5. We return simplehtmldom object and also save it to ``` php$this->dom``` field.
    ## Emulating ASP post requests
    Next we need to create post parameters for Curl.  They should be in a form of
    ``` php"postvar1=value1&postvar2=value2&postvar3=value3"

You’ll find more examples here: http://curl.haxx.se/libcurl/php/examples/simplepost.html

1
2
3
4
5
6
7
8
9
10
11
12
function createASPPostParams($dom, array $params) {
$postData = $dom->find('input,select,textarea');
$postFields = array();
foreach($postData as $d) {
$name = $d->name;
if(trim($name) == '' || in_array($name, $this->exclude)) continue;
$value = isset($params[$name]) ? $params[$name] : $d->value;
$postFields[] = rawurlencode($name).'='.rawurlencode($value);
}
$postFields = implode('&', $postFields);
return $postFields;
}

In order to make correct post request we get all inputs, selects and textareas from provided simplehtmldom object. Then we loop through these elements.

If element is not in excluded list (which we set beforehand), check if we have this element in

Get value from ``` php$params[$name]``` or from ``` php$dom``` :
1
2
``` php$value = isset($params[$name]) ? $params[$name] : $d->value;

Then rawurlencode name and value and create string from resulting array, concatenated with ‘&’. This

We can use this function like this to create parameters for lastname search:

1
2
$post = $scraper->createASPPostParams($dom, array('ctl00$ContentPlaceHolder1$txtLastName' => 'smith'));
$scraper->getDom($url, $post); // make post request

Here “ctl00$ContentPlaceHolder1$txtLastName” is the name of lastname field input tag. You can get it by inspecting lastname element in Google Developer Tools.

Important notice

Since we put all found inputs into the request it’s necessary to create

array. Otherwise we'll also send "Submit" and "Reset" inputs to server as well. It might lead to weird bugs.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
Once I spent an hour debugging why next page button wasn't working. Realized that I sent "10 items per page" select element as well and it did reset the form!
You can do it another way though: manually check all inputs that you need to send and don't include unnecessary stuff. That's up to you.
## Emulating __dopostback ajax requests
To emulate __dopostback we need to set __EVENTTARGET and __EVENTARGUMENT which you can also get from button element.
``` php
function doPostRequest($url, array $params) {
$post = $this->createASPPostParams($this->dom, $params);
return $this->getDom($url, $post);
}
function doPostBack($url, $eventTarget, $eventArgument = '') {
return $this->doPostRequest($url, array(
'__EVENTTARGET' => $eventTarget,
'__EVENTARGUMENT' => $eventArgument
));
}

Final scraper code

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
include_once __DIR__.'/simple_html_dom.php';
define('COOKIE_FILE', __DIR__.'/cookie.txt');
@unlink(COOKIE_FILE); //clear cookies before we start
define('CURL_LOG_FILE', __DIR__.'/request.txt');
@unlink(CURL_LOG_FILE);//clear curl log
class ASPBrowser {
public $exclude = array();
public $lastUrl = '';
public $dom = false;
/**Get simplehtmldom object from url
* @param $url
* @param $post
* @return bool|simple_html_dom
*/
public function getDom($url, $post = false) {
$f = fopen(CURL_LOG_FILE, 'a+'); // curl session log file
if($this->lastUrl) $header[] = "Referer: {$this->lastUrl}";
$curlOptions = array(
CURLOPT_ENCODING => 'gzip,deflate',
CURLOPT_AUTOREFERER => 1,
CURLOPT_CONNECTTIMEOUT => 120, // timeout on connect
CURLOPT_TIMEOUT => 120, // timeout on response
CURLOPT_URL => $url,
CURLOPT_SSL_VERIFYPEER => false,
CURLOPT_SSL_VERIFYHOST => false,
CURLOPT_FOLLOWLOCATION => true,
CURLOPT_MAXREDIRS => 9,
CURLOPT_RETURNTRANSFER => 1,
CURLOPT_HEADER => 0,
CURLOPT_USERAGENT => "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36",
CURLOPT_COOKIEFILE => COOKIE_FILE,
CURLOPT_COOKIEJAR => COOKIE_FILE,
CURLOPT_STDERR => $f, // log session
CURLOPT_VERBOSE => true,
);
if($post) { // add post options
$curlOptions[CURLOPT_POSTFIELDS] = $post;
$curlOptions[CURLOPT_POST] = true;
}
$curl = curl_init();
curl_setopt_array($curl, $curlOptions);
$data = curl_exec($curl);
$this->lastUrl = curl_getinfo($curl, CURLINFO_EFFECTIVE_URL); // get url we've been redirected to
curl_close($curl);
if($this->dom) {
$this->dom->clear();
$this->dom = false;
}
$dom = $this->dom = str_get_html($data);
fwrite($f, "{$post}\n\n");
fwrite($f, "-----------------------------------------------------------\n\n");
fclose($f);
return $dom;
}
function createASPPostParams($dom, array $params) {
$postData = $dom->find('input,select,textarea');
$postFields = array();
foreach($postData as $d) {
$name = $d->name;
if(trim($name) == '' || in_array($name, $this->exclude)) continue;
$value = isset($params[$name]) ? $params[$name] : $d->value;
$postFields[] = rawurlencode($name).'='.rawurlencode($value);
}
$postFields = implode('&', $postFields);
return $postFields;
}
function doPostRequest($url, array $params) {
$post = $this->createASPPostParams($this->dom, $params);
return $this->getDom($url, $post);
}
function doPostBack($url, $eventTarget, $eventArgument = '') {
return $this->doPostRequest($url, array(
'__EVENTTARGET' => $eventTarget,
'__EVENTARGUMENT' => $eventArgument
));
}
function doGetRequest($url) {
return $this->getDom($url);
}
}

Example usage code

For example let’s scrape Unclaimed Property Search page. Let’s programmatically search for ‘Smith’ last name and print 4 result pages.

[caption id=”attachment_654” align=”alignnone” width=”660”]Unclaimed property search Unclaimed property search[/caption]

 

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
function removeSpaces($s) {
return trim(preg_replace('!\s+!', ' ', $s));
}
function printTableData(simple_html_dom $dom) {
foreach($dom->find('tr.gridViewRow, tr.gridViewAlternateRow') as $tr) {
$td = $tr->find('td');
echo removeSpaces($td[0]->innertext.';'.$td[1]->innertext.';'.$td[2]->innertext.';');
echo $td[3]->find('a', 0)->href.';';
echo $td[4]->find('img', 0)->alt."\n";
}
}
$url = 'https://ucpi.sco.ca.gov/ucp/Default.aspx';
$browser = new ASPBrowser();
$browser->exclude = array('ctl00$ContentPlaceHolder1$btnClear');
$browser->doGetRequest($url); // get form
$resultPage = $browser->doPostRequest($url, array('ctl00$ContentPlaceHolder1$txtLastName' => 'smith')); // get 1st page of results
$browser->exclude = array('ctl00$ContentPlaceHolder1$btnClearInd', 'ctl00$ContentPlaceHolder1$ddlPageSize', 'ctl00$ContentPlaceHolder1$btnSearchInd');
for($i = 2; $i < 5; $i++) {
printTableData($resultPage);
$resultPage = $browser->doPostBack($browser->lastUrl, 'ctl00$ContentPlaceHolder1$gvResults', 'Page$'.$i);
}
printTableData($resultPage);
$resultPage->clear();

Note that we exclude “clear” button (“ctl00$ContentPlaceHolder1$btnClear”) for the initial request and “clear”, “page size”, “search” elements for next page requests.

Output will be like

1
2
3
4
5
6
7
8
9
10
11
12
13
14
SMITH;6151 MCKNIGHT DR ;LAKEWOOD CA 90713;https://ucpi.sco.ca.gov/ucp/InterimDetails.aspx?propertyRecID=154318;Interim ID
SMITH;2458 ORCHID ST ;FAIRFIELD CA 94533;https://ucpi.sco.ca.gov/ucp/NoticeDetails.aspx?propertyRecID=1460817;Notice ID
SMITH;448 MARSHALL WAY ;ALAMEDA CA 94501;https://ucpi.sco.ca.gov/ucp/NoticeDetails.aspx?propertyRecID=2300011;Notice ID
SMITH;25352 JUANITA AVE MORENO VL;MORENO VALLEY CA 92553-;https://ucpi.sco.ca.gov/ucp/PropertyDetails.aspx?propertyRecID=6262584;Property ID
SMITH;25352 JUANITA AVE MORENO VL;MORENO VALLEY CA 92553-;https://ucpi.sco.ca.gov/ucp/PropertyDetails.aspx?propertyRecID=6262584;Property ID
SMITH;520 AVENIDA GRANADA;PALM SPRINGS CA 92264-;https://ucpi.sco.ca.gov/ucp/PropertyDetails.aspx?propertyRecID=7247733;Property ID
SMITH;615 E UNION ST;PASADENA 00000-;https://ucpi.sco.ca.gov/ucp/PropertyDetails.aspx?propertyRecID=8588991;Property ID
SMITH;10 MARQUETTE APT 232;IRVINE CA 92715-4207;https://ucpi.sco.ca.gov/ucp/PropertyDetails.aspx?propertyRecID=4051511;Property ID
SMITH;721 SO. 8TH ST.;ALHAMBRA CA 91801-;https://ucpi.sco.ca.gov/ucp/PropertyDetails.aspx?propertyRecID=2614692;Property ID
SMITH; ; ;https://ucpi.sco.ca.gov/ucp/PropertyDetails.aspx?propertyRecID=3523067;Property ID
SMITH;11370 ANDERSON ST;LOMA LINDA CA 92354-;https://ucpi.sco.ca.gov/ucp/PropertyDetails.aspx?propertyRecID=5968026;Property ID
SMITH;1436 W. TRENTON ST.;SAN BERNARDINO CA 92411-;https://ucpi.sco.ca.gov/ucp/PropertyDetails.aspx?propertyRecID=5388905;Property ID
SMITH;38106 N 12TH ST E;PALMDALE CA 93510-;https://ucpi.sco.ca.gov/ucp/PropertyDetails.aspx?propertyRecID=6826310;Property ID
SMITH;4886 N MUIRWOOD CT;SIMI VALLEY CA 93063-;https://ucpi.sco.ca.gov/ucp/PropertyDetails.aspx?propertyRecID=6829442;Property ID

That’s it! I’ve struggled with ASP scraping for some time and I hope this article was useful to you.
As usual you can find final source code in the Github repo.