Scraping ASP websites in PHP with __doPostBack ajax emulation
There are many questions and discussions about scraping websites built with ASP (.aspx extension). People ask how to scrape ajax content, how to emulate button clicks, what is __doPostBack etc.
Scraping of ASP isn’t that difficult, you just have to be careful. In this article I’ll show how to do it and will provide base code for emulating ASP clicks and post requests.
We will use https://ucpi.sco.ca.gov/ucp/Default.aspx as an example site to scrape. As usual our tools are PHP, Curl and SimpleHtmlDom.
General information about ASP scraping
In order to succesfully scrape ASP website you need to do the following:
- Load page with form, get and later use cookies it sets.
- Get values of all INPUT elements where name starts with ““ and post them with your request. You should get fields like “VIEWSTATE”, “VIEWSTATEGENERATOR”, “EVENTVALIDATION”, probably something else. There won’t be necessarily all of them at once.
- If you want to get the next page, or click some button with ajax, set “EVENTTARGET” and “EVENTARGUMENT” in your post request. For example:1<a href="javascript:__doPostBack('ctl00$ContentPlaceHolder1$gvResults','Page$2')">2</a>
Here “EVENTTARGET” is “ctl00$ContentPlaceHolder1$gvResults” and “EVENTARGUMENT” is “Page$2”.
- Follow redirects and send requests to the last page you were redirected to.
Download and parse page
At first we need to get SimpleHtmlDom object to work with:
|
|
I’ll explain a little bit what’s going on:
Log curl sessions to request.txt, so later we can debug if something goes wrong
=> $f``` 12. Tell Curl to use cookies ``` phpCURLOPT_COOKIEFILE => COOKIE_FILE, CURLOPT_COOKIEJAR => COOKIE_FILEDon’t verify SSL certificates
=> false, CURLOPT_SSL_VERIFYHOST => false,```. It allows to scrape SSL websites. Otherwise we'll get an error from Curl if it can't validate SSL certificate. 12345674. Get final url after all redirects and save it to ``` php$this->lastUrl```. Then use it as referer header5. We return simplehtmldom object and also save it to ``` php$this->dom``` field.## Emulating ASP post requestsNext we need to create post parameters for Curl. They should be in a form of``` php"postvar1=value1&postvar2=value2&postvar3=value3"
You’ll find more examples here: http://curl.haxx.se/libcurl/php/examples/simplepost.html
|
|
In order to make correct post request we get all inputs, selects and textareas from provided simplehtmldom object. Then we loop through these elements.
If element is not in excluded list (which we set beforehand), check if we have this element in
Then rawurlencode name and value and create string from resulting array, concatenated with ‘&’. This
We can use this function like this to create parameters for lastname search:
Here “ctl00$ContentPlaceHolder1$txtLastName” is the name of lastname field input tag. You can get it by inspecting lastname element in Google Developer Tools.
Important notice
Since we put all found inputs into the request it’s necessary to create
Final scraper code
|
|
Example usage code
For example let’s scrape Unclaimed Property Search page. Let’s programmatically search for ‘Smith’ last name and print 4 result pages.
[caption id=”attachment_654” align=”alignnone” width=”660”] Unclaimed property search[/caption]
|
|
Note that we exclude “clear” button (“ctl00$ContentPlaceHolder1$btnClear”) for the initial request and “clear”, “page size”, “search” elements for next page requests.
Output will be like
That’s it! I’ve struggled with ASP scraping for some time and I hope this article was useful to you.
As usual you can find final source code in the Github repo.