Fast scraping with ReactPHP + Curl + Proxies
This post shows how to use ReactPHP, Curl and proxy servers together to speed up scraping process and make it asynchronous (non-blocking).
Web scraping with PHP and Curl is simple and effective. With such powerful tool like SimpleHtmlDom available, it’s possible to scrape pretty much any website even with complicated login process and ajax content.
Scraping works great, but speed might become an issue. Suppose you have 1 million pages to crawl, each taking 1 second to load and parse. Then it will take about 11 days to process all of them. If you are able to load 4 pages at a time, you will need about 3 days to finish all pages.
The easiest way to speed everything up would be to run the same script 4 times and use some centralized data-storage with ability to atomically get one url at a time (Redis’s list for example). If you have predefined list of urls to scrape, this approach will work very well.
But what if you have to add urls as you go? For example you are scraping main page and 1st-level subpages of the same website. Using first approach, you might just add new urls to the storage to be scraped later.
Another option is to use curl_multi_exec function and run several requests in parallel. It will enable multiple simultaneous transfers and asynchronous event-based handling.
According to docs and our requirements, you will need some queue of urls to scrape, ability to add new urls as needed and some event-loop to check if any request is completed.
You have to manage event-loop yourself or use some library.
You probably heard about ReactPHP which is “Event-driven, non-blocking I/O”. PHP isn’t non-blocking by default, React adds such ability with a number of classes and functions.
So it should be possible to use ReactPHP with Curl to organize simultaneous JavaScript-style requests with callbacks.
Turns out, somebody already did it. Check out this awesome library: reactphp-curl. It does exactly what we need: runs curl_multi_exec, manages url queue and employs ReactPHP event loop and promises. You can even set different options, like proxy for each request.
Simple example would be this:
Basically this gets some url and runs callback function.
Code
Now let’s get our hands dirty with some code and make use of this stuff. You will at least PHP 5.5.
Our task is as follows: get the first page of https://news.ycombinator.com/, load all topic pages in parallel, and get list of users commented the topic and number of comments made by each user. Each request would be with a different proxy.
At first let’s create composer.json:
|
|
Save it and run
This will install reactphp-curl and simplehtmldom. Then create crawler.php with the following code:
|
|
What this code does
- Loads 10 proxy servers from GimmeProxy.com in loadProxies function
- Creates event loop and ReactPHPCurl object
- Sets default options for all requests
- Requests icanhazip.com to check that random proxy works (only one, just to show you how it’s done)
- Requests main page of news.ycombinator.com, adds topic pages on the fly
- Requests topic pages and parses them
- When everything is done it prints results
- Every request is done with new random proxy
Notes
To get a list of working proxy servers, we use GimmeProxy.com, you can use your own proxies, just modify loadProxies() function.
Proxy is set as a second parameter (array of options) of curl->get(url, [additional parameters, like CURLOPT_XXX => yyy, CURLOPT_ZZZ => nnn]). You can add any other CURL parameters.
Note that in order to be able to call parseMainPage() and parseTopicPage() as a callback, we need to make them public, otherwise $this->curl won’t be able to call them. That seems pretty obvious, but anyway.
If you get errors like “PHP Catchable fatal error: Argument 1 passed to React\Promise\Promise::then() must be callable, array given”, check if your methods are private or protected.
Or you can do it like this:
Note how we handle errors: we pass error handler as a second argument of then function, it returns Exception object with injected result:
Final version of code is here: https://gist.github.com/256cats/7a704640f33965a7eb92