This post shows how to use ReactPHP, Curl and proxy servers together to speed up scraping process and make it asynchronous (non-blocking).

Web scraping with PHP and Curl is simple and effective. With such powerful tool like SimpleHtmlDom available, it’s possible to scrape pretty much any website even with complicated login process and ajax content.

Scraping works great, but speed might become an issue. Suppose you have 1 million pages to crawl, each taking 1 second to load and parse. Then it will take about 11 days to process all of them. If you are able to load 4 pages at a time, you will need about 3 days to finish all pages.

The easiest way to speed everything up would be to run the same script 4 times and use some centralized data-storage with ability to atomically get one url at a time (Redis’s list for example). If you have predefined list of urls to scrape, this approach will work very well.

But what if you have to add urls as you go? For example you are scraping main page and 1st-level subpages of the same website. Using first approach, you might just add new urls to the storage to be scraped later.

Another option is to use curl_multi_exec function and run several requests in parallel. It will enable multiple simultaneous transfers and asynchronous event-based handling.

According to docs and our requirements, you will need some queue of urls to scrape, ability to add new urls as needed and some event-loop to check if any request is completed.

You have to manage event-loop yourself or use some library.

You probably heard about ReactPHP which is “Event-driven, non-blocking I/O”. PHP isn’t non-blocking by default, React adds such ability with a number of classes and functions.

So it should be possible to use ReactPHP with Curl to organize simultaneous JavaScript-style requests with callbacks.

Turns out, somebody already did it. Check out this awesome library: reactphp-curl. It does exactly what we need: runs curl_multi_exec, manages url queue and employs ReactPHP event loop and promises. You can even set different options, like proxy for each request.

Simple example would be this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
require_once __DIR__ . '/vendor/autoload.php';
use \React\EventLoop\Factory;
use \KHR\React\Curl\Curl;
use \KHR\React\Curl\Exception;
$loop = Factory::create();
$curl = new Curl($loop);
$curl->client->setMaxRequest(3);
$curl->client->setSleep(6, 1.0, false); // 6 request in 1 second
$curl->client->setCurlOption([CURLOPT_AUTOREFERER => true, CURLOPT_COOKIE => 'fruit=apple; colour=red']); // default options
$curl->get('https://graph.facebook.com/http://www.yandex.ru')->then(function($result){
echo (string) $result, PHP_EOL; // echo $result->body; OR echo $result->getBody();
});
$curl->run();
$loop->run();

Basically this gets some url and runs callback function.

Code

Now let’s get our hands dirty with some code and make use of this stuff. You will at least PHP 5.5.

Our task is as follows: get the first page of https://news.ycombinator.com/, load all topic pages in parallel, and get list of users commented the topic and number of comments made by each user. Each request would be with a different proxy.

At first let’s create composer.json:

1
2
3
4
5
6
{
"require": {
"khr/react-curl": "~2.0",
"sunra/php-simple-html-dom-parser": "~1.5"
}
}

Save it and run

1
composer install

This will install reactphp-curl and simplehtmldom. Then create crawler.php with the following code:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
require_once __DIR__ . '/vendor/autoload.php';
use \React\EventLoop\Factory;
use KHR\React\Curl\Curl;
use KHR\React\Curl\Result;
use KHR\React\Curl\Exception;
use Sunra\PhpSimple\HtmlDomParser;
date_default_timezone_set('Asia/Jakarta'); // change to your timezone
mb_internal_encoding("UTF-8");
class Crawler {
private $proxies = [], $topics = [], $curl = null;
/** Load proxies from gimmeproxy.com **/
private function loadProxies($num) {
echo "Load {$num} proxies\n";
for($i = 0; $i < $num; $i++) {
$data = json_decode(file_get_contents('http://gimmeproxy.com/api/getProxy'), 1);
$this->proxies[] = $data['curl'];
}
}
/** Set proxy option **/
private function getProxyOption() {
$key = array_rand($this-&gt;proxies);
return [CURLOPT_PROXY =&gt; $this-&gt;proxies[$key]];
//return [];
}
/** Get url from result **/
private function resultGetUrl(Result $result) {
return $result-&gt;getOptions()[CURLOPT_URL];
}
/** Parse main page **/
public function parseMainPage($result) {
echo "Loaded ".$result-&gt;getOptions()[CURLOPT_URL]."\n";
$links = [];
$dom = HtmlDomParser::str_get_html($result);
foreach($dom-&gt;find('a[href^=item]') as $a) { // look for links starting with "item"
$href = $a-&gt;href;
if(!isset($links[$href])) { // if link is not already visited, crawl it
echo "Get {$href}\n";
$links[$href] = 1;
//Get topic page
$this-&gt;curl-&gt;get('https://news.ycombinator.com/'.$href, $this-&gt;getProxyOption())-&gt;then(
array($this, 'parseTopicPage'), // promise resolved, parse topic page
function($exception) { // promise rejected, i.e. some error occurred
echo "Error loading url ".$this-&gt;resultGetUrl($exception-&gt;result).": ".$exception-&gt;getMessage()."\n";
}
);
}
}
$dom-&gt;clear();
}
/** Parse topic page **/
public function parseTopicPage($result) {
echo "Successfully loaded ".$this-&gt;resultGetUrl($result)."\n";
$topic = ['title' =&gt; '', 'users' =&gt; []];
$dom = HtmlDomParser::str_get_html($result);
$topic['title'] = $dom-&gt;find('.title a', 0)-&gt;innertext;
foreach($dom-&gt;find('a[href^=user]') as $a) {
$username = trim($a-&gt;innertext);
$topic['users'][$username] = isset($topic['users'][$username]) ? $topic['users'][$username] + 1 : 1;
}
$this-&gt;topics[] = $topic;
$dom-&gt;clear();
}
/** Run crawler **/
public function run() {
$this-&gt;loadProxies(10); // load 10 proxies from GimmeProxy.com
$loop = Factory::create();
$this-&gt;curl = new Curl($loop);
$this-&gt;curl-&gt;client-&gt;setMaxRequest(5); // number of parallel requests
$this-&gt;curl-&gt;client-&gt;setSleep(2, 1.0, false); // make maximum 2 requests in 1 second
$this-&gt;curl-&gt;client-&gt;setCurlOption([
CURLOPT_AUTOREFERER =&gt; true,
CURLOPT_USERAGENT =&gt; 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36',
CURLOPT_CONNECTTIMEOUT =&gt; 10,
CURLOPT_TIMEOUT =&gt; 10,
CURLOPT_SSL_VERIFYPEER =&gt; false,
CURLOPT_SSL_VERIFYHOST =&gt; false,
CURLOPT_FOLLOWLOCATION =&gt; true,
CURLOPT_MAXREDIRS =&gt; 9,
CURLOPT_RETURNTRANSFER =&gt; TRUE,
CURLOPT_HEADER =&gt; 0,
]);
$this-&gt;curl-&gt;get('http://icanhazip.com/', $this-&gt;getProxyOption())-&gt;then(function($result) { //check that proxy server is working
echo $result."\n";
});
$this-&gt;curl-&gt;get('https://news.ycombinator.com/', $this-&gt;getProxyOption())-&gt;then(
array($this, 'parseMainPage') // call $this-&gt;parseMainPage
);
$this-&gt;curl-&gt;run();
$loop-&gt;run();
print_r($this-&gt;topics);
}
}
$crawler = new Crawler();
$crawler-&gt;run();

 

What this code does

  • Loads 10 proxy servers from GimmeProxy.com in loadProxies function
  • Creates event loop and ReactPHPCurl object
  • Sets default options for all requests
  • Requests icanhazip.com to check that random proxy works (only one, just to show you how it’s done)
  • Requests main page of news.ycombinator.com, adds topic pages on the fly
  • Requests topic pages and parses them
  • When everything is done it prints results
  • Every request is done with new random proxy

Notes

To get a list of working proxy servers, we use GimmeProxy.com, you can use your own proxies, just modify loadProxies() function.

Proxy is set as a second parameter (array of options) of curl->get(url, [additional parameters, like CURLOPT_XXX => yyy, CURLOPT_ZZZ => nnn]). You can add any other CURL parameters.

Note that in order to be able to call parseMainPage() and parseTopicPage() as a callback, we need to make them public, otherwise $this->curl won’t be able to call them. That seems pretty obvious, but anyway.

If you get errors like “PHP Catchable fatal error: Argument 1 passed to React\Promise\Promise::then() must be callable, array given”, check if your methods are private or protected.

Or you can do it like this:

1
2
3
4
5
6
$this->curl->get('https://news.ycombinator.com/', $this->getProxyOption())->then(
function($result) {
$this->parseMainPage($result);
}
);

Note how we handle errors: we pass error handler as a second argument of then function, it returns Exception object with injected result:

1
2
3
4
function($exception) { // promise rejected, i.e. some error occurred
echo "Error loading url ".$this->resultGetUrl($exception->result).": ".$exception->getMessage()."\n";
}

Final version of code is here: https://gist.github.com/256cats/7a704640f33965a7eb92