Recently I needed to automate reverse image search for my client’s blog. I’ve ended up writing a simple PHP web scraper for Google Images. In this article I’ll show you how I did it.

We will use Curl and Simple Html Dom library.

So here is our task: for given image URL get the best guess and websites with this image from Google Image Search pages.

So here is our task: for given image URL get it’s name (the best guess) and websites with this image from Google Image Search pages.

For scraping I usually use Simple Html Dom library, you can get it here http://simplehtmldom.sourceforge.net/. In case you don’t know, Simple Html Dom provides jQuery-like css selector functionality in PHP.

I also use Google Chrome to analyze webpages for patterns, but you can use Firefox or Internet Explorer if you wish.

Image for this example was taken from Wikipedia: http://upload.wikimedia.org/wikipedia/commons/2/22/Turkish_Van_Cat.jpg

Example image

Download and parse a webpage

At first we need some utility function to download a web page and create simple_html_dom class for it. We set some Curl options, e.g. to follow redirects and use cookies. Then we get webpage and create simple_html_dom object from it.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
include_once __DIR__.'/simple_html_dom.php';
define('COOKIE_FILE', __DIR__.'/cookie.txt');
@unlink(COOKIE_FILE); //clear cookies before we start
/**Get simplehtmldom object from url
* @param $url
* @return bool|simple_html_dom
*/
public function getSimpleHtmlDom($url) {
$curl = curl_init();
curl_setopt_array($curl, array(
CURLOPT_URL => $url,
CURLOPT_SSL_VERIFYPEER => false,
CURLOPT_FOLLOWLOCATION => true,
CURLOPT_MAXREDIRS => 9,
CURLOPT_RETURNTRANSFER => 1,
CURLOPT_HEADER => 0,
CURLOPT_USERAGENT => "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36",
CURLOPT_COOKIEFILE => COOKIE_FILE,
CURLOPT_COOKIEJAR => COOKIE_FILE
));
$dom = str_get_html(curl_exec($curl));
curl_close($curl);
return $dom;
}

Get best guess

Get best guess

 

To get the best guess we need to find a pattern for dom element that contains it. Let’s right click on the “Best guess for this image: turkish van cat” and inspect that element.

We see several divs and links. I decided that the easiest way to get the best guess would be the following:

1
2
3
4
5
6
7
8
9
10
11
12
13
/**Get best guess text
* @param simple_html_dom $dom
* @return bool
*/
public function getBestGuess(simple_html_dom $dom) {
foreach ($dom->find('div[class=card-section] div') as $div) {
if(stripos($div->innertext, 'Best guess for this image') !== false) {
$a = $div->find('a', 0);
return array($a->innertext, $this->googleDomain.$a->href);
}
}
return false;
}

Let’s break this down. We get div with class “card-section” and loop through each div inside of it. If it contains text “Best guess for this image”, we get first link and return it’s text and href attributes.

Get image search results

To get image search results, let’s again right click on one of them and inspect the element.

We see several divs with classes "srg" and "rc" and our links are inside h3 tag with class "r".

We see several divs with classes “srg” and “rc” and our links are inside h3 tag with class “r”.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
/**Get search results from current page
* @param simple_html_dom $dom
* @return array
*/
public function getSearchResults(simple_html_dom $dom) {
$result = array();
$c = count($dom->find('div.srg')) > 1 ? 1 : 0; // if this is first page, we have 2 divs, first with some irrelevant
//links, so skip the first page
$d = $dom->find('div.srg', $c); // get second div(if this is 1st page), or first div
foreach($d->find('div.rc h3.r') as $h3) {
foreach($h3->find('a') as $a) { // get links
$result[] = array(htmlspecialchars_decode($a->plaintext, ENT_QUOTES), $a->href);
}
}
return $result;
}

Here I look for div with “srg” class and go through all divs with class “rc”. Then I get h3 tag and link inside of it is my search result.

You’ve probably noticed this part of code:

1
2
3
4
$c = count($dom->find('div.srg')) > 1 ? 1 : 0; // if this is first page, we have 2 divs, first with some irrelevant
//links, so skip the first page
$d = $dom->find('div.srg', $c); // get second div(if this is 1st page), or first div

When we scrape first page, it has 2 divs with “srg” class. We don’t need the first one, that’s why I did this check. Then I get only second div (

$c)```).
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
## Browse through pages
Final part is pretty simple, we browse through pages and get all search results.
``` php
/**Get best guess text and loop through pages to get links to images
* @param $imageUrl
* @param int $numPages - number of pages to scrape
* @return array(
* 'best_guess' => string,
* 'search_results' => array(
* array(name, url),
* array(name, url),
* ...,
* etc
* )
* )
*/
public function search($imageUrl, $numPages = 1) {
try {
$dom = $this->getSimpleHtmlDom($this->searchUrl.$imageUrl); // get first page dom
$bestGuess = $this->getBestGuess($dom); // get best guess from 1st page
$searchResults = $this->getSearchResults($dom); // get search results from 1st page
$nextPageA = $dom->find('#nav a.pn', 0); // check if we have 'next page' link (if we don't - it's the only page)
$dom->clear();
for($i = 1; $i < $numPages && $nextPageA; $i++) { // loop through pages [2 - $numPages] $dom = $this->getImageSearchDom($this->searchUrl.$imageUrl.'&start='.($i * 10));
$searchResults = array_merge($searchResults, $this->getSearchResults($dom));// get search results from page and merge with available results
$nextPageA = $dom->find('#nav a.pn', 0); // check if we have 'next page' link (if we don't - it's last page)
$dom->clear();
sleep(1);
}
return array('best_guess' => $bestGuess, 'search_results' => $searchResults);
} catch (Exception $e) {
echo 'Exception for url: ', $imageUrl, "
\n", $e->getMessage(), "
\n";
return false;
}
}

Conclusion

That’s it pretty much, have a look at the Github repo, final source code is there.

One thing to mention: if you run this scraper from your webserver several times, Google might think that it’s a bot (and it is!). I didn’t notice any problems while running scraper from my home machine though.

If you really need bulletproof scraping, you might use proxies or Tor, but that’s a topic of another article.

Update

I was asked about using local images with this script. That’s possible. According to this answer you need to post your image with some parameters to https://www.google.co.in/searchbyimage/upload. That will redirect you to the page with results.

To check that I modified getDom function to accept POST parameters. I took the code from my “Scraping asp websites“ post.

To upload file with CURL simply set POST parameter with @filename, like this:

1
'encoded_image' => '@'.realpath($fileName),

If your are using PHP >= 5.5, file upload like this (with ‘@’) probably won’t work, since they added SAFE_UPLOAD CURL option and CURLFile class which you should use instead of ‘@’filename.

Check this for more info on this topic. My example is built for PHP < 5.5.

I updated Github repo, check it:) Hope this helps.

 

Update 2 - scraping original image url

To scrape to original image url as well, let’s check html code of the search page:

[caption id=”attachment_685” align=”alignnone” width=”660”]Scraping original image of google search Scraping original image of google search[/caption]

To see html code uncomment

$dom; die();``` in getSearchResults().
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
Data that we are scraping already is marked by green line and new data is marked red. That 'div.th a' contains original image url, we need to get it and parse it's href attribute. We need to get 'imgurl' part of href.
In order to do this, let's modify getSearchResults() function a little bit:
``` php
public function getSearchResults(simple_html_dom $dom) {
$result = array();
//echo $dom; die(); // uncomment this if you want to check search page that Google sends us
$c = count($dom-&gt;find('div.srg')) &gt; 1 ? 1 : 0; // if this is first page, we have 2 divs, first with some irrelevant
//links, so skip the first page
$d = $dom-&gt;find('div.srg', $c); // get second div(if this is 1st page), or first div
foreach($d-&gt;find('div.rc') as $div) {
$a = $div-&gt;find('h3.r a', 0); // get link to the website
//Get original image url
$originalImg = $div-&gt;find('div.th a', 0);
preg_match('/imgurl=(.+?)&amp;/', $originalImg-&gt;href, $matches); // get part with original image url
$result[] = array(htmlspecialchars_decode($a-&gt;plaintext, ENT_QUOTES), $a-&gt;href, $matches[1]);
}
/* // old code
foreach($d-&gt;find('div.rc h3.r') as $h3) {
foreach($h3-&gt;find('a') as $a) { // get links
$r = array(htmlspecialchars_decode($a-&gt;plaintext, ENT_QUOTES), $a-&gt;href);
$result[] = $r;
}
}*/
return $result;
}

Compare it with the old code. We loop through div.rc and get link to the website as usual. Then we get ‘div.th a’ and parse imgurl part with preg_match. $matches[1] contains link to original image.

Note that in phpfind('h3.r a', 0);, zero (0) tells simplehtmldom to return the first found element instead of array with all elements.

10/27/2015 : Github repo is updated to work with search html updates (error on line 100).