Recently I needed to automate reverse image search for my client’s blog. I’ve ended up writing a simple PHP web scraper for Google Images. In this article I’ll show you how I did it.
We will use Curl and Simple Html Dom library.
So here is our task: for given image URL get it’s name (the best guess) and websites with this image from Google Image Search pages.
For scraping I usually use Simple Html Dom library, you can get it here http://simplehtmldom.sourceforge.net/. In case you don’t know, Simple Html Dom provides jQuery-like css selector functionality in PHP.
I also use Google Chrome to analyze webpages for patterns, but you can use Firefox or Internet Explorer if you wish.
At first we need some utility function to download a web page and create simple_html_dom class for it. We set some Curl options, e.g. to follow redirects and use cookies. Then we get webpage and create simple_html_dom object from it.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
include_once__DIR__.'/simple_html_dom.php';
define('COOKIE_FILE', __DIR__.'/cookie.txt');
@unlink(COOKIE_FILE); //clear cookies before we start
/**Get simplehtmldom object from url
* @param $url
* @return bool|simple_html_dom
*/
publicfunctiongetSimpleHtmlDom($url){
$curl = curl_init();
curl_setopt_array($curl, array(
CURLOPT_URL => $url,
CURLOPT_SSL_VERIFYPEER => false,
CURLOPT_FOLLOWLOCATION => true,
CURLOPT_MAXREDIRS => 9,
CURLOPT_RETURNTRANSFER => 1,
CURLOPT_HEADER => 0,
CURLOPT_USERAGENT => "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36",
CURLOPT_COOKIEFILE => COOKIE_FILE,
CURLOPT_COOKIEJAR => COOKIE_FILE
));
$dom = str_get_html(curl_exec($curl));
curl_close($curl);
return $dom;
}
Get best guess
To get the best guess we need to find a pattern for dom element that contains it. Let’s right click on the “Best guess for this image: turkish van cat” and inspect that element.
We see several divs and links. I decided that the easiest way to get the best guess would be the following:
1
2
3
4
5
6
7
8
9
10
11
12
13
/**Get best guess text
* @param simple_html_dom $dom
* @return bool
*/
publicfunctiongetBestGuess(simple_html_dom $dom){
foreach ($dom->find('div[class=card-section] div') as $div) {
if(stripos($div->innertext, 'Best guess for this image') !== false) {
Let’s break this down. We get div with class “card-section” and loop through each div inside of it. If it contains text “Best guess for this image”, we get first link and return it’s text and href attributes.
Get image search results
To get image search results, let’s again right click on one of them and inspect the element.
We see several divs with classes “srg” and “rc” and our links are inside h3 tag with class “r”.
That’s it pretty much, have a look at the Github repo, final source code is there.
One thing to mention: if you run this scraper from your webserver several times, Google might think that it’s a bot (and it is!). I didn’t notice any problems while running scraper from my home machine though.
If you really need bulletproof scraping, you might use proxies or Tor, but that’s a topic of another article.
Update
I was asked about using local images with this script. That’s possible. According to this answer you need to post your image with some parameters to https://www.google.co.in/searchbyimage/upload. That will redirect you to the page with results.
To check that I modified getDom function to accept POST parameters. I took the code from my “Scraping asp websites“ post.
To upload file with CURL simply set POST parameter with @filename, like this:
1
'encoded_image' => '@'.realpath($fileName),
If your are using PHP >= 5.5, file upload like this (with ‘@’) probably won’t work, since they added SAFE_UPLOAD CURL option and CURLFile class which you should use instead of ‘@’filename.
Check this for more info on this topic. My example is built for PHP < 5.5.
I updated Github repo, check it:) Hope this helps.
Update 2 - scraping original image url
To scrape to original image url as well, let’s check html code of the search page:
[caption id=”attachment_685” align=”alignnone” width=”660”] Scraping original image of google search[/caption]
To see html code uncomment
$dom; die();``` in getSearchResults().
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
Data that we are scraping already is marked by green line and new data is marked red. That 'div.th a' contains original image url, we need to get it and parse it's href attribute. We need to get 'imgurl' part of href.
In order to do this, let's modify getSearchResults() function a little bit:
``` php
public function getSearchResults(simple_html_dom $dom) {
$result = array();
//echo $dom; die(); // uncomment this if you want to check search page that Google sends us
$c = count($dom->find('div.srg')) > 1 ? 1 : 0; // if this is first page, we have 2 divs, first with some irrelevant
//links, so skip the first page
$d = $dom->find('div.srg', $c); // get second div(if this is 1st page), or first div
foreach($d->find('div.rc') as $div) {
$a = $div->find('h3.r a', 0); // get link to the website
//Get original image url
$originalImg = $div->find('div.th a', 0);
preg_match('/imgurl=(.+?)&/', $originalImg->href, $matches); // get part with original image url
Compare it with the old code. We loop through div.rc and get link to the website as usual. Then we get ‘div.th a’ and parse imgurl part with preg_match. $matches[1] contains link to original image.
Note that in phpfind('h3.r a', 0);, zero (0) tells simplehtmldom to return the first found element instead of array with all elements.
10/27/2015 :Github repo is updated to work with search html updates (error on line 100).