This post shows how to scrape Instagram images and popular photos using Instagram API and PHP and how to quickly download them in parallel using Redis and Curl.

Prerequisites: PHP, Redis server (http://redis.io/)

Installing Php Redis extension

1
2
3
4
5
6
7
8
9
wget https://github.com/nicolasff/phpredis/archive/master.zip
unzip master.zip
cd phpredis-master
phpize
./configure
make
sudo make install

Scraping Instagram

At first you have to register as a developer at Instagram (if you haven’t already) here: https://instagram.com/developer/register/. Then create an application and get you public and secret keys.

Then create a directory for your project and install Instagram API wrapper for PHP via composer:

1
composer require cosenary/instagram

Then run example (/vendor/cosenary/instagram/example/index.php) and get your accessToken.

Then create scraper.php:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
require_once 'vendor/autoload.php';
use MetzWeb\Instagram\Instagram;
date_default_timezone_set('UTC');
$redis = new Redis();
$redis->connect('127.0.0.1', 6379);
$instagram = new Instagram(array(
'apiKey' => 'YOUR_APP_KEY',
'apiSecret' => 'YOUR_APP_SECRET',
));
$accessToken = 'YOUR_ACCESS_TOKEN';
$instagram->setAccessToken($accessToken);
$search = $instagram->getPopularMedia();
$data = $search->data;
foreach($data as $d) {
if($d->type == 'image') {
$item = array(
'images' => $d->images,
'caption' => $d->caption,
'created_time' => $d->created_time,
'id' => $d->id,
'filename' => $id.'.jpg'
);
$redis->lPush('photo:queue', serialize($item));
}
}

This will get popular photos from Instagram and save them to Redis list (photo:queue) to be downloaded later.

Downloading photos in parallel

Create ‘photos’ directory and download.php:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
$dir = __DIR__.'/photos';
$redis = new Redis();
$redis->connect('127.0.0.1', 6379);
function get($url) {
//usual curl get
}
while(true) {
$item = $redis->brPop('photo:queue', 10); // wait until we get new item from Redis
$retry = 1;
if($item) {
$item = unserialize($item[1]);
$filename = $dir.'/'.$item['filename'];
if(!file_exists($filename)) {
while(!($photo = get($item['images']->standard_resolution->url))) {
echo "retrying download {$retry}\n";
sleep(2);
$retry++;
}
file_put_contents($filename, $photo);
echo "Loaded {$filename}\n";
}
} else {
echo "no items in Redis\n";
}
}

Here function get($url) is usual Curl download function. You can get it from the final Gist.

Then you need to run multiple download workers, let’s do it.

1
2
nohup php download.php >/dev/null 2>&1 &

Run this command as many times as many downloaders you need.

If you want to stop downloads, do this:

1
2
ps aux | grep download.php

Then kill all processes found.

Get Final Gist with code.