In this article we will look how Gimmeproxy.com stores it’s data. The only datastores used are Redis and ElasticSearch. They were chosen because of speed.

Other posts in Gimmeproxy tech series:

Why Redis, you might ask, it is not a “real” database and not persistent enough. Well, while it may more or less true, depending on Redis settings (see http://oldblog.antirez.com/post/redis-persistence-demystified.html), it doesn’t matter for this project.

As shown in previous article - Gimmeproxy gets proxies from open sources by Cronjob. In case of some data loss it will be able to efficiently rescrape all the database in couple of minutes. So Redis is a good choice here.

BTW dedicated Redis Module might be a better way to solve it (https://redislabs.com/blog/writing-redis-modules). I’ll discuss ElasticSearch usage in the next article.

Some of the Redis-code is written in LUA to save on network speed and achieve transaction-like workflow for some operations.

I use node-redis-scripto to load LUA to Redis from Node.js. While it has not been updated for a long time, it does the job just fine. Please let me know if you are aware of better alternatives.

1
2
npm install redis-scripto --save
npm install bluebird --save

Redis boilerplate

This is some boilerplate for working with Redis, should be prepended to each example. Here I promisify Redis and redis-scripto for convenience and load LUA scripts from ./lua directory.

1
2
3
4
5
6
7
8
9
10
11
12
13
'use strict'
const Promise = require('bluebird')
const path = require('path')
const redis = require('redis')
const redisClient = redis.createClient()
const scripto = require('redis-scripto')
const redisLua = new scripto(redisClient);
redisLua.loadFromDir(path.resolve(path.dirname(__filename), 'lua'))
Promise.promisifyAll(redisLua)
Promise.promisifyAll(redis)

Load proxy to Redis

Here we load initial proxy data from the proxy list into the Redis.

  • gimme:proxies:available is a set of available proxy ids
  • gimme:proxy:data - hash of proxy data
  • gimme:proxies:tocheck - list of ids to check
  • gimme:proxies:checked - set of checked ids

lua/gimme-add-proxy.lua

1
2
3
4
5
6
7
8
local id = ARGV[1]
local value = ARGV[2]
if tonumber(redis.call('SISMEMBER', 'gimme:proxies:available', id)) == 0 then -- don't add if proxy is already added
redis.call('HSET', 'gimme:proxy:data', id, value)
redis.call('SADD', 'gimme:proxies:available', id)
redis.call('LPUSH', 'gimme:proxies:tocheck', id)
end
return 0

load.js

1
2
3
4
5
6
7
8
9
10
11
// ...boilerplate
const id = 1
const proxy = {
id: 1,
ip: '127.0.0.1',
port: 8080,
// etc
}
redisLua.runAsync('gimme-add-proxy', [], [id, JSON.stringify(proxy)])

Add checked proxy

After proxy was checked, we update our data with what check-proxy library provided.

lua/gimme-proxy-checked.lua

1
2
3
4
5
6
7
8
9
10
11
12
13
local id = ARGV[1]
local ok = tonumber(ARGV[2])
local numProxies
if ok == 0 then -- proxy is not working, remove
redis.call('SREM', 'gimme:proxies:checked', id) -- remove from checked list
redis.call('HDEL', 'gimme:proxy:data', id) -- remove data
redis.call('SREM', 'gimme:proxies:available', id) -- remove from available
else -- proxy is working, add
redis.call('LPUSH', 'gimme:proxies:tocheck', id)
redis.call('HSET', 'gimme:proxy:data', id, ARGV[3])
redis.call('SADD', 'gimme:proxies:checked', id)
end

checker.js

1
2
3
4
5
6
7
8
9
10
11
12
// ...boilerplate
const id = 1
const proxy = {
id: 1,
ip: '127.0.0.1',
port: 8080,
protocol: 'http',
anonymityLevel: 1
// etc
}
redisLua.runAsync('gimme-proxy-checked', [], [id, 1, JSON.stringify(data)])

Remove proxy from Redis

If proxy isn’t working, we remove it from Redis.

checker.js

1
2
3
4
// ...boilerplate
const id = 1
redisLua.runAsync('gimme-proxy-checked', [], [id, 0])

Return random proxy

lua/gimme-get-random-proxy.lua

1
2
3
4
5
6
7
8
local proxyId = ''
local proxyData = ''
proxyId = redis.call('srandmember', 'gimme:proxies:checked') -- get one random proxy
if(proxyId) then
proxyData = redis.call('hget', 'gimme:proxy:data', proxyId)
end
return proxyData

get-proxy.js

1
2
3
4
5
6
// ...boilerplate
const ip = request.headers['x-forwarded-for'] || request.ip
const ts = Math.floor(Date.now() / 1000)
redisLua.runAsync('gimme-get-random-proxy')
.then(data => console.log(data))

That’s it for Redis. Next section will be about ElasticSearch.

Other posts in Gimmeproxy tech series: