Then we will stop web scrapers.
Please note, that latest Koa requires at least Node v7.6.0.
|
|
|
|
That’s it. This allows you to show different pages to visitors from different countries.
Here we set app.proxy
to true
as I usually have Nginx in front of Node.js. I recommend it as Node is single-threaded and any slow client will block the event loop.
On the other hand Nginx in front of Node will deal with slow clients easily.
But https://ip-api.io can do so much more! For example it’s possible to prevent webscraping.
As you know, scrapers usually use TOR and public proxies. Ip-api.io checks TOR, public proxy and spammer databases and provides this information as well.
Full ip-api.io response looks like this:
|
|
is_proxy
, is_tor_node
, is_spam
are self-explanatory. is_suspicious
is true
when either of aforementioned fields is true
.
So let’s also stop scrapers:
|
|
Other posts in Gimmeproxy tech series:
Let’s see how to use ElasticSearch with Nodejs. At first we have to connect to ElasticSearch.
|
|
Create or delete index, init mapping
|
|
Now we can add or delete a document
|
|
Or add/remove several documents at once - according to ElasticSearch docs
|
|
Finally we are able to return random item from all ElasticSearch records
|
|
Other posts in Gimmeproxy tech series:
]]>Other posts in Gimmeproxy tech series:
Why Redis, you might ask, it is not a “real” database and not persistent enough. Well, while it may more or less true, depending on Redis settings (see http://oldblog.antirez.com/post/redis-persistence-demystified.html), it doesn’t matter for this project.
As shown in previous article - Gimmeproxy gets proxies from open sources by Cronjob. In case of some data loss it will be able to efficiently rescrape all the database in couple of minutes. So Redis is a good choice here.
BTW dedicated Redis Module might be a better way to solve it (https://redislabs.com/blog/writing-redis-modules). I’ll discuss ElasticSearch usage in the next article.
Some of the Redis-code is written in LUA to save on network speed and achieve transaction-like workflow for some operations.
I use node-redis-scripto to load LUA to Redis from Node.js. While it has not been updated for a long time, it does the job just fine. Please let me know if you are aware of better alternatives.
|
|
This is some boilerplate for working with Redis, should be prepended to each example. Here I promisify Redis and redis-scripto for convenience and load LUA scripts from ./lua directory.
|
|
Here we load initial proxy data from the proxy list into the Redis.
lua/gimme-add-proxy.lua
|
|
load.js
|
|
After proxy was checked, we update our data with what check-proxy library provided.
lua/gimme-proxy-checked.lua
|
|
checker.js
|
|
If proxy isn’t working, we remove it from Redis.
checker.js
|
|
lua/gimme-get-random-proxy.lua
|
|
get-proxy.js
|
|
That’s it for Redis. Next section will be about ElasticSearch.
Other posts in Gimmeproxy tech series:
]]>Ok, I admit the title is a bit cheesy:) But that’s it really. In June I was shopping for a new laptop and considered available options. At the time Apple was silent about specs and release date of the new MBP as usual.
BTW: I’m using Macbook PRO 2015 at work everyday and it’s ok, but I needed more powerful machine for personal use.
The previous setup I had included two laptops: one used for the development, the other one was used as a server. It needed upgrade as I was planning to travel Asia this winter. Which didn’t happen because I moved to Berlin, Germany instead in October.
I needed decent travel-friendly laptop:
There were several good options on the market, Asus Zenbook Pro UX501VW, DELL XPS, some MSI offers, Asus ROG, Sager/Clevo had nice offers, etc. New Macbook Pro was not available and the old one didn’t satisfy my needs.
At first I was planning to buy a Dell XPS 15 despite some people reporting issues with it. My spouse has one already and it’s really nice machine.
But luckily I stumbled upon another Dell laptop - Dell Precision 15 5000 Series (5510). It had all the features I wanted and then some.
Dell Precision photos copyright by Pascal Volk - https://www.flickr.com/photos/sigalrm/
It looks exactly like Dell XPS, but has more powerful processor and less powerful graphics card. As far as I understand everything else is pretty much the same. So I got it and don’t regret.
The particular model I got has i7-6820HQ, 32GB DDR4 RAM, 512GB SSD, Nvidia Quadro M1000M/Intel HD Graphics 530, Windows 10 PRO. I can’t say enough good things about this machine. It’s powerful, fast and light. Works really well for me.
Cons? Windows 10 seems to be restarting computer whenever it wants and there’s no way to completely prevent it.
Unfortunately for me it’s not possible to comfortably develop on Windows alone. I need virtual machine with Linux to be able to work with the latest technologies. It’s just a lot easier on Linux.
Microsoft is working on bash for Windows, maybe it will change the way I work, but now it’s still experimental.
So that’s what I did. I installed Oracle Virtual Box. It has great display integration feature allowing to seamlessly work on both operation systems at once.
Then I created VM with 12GB RAM and 64GB hard drive.
I installed Lubuntu 16.04 desktop to use as a dev machine. Lubuntu is supposed to work well on old computers and interface is quite fast. It’s really fast in VM.
Being Linux Lubuntu allows me to install all the server software I need easily. Terminal and SSH client are also much better than Windows counterparts.
Visual Studio Code works fine and I develop and run code right inside Lubuntu. In fact I’m even writing this post inside of it.
This setup works well for me for several months already. Hopefully this info will help someone in the same position as I was.
]]>[code]
blocks and more tag.
But Wordpress is great, why did I switch from it?
While Wordpress has a lot of great features and modules, I was not using them. I felt that Wordpress had just too much functions and code for such a simple blog as mine. And what could be simpler than Markdown and static files?
Also wysiwyg online interface didn’t work well for me. I usually post some code examples and have to manually prepare them in some other editor before posting into Wordpress. I do proper formatting as it’s impossible to do TAB or Shift+TAB and other things in browser.
It would be great to use just one editor to do it all, that is VS Code, which btw is great!:)
Also as I switched to Node.js/Javascript a year ago, keeping PHP running on the server was somewhat annoying.
All in all Wordpress seemed like an overkill.
Hexo.io is a static blog generator which supports Markdown. It seems pretty mature, stable and has a lot of modules and themes. That means I have to spend less time setting up blog and more time writing.
And it’s javascript! So I already had everything in order to run it.
Installation is pretty simple, install it from npm, init new blog and create some markdown files. There are several automatic deploy options provided or you can just copy all your files to remote server.
Hexo provides migration options as well. In order to migrate from Wordpress there is hexo-migrator-wordpress plugin (https://hexo.io/docs/migration.html#WordPress)
My WP blog was using Yoast seo plugin and it was not supported by default migrator plugin. So I had to add it manually. Then I added <!--more-->
wordpress tag support as my theme allowed it. Also I converted [code]
tags to markdown code and removed <pre>
tags.
If you need it, here is what to do. Feel free to modify as needed.
In your blog directory open node_modules/hexo-migrator-wordpress/index.js
Find lines with fiels definitions and modify as follows
|
|
That’s it, your new blog posts will get nice tags and description and native markdown code formatting. You probably will still have to fix some things manually, but it’s a good place to start.
]]>Some were just a web proxy interface, some required authentication. Some were banned, slow or didn’t respond at all. And I needed only working proxies. That’s how Gimmeproxy.com idea was born.
Other posts in Gimmeproxy tech series:
Gimmeproxy is built with Node.js (koa), Redis, Elasticsearch and ELK stack for monitoring as well. Nginx is used as a reverse proxy.
I’m planning to describe all of the tech used in several articles. I’ll explain how Gimmeproxy.com works and provide some code examples.
This is the first one, let’s start with getting and checking proxies. Everything will be done with Node.js, so get it ready.
Gimmeproxy uses custom proxylist parsing script which collects free public proxies from websites like hidemyass or gatherproxy. It’s written in javascript and is run by cron every 30 minutes.
At the time of developing I was not aware of any decent open source proxy scraping modules or libraries. But now there is one, so you don’t have to write it yourself.
I’m talking about https://github.com/chill117/proxy-lists. Let’s check how to use it.
|
|
Sample code client.js
|
|
Sample foundProxies result
|
|
Now we need to check if proxies are really working. I developed and open-sourced library for this - https://github.com/256cats/check-proxy. It requires only an IP address and a port to start working, it will get anonymity level, protocol and country for us.
It works like this: library has a server and client. A client tries to access server through provided proxy by sending get and post requests.
Server checks what was received from proxy (if any), whether client’s ip address was leaked, what headers were, etc. And responds with json of proxy parameters. Thus it allows to reliably check that proxy is indeed working.
Ok, let’s install check-proxy.
|
|
At first you will need to run a server, better on a different machine that will respond to check-proxy requests. I’ve put it on Openshift VM.
Install express
|
|
Create server.js
|
|
Here we used getProxyType function provided by check-proxy library to determine proxy server options.
Run server
|
|
Now we are ready to actually check proxies. Let’s also check whether proxy supports Google.
Modify client.js to include the following
|
|
Then you can do something like this to check all proxies.
|
|
This should give you an idea how gimmeproxy gathers it’s proxies. In the next articles I’ll show how to add them to Redis and Elasticsearch.
Other posts in Gimmeproxy tech series:
]]>Now I’m working as a senior frontend developer for an automotive startup.
Anyway next long overdue post is going to be about Gimmeproxy.
Tschüss!
]]>Proxy-testing library is opensourced, so check it out - https://www.npmjs.com/package/check-proxy and feel free to use it in your own projects!
Also a long awaited feature - search API was finally added. For example, to get random proxy from United States, tested in last 10 minutes and with port 80, do the following: http://gimmeproxy.com/api/getProxy?port=80&maxCheckPeriod=600&country=US.
More examples at the API page: http://gimmeproxy.com/#how.
Little bit of stats: GimmeProxy has 700-800 working proxies at any given moment. On average it serves 60 requests per second to more than 1300 machines every day, peaking at 150 requests per second. And it’s growing!
Next week I’m going to blog about GimmeProxy internals, so stay tuned!
|
|
This post shows how to fix it.
Laravel 5.0 uses mcrypt extension which was finally removed from PHP7. That’s why you get errors. In order to fix it you have 2 options:
1) Upgrade Laravel 5 and your application to the latest version (http://laravel.com/docs/master/upgrade). It’s pretty tedious process, but should be done anyway.
2) Or you can use fake mcrypt extension, to make Laravel think mcrypt is present.
Install laravel-mcrypt-faker
Note that we use –ignore-platform-reqs otherwise you’ll get errors like
Then open your config/app.php file and comment out Illuminate\Encryption\EncryptionServiceProvider in the providers list.
Replace it with either Thomaswelton\LaravelMcryptFaker\NoEncryptionServiceProvider or Thomaswelton\LaravelMcryptFaker\OpensslEncryptionServiceProvider.
NoEncryptionServiceProvider provides no encryption at all. Set cipher variable in config/app.php to null (i.e. no encryption). Don’t use it for production.
OpensslEncryptionServiceProvider encrypts data using defuse/php-encryption package. In order to use it, you have to regenerate app key
This should fix the problem.
]]>
|
|
Stop httpd and remove old php version
Install php70
Start apache
If you use php-fpm, restart it
Note that php-fpm config and php.ini files will be in /etc/opt/remi/php70/ and not in /etc as usual.
That’s it. Enjoy your new fast PHP:)
]]>Default memory limit is 512MB on 32bit systems and ~1.5GB on 64bit systems. Increase with
If you run Node in cluster mode, it’s still limit per process.
]]>
|
|
Choose version of NodeJS you would like to install. I use v4 for now.
NodeJS 5.x
NodeJS 4.x
NodeJS 0.12.x
NodeJS 0.10.x
Then install it
NPM was installed by default. We will need devtools to compile and install some native addons from npm.
Install PM2
|
|
That’s it! We’ve got Node v4, NPM and PM2 on Centos 7.
]]>1. Install Apache
|
|
Check that everything works fine, by pointing browser to http://your_server_IP_address/. You should see Apache test page.
2. Enable apache to start on boot
|
|
3. Install Mariadb (popular MySQL fork)
|
|
4. Install PHP 5.6
We will get php5.6 from remi repository.
Install EPEL repository
|
|
Install remi repository
|
|
Enable remi
|
|
Install php 5.6, redis and opcache on Centos 7
|
|
Then create file /var/www/html/test.php with <?phpinfo(); ?> and open http://your_server_IP_address/test.php in browser. You should be able to see information about php with all modules installed.
5. Install composer
|
|
That’s it LAMP stack is installed! You can now set up Wordpress or Laravel-based website. In the next article we will install NodeJS 4, NPM and PM2 on Centos 7.
]]>0. Prerequisites
|
|
1. Create new user and grant root privileges
|
|
2. Generate key pair to authenticate on the server
On the client (e.g. your home machine) we have to generate private/public key pair. It will be used to authenticate to the server instead of password.
It will then create two files ~/.ssh/id_rsa and ~/.ssh/id_rsa.pub. The former id_rsa is your private key and id_rsa.pub is your public key.
We have to change permissions of private key on the client
|
|
3. Copy public key to the server
Login as your new user into new ssh session. Then public key should be copied to the server and installed to the authorized_keys list of the new user:
|
|
Then set public key permissions (on the server)
|
|
Make sure that SELinux contexts are set correctly
|
|
4. Test new user
Don’t close current ssh (root) session yet. Try connecting to your server with new user and private key.
If everything is ok, ssh won’t ask you for password (if you haven’t set passphrase when generated keys).
If key authentication fails for some reason, ssh will ask for password.
Now we can protect our server.
5. Protect ssh
Let’s protect ssh by changing default port number, blocking root access and password authentication. Also we will force protocol 2.
|
|
Add following lines to /etc/ssh/sshd_config
|
|
Then restart ssh
|
|
Try connecting to your server with new port (24653 in this example), it should work fine. You shouldn’t be able to connect as root user anymore and will have to use sudo.
If you have troubles connecting to different port, check next section. Sometimes firewalld is already installed and blocks unknown port and you will have to add it.
6. Install firewall
|
|
Since firewall isn’t aware of ssh yet (and our changed port), we need to add port to firewall.
|
|
We plan to run additional services and have to open the firewall for them explicitly.
Enable http service
|
|
If you need web server with SSL/TLS, enable https service
|
|
If you will need to open other ports, you can do it the same way we did here.
To check other available services, you can use this command
|
|
Now let’s load all added exceptions
|
|
Again, connect to the server to test everything. Then make firewall load on boot.
|
|
7. Time synchronization
|
|
8. Installing file2ban
File2ban bans malicious IPs - too many failed logins, etc. We will install it from EPEL repository. Then you will get updates as they are released. There are other tutorials that show how to install rpm from another repos. It may work, but do it at your own risk.
Install EPEL & file2ban for Centos 7
|
|
There are no jails configured by default, let’s create basic sshd jail.
|
|
Add following
|
|
Save file and start fail2ban
|
|
Have it start at boot
|
|
To restart do following
|
|
Check the log file
Try to connect to your server again, and if everything works fine, congratulations! You have just configured and protected your Centos 7 server.
In the next article we will install LAMP stack. Stay tuned.
]]>We’ll be using CasperJS, but tips will work equally well with plain PhantomJS or SlimerJS.
Phantomjs memory leak starts with the browser. That might be obvious, but when I stopped loading images on one project, ram usage dropped from 4gb to 1gb immediately.
In this snippet onResourceRequested prevents images and other non-html files from loading which saves some memory. The list is extensive and includes even archives and pdfs, but you get the idea.
You can exclude even website’s javascript and css files. That might be useful in some cases - for example when you need to skip some heavy ads (see fbexternal-a.akamaihd.net).
Note that loadImages is set to true, it’s reported to be necessary to stop memory leaks (https://github.com/ariya/phantomjs/issues/12903#issuecomment-143427299).
One can use casper.page.close() and require(‘webpage’).create() to recreate webpage. But note, that you have to provide all initial options yourself.
It’s possible to add a function to Casper.prototype to do just that (see https://github.com/n1k0/casperjs/pull/826#issuecomment-34950562).
|
|
In theory this should free all memory used by current webpage, but discussion on aforementioned resource questions that. Hopefully this patch will be merged into Casperjs master someday.
That isn’t really a PhantomJS tip, but when nothing else helps, you can use external tool like NodeJS to manage Phantom’s jobs and results. You are probably using it anyway to save results to some database.
Algorithm can be something like this:
Some tasks allow to run several instances of PhantomJS in parallel, that will increase scraping speed dramatically while keeping memory usage low.
]]>
|
|
Usually this error means that your hard drive (or partition) is out of space or there’s something wrong with permissions and Mysql can’t write to it.
Since everything worked fine up to that moment, I assumed the former. Okay, let’s check free space:
|
|
What’s that? Disk is almost filled, but there shouldn’t be anything particularly large on that machine. Let’s check space consumed by Mysql.
|
|
That’s a lot! I removed couple of ‘backup’ tables and checked space again, but nothing changed.
After some Googling I found out that this issue is related to very old bug report. Basically InnoDB engine (which is default these days) never shrinks ibdata files which contain the data for all tables and indexes. Even when you delete rows or drop tables, they are only marked as deleted. So space is never freed.
Luckily it’s easy to tell Mysql to store each table in it’s own file and reclaim free space. But before that make sure that you have enough space somewhere to store dumps.
If you are running Amazon EC2 instance, you may try expanding your disk (see instructions here).
Let’s fix this issue. Here is an adaptation of awesome instructions.
1. Dump all databases
|
|
2. Drop all databases (except information_schema, mysql, performance_schema)
3. Stop Mysql
4. Modify /etc/my.cnf to include the following:
_innodb_buffer_poolsize is size of memory buffer that stores cache data and indexes of tables.
If you have a lot of RAM, you might set it up to 80% of RAM available. Depends on your use case for this machine.
_innodb_log_file_size _should be quarter of innodb_buffer_pool_size.
5. Remove ibdata and ib_logfile files
6. Restart Mysql
7. Reload data back into Mysql
That’s it. Now when you drop tables, InnoDB will remove it’s files as well and space will be reclaimed.
]]>At first sign up for Amazon Affiliate Program. Then you need to create AWS public and secret keys.
Open AWS website, click Account->Security Credentials. Login and you’ll see tab “Access Keys”. Create new key pair and store it somewhere.
Install Amazon Advertising API wrapper.
Create amazon.php
There is a lot of other information in $formattedResponse as well. You can print it or check Amazon docs.
]]>At first install Redis and Redis-scripto modules as described in previous article.
Create demo.js. It runs simple-counter.lua and prints result which adds one hit to zset and returns number of hits made in last 24 hours as of running.
|
|
Create lua/simple-counter.lua
|
|
It just adds one timestamp to the stat:hits sorted set, removes old timestamps and returns total count.
You can notice that score and value are both a timestamp
That leads to incorrect results in case of two requests made simultaneously. Indeed, timestamps will be the same and only one result will be stored in sorted set. To fix it, we need to use some unique value.
For simplicity, we can use some Redis key as a counter and just increment it with every request. Modify simple-counter.lua:
|
|
Here we used stat:counter as a unique value.
What if we don’t need stats per second and are only interested in minutes? We can save some memory by doing this
Or if we need hours:
Modify simple-counter.lua accordingly.
If your site has millions of hits, removing and counting sorted sets might become an expensive operation. If you don’t need each individual hit information (you may want to store user’s request url, ip address and other information), you can use just several integer keys and let Redis expire and remove old ones.
Since there are only 24 hours, we can create 24 keys and store hits for each hour of the day accordingly. But then we can’t make it moving, as we need all 24 keys to get stats and we can’t delete ‘old’ ones.
To fix it, let’s pretend that there are 25 hours a day (timestamp in hours modulo 25). We will write hits to each one (of 25) continuously, but use only 24 last ones to get total. With each write we ask Redis to expire current hour key in 86400 seconds (24h).
Since we have 25 hours and each key expires in 24 hours, the least written one (that was 25 hours ago) will be removed by Redis. We will start counting from zero when we get to it again.
Thus we have the moving window (rolling counter).
demo.js
key-counter.lua
|
|
Actually my spouse proposed another way of doing the same thing which might be more intuitive. We can use 24 keys and a flag (another key) indicating current hour we are writing to. When current hour changes, set corresponding hour stat to zero. Redis is single threaded and we won’t get any race conditions, like setting a flag multiple times.
Since we don’t need to expire anything, we can use Redis hashes.
demo.js
key-counter2.lua
|
|