PhantomJS (like any modern web browser) is known to consume a lot of memory. Even if it can’t be fixed completely, there are some workarounds to decrease memory usage.

We’ll be using CasperJS, but tips will work equally well with plain PhantomJS or SlimerJS.

Tip 1: don’t load images and other heavy resources

Phantomjs memory leak starts with the browser. That might be obvious, but when I stopped loading images on one project, ram usage dropped from 4gb to 1gb immediately.

1
2
3
4
5
6
7
8
9
10
11
12
13
var casper = require('casper').create({
pageSettings: {
loadImages : true,
loadPlugins : false,
},
onResourceRequested : function(R, req, net) {
var match = req.url.match(/fbexternal-a\.akamaihd\.net\/safe_image|\.pdf|\.mp4|\.png|\.gif|\.avi|\.bmp|\.jpg|\.jpeg|\.swf|\.fla|\.xsd|\.xls|\.doc|\.ppt|\.zip|\.rar|\.7zip|\.gz|\.csv/gim);
if (match !== null) {
net.abort();
}
},
//... add other stuff here
})

In this snippet onResourceRequested prevents images and other non-html files from loading which saves some memory. The list is extensive and includes even archives and pdfs, but you get the idea.

You can exclude even website’s javascript and css files. That might be useful in some cases - for example when you need to skip some heavy ads (see fbexternal-a.akamaihd.net).

Note that loadImages is set to true, it’s reported to be necessary to stop memory leaks (https://github.com/ariya/phantomjs/issues/12903#issuecomment-143427299).

Tip 2: close and create webpage again

One can use casper.page.close() and require(‘webpage’).create() to recreate webpage. But note, that you have to provide all initial options yourself.

It’s possible to add a function to Casper.prototype to do just that (see https://github.com/n1k0/casperjs/pull/826#issuecomment-34950562).

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
//from github.com/n1k0/casperjs/pull/826#issuecomment-34950562
Casper.prototype.newPage = function newPage() {
"use strict";
this.checkStarted();
this.page.close();
this.page = this.mainPage = createPage(this);
this.page.settings = utils.mergeObjects(this.page.settings, this.options.pageSettings); // loads initial options
if (utils.isClipRect(this.options.clipRect)) {
this.page.clipRect = this.options.clipRect;
}
if (utils.isObject(this.options.viewportSize)) {
this.page.viewportSize = this.options.viewportSize;
}
return this.page;
};

In theory this should free all memory used by current webpage, but discussion on aforementioned resource questions that. Hopefully this patch will be merged into Casperjs master someday.

Tip 3: manually restart PhantomJS/SlimerJS after several iterations

That isn’t really a PhantomJS tip, but when nothing else helps, you can use external tool like NodeJS to manage Phantom’s jobs and results. You are probably using it anyway to save results to some database.

Algorithm can be something like this:

  1. NodeJS runs PhantomJS process with urls/parameters to scrape.
  2. PhantomJS scrape pages and exits after 10 page loads. Prints results as JSON.3. NodeJS consumes Phantom’s output and run another Phantom process to load other pages.

Some tasks allow to run several instances of PhantomJS in parallel, that will increase scraping speed dramatically while keeping memory usage low.