Phantomjs memory leak, how to decrease memory usage
PhantomJS (like any modern web browser) is known to consume a lot of memory. Even if it can’t be fixed completely, there are some workarounds to decrease memory usage.
We’ll be using CasperJS, but tips will work equally well with plain PhantomJS or SlimerJS.
Phantomjs memory leak starts with the browser. That might be obvious, but when I stopped loading images on one project, ram usage dropped from 4gb to 1gb immediately.
In this snippet onResourceRequested prevents images and other non-html files from loading which saves some memory. The list is extensive and includes even archives and pdfs, but you get the idea.
Note that loadImages is set to true, it’s reported to be necessary to stop memory leaks (https://github.com/ariya/phantomjs/issues/12903#issuecomment-143427299).
One can use casper.page.close() and require(‘webpage’).create() to recreate webpage. But note, that you have to provide all initial options yourself.
It’s possible to add a function to Casper.prototype to do just that (see https://github.com/n1k0/casperjs/pull/826#issuecomment-34950562).
In theory this should free all memory used by current webpage, but discussion on aforementioned resource questions that. Hopefully this patch will be merged into Casperjs master someday.
That isn’t really a PhantomJS tip, but when nothing else helps, you can use external tool like NodeJS to manage Phantom’s jobs and results. You are probably using it anyway to save results to some database.
Algorithm can be something like this:
- NodeJS runs PhantomJS process with urls/parameters to scrape.
- PhantomJS scrape pages and exits after 10 page loads. Prints results as JSON.3. NodeJS consumes Phantom’s output and run another Phantom process to load other pages.
Some tasks allow to run several instances of PhantomJS in parallel, that will increase scraping speed dramatically while keeping memory usage low.