DIY AngularJS SEO with PhantomJS (the easy way!)

Systems Administration

Set­ting up your Angu­lar­JS devel­op­ment envi­ron­ment needs to include SEO best prac­tices. For JS-ren­dered appli­ca­tions, take a look at this solid solu­tion using Phan­tomJS.

I’ve been tasked with recre­at­ing a web­site for a high­er edu­ca­tion insti­tu­tion and I want to cap­i­tal­ize on Angu­lar­JS tech­nolo­gies to provide a rich user expe­ri­ence. Unfor­tu­nate­ly, one of the largest issues with using the SPA approach to business/corporate/education web design is search engine opti­miza­tion; Angu­lar­JS, and any JS-ren­dered appli­ca­tion frame­work, is not SEO-friend­ly. To get around this, we need a way to serve search engine bots a set of pre-ren­dered HTML pages. Our goal is to cre­ate a devel­op­ment envi­ron­ment for Angu­lar­JS SEO awe­some­ness. In this tuto­ri­al, we’ll walk through how to get Phan­tomJS up and run­ning right alongside our app using our Yeo­man Angu­lar­JS scaf­fold­ing that comes with small devel­op­ment server. We’ll go from hav­ing noth­ing to hav­ing a full devel­op­ment envi­ron­ment, com­plete with ver­i­fi­able pre-ren­dered page cache for bots to eat up and enjoy.

Update – May 2nd: I’ve received a few dozen emails after pub­lish­ing this arti­cle (and lots of gen­er­ous praise — thank you, every­one!) Please use the com­ments sec­tion to post any ques­tions or con­cerns you may have, as this is just the start of a huge Angu­lar­JS tuto­ri­al project that I’ve start­ed.

The Scaffolding

To setup our devel­op­ment envi­ron­ment, we’ll be using Yeoman’s Angu­lar­JS gen­er­a­tor, which is an all-in-one solu­tion for devel­op­ing and test­ing Angu­lar­JS appli­ca­tions. Open up a ter­mi­nal and let’s get start­ed:

Wait for the scaf­fold­ing to start, and def­i­nite­ly go ahead and answer those ques­tions about whether you want to include Boot­strap (that’s what I use) and any of those oth­er Angu­lar­JS mod­ules. Once it’s fin­ished, you’ll have a devel­op­ment envi­ron­ment setup. From your ter­mi­nal you were just in, let’s test our devel­op­ment server: grunt server You should see some automa­tion kick in, then your default browser will open up and you’ll see the default Yeo­man scaf­fold­ing for the Angu­lar­JS tem­plate. Right now we’re all set to devel­op an SPA with Angu­lar­JS, but we’re going to take this one step fur­ther and cre­ate an HTML pre-ren­der­ing work­flow to server pre-ren­dered HTML pages to bots for SEO pur­pos­es.

The SEO Setup

I have to be hon­est: I’m only writ­ing this tuto­ri­al because I could not find a decent tuto­ri­al on set­ting up an envi­ron­ment for Angu­lar­JS SEO awe­some­ness. It couldn’t be that hard, right? We need some­thing that will tell crawlers to eat up pre-ren­dered pages, and thank­ful­ly, a lot of the heavy lift­ing in terms of mod­ule design has already been tak­en care of by Steeve at GitHub. Steeve’s code is actu­al­ly the bulk of what we’re going to be using here.

The first thing we’ll need to do is get a hold of this angu­lar-seo pack­age from GitHub:

git clone https://github.com/steeve/angular-seo.git

Inside this fold­er you’ll have two core files:

  • angular-seo.js, which you need to put into your /lawsonry/app fold­er, and
  • angular-seo-server.js, which you need to put in your /lawsonry fold­er (or wherever your appli­ca­tion root fold­er is — you know, the one with the Gruntfile.js file in it).

You can fol­low Steeve’s instruc­tions, but I found it a lit­tle unhelp­ful at 6:00AM. So let’s do this setup togeth­er.

The idea is sim­ple: we’re going to have our appli­ca­tion run­ning from our appli­ca­tion port, and then a Phan­tomJS instance of our appli­ca­tion run­ning from a snap­shot port. Requests from non-bots will be served direct­ly from our appli­ca­tion port (it doesn’t mat­ter what port that is), and requests from bots and search engi­nes will be served pre-ren­dered html con­tent via the snap­shot port.

To do this, we’ll have to do three things: tell our appli­ca­tion to enable AJAX index­ing by crawlers; include our seo mod­ule and tell our appli­ca­tion to let us know when we’re done ren­der­ing the page; install and run Phan­tomJS.

Making our Site Crawlable

This couldn’t be eas­ier. Go to your index.html file and add the fol­low­ing line just before the <head>:

<meta name="fragment" content="!"/>

This basi­cal­ly tells search engi­nes that, while you’re tech­ni­cal­ly a SPA, you have the abil­i­ty to inter­pret a spe­cial URL struc­ture that it will request in order to ask for pre-ren­dered HTML pages. Here’s the gist of what’s hap­pen­ing:

  • A crawler hits your site and sees that it’s not pre-ren­dered HTML, but finds the frag­ment meta tag. This tag tells it to alter the way it requests infor­ma­tion from your server by chang­ing the hash­tag in the URL struc­ture to ?_escaped_fragment.
  • Now your server, with this request for a new url, serves the request from a pre-ren­dered set of pages instead of from the appli­ca­tion. This lat­ter pro­ce­dure gives the search engine a full html page to work with, rather than just an emp­ty JS-ren­dered page.1

Adding the SEO Code

The next thing we’ll do is go into our app.js file and find the mod­ule inclu­sions part of our dec­la­ra­tion. How­ev­er you do it, you’ll need to include the seo mod­ule that comes inside the angular-seo.js file we put in our lawsonry/app fold­er ear­lier. For exam­ple, here’s what my mod­ule dec­la­ra­tions block looks like:

Notice that I’ve added the seo mod­ule up there. Make sure you do, too! The last thing we’ll do in the app is set a scope-lev­el dec­la­ra­tion that all the html has been ren­dered. This is super easy: Depend­ing on how you orga­nize your con­trollers, sim­ply call $scope.htmlReady() when­ev­er you are cer­tain that the HTML page is done load­ing. This is often done at the end of the main con­troller. For exam­ple, with the con­troller that comes with Yeoman’s Angu­lar­JS scaf­fold­ing, your main.js file would look like this:

Final­ly, we need to actu­al­ly include the angular-seo.js file man­u­al­ly in our index.html file, toward the bot­tom where the includes for our con­trollers go. In an unedit­ed scaf­fold­ing, my new index.html file looks like this (at the bot­tom):

Now we’re com­plete with app-lev­el changes, so let’s move to our com­mand line to deal with the server-side require­ments. Don’t wor­ry; we’re almost done!

Setting up PhantomJS

The last part of our Scaf­fold­ing is to install and run Phan­tomJS alongside our devel­op­ment envi­ron­ment. You should already have npm, so install phan­tom like this: npm install phantomjs Once that’s com­plet­ed, nav­i­gate to your appli­ca­tion root direc­to­ry (the one where we put the oth­er angu­lar-seo file, angular-seo-server.js) and run the fol­low­ing com­mand:

phantomjs --disk-cache=no angular-seo-server.js 9090 http://127.0.0.1:9000

This will start a phan­tomJS server with no disk caching (we’ll use that dur­ing pro­duc­tion in anoth­er tuto­ri­al) on port 9090. It’s impor­tant to note that PhantomJS’s port needs to be dif­fer­ent from the port that your appli­ca­tion runs on. Notice that we have set the last para­me­ter (the appli­ca­tion) URL to be run­ning on port 9000; that port num­ber comes from the grunt file native to Yeoman’s Angu­lar­JS scaf­fold­ing.

In oth­er words, yo angular gives us the option to run grunt server, which sets up a local­host web­server to test our app on port 9000.

So think of it like this:

  • Phan­tomJS runs on port 9090 and lis­tens for requests on port 9000.
  • If those requests con­tain the ?_escaped_fragment= URL instead of the hash­tag URL, then Phan­tomJS knows to pre-ren­der the page and serve it because the only way we wouldn’t be ask­ing for a hash­tag url is if the requestor is a crawler.
  • If those requests con­tain hash­tags, then this is a human (browser) access­ing the app and we can go ahead and bypass Phan­tomJS alto­geth­er.

Now that we’ve got Phan­tomJS run­ning, let’s go ahead and run our devel­op­ment server, too: grunt serveNow we’ve got our devel­op­ment envi­ron­ment run­ning a web server on 127.0.0.1 at port 9000 (or local­host, depend­ing on what you like to call it), and a sec­ond web server run­ning on port 9090 that will lis­ten to traf­fic on port 9000 to see if that traf­fic is com­ing from a crawler. Fan­tas­tic!

Testing Your Pre-Rendered HTML

The last thing I would encour­age every­one to do is test whether your site is serv­ing pre-ren­dered HTML to requests that con­tain the ?_escaped_fragment= url. You do this by going back to your ter­mi­nal and typ­ing:curl 'http://localhost:9090/?_escaped_fragment_= This will pull from your Phan­tomJS server a request for what­ev­er is rout­ed to the '/' route, which should be (if you haven’t mod­i­fied the Yeo­man Angu­lar­JS scaf­fold­ing) the views/main.html file. The ter­mi­nal should out­put a ful­ly ren­dered HTML page. Check the con­tents of the <div class="container ng-scope" ng-view"> tag, and you should see a bunch of HTML under­neath. It works!

Going Live

To take this to pro­duc­tion, you’ll need to make one more adjust­ment on the server. Add a detec­tion block in your site’s con­fig­u­ra­tion on your server that will check if the escaped_fragment_url is being request­ed, because if it is, we’ll want to proxy the user over to Phan­tomJS instead of serv­ing from our main server on port 80. If you’re in Nginx2 (like I am), you can do this:

if ($args ~ escaped_fragment) { # Proxy to PhantomJS instance here }

How­ev­er you do it, just remem­ber to have your Phan­tomJS run­ning on a dif­fer­ent port than your web server.

Common Problems

(This sec­tion is reserved for com­menters whose prob­lems are solved. If you have any ques­tions or con­cerns, leave a com­ment and let’s sort it out togeth­er!) Prob­lem: The curl test is not out­putting pre-ren­dered html pages. Solu­tion: You need to ensure that your root route '/' is what you used for the server address when you instan­ti­at­ed phan­tomjs. For exam­ple, if you’re rout­ing your application’s root to '/index.html', you need to change the server address from the exam­ple above to http://localhost:9090/index.html/?_escaped_fragment_=' Whew! I know it seems daunt­ing, but once you have it all setup, it’s real­ly very sim­ple.

Footnotes

  1. If you want to see what a page looks like with­out pre-ren­dered HTML, just open up your Angu­lar­JS app in view-source:your-app-url and take a look. Notice any­thing? Your view par­tials are not load­ed on this page; JavaScript loads the­se html files dynam­i­cal­ly. If you think about this from a search engine’s point of view, how will you be able to see the web con­tent unless you’re view­ing the site from a browser?
  2. In a future tuto­ri­al, we’ll talk about cap­i­tal­iz­ing on Nginx’s amaz­ing sta­t­ic file serv­ing capa­bil­i­ties by set­ting up a failover for a pre­ren­dered cache of html files in a snapshots/ fold­er. Basi­cal­ly, Nginx receives the request and checks a local snapshots/ fold­er to see if the request­ed file exists. If it does, it will check it’s last cache time. If it’s too old, it will recache the file with Phan­tomJS and then serve the cached file. If it’s not too old, it will sim­ply serve the cached file.
  • Marcin Koc­zorowski

    does this solu­tion requires to have hash bangs in urls ? what about a beau­ti­ful not hash­ing solu­tion. I htm­l5­mode on.

  • Howard.zuo

    The same ques­tion as @marcinkoczorowski:disqus

  • Este­ban Vera

    como puedo lograr esto en un servi­dor apache?

  • Flo­ri­an Cel­lier

    Cool but what is the # Proxy to Phan­tomJS? I try this 

    rewrite .* /$request_uri? break;
    prox­y_­pass http://localhost:9090;

    But the URL is not inter­pret­ed and say that route is can’t get the route !

    • Try omit­ting the local­host and use 127.0.0.1 instead

  • Very inter­est­ing. Well I havn’t any expe­ri­ence with Angu­lar­Js but yes I nor­mal­ly do SEO for web­sites in var­i­ous tech­nolo­gies like ASP.net, php and etc. Cur­rent­ly I’m work­ing on one project that is http://www.lyseismarketing.co.uk/. Even In that there is no use of angu­lar js.

  • I get this error:
    Failed to exe­cute ‘postMes­sage’ on ‘DOMWin­dow’: The tar­get orig­in pro­vid­ed (‘http://localhost:9000’) does not match the recip­i­ent window’s orig­in (‘http://localhost:9090’)

    Of course no JS files load but apart from tak­ing ages to load it doesn’t fetch the data com­plete­ly.

  • Abdul Khalid

    I have fol­lowed each step reli­gious­ly but google is not crawl­ing my site. I am get­ting cor­rect respon­se when I am doing Curl on the same url but when i search the link on google, the url does not show any descrip­tion.

    • Use Google’s webcrawler/SEO check­er to see what it sees. Is it a caching prob­lem?

  • Hen­dra Kur­ni­awan

    I fol­low all the step, but when I test with curl, I’m not get­ting a pre­ren­dered html, instead I got not pre­ren­dered html

    Can some­one help me ?

    😀

  • Mukesh

    This will start a phan­tomJS server with no disk caching (we’ll use that dur­ing pro­duc­tion in anoth­er tuto­ri­al) on port 9090

    Do you have the cached ver­sion of the tuto­ri­al?

    • I do not. But Phan­tomJS docs are fan­tas­tic and you can prob­a­bly get all you need from their site.

  • CHALAKA ELLAWALA

    Hi, when I entered the com­mand that con­tains curl, it out­puts an emp­ty URL. For incor­rect URLs also, it out­puts emp­ty html. The result is like, ”

    • CHALAKA ELLAWALA

      Hi I was final­ly able to sort out this issue. The issue was exact­ly what is men­tioned in com­mon prob­lems. Thanks for this awe­some guide. It saved a lot of time. But still, there is an issue I’m work­ing on and that is, some of the con­tent like images, are not loading.Im get­ting Error : ‘Failed to load resource: net::ERR_CONNECTION_RESET’ in con­sole for image files.

  • Pingback: Navigating the React.JS Ecosystem | GuarniBlog()

  • Omi Amar­wal

    Hel­lo Jesse Law­son,
    this solu­a­tion is work­ing fine as i fol­low your code flow in my live project. and it’s work­ing only home page eg.(view-source:http://www.handicrunch.com:8888/) the meta data of home page are ren­der­ing in port 8888 but when i open pro­duct details page then it’s not work­ing there. eg.(view-source:http://www.handicrunch.com:8888/en/product-TWT7009/sun-face-printed-tapestry.html) it’s show­ing in browser … please provide me a solu­a­tion for this… Thanks

  • Amir

    some ques­tions:
    1. do I still have to do that? google have said they read js based sites. (looks like they’re not, tho)

    2. i do not use hash­bang syn­tax, with the !#, so my site looks like any oth­er site: http://www.mysite.com/userpage.
    how will i detect the start­ing point?. what should i use instead of 

    3. did you pub­lish that arti­cle about how to con­fig­ure nginx to hold the sta­t­ic pages? link would be great.
    Thanks!

  • Nirav

    port 9000 is ask­ing User­name and Pass­word? how we can make it work with­out http Auth?

    • That’s a prob­lem with your server con­fig, not with Angu­lar or Phan­tom.

  • Robin Dier­ckx