Alex MacCaw

Ruby/JavaScript developer & entrepreneur. O'Reilly writer and open source developer. Working for Stripe.


I'm @maccman on Twitter.

I'm maccman on Github.

I'm oldmanorhouse on Skype.

I'm here on Linked in.

I'm here on Delicious.


Current OS projects:
Books/Sites/App's
I've created/written:
Email me

I am now blogging at blog.alexmaccaw.com

HTML/XML Parsing with Node & jQuery

November 13, 2011

I had to do some HTML parsing recently to convert some markdown into the format required for Nettuts+ tutorials. It required moving various elements around, adding classes and appending some new elements.

Now normally I'd go with Ruby's de-facto solution to XML parsing, Nokogiri. However, I quickly ran into issues which, combined with the library's class based excuse for documentation, made me decide to take a different approach.

One thing I realized was that jQuery's API is perfect for this scenario, especially when it comes to traversing and manipulation. If only there was a Ruby equivalent with a similiar interface?

Then it struck me, forget Ruby, let's just use Node and jQuery. In fact, there's already a jQuery npm package to do this which includes a HTML parser and DOM emulator.

First, install the necessary npm dependencies (in the app's directory):

npm install -g coffee-script
npm install jquery node-markdown

Then create a CoffeeScript Cakefile:

fs = require('fs')
$  = require('jQuery')
md = require('node-markdown').Markdown

task 'build', 'Build index.html', ->
  # Read in file
  html = fs.readFileSync('./index.md', 'utf8')

  # Convert to markdown
  html = md(html)

  # Create jQuery object
  doc  = $('<body />').append(html)

  # Insert <hr /> before all <h2 /> elements
  doc.find('h2').before('<hr />')
  doc.find('hr:first').remove()

  # Correct pre syntax
  doc.find('pre code').each ->
    $(@).parent().html $(@).html()
  doc.find('pre').attr('name', 'code').addClass('cs')

  # Remove images from p tags, and wrap them correctly
  doc.find('p img').each ->
    parent = $(@).parent()
    parent.after $(@)
    parent.remove()
  doc.find('img').wrap('<div class="tutorial_image" />')

  # Add required class to blockquotes
  doc.find('blockquote').addClass('pullquote pqRight')

  # Write out file
  fs.writeFileSync('./index.html', doc.html())

Now tell me that syntax isn't concise and beautiful, a vast improvment over XML parsing with other libraries.

Our build task can be invoked by running cake build, generating the resultant index.html file.

Now, of course this approach won't be suitable for all use cases. For example, I've no idea of the script's performance. However for my needs, where it only needs to be run once, it's ideal. If needs be, we could even pipe the resultant HTML back to Ruby via STDOUT.