Robust HTML parsing the Groovy way

With Groovy, it’s very easy to parse XML data and extract arbitrary information. This works great as long as the input data is well-formed, but you can’t always guarantee that in real-world scenarios. Think of extracting data from HTML pages. They are very often a mess when it comes to XML validity and that’s where the TagSoup library comes to the rescue.

There are two major problems with HTML input:

  • DTD resolution
  • Missing closing tags

We are going to build a simple Groovy script that prints the list of questions on StackOverflow’s start page. The straight forward solution looks something like that

def slurper = new XmlSlurper()
def htmlParser = slurper.parse("http://stackoverflow.com/")

htmlParser.'**'.findAll{ it.@class == 'question-hyperlink'}.each {
	println	it
}

We parse http://stackoverflow.com with XMLSlurper, loop over all tags with the class attribute ‘question-hyperlink’ and print it. But when running the script we get the following exception:

Caught: java.io.IOException: Server returned HTTP response code: 503 for URL: http://www.w3.org/TR/html4/strict.dtd at html_parser.run(html_parser.groovy:7)
XMLSlurper has problems with HTML DTDs. By using the information in this post, we get rid of the exception.
def slurper = new XmlSlurper()
slurper.setFeature("http://apache.org/xml/features/nonvalidating/load-external-dtd", false)
def htmlParser = slurper.parse("http://stackoverflow.com/")

htmlParser.'**'.findAll{ it.@class == 'question-hyperlink'}.each {
	println	it
}
So next try. The DTD exception is gone, but we get another one saying the closing link-tag is missing. And here comes TagSoup. It’s a library that tries to transform invalid HTML data into well-formed XML. And best of all, it works great together with XMLSlurper. Here is the final Script:
@Grab(group='org.ccil.cowan.tagsoup',
      module='tagsoup', version='1.2' )
def tagsoupParser = new org.ccil.cowan.tagsoup.Parser()
def slurper = new XmlSlurper(tagsoupParser)
def htmlParser = slurper.parse("http://stackoverflow.com/")

htmlParser.'**'.findAll{ it.@class == 'question-hyperlink'}.each {
	println	it
}
The first command uses the @Grab-annotation to load the TagSoup library. Next we create a TagSoup-Parser instance and pass it as constructor-parameter to XMLSlurper. That’s all and we even got rid of the setFeature workaround.
You know other tricks to make HTML parsing more robust? Then please leave them in the comments.

11 thoughts on “Robust HTML parsing the Groovy way

  1. I’ve been using NekoHTML lately. The syntax is a bit different though and I haven’t benchmarked yet if it’s faster or slower than XmlSlurper.

    FYI, the syntax is a little like this in Groovy:


    String get(def uri) {
    builder.request(uri, GET, TEXT, {}).text
    }

    Document document(def uri) {
    DOMParser parser = new DOMParser()
    parser.parse(new InputSource(new StringReader(get(uri))))
    parser.document
    }

    (Builder is an HTTP Builder)

    • Thanks for pointing to this. I know some other libraries aiming the same goal, e.g. TidyHTML, but I never heard of NekoHTML.
      Looks like this one is the way to go if you would like to use a DOMParser, though I really like XMLSlurper’s syntax in Groovy.

      Performance is not really an issue in my projects, so I wouldn’t really care which one is faster. More important is reliability. How close is the result to the real intention of the page.

  2. Very interesting article, makes what I was doing with Java way shorter. But I was wondering how would I select an element that is, say, all tags after a certain class, or all bold text on the page. What I’m trying to scrape isn’t a class unfortunately…

  3. I’ve found a partial solution to selecting other elements. You can use ” it.name() == ‘p’ ” for all tags, or replace it with ‘h1′ for all h1 tags. If anyone else has more info on how to more specifically select page elements I’d still like more info…

  4. #!/usr/bin/env groovy

    @Grapes([
    @Grab(group='net.sourceforge.nekohtml', module='nekohtml', version='1.9.15'),
    @Grab(group='xerces', module='xercesImpl', version='2.9.1'),
    ])

    def slurper = new XmlSlurper(new org.cyberneko.html.parsers.SAXParser())
    def htmlParser = slurper.parse(“http://stackoverflow.com/”)

    htmlParser.’**’.findAll{ it.@class == ‘question-hyperlink’}.each { println it }

  5. I am getting following exception for second code snippet:

    [Fatal Error] :20:187: The reference to entity “A” must end with the ‘;’ delimiter.
    Caught: org.xml.sax.SAXParseException; systemId: http://stackoverflow.com/; lineNumber: 20; columnNumber: 187; The reference to entity “A” must end with the ‘;’ delimiter.
    org.xml.sax.SAXParseException; systemId: http://stackoverflow.com/; lineNumber: 20; columnNumber: 187; The reference to entity “A” must end with the ‘;’ delimiter.
    at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1236)
    at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:568)
    at HtmlParsers.run(HtmlParsers.groovy:3)

  6. This article is a life saver. Don’t know why there are so many other blogs on the web indicating things like the Neko parser is good for screen scraping when we all know the web isn’t a perfect world and there’s a high chance the XML is malformed.

    Thanks for sharing.

  7. Jsoup is pretty neat as well:
    @Grab(group=’org.jsoup’, module=’jsoup’, version=’1.7.2′)
    def document = org.jsoup.Jsoup.connect(“http://stackoverflow.com/”).get();
    println document.select(‘.question-hyperlink’)*.text()

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>