With Groovy, it’s very easy to parse XML data and extract arbitrary information. This works great as long as the input data is well-formed, but you can’t always guarantee that in real-world scenarios. Think of extracting data from HTML pages. They are very often a mess when it comes to XML validity and that’s where the TagSoup library comes to the rescue.
There are two major problems with HTML input:
- DTD resolution
- Missing closing tags
We are going to build a simple Groovy script that prints the list of questions on StackOverflow’s start page. The straight forward solution looks something like that
def slurper = new XmlSlurper()
def htmlParser = slurper.parse("http://stackoverflow.com/")
htmlParser.'**'.findAll{ it.@class == 'question-hyperlink'}.each {
println it
}
We parse http://stackoverflow.com with XMLSlurper, loop over all tags with the class attribute ‘question-hyperlink’ and print it. But when running the script we get the following exception:
Caught: java.io.IOException: Server returned HTTP response code: 503 for URL: http://www.w3.org/TR/html4/strict.dtd at html_parser.run(html_parser.groovy:7)
def slurper = new XmlSlurper()
slurper.setFeature("http://apache.org/xml/features/nonvalidating/load-external-dtd", false)
def htmlParser = slurper.parse("http://stackoverflow.com/")
htmlParser.'**'.findAll{ it.@class == 'question-hyperlink'}.each {
println it
}
@Grab(group='org.ccil.cowan.tagsoup',
module='tagsoup', version='1.2' )
def tagsoupParser = new org.ccil.cowan.tagsoup.Parser()
def slurper = new XmlSlurper(tagsoupParser)
def htmlParser = slurper.parse("http://stackoverflow.com/")
htmlParser.'**'.findAll{ it.@class == 'question-hyperlink'}.each {
println it
}

I’ve been using NekoHTML lately. The syntax is a bit different though and I haven’t benchmarked yet if it’s faster or slower than XmlSlurper.
FYI, the syntax is a little like this in Groovy:
String get(def uri) {
builder.request(uri, GET, TEXT, {}).text
}
Document document(def uri) {
DOMParser parser = new DOMParser()
parser.parse(new InputSource(new StringReader(get(uri))))
parser.document
}
(Builder is an HTTP Builder)
Thanks for pointing to this. I know some other libraries aiming the same goal, e.g. TidyHTML, but I never heard of NekoHTML.
Looks like this one is the way to go if you would like to use a DOMParser, though I really like XMLSlurper’s syntax in Groovy.
Performance is not really an issue in my projects, so I wouldn’t really care which one is faster. More important is reliability. How close is the result to the real intention of the page.
great:) helped me rigth now for some simple custom html testing
Thank you!
Really useful for some web automation!
Very interesting article, makes what I was doing with Java way shorter. But I was wondering how would I select an element that is, say, all tags after a certain class, or all bold text on the page. What I’m trying to scrape isn’t a class unfortunately…
I’ve found a partial solution to selecting other elements. You can use ” it.name() == ‘p’ ” for all tags, or replace it with ‘h1′ for all h1 tags. If anyone else has more info on how to more specifically select page elements I’d still like more info…
#!/usr/bin/env groovy
@Grapes([
@Grab(group='net.sourceforge.nekohtml', module='nekohtml', version='1.9.15'),
@Grab(group='xerces', module='xercesImpl', version='2.9.1'),
])
def slurper = new XmlSlurper(new org.cyberneko.html.parsers.SAXParser())
def htmlParser = slurper.parse(“http://stackoverflow.com/”)
htmlParser.’**’.findAll{ it.@class == ‘question-hyperlink’}.each { println it }
I am getting following exception for second code snippet:
[Fatal Error] :20:187: The reference to entity “A” must end with the ‘;’ delimiter.
Caught: org.xml.sax.SAXParseException; systemId: http://stackoverflow.com/; lineNumber: 20; columnNumber: 187; The reference to entity “A” must end with the ‘;’ delimiter.
org.xml.sax.SAXParseException; systemId: http://stackoverflow.com/; lineNumber: 20; columnNumber: 187; The reference to entity “A” must end with the ‘;’ delimiter.
at com.sun.org.apache.xerces.internal.parsers.AbstractSAXParser.parse(AbstractSAXParser.java:1236)
at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$JAXPSAXParser.parse(SAXParserImpl.java:568)
at HtmlParsers.run(HtmlParsers.groovy:3)
This article is a life saver. Don’t know why there are so many other blogs on the web indicating things like the Neko parser is good for screen scraping when we all know the web isn’t a perfect world and there’s a high chance the XML is malformed.
Thanks for sharing.
Jsoup is pretty neat as well:
@Grab(group=’org.jsoup’, module=’jsoup’, version=’1.7.2′)
def document = org.jsoup.Jsoup.connect(“http://stackoverflow.com/”).get();
println document.select(‘.question-hyperlink’)*.text()
If you desire to increase your knowledge simply keep visiting this
web site and be updated with the most up-to-date
news posted here.