Sunday, March 18, 2012

open source conundrum

I needed some code to perform a specific task (parse out a particular set of web pages). Obviously something someone has already done. A google search, a stackoverflow post and a couple candidates appear. High praise, both listed in the maven repo (a major plus), but low documentation. Try the first, jtidy -- mostly works but doesn't quite parse out the way I want. Take a look at the second -- javadoc hard to find and sparse.

I could do the basic task directly with regular expression matching -- but it always takes a bit longer than you think to work out regex kinks. On the other hand, I could fiddle with the open source for longer to find it doesn't quite do what I need or it's buggy.

Thus it often is with second tier open source. Never obvious what best route is.

In the end I've decided to go with a hybrid jtidy/roll my own solution -- I can use jtidy to find the right chunkis of the web page and then just String.substring to find what I need.