I’ve been spending some time at work scraping data. Long story short: government transparency is not transparent when the only access they give you is a pile of poorly structured html. That’s better than government opacity but not past the level of frosted glass: titillating but unsatisfying. If your expected audience is pencil pushers, please release your data in a spreadsheet. That’s what I did.

Notes for nerds:

**Regular Expressions vs. Parsing Engines: **I wrote a the first parser in Python with Regular Expressions, then rewrote it in BeautifulSoup (a Python parser). It took me about 2 hours to write it the first time with RegExp. It took me about 2 days to do it with BeautifulSoup. It’s slightly easier to maintain now, but you tell me which one is more semantically correct:

project_title = re.search('<tr><td><b>Project&nbsp;title</b></td><td>(.+)</td></tr>', line)


project_title = app.find(text="Project&nbsp;title").parent.parent.nextSibling.string

Yep, it’s written in 2-column tables with each row being a different data-set: the first column holds a key (if there is a key; sometimes there isn’t) and the second column being the data . With RegExp, I know exactly what I’m looking for. With the parser, I have to find the element in the tree, then traverse up, over and down (if there isn’t a key, I have to go up, up, over, over, over, down, over, down). The data itself is a big set of applications (about 2000+ total) and each application has about 15 different data-sets (some with keys, some just follow a consistent-ish pattern).

Fortunately, I have an appreciative audience for my troubles and it lets me draw pretty maps like the ones above. Also  done with Python by parsing an SVG vector image.

Michigan boaters beware: there is now an isthmus between Mackinaw City and St. Ignace. Rather than rewrite the process for grouped-shapes—Michigan being in 2 parts—it was good enough to make Michigan 1. Hawaii somehow endured.