I spend a lot of time digging around in other people's data. If I'm not hunting for anything in particular then it's a bit of a crapshoot, to be honest, if only because you never know what you're in for. You can pretty much take it to the bank that if you didn't assemble it yourself, you can't count on it being complete, well formed, or anything approximating the output of a human being (it usually came out of a database, but I think you see what I'm getting at). Sometimes, if I'm really lucky I'll just get hold of a JSON dump of the database, which to be fair is better than nothing when there isn't even an API to use. From time to time I'll make an attempt at fitting the data into a database of some kind, sometimes MySQL, sometimes SQLite, or occasionally an API layer like Sandman2. This is all well and good, but it winds up being more of an adventure than I'm looking for. I'd much rather be Indiana Jones prowling around in the temple than Rambo going through a preparation montage because Indy was actually getting stuff done.
Wow, this article went a little off the rails. I was never good at writing intros to new code... anyway.
Not too long ago I was looking at interfacing some of my search tech with the CIA's World Factbook (anonymized link), but as is not uncommon they don't have any way of getting at the data other than loading the site in a browser or writing a scraper for it. No RSS feed, no API, which sort of makes sense when you consider that it was meant as a reference work. Load it up, maybe run a search, and go about your business. A coder named Ian Coleman wrote a utility which periodically scrapes the entire Factbook and writes the contents out as a JSON document. Once I discovered this, I spent some time trying to figure out how best to use it and threw some of my usual tools at it. To say that it just wasn't working is probably being polite. So, I decided to sit down a few days ago and write a custom tool that did exactly what I wanted. Let's look at a little snippet of what I'm dealing with:
{ "countries": { "world": { "data": { "name": "World", "introduction": { "background": "Globally..." }, "geography": { "overview": "The surface of the earth...", "map_references": "Physical Map of the World", "area": { ... }, ... } }, "metadata": { "date": "2019-06-24", "source": "https://web.archive.org/web/20190624073508/https://www.cia.gov/library/publications/the-world-factbook/geos/xx.html", "nearby_dates": "https://web.archive.org/web/20190624000000*/https://www.cia.gov/library/publications/the-world-factbook/geos/xx.html" } },As you can see, we have keys ("countries", "world", "data", and so forth) and values ("Physical Map of the World", "2019-06-24", et al) organized hierarchically. It seems pretty straightforward, logical, and internally consistent. But I didn't have a tool that would let me explore the data the way I wanted, which is to say "Let me use keys in a logical manner to find values with a minimum of bullshit." I didn't want to mess with having to write a parser or write any more JSONpaths than absolutely necessary, I wanted a thin layer over top of the file that let me treat it as a database with a REST API on top. So, I wrote one. I call it the Universal JSON Server.
I was going to call it something silly, like Fred the Frog (because, JSON -> Jason -> Blaster Master (what can I say, my mind is a weird machine)) but cute names and in-jokes that make sense to only two people on the entire Net don't make for good search terms. What I basically did was take another piece of code I wrote which implemented a small REST API and cut the stuff I didn't need out. In its place I added a little bit of code that loads the JSON document into a nested hash table, takes the URI it receives, and chops it up into a set of keys to traverse the hash table with. If a key matches, it drops one more level inward, and if it doesn't it throws an error. So, you can explore the data to find what you want, then save off the path and reference it in the future with as little pain in the ass as possible. I made the code as lightweight as possible, it has no external dependencies, which is to say if you have a basic install of Python 3 you have everything the Universal JSON Server needs to run, with no config files to hand-hack necessary. Here's how I start it up for my intended use case:
{14:01:02 @ Sun Jul 28} [drwho @ windbringer universal_json_server] () $ ./universal_json_server.py --json ~/cia_world_factbook_api/data/factbook.jsonLet's look at some top-level keys:
{12:29:55 @ Sun Jul 28} [drwho @ windbringer ~] () $ curl http://localhost:11000/ [ "countries", "metadata" ]Nice. Let's take a closer look at the document metadata. Paste the string "metadata" onto the end of the URL we pass to cURL:
{14:02:07 @ Sun Jul 28} [drwho @ windbringer ~] () $ curl http://localhost:11000/metadata { "date": "2019-06-24", "parsed_time": "2019-07-02 02:05:00 UTC", "parser_version": "0.0.3-beta" }Whee. Isn't that nice. Now let's grab the timestamp that this dump of the CIA World Factbook was made:
[drwho @ windbringer ~] () $ curl http://localhost:11000/metadata/parsed_time "2019-07-02 02:05:00 UTC"Using these somewhat trivial examples as object lessons, we can drill down in the data for a very specific piece of information with the same techniques. I wonder what the budget of Belarus was the last time the Factbook was updated...
{14:07:29 @ Sun Jul 28} [drwho @ windbringer ~] () $ curl http://localhost:11000/countries/belarus/data/economy/budget { "date": "2017", "expenditures": { "units": "USD", "value": 20570000000 }, "revenues": { "units": "USD", "value": 22150000000 } }You can do this drilling down procedure with just about anything that speaks HTTP, from a web browser to custom tools you've written to communicate with API services. Yes, the UJS has a couple of bugs. There are some edge cases that I haven't accounted for yet, which I'll get to as I run into them with different data sets. If you happen to find any please open a bug ticket to let me know about it. Or, better yet, file a pull request with a fix so I can incorporate it.