eddology • by edd dumbill

If your brain works the way mine does, you often find yourself doing a variety of related but not necessary tasks along the way to fulfilling a larger goal. For some reason, the history of which I’m not entirely sure, this is called “yak shaving.” Some of my finest work has been done that way, so I’m happy to embrace it, even if I do add hours to the job at hand.

This week’s yak shaving has neared epic proportions, but proved eminently satisfying. I’ve been working on collecting together an overview of big data products. It’s early days yet for Hadoop centered offerings, but 2012 will be the year that Microsoft and Oracle will join IBM and EMC Greenplum in having launched products in the marketplace.

Surveying who-offers-what is a bit like writing code. I naively started off with a spreadsheet, but I quickly realized that real life data doesn’t fit into grids. Even if you can coerce it that way, the effort required to get there is massive. Instead I wanted a structured description of the products that I could amend and refactor as the whole landscape came into perspective.

To get this done, I revived an old interest. Some years ago I contributed a little to Dave Beckett’s Redland RDF project, and so I returned to this technology. RDF is simply a machine readable way of writing descriptions about things.

The task at hand was to collect knowledge about each vendor and their products, and have this in a format that could be used to create feature tables and comparison charts.

In my little project, each vendor has a file in which I’ve written a machine-readable description of their products. Here’s a little excerpt from my description of Cloudera:

<http://www.cloudera.com/hadoop/>
    a :HadoopDistribution ;
    dc:title "Cloudera's Distribution including Apache Hadoop" ;
    :homePage <http://www.cloudera.com/hadoop/> .
You can probably figure out this is the beginning of a description of Cloudera’s Hadoop Distribution. This is in the Turtle dialect of RDF, which makes the whole thing a lot more readable. I’m not an RDF purist, and don’t have plans yet to publish this data publicly. I was simply using the technology as a shorthand for writing down a large amount of semi-structured information.

After collating all the data together into a Redland database, I was able to write a small amount of Python code that queried the data, fed it into some templates, and spat out some chunks of HTML for inclusion into my article. For the curious, I used a combination of SPARQL and the regular Redland APIs to query the data, and chose the jinja2 templating engine to create my HTML. Excerpted screenshots are attached to this post.

Yes, I could have chosen XML & XSLT for this, or JSON, or, or… but that’s not really the point. The point was to have fun and learn stuff at the same time as achieving my goals. I brushed up on my Python, got some SPARQL experience in, and have content that can be easily kept up to date and repurposed.

Oh, and if you’re going anywhere near RDF technology, I can recommend Redland as a pragmatic and straightforward toolkit.

Look out for the Hadoop survey to be published tomorrow!
  1. best-sim-only-deals reblogged this from eddology
  2. eddology posted this
Blog comments powered by Disqus