From the perspective of someone who uses government data pretty often, Data.gov and its state progeny (Massachusetts, D.C., Minnesota – the “data deli” is a great name – among them) are better than what we used to have. They make the acquisition of data by journalists, regulated communities and the general public much easier than it has been. But there are two related issues with such efforts that I can see: in general, both producers and consumers of government data tend to operate in vacuums.
Many governments want to put data online. It’s a bit of a PR win for them, and it does provide a service that otherwise might occupy the time of a government employee who has other tasks to accomplish. But, as Rob Goodspeed notes, many of them are trying to figure out what exactly to post online. And in dealing with that question, the transparency movement isn’t exactly providing a lot of clarity, or even the right kind of input.
Rob accurately reports that the common answer to the question of what to post is “everything,” but he quickly points out that this isn’t possible in many cases. And sometimes – I’m looking at you, Data.gov – posting everything actually obscures the absence of good stuff. Sure, it seems impressive that nearly 275,000 datasets are available on Data.gov, but that has the secondary effect of making the most popular dataset on the site (as I write this) the “US Topo 7.5-minute map for Imperial, TX.” Seriously, more than 200 people have downloaded it.
The state of Data.gov – which will greatly influence state and local government data efforts – is skewed in a way that gives rise to frustration among users of government data. Even folks in the transparency community – who previously celebrated the sheer number of datasets released – are starting to recognize this situation for what it is. Ellen Miller, executive director of the Sunlight Foundation, wrote that “[t]he torrent of data we expected to see at Data.gov isn’t materializing.” She cites a colleague’s calculation that 99 percent of the files available there are GIS data, which while useful, are not quite as accessible as, say, a CSV or Excel file. Other datasets are flawed in significant ways, rendering them nearly useless. More Miller: “Call it a hot Friday afternoon, and maybe I’m cranky, but I think it’s time to begin to ask some tough questions of the White House. Whither or wither transparency?” I’m sure many folks who work with government data have thought or uttered similar sentiments.
But before we prepare the tar and feathers, let’s pause for a bit and consider a few things. First, what is Data.gov supposed to be? Here’s what the site says: “The purpose of Data.gov is to increase public access to high value, machine readable datasets generated by the Executive Branch of the Federal Government.”
The only bit you can argue with there is the term “high value”. And yes, Sunlight and its allies have a point. There’s very little in the way of data from the Justice or Interior departments (although the DOJ’s regular survey of jails is quite well-represented, dating back to 1985). Since Data.gov is an executive branch joint, there’s nothing from Congress or the judicial branch, either, but that’s pretty clearly stated.
So the valid objection is that the tens of thousands of datasets on Data.gov tend to clutter up the joint, and some of the best data isn’t there. To which I say, “Hello, my name is crucial information. I’m really valuable, which is why it’s probably hard to get your hands on me.” In Washington, as in state capitals, this is no surprise at all. Data.gov is the fulfillment of the phrase “low-hanging fruit”: you can fill up on it, but there always will be stuff out of reach.
Rob Goodspeed suggests five categories of data that should come first in government transparency efforts. They’re all worthy candidates, and it’s important to note that many are related to current policy questions. But leaving it up to the government to decide which data fulfills those criteria (or others) – which is what we’re really doing when we emphasis the sheer number of datasets available – seems to be inviting more of the same.
So what can the good folks at Sunlight and others interested in a more transparent and accountable government do? One option would be to stop highlighting the number of datasets and start focusing on quality. The National Data Catalog is an interesting and useful project with a great goal, but you know what would be better to do in parallel? The National Really Useful Data Catalog. A NRUDC (admittedly, a lousy acronym) would include a targeted selection of existing datasets *and* a list of really useful data that isn’t currently available. It would focus less on the number of datasets available and more on what’s good, bad and ugly about ones of particular note. What’s more, the NRUDC would be more than a catalog – it would include a group of people tasked with investigating, testing and analyzing the uses and flaws of those datasets. People like, say, Sunlight’s Reporting Group.
Not saying the Reporting Group doesn’t do good work. But the various data catalogs are crying out for some semblance of an editor function – someone who can say, “Hey, don’t bother with this, it’s missing key elements,” or “This agency’s forms contain much more information than are available in the data, and here’s why.” What we get now are lists of datasets, sometimes with links to documentation.
Some of this information can already be found in, say, the NICAR-L archives or IRE’s tipsheets, where reporters have helped each other to understand the caveats of using this dataset or that. But in general, we haven’t done enough to make this kind of helpful information for users available to the broader public. Sunlight and other groups could really help with that effort, and in the process make a valuable contribution to meaningful transparency, the kind that isn’t so impressed by big numbers.
And what about government’s role? It’s the rare person who works with data who doesn’t want users of that information to understand how it works and why. By digging deeper into individual datasets instead of growing horizontally toward an ever-larger number of them, it might be possible to extend the expertise of those individuals to better use. Yes, it’s far easier to count every dataset and say you’re accomplishing transparency. But what kind of transparency, and to what end?