Big Numbers, Low Impact
June 27th, 2010 | Published in Data, Fed Data | 6 Comments
From the perspective of someone who uses government data pretty often, Data.gov and its state progeny (Massachusetts, D.C., Minnesota – the “data deli” is a great name – among them) are better than what we used to have. They make the acquisition of data by journalists, regulated communities and the general public much easier than it has been. But there are two related issues with such efforts that I can see: in general, both producers and consumers of government data tend to operate in vacuums.
Many governments want to put data online. It’s a bit of a PR win for them, and it does provide a service that otherwise might occupy the time of a government employee who has other tasks to accomplish. But, as Rob Goodspeed notes, many of them are trying to figure out what exactly to post online. And in dealing with that question, the transparency movement isn’t exactly providing a lot of clarity, or even the right kind of input.
Rob accurately reports that the common answer to the question of what to post is “everything,” but he quickly points out that this isn’t possible in many cases. And sometimes – I’m looking at you, Data.gov – posting everything actually obscures the absence of good stuff. Sure, it seems impressive that nearly 275,000 datasets are available on Data.gov, but that has the secondary effect of making the most popular dataset on the site (as I write this) the “US Topo 7.5-minute map for Imperial, TX.” Seriously, more than 200 people have downloaded it.
The state of Data.gov – which will greatly influence state and local government data efforts – is skewed in a way that gives rise to frustration among users of government data. Even folks in the transparency community – who previously celebrated the sheer number of datasets released – are starting to recognize this situation for what it is. Ellen Miller, executive director of the Sunlight Foundation, wrote that “[t]he torrent of data we expected to see at Data.gov isn’t materializing.” She cites a colleague’s calculation that 99 percent of the files available there are GIS data, which while useful, are not quite as accessible as, say, a CSV or Excel file. Other datasets are flawed in significant ways, rendering them nearly useless. More Miller: “Call it a hot Friday afternoon, and maybe I’m cranky, but I think it’s time to begin to ask some tough questions of the White House. Whither or wither transparency?” I’m sure many folks who work with government data have thought or uttered similar sentiments.
But before we prepare the tar and feathers, let’s pause for a bit and consider a few things. First, what is Data.gov supposed to be? Here’s what the site says: “The purpose of Data.gov is to increase public access to high value, machine readable datasets generated by the Executive Branch of the Federal Government.”
The only bit you can argue with there is the term “high value”. And yes, Sunlight and its allies have a point. There’s very little in the way of data from the Justice or Interior departments (although the DOJ’s regular survey of jails is quite well-represented, dating back to 1985). Since Data.gov is an executive branch joint, there’s nothing from Congress or the judicial branch, either, but that’s pretty clearly stated.
So the valid objection is that the tens of thousands of datasets on Data.gov tend to clutter up the joint, and some of the best data isn’t there. To which I say, “Hello, my name is crucial information. I’m really valuable, which is why it’s probably hard to get your hands on me.” In Washington, as in state capitals, this is no surprise at all. Data.gov is the fulfillment of the phrase “low-hanging fruit”: you can fill up on it, but there always will be stuff out of reach.
Rob Goodspeed suggests five categories of data that should come first in government transparency efforts. They’re all worthy candidates, and it’s important to note that many are related to current policy questions. But leaving it up to the government to decide which data fulfills those criteria (or others) – which is what we’re really doing when we emphasis the sheer number of datasets available – seems to be inviting more of the same.
So what can the good folks at Sunlight and others interested in a more transparent and accountable government do? One option would be to stop highlighting the number of datasets and start focusing on quality. The National Data Catalog is an interesting and useful project with a great goal, but you know what would be better to do in parallel? The National Really Useful Data Catalog. A NRUDC (admittedly, a lousy acronym) would include a targeted selection of existing datasets *and* a list of really useful data that isn’t currently available. It would focus less on the number of datasets available and more on what’s good, bad and ugly about ones of particular note. What’s more, the NRUDC would be more than a catalog – it would include a group of people tasked with investigating, testing and analyzing the uses and flaws of those datasets. People like, say, Sunlight’s Reporting Group.
Not saying the Reporting Group doesn’t do good work. But the various data catalogs are crying out for some semblance of an editor function – someone who can say, “Hey, don’t bother with this, it’s missing key elements,” or “This agency’s forms contain much more information than are available in the data, and here’s why.” What we get now are lists of datasets, sometimes with links to documentation.
Some of this information can already be found in, say, the NICAR-L archives or IRE’s tipsheets, where reporters have helped each other to understand the caveats of using this dataset or that. But in general, we haven’t done enough to make this kind of helpful information for users available to the broader public. Sunlight and other groups could really help with that effort, and in the process make a valuable contribution to meaningful transparency, the kind that isn’t so impressed by big numbers.
And what about government’s role? It’s the rare person who works with data who doesn’t want users of that information to understand how it works and why. By digging deeper into individual datasets instead of growing horizontally toward an ever-larger number of them, it might be possible to extend the expertise of those individuals to better use. Yes, it’s far easier to count every dataset and say you’re accomplishing transparency. But what kind of transparency, and to what end?
June 27th, 2010 at 6:37 pm (#)
I’ve had similar thoughts. I get the sense that Data.gov represents the cleaning out of some agencies’ data closets. “Hey look — here’s something we can post!” But the most helpful data to journalists probably remains that which we have to file a FOIA request to get or have a deeper knowledge of agency workings to know about at all.
June 28th, 2010 at 10:00 am (#)
Great post. I think what people are realizing is that there is a difference between data and information. I recently created the Center for Digital Information http://digitalinfo.org not focused just on government research/data, but policy research generally including think tanks, agencies, foundations, nonprofits, etc. The goal is to start to make these important distinctions between “research” “data” “information” where they are often used synonymously. To qualify as “information,” I maintain it needs to be effectively *communicated* (in digital media). Perhaps that’s where data distribution such as this falls short of information?
June 28th, 2010 at 12:42 pm (#)
Amen.
June 28th, 2010 at 1:28 pm (#)
Bravo Derek. I agree with a good deal of what you say here, and rest assured, the Reporting Group is on top of this issue. We funded a project, led by Jim Morris at our old alma mater the Center for Public Integrity, to identify high value data sets that aren’t being released (called the Data Mine–check it out here: http://www.publicintegrity.org/data_mine/). We’ve been critical of the Open Government Directive data sets that have been released to date: See here for an example (http://reporting.sunlightfoundation.com/2010/ogd-labor-releases-five-enforcement-datasets/).
We’re pretty much in favor of pushing the envelopes and, yes, asking for everything. We know why we ask for everything: until we journalists get our grubby little paws on the data, we don’t know what’s there and what’s not there, and what kind of stories it can tell. That’s the problem with creating a NRUDC: We don’t know what data is going to be valuable tomorrow. As I keep saying, until the plane crash lands in the Hudson River, how many people want to see the FAA’s bird strike database? Until a mine explosion or a deepwater drilling disaster, how many people want to dig through MMS data?
One of the key concepts between the National Data Catalog (which, by the way, is still in its alpha phase) is that the data would be curated. Developers, journalists and others (and yes, Sunlight staff will be providing a ton of curation ourselves) can comment on datasets, talking about their quality, caveats, uses, potential pitfalls, errors, and so on. Think of it as an annotated Data.gov that, by the way, can also import congressional, judicial, state, local and even third party (i.e., OpenSecrets.org) data.
Something else I’ve been up to for Sunlight: Working with my colleagues towards a definition of what accountability data actually is. Because I’m no good at abstract thinking, I’ve tried to work backwards from specific examples. I’ll be very curious to know what you think — I’ll ping you when I post.
June 28th, 2010 at 1:39 pm (#)
Nice post, Derek.
One idea that came to me recently is to stop using the word “data” entirely — it just confuses things. We, as journalists, tend to want records. These are the actual administrative byproducts of running or monitoring a program or an agency.
“Data”, at least as envisioned in data.gov and elsewhere, is usually cleansed and anonymous or aggregated statistics crosscut in some useful ways, but too far away from the original source to make its flaws apparent and its uses meaningful. It’s usually not as accurate as the original, since it’s been created solely for public consumption instead of as a way to do agency work.
Maybe we should press agencies to get their own recordkeeping in order, building in transparency rather than for relatively convenient access to dubious data. That sounds harsh, but in the long run might make government more open and useful.
June 28th, 2010 at 4:09 pm (#)
“what accountability data actually is”
A good place to start might be here: http://www.icgfm.org/, the International Consortium on Governmental Financial Management. They’re trying to come up with standards of accounting and transparency, including the kind of records that should be made public. I agree that Data.gov is disappointing — you get better records from individual agencies (like the SEC’s Edgar database).
Maybe coming from an academic background makes this more obvious, but for me it’s always been clear that data refers to raw facts, sometimes gathered together in a database. Information is processed, and has value added, usually by spotting patterns or supplying context. Information has meaning.