A GitHub for Data?
July 31st, 2010 | Published in Data | 3 Comments
Clay Johnson, late of Sunlight Labs and now writing at the splendidly-named InfoVegan, says that what the “Open Data” movement needs is a better way to store data on the Web. Something like a GitHub for data:
Why can I not type into a console gitdata install census-2010 or gitdata install census-2010 —format=mongodb and have everything I need to interface with the coming census data?
Technically, there’s not much reason why this couldn’t happen. Sure, some government datasets are very large, and some are in arcane and oddball formats, but these are technical problems that can be overcome. But the biggest issue, for data-driven apps contests and pretty much any other use of government data, is not that data isn’t easy to store on the Web. It’s that data is hard to understand, no matter where you get it.
In a sense, a GitHub for data could help solve this problem, too, because you can write documentation and many GitHub projects have excellent documentation. But there also are projects with very limited documentation – heck, some of them are mine. This is the biggest gap to better apps, that so few people really understand the data and its pitfalls. I’d like to see what Clay wants to see, too, but right now I’m more interested in:
gitdata install census-2010
If the person executing that command is, say, Paul Overberg.
That’s not to say that I’m in favor of a situation where only those with expertise have access to data. What I’m saying is that the very act of what Clay describes as a hassle:
A developer has to download some strange dataset off of a website like data.gov or the National Data Catalog, prune it, massage it, usually fix it, and then convert it to their database system of choice, and then they can start building their app.
Is in fact what helps a user learn more about the dataset he or she is using. Even a well-documented dataset can have its quirks that show up only in the data itself, and the act of importing often reveals more about the data than the documentation does. We need to import, prune, massage, convert. It’s how we learn.
August 2nd, 2010 at 10:32 am (#)
Amen. The most ready examples would be for already-open data created by government agencies. However, I see this as amazingly useful for academic and private research as well. There could be a feature like GitHub organizations where you only provide access to a research team, etc.
I’m definitely in favor of more transparent data, but there is a lot of data that can’t be open based on funding sources. However, the same concept you’re advocating would be great for distributed teams to make processing and analysis much faster. This could be a great way to speed up the time for peer review.
January 19th, 2011 at 12:54 pm (#)
Have you seen datapkg: http://blog.okfn.org/2010/02/23/introducing-datapkg/
datapkg is an user tool for distributing, discovering and installing data (and content) ‘packages’.
datapkg is a simple way to ‘package’ data building on existing packaging tools developed for code (e.g. Debian apt, PyPI, CRAN, Gems, CPAN). datapkg is designed to integrate closely with the CKAN (Comprehensive Knowledge Archive Network).
In terms of the big picture, datapkg is the “apt-get/aptitude/dpkg” part of the vision for a ‘Debian of Data’ (i.e. scalable, distributed, open data infrastructures! — for more see this post or these recent slides):
Documentation http://packages.python.org/datapkg/.
October 31st, 2011 at 5:28 pm (#)
if you’re looking for a github for data, you should check http://BuzzData.com