The Scoop

  • Home
  • Projects
  • About The Scoop
  • Fixing Journalism
  • Medill Links
  • Departments
    • API
    • Apple
    • Asides
    • Broadcast
    • Campaign Finance
    • Car Tools
    • Code
    • Data
    • DIY
    • django
    • Fed Data
    • FOIA
    • General
    • IRE
    • Journalism
    • Local Data
    • Mapping
    • Miscellany
    • NonGov Data
    • Online
    • Paper Trail
    • Presentations
    • Public Records
    • Python
    • Rails
    • Ruby
    • SLA
    • Social Network Analysis
    • Sports
    • State Data
    • Teaching
    • Work
    • XML
  • Subscribe via RSS

A GitHub for Data?

July 31st, 2010  |  Published in Data  |  3 Comments

Clay Johnson, late of Sunlight Labs and now writing at the splendidly-named InfoVegan, says that what the “Open Data” movement needs is a better way to store data on the Web. Something like a GitHub for data:

Why can I not type into a console gitdata install census-2010 or gitdata install census-2010 —format=mongodb and have everything I need to interface with the coming census data?

Technically, there’s not much reason why this couldn’t happen. Sure, some government datasets are very large, and some are in arcane and oddball formats, but these are technical problems that can be overcome. But the biggest issue, for data-driven apps contests and pretty much any other use of government data, is not that data isn’t easy to store on the Web. It’s that data is hard to understand, no matter where you get it.

In a sense, a GitHub for data could help solve this problem, too, because you can write documentation and many GitHub projects have excellent documentation. But there also are projects with very limited documentation – heck, some of them are mine. This is the biggest gap to better apps, that so few people really understand the data and its pitfalls. I’d like to see what Clay wants to see, too, but right now I’m more interested in:

gitdata install census-2010

If the person executing that command is, say, Paul Overberg.

That’s not to say that I’m in favor of a situation where only those with expertise have access to data. What I’m saying is that the very act of what Clay describes as a hassle:

A developer has to download some strange dataset off of a website like data.gov or the National Data Catalog, prune it, massage it, usually fix it, and then convert it to their database system of choice, and then they can start building their app.

Is in fact what helps a user learn more about the dataset he or she is using. Even a well-documented dataset can have its quirks that show up only in the data itself, and the act of importing often reveals more about the data than the documentation does. We need to import, prune, massage, convert. It’s how we learn.

Responses

Feed Trackback Address
  1. Dave Stanton says:

    August 2nd, 2010 at 10:32 am (#)

    Amen. The most ready examples would be for already-open data created by government agencies. However, I see this as amazingly useful for academic and private research as well. There could be a feature like GitHub organizations where you only provide access to a research team, etc.

    I’m definitely in favor of more transparent data, but there is a lot of data that can’t be open based on funding sources. However, the same concept you’re advocating would be great for distributed teams to make processing and analysis much faster. This could be a great way to speed up the time for peer review.

  2. Rufus Pollock says:

    January 19th, 2011 at 12:54 pm (#)

    Have you seen datapkg: http://blog.okfn.org/2010/02/23/introducing-datapkg/

    datapkg is an user tool for distributing, discovering and installing data (and content) ‘packages’.

    datapkg is a simple way to ‘package’ data building on existing packaging tools developed for code (e.g. Debian apt, PyPI, CRAN, Gems, CPAN). datapkg is designed to integrate closely with the CKAN (Comprehensive Knowledge Archive Network).

    In terms of the big picture, datapkg is the “apt-get/aptitude/dpkg” part of the vision for a ‘Debian of Data’ (i.e. scalable, distributed, open data infrastructures! — for more see this post or these recent slides):

    Documentation http://packages.python.org/datapkg/.

  3. malcolm tesla says:

    October 31st, 2011 at 5:28 pm (#)

    if you’re looking for a github for data, you should check http://BuzzData.com

Leave a Response

Recent Comments

  • Phil Underwood on Django, iCal and vObject
  • Derek Willis on Xpdf on the Mac
  • Danielle on Xpdf on the Mac
  • Christopher on Measuring Vocabulary Richness (or, Trying Out Django on Heroku)
  • malcolm tesla on A GitHub for Data?

Recent Posts

  • What We Don’t Know About Elections
  • RemoteTable Is Your Friend
  • Measuring Vocabulary Richness (or, Trying Out Django on Heroku)
  • In Defense of Building Tools
  • Why Teach SQL?

Linking Out

  • Mapping America — Census Bureau 2005-9 American Community Survey - NYTimes.com
    holy crap
  • Backbone.js and Django | joshbohde.com
  • ProPublica
  • Geoff: GeoJSON Feature Functions for JavaScript
  • Introducing Spanner: From Documents to Linked Data Apps—Clark & Parsia: Thinking Clearly
  • A performance lesson on Django QuerySets | Seek Nuance
  • http://www.post-gazette.com/pg/03001/1108747-209.stm
  • CBC News - Canada - Database: Canadian cables in WikiLeaks
  • Federal prosecutors likely to keep jobs after cases collapse - USATODAY.com
  • Strata Gems: Explore and visualize graphs with Gephi - O'Reilly Radar


©2012 The Scoop
Powered by WordPress using the Gridline Lite theme by Graph Paper Press.