The Scoop

  • Home
  • Projects
  • About The Scoop
  • Fixing Journalism
  • Medill Links
  • Departments
    • API
    • Apple
    • Asides
    • Broadcast
    • Campaign Finance
    • Car Tools
    • Code
    • Data
    • DIY
    • django
    • Fed Data
    • FOIA
    • General
    • IRE
    • Journalism
    • Local Data
    • Mapping
    • Miscellany
    • NonGov Data
    • Online
    • Paper Trail
    • Presentations
    • Public Records
    • Python
    • Rails
    • Ruby
    • SLA
    • Social Network Analysis
    • Sports
    • State Data
    • Teaching
    • Work
    • XML
  • Subscribe via RSS

On Legislative Data Transparency

February 4th, 2012  |  Published in Data, Fed Data  |  5 Comments

This week I was honored to speak at at the Legislative Data and Transparency Conference put on by the Committee on House Administration. If you’re so inclined, the videos of the presentations are online at the conference site, although I must warn you that they contain heavy doses of XML references and other fun stuff. What follows is not my presentation, strictly speaking, but most of it along with some other thoughts.

Being a former Congressional Quarterly staffer, I have an innate fondness for House Admin, one of the lesser-known committees but one with a large influence over what kinds of information the public can see about the House side of the legislative process. The committee has jurisdiction over the Library of Congress, which by extension means Thomas, the online home of so much Congressional information.

There are many other posts about the general desires of what those folks committed to transparency want when it comes to Congress, but Daniel Schuman of the Sunlight Foundation sums them up pretty well: “To the maximum extent possible, legislative information must be available online, in real time, and in machine readable formats.”

I don’t disagree, and I am sympathetic to complaints that Congress has been slow to address the availability of bulk data. People such as Josh Tauberer have been screen-scraping Thomas since 2004, and I joined in the process a year later at washingtonpost.com. In 2012, we’re both still doing it, now joined by Sunlight, OpenCongress and who knows how many others (speaking of OpenCongress, if you want a less patient restatement of Schuman’s thoughts, OC’s David Moore has a stem-winder of a post for you).

I, too, long for the day when I don’t have to wonder when my HTML parsers will break after a seemingly innocuous change to Thomas’ styles, or when I don’t have to enter three different IDs for a new Senator (Bioguide, LIS and Thomas’ own unique sequential number). But my presentation on Thursday concentrated on a more fundamental need. Before bulk data can become really useful, it has to be more consistent, understandable and accurate. Right now, if you’re not willing to put in a lot of time studying the quirks of Congress, you will always face the likelihood that your data, however lovingly collected, has plenty of errors.

For example, in the Senate it is possible for the Majority Leader and Minority Leader to alter the rules of math when it comes to how many senators constitute a three-fifths majority. The death of Sen. Ted Kennedy in 2009 reduced the number of Democrats in the chamber at that time to 59, and the total number of senators “duly elected and sworn” to 99. For votes requiring a three-fifths majority (thanks, Malcolm), a 99-member Senate would need 59.4 senators for passage, or at least 59. But the party leaders agreed to keep the three-fifths threshold at 60 votes throughout the period when the Senate had 99 senators, not 100. For much of that period, any two-thirds vote displayed on nytimes.com had the wrong number of votes required for passage, because I was relying on math. I could not find any place in the Congressional Record or anywhere else where this was documented.

An edge case, you might say. But when it comes to Congress, there are loads of them. A reporter called me several weeks ago to ask why a seemingly simple question about three members of her state’s delegation was maddeningly hard to answer. All three had been elected to the House the same year, and had served since then. But each of them had a different number of total votes he or she was eligible to vote on. How could that be?

It took me a little while, but the only explanation I could find was that their dates of service had to differ in some way, and my guess was that not all of them were sworn in for each session on the same day. It happens. Unfortunately, neither Thomas nor the Clerk of the House provides an easy way to find out when a particular member was sworn in, despite the fact that it is a basic element of what makes someone a Member of Congress. At the conference, I heard someone say that it would be possible to provide a list of swearing-in dates for every lawmaker. That’s good, and needed, but it’s not good enough. I need, and the data demands, timestamps in this case. That’s the only way I can be sure of what votes a member was or was not eligible to vote on.

You might think that you could find the total number of House votes for a given year by looking at the Clerk’s votes site. In 2011, the last vote was roll call 949. Alas, officially, there were 948 votes that year, because roll call 484 was vacated and replaced by vote 485, and thus never really happened.

In my presentation I cited a few other examples, but they mostly boil down to this: unless we can make congressional information easier to use and understand by people outside the small circle of legislative wonks, bulk data access by itself won’t solve our problems. Today the most creative uses of congressional information, such as Sunlight’s Capitol Words project, suffer from this limitation. I love Capitol Words, but right now the Congressional Record – the source for it – cannot reliably tell me in a machine-readable form whether a particular word or phrase or speech was even spoken out loud on the floor of the House or Senate. That’s kind of a big deal, for reporters, historians and the public.

If we can’t use congressional data to answer what should be straightforward questions, or can’t agree on what the answers should be, providing immediate access to that data in bulk form may not be as helpful as we would think, and in some cases risks adding to the confusion. It may expose more of those problems, which is of some usefulness, but if the ultimate goal is not just access but understanding, we need to address the fundamental issues of accuracy and consistency before we switch on the firehose.

Responses

Feed Trackback Address
  1. Malcolm Tredinnick says:

    February 4th, 2012 at 11:35 pm (#)

    Um… the senate supermajority number for cloture votes is a three-fifths, not two-thirds. Otherwise it would require 66 votes in a 99 member Senate. It was changed in 1975, so it’s probably appropriate to move on from the older name now. :-)

  2. (02:33 06-02-2012) Noticias más populares de #opengov en las ultimas 24 horas | Tuits de Software Libre says:

    February 5th, 2012 at 9:34 pm (#)

    [...] (6) On Legislative Data Transparency :: The Scoop [...]

  3. (15:05 06-02-2012) Noticias más populares de #opengov en las ultimas 24 horas | Tuits de Software Libre says:

    February 6th, 2012 at 10:06 am (#)

    [...] (7) On Legislative Data Transparency :: The Scoop [...]

  4. (19:19 06-02-2012) Noticias más populares de #opengov en las ultimas 24 horas | Tuits de Software Libre says:

    February 6th, 2012 at 2:21 pm (#)

    [...] (8) On Legislative Data Transparency :: The Scoop [...]

  5. Eric Mill says:

    February 6th, 2012 at 11:32 pm (#)

    It seems to me that releasing bulk data, even flawed data, would help people like us discover and draw attention to errors more quickly and effectively. If an agency commits to publish data, they have a more obvious responsibility to address problems in it, and critics have a more compelling podium.

    It makes more sense to represent the goals of accuracy and parse-ability as symbiotic, than as in tension.

Leave a Response

Recent Comments

  • Seth Lewis on Lost in the Weeds
  • Reporters' Lab // News algorithms already exist – and that’s good on The Programmer-Reporter
  • Eric Mill on On Legislative Data Transparency
  • (19:19 06-02-2012) Noticias más populares de #opengov en las ultimas 24 horas | Tuits de Software Libre on On Legislative Data Transparency
  • (15:05 06-02-2012) Noticias más populares de #opengov en las ultimas 24 horas | Tuits de Software Libre on On Legislative Data Transparency

Recent Posts

  • Lost in the Weeds
  • Our Mark Knoller Problem
  • The Programmer-Reporter
  • Investigating House Freshmen Voting Patterns
  • On Legislative Data Transparency

Linking Out

  • Mapping America — Census Bureau 2005-9 American Community Survey - NYTimes.com
    holy crap
  • Backbone.js and Django | joshbohde.com
  • ProPublica
  • Geoff: GeoJSON Feature Functions for JavaScript
  • Introducing Spanner: From Documents to Linked Data Apps—Clark & Parsia: Thinking Clearly
  • A performance lesson on Django QuerySets | Seek Nuance
  • http://www.post-gazette.com/pg/03001/1108747-209.stm
  • CBC News - Canada - Database: Canadian cables in WikiLeaks
  • Federal prosecutors likely to keep jobs after cases collapse - USATODAY.com
  • Strata Gems: Explore and visualize graphs with Gephi - O'Reilly Radar


©2012 The Scoop
Powered by WordPress using the Gridline Lite theme by Graph Paper Press.