The Scoop

  • Home
  • Projects
  • About The Scoop
  • Fixing Journalism
  • Medill Links
  • Departments
    • API
    • Apple
    • Asides
    • Broadcast
    • Campaign Finance
    • Car Tools
    • Data
    • DIY
    • django
    • Fed Data
    • FOIA
    • General
    • IRE
    • Journalism
    • Local Data
    • Mapping
    • Miscellany
    • NonGov Data
    • Online
    • Paper Trail
    • Presentations
    • Public Records
    • Python
    • Rails
    • SLA
    • Social Network Analysis
    • Sports
    • State Data
    • Teaching
    • Work
    • XML
  • Subscribe via RSS

How APIs Help the Newsroom

July 12th, 2010  |  Published in API | Comments (3)

As nice as it is to get praised for the civic-mindedness of your work, the not-so-secret secret about APIs at The Times is that we’re the biggest consumer of them. The flexibility and convenience that the APIs provide make it easier to cut down on repetitive manual work and bring new ideas to fruition. Other news organizations can do the same.

This week, for example, we launched a page to track Republican senators’ positions on the nomination of Elena Kagan to the Supreme Court. The fabulous graphics department has done things like this in the past, such as with the House vote on health care. Both of those graphics were assembled from lots of different pieces of information – electoral results and previous votes among them – and the Kagan data includes stuff like whether the senator in question is running for re-election this year.

You could, of course, ask people to gather up all that information, but if you’re going to do something like this more than once, it makes sense to have a way to automate as much as possible. That’s where the APIs come in. For the Kagan graphic, we used the NYT Congress API to pull in information on senators and their votes, which leaves the gathering of information about their statements on Kagan as the lone manual task. In other words, only the stuff that is specific to this app requires manual effort.

Similarly, the new Districts API we released plays well with our other APIs, so that I was able to build a simple demo app that takes advantage of the fact that our Congress API, among others, can return the current member for a particular district.

For newsrooms, the utility of APIs goes beyond creating Web apps. Making data available via APIs is a little like giving the newsroom the ability to ask and answer questions without having to tie down a CAR person for long periods of time. APIs can provide data in whatever format you choose, which means that a wider range of people can take advantage, from graphic artists used to working with XML to reporters comfortable with CSV files. When your data is more accessible and flexible, the possibilities for doing things with it expands.

So if you have a big local election coming up, having an API for candidate summary data makes it easier to do a quick-and-dirty internal site for reporters and editors to browse, but also gives graphics folks a way to pull in the latest data without having to ask for a spreadsheet. Chances are that if serious data analysis is what you need, that’ll be done in some desktop application or database server anyway. The API is just a messenger, albeit one that is always on and able to spawn lots of ideas and experiments.

If you’re looking to build an API, remember that it’s just a Web application delivering data in a structured format (XML and JSON being two popular formats these days). There are lots of options in terms of what you use to build and serve an API, so it’s important to pay attention to the design: which information you’ll deliver, and how. Being a significant user of your own API is really important, too; it’ll give you the best sense of how well you’ve designed your responses, and what you might be missing.

Big Numbers, Low Impact

June 27th, 2010  |  Published in Data, Fed Data | Comments (6)

From the perspective of someone who uses government data pretty often, Data.gov and its state progeny (Massachusetts, D.C., Minnesota – the “data deli” is a great name – among them) are better than what we used to have. They make the acquisition of data by journalists, regulated communities and the general public much easier than it has been. But there are two related issues with such efforts that I can see: in general, both producers and consumers of government data tend to operate in vacuums.

Many governments want to put data online. It’s a bit of a PR win for them, and it does provide a service that otherwise might occupy the time of a government employee who has other tasks to accomplish. But, as Rob Goodspeed notes, many of them are trying to figure out what exactly to post online. And in dealing with that question, the transparency movement isn’t exactly providing a lot of clarity, or even the right kind of input.

Rob accurately reports that the common answer to the question of what to post is “everything,” but he quickly points out that this isn’t possible in many cases. And sometimes – I’m looking at you, Data.gov – posting everything actually obscures the absence of good stuff. Sure, it seems impressive that nearly 275,000 datasets are available on Data.gov, but that has the secondary effect of making the most popular dataset on the site (as I write this) the “US Topo 7.5-minute map for Imperial, TX.” Seriously, more than 200 people have downloaded it.

The state of Data.gov – which will greatly influence state and local government data efforts – is skewed in a way that gives rise to frustration among users of government data. Even folks in the transparency community – who previously celebrated the sheer number of datasets released – are starting to recognize this situation for what it is. Ellen Miller, executive director of the Sunlight Foundation, wrote that “[t]he torrent of data we expected to see at Data.gov isn’t materializing.” She cites a colleague’s calculation that 99 percent of the files available there are GIS data, which while useful, are not quite as accessible as, say, a CSV or Excel file. Other datasets are flawed in significant ways, rendering them nearly useless. More Miller: “Call it a hot Friday afternoon, and maybe I’m cranky, but I think it’s time to begin to ask some tough questions of the White House. Whither or wither transparency?” I’m sure many folks who work with government data have thought or uttered similar sentiments.

But before we prepare the tar and feathers, let’s pause for a bit and consider a few things. First, what is Data.gov supposed to be? Here’s what the site says: “The purpose of Data.gov is to increase public access to high value, machine readable datasets generated by the Executive Branch of the Federal Government.”

The only bit you can argue with there is the term “high value”. And yes, Sunlight and its allies have a point. There’s very little in the way of data from the Justice or Interior departments (although the DOJ’s regular survey of jails is quite well-represented, dating back to 1985). Since Data.gov is an executive branch joint, there’s nothing from Congress or the judicial branch, either, but that’s pretty clearly stated.

So the valid objection is that the tens of thousands of datasets on Data.gov tend to clutter up the joint, and some of the best data isn’t there. To which I say, “Hello, my name is crucial information. I’m really valuable, which is why it’s probably hard to get your hands on me.” In Washington, as in state capitals, this is no surprise at all. Data.gov is the fulfillment of the phrase “low-hanging fruit”: you can fill up on it, but there always will be stuff out of reach.

Rob Goodspeed suggests five categories of data that should come first in government transparency efforts. They’re all worthy candidates, and it’s important to note that many are related to current policy questions. But leaving it up to the government to decide which data fulfills those criteria (or others) – which is what we’re really doing when we emphasis the sheer number of datasets available – seems to be inviting more of the same.

So what can the good folks at Sunlight and others interested in a more transparent and accountable government do? One option would be to stop highlighting the number of datasets and start focusing on quality. The National Data Catalog is an interesting and useful project with a great goal, but you know what would be better to do in parallel? The National Really Useful Data Catalog. A NRUDC (admittedly, a lousy acronym) would include a targeted selection of existing datasets *and* a list of really useful data that isn’t currently available. It would focus less on the number of datasets available and more on what’s good, bad and ugly about ones of particular note. What’s more, the NRUDC would be more than a catalog – it would include a group of people tasked with investigating, testing and analyzing the uses and flaws of those datasets. People like, say, Sunlight’s Reporting Group.

Not saying the Reporting Group doesn’t do good work. But the various data catalogs are crying out for some semblance of an editor function – someone who can say, “Hey, don’t bother with this, it’s missing key elements,” or “This agency’s forms contain much more information than are available in the data, and here’s why.” What we get now are lists of datasets, sometimes with links to documentation.

Some of this information can already be found in, say, the NICAR-L archives or IRE’s tipsheets, where reporters have helped each other to understand the caveats of using this dataset or that. But in general, we haven’t done enough to make this kind of helpful information for users available to the broader public. Sunlight and other groups could really help with that effort, and in the process make a valuable contribution to meaningful transparency, the kind that isn’t so impressed by big numbers.

And what about government’s role? It’s the rare person who works with data who doesn’t want users of that information to understand how it works and why. By digging deeper into individual datasets instead of growing horizontally toward an ever-larger number of them, it might be possible to extend the expertise of those individuals to better use. Yes, it’s far easier to count every dataset and say you’re accomplishing transparency. But what kind of transparency, and to what end?

Using the NYT Congress API with … Excel?

May 11th, 2010  |  Published in Car Tools, XML | Comments (2)

It’s true that Excel has been a decreasing part of my toolkit for several years now, and that I never quite had the love for it that I do for various database managers. But I’m guessing that’s the exception, not the rule, in the broader journalism community. So when it came time to propose a lightning talk for the 2010 CAR Conference last week, I chose to pull out the ol’ spreadsheet and show how you could get started with the NYT’s Congress API with a familiar tool.

To do this, I had to not only drag out Excel but also do it on Windows, since Excel’s Web Query feature isn’t available on the Mac. (You could also do this, albeit in a slightly different manner, using OpenOffice and Google Spreadsheets. In the comments, Chris Amico shows you how using Google Spreadsheets.) Here’s how it works using Excel.

First, you’ll need an API key. To get one, go to The Times Developer Network and register (note: you’ll need to be a registered user of nytimes.com first).

You’re registering an “application”, and then you can add specific API keys to that account. Let’s add one for the Congress API. The key itself is a longish string of letters and numbers that gets appended to every API request URL, including the ones we’ll make from Excel. Let’s copy the API key so we can easily grab it (note that this particular key has been disabled, so using it won’t work).

Let’s find an API call that we can use be looking at the Congress API’s documentation. Let’s pick the “members leaving office” response, otherwise known as the casualty list. All that’s required is the chamber (‘house’ or ‘senate’) and the congress (currently only the 111th is supported). If we choose the House, the URL will look like this, except that you’ll need to specify your Congress API Key.

The version number should be “v3″ and you don’t need to specify a format after leaving (xml is the default). You should quickly get an xml file that looks roughly like this:

To get that xml into Excel, we’re going to use Excel’s Import Data feature. I’m not one of those cool kids who has Excel 2007 at their fingertips, so I’m going to use Excel 2002. Import Data can be found at Data -> Get External Data -> Import Data.

Then change the file type to xml and paste the full API url into the box just above the file type.

It works for local files and Web urls. Then click on “Open” to start the process. The import process consists of Excel asking you where to put the file. Just click “OK” and you should soon see something like this:

The header row in row 2 isn’t perfect, but it should suffice. You probably don’t need the copyright statement in column A. But now you’ve got a way to pull data into Excel from an API! If you have questions or comments, please don’t hesitate to post them below. If you’re having issues with the API, the forum is the best place to head.

An Even Better CAR Conference?

May 3rd, 2010  |  Published in IRE | Comments (7)

It’s dangerous to blog late at night, so take what follows with a grain of salt. It’s more stream of (semi-)consciousness than anything else, but I’m curious what other folks think. Note: I wrote this before Aron’s thread about conferences on Hacks/Hackers, but some of it relates to that question, too.

Give IRE a lot of credit for what is a difficult task: putting on a computer-assisted reporting conference that appeals to both novice and expert. The CAR confab in Phoenix in March, like every year’s, tries to bring together a set of journalists who can teach each other and themselves about everything from basic spreadsheets to Web frameworks. For the most part it works, even if it means that a sizable chunk of the attendees are also instructors and speakers (often for more than one session). The energy and enthusiasm at Phoenix was great to see, and the sessions had a good variety of topics and formats.

But an IRE veteran would notice that quite a few of the big names in CAR work weren’t in Phoenix, or even at the last few conferences. Shrinking travel budgets are a factor, of course, but the IRE conferences have always had a segment of people who paid their own way. I suspect that for a small number of “high-end” CAR practitioners, however, the CAR conference doesn’t offer them much anymore, because of its long-standing tradition of appealing to a broad base of people.

That tradition is no bad thing at all: journalism needs to keep bringing in more and more people who want to learn these techniques, and expose those who already have some experience to greater challenges. This isn’t a call to drop introductory sessions. But I wonder if there aren’t some changes that could be made to make the conference irresistible for those who don’t see many chances for growth in the schedule.

For example, there were at least three sessions this year generically focused on “new tools” for reporting: machine learning, advanced methods and new frontiers. What if, instead, we blocked out some afternoon time on Thursday and actually tried out some of this software together? Bring a laptop and some legislation, and with a group of people figure out entity extraction and other classification techniques and then present it later in the conference and/or write it all up for Uplink. What if we voted on some new federal or state dataset and ran the traps on it together, finding out its pitfalls and uses, or brainstormed about better tools for newsrooms? What if some sessions were recast to produce something – the best documentation for a particular data source, for example – rather than a collection of tipsheets that might never be assembled into a coherent guide (or say, a beat book)? What if we turned the evening bar sessions – ok, ok, too much change. But still.

It’ll be difficult to appeal to absolutely everyone, but if we made it easier to do more than talk for 50 minutes at a time, perhaps by providing the opportunity to get together with a range of folks and produce something that we couldn’t do alone, IRE might be able attract even more people new and old. There are an increasing number of people attending the CAR conference who are in a position to evaluate and develop tools for newsrooms, and they want to do this. Pairing them with folks who have spent years combing through data and documents while reporting can only be a good thing – we might end up with a base FEC data parser that newsrooms could customize, or the best set of documentation for IRS migration data, or even some cool dashboards to help reporters spot trends. Maybe we could designate a theme for a particular conference.

If you spotted the influence of open-source development in this post, you’d be a keen observer. One of IRE’s defining moments, the Arizona Project, was in part about doing a public service by marshaling a wide set of talent. It was a fairly radical act of selflessness that not all IRE members agreed with, but to me it represents a key strength of the organization: collaborative learning. A lot of open source software projects out there would kill for the dedication of IRE members. What else can we do together so that we all benefit even more?

2010 CAR Conference

March 7th, 2010  |  Published in IRE | Comments (2)

The 2010 CAR Conference begins on Thursday, and here are some of the sessions I’m trying not to miss:

Thursday, March 11

  • Big Data: Analyzing legislation with machine learning. Always good to hear what Chase Davis has been up to.
  • Open Source GIS. Now that mapping is more and more accessible, it pays to stay on top of what people are using.
  • Juice up your stories with advanced methods. After a few months in the academy, Sarah Cohen should have some good stuff to share.
  • Some lightning talks!

Friday, March 12

  • Semantic tagging and DocumentCloud. Really need to get more in-depth on this.
  • New Frontiers in Reporting Tools. More of the New New Stuff.
  • Forensic Accounting for Reporters. Nice to have some outside expertise.

Saturday, March 13

  • GeoDjango & OpenLayers. Yes, I’m on this panel. But you should come anyway to hear from Ben Welsh of the LA Times.
  • Not a programmer? Not a worry. The new software from ProPublica that helps publish data on the Web. Very interested to see this in action.

There’s also the Django bootcamp and plenty of opportunities for demos, discussions and debates. Hope to see you there!

Previously


Jun 27, 2010
Big Numbers, Low Impact

by Derek | Read | 6 Comments

From the perspective of someone who uses government data pretty often, Data.gov and its state progeny (Massachusetts, D.C., Minnesota – the “data deli” is a great name – among them) are better than what we used to have. They make the acquisition of data by journalists, regulated communities and the general public much easier than [...]


May 11, 2010
Using the NYT Congress API with … Excel?

by Derek | Read | 2 Comments

It’s true that Excel has been a decreasing part of my toolkit for several years now, and that I never quite had the love for it that I do for various database managers. But I’m guessing that’s the exception, not the rule, in the broader journalism community. So when it came time to propose a [...]


May 3, 2010
An Even Better CAR Conference?

by Derek | Read | 7 Comments

It’s dangerous to blog late at night, so take what follows with a grain of salt. It’s more stream of (semi-)consciousness than anything else, but I’m curious what other folks think. Note: I wrote this before Aron’s thread about conferences on Hacks/Hackers, but some of it relates to that question, too. Give IRE a lot [...]


Mar 7, 2010
2010 CAR Conference

by Derek | Read | 2 Comments

The 2010 CAR Conference begins on Thursday, and here are some of the sessions I’m trying not to miss: Thursday, March 11 Big Data: Analyzing legislation with machine learning. Always good to hear what Chase Davis has been up to. Open Source GIS. Now that mapping is more and more accessible, it pays to stay [...]


Feb 23, 2010
A Gentle Introduction to Google App Engine

by Derek | Read | No Comments

As part of our roll-out of version 3 of the NYT Congress API, I was tasked with coming up with a sample application that uses the API to do something mildly interesting, or at least functional. I had gotten a book on Google App Engine for my birthday and was pretty excited to see that [...]


Feb 18, 2010
Lightning Talks at NICAR

by Derek | Read | 2 Comments

This year’s computer-assisted reporting conference in Phoenix has a couple of new sessions on the schedule. One of them is an idea a couple of us have been pushing for a few years: lightning talks. A staple of technical conferences, lightning talks are based on the notion that while 45-50 minutes presentations are good, sometimes [...]

About The Scoop

Derek Willis’ weblog on investigative and computer-assisted reporting.

Recent Comments

  • Jessica Baumgart on How APIs Help the Newsroom
  • Bookmarks van juli 7th tot juli 14th | .: zerocontent - Blog :. on How APIs Help the Newsroom
  • Reporting with Data: How the New York Times Uses APIs on How APIs Help the Newsroom
  • Brad B on Six Reasons To Look Past Caspio
  • Annelies on Big Numbers, Low Impact

Recent Posts

  • How APIs Help the Newsroom
  • Big Numbers, Low Impact
  • Using the NYT Congress API with … Excel?
  • An Even Better CAR Conference?
  • 2010 CAR Conference

Contributors

  • Derek
  • Matt

Popular

  • Methadone Overdose Deaths
  • Outsourcing Database Development, or the Caspio Issue
  • The Times
  • On Bomb-Throwing
  • Joyce Meyer Ministry Compensation
  • A Question of Emphasis
  • Trial By Caspio
  • Six Reasons To Look Past Caspio
  • Django, iCal and vObject
  • The Original (and Future?) Facebook
  • Around the Site

    • Home
    • About
    • Projects
    • Fixing Journalism
    • Database of CAR Stories
  • Methods

    • Hacks/Hackers
    • Open
    • Institute for Analytic Journalism
    • CAR in Canada
    • IRE
    • MacDevCenter
    • ONLamp.com
    • Planet MySQL
    • Poynter
    • Resource Shelf
  • People

    • Mark Schaver
    • Jeremy Zawodny
    • Matt Wynn
    • Chase Davis
    • Adrian Holovaty
    • Joe Adams
    • Matt Waite
    • Mike Hillyer
    • Mark Hamilton
    • William P. Hartnett


  • ©2010 The Scoop
    Powered by WordPress using the Gridline Lite theme by Graph Paper Press.