The Scoop

  • Home
  • Projects
  • About The Scoop
  • Fixing Journalism
  • Medill Links
  • Departments
    • API
    • Apple
    • Asides
    • Broadcast
    • Campaign Finance
    • Car Tools
    • Code
    • Data
    • DIY
    • django
    • Fed Data
    • FOIA
    • General
    • IRE
    • Journalism
    • Local Data
    • Mapping
    • Miscellany
    • NonGov Data
    • Online
    • Paper Trail
    • Presentations
    • Public Records
    • Python
    • Rails
    • Ruby
    • SLA
    • Social Network Analysis
    • Sports
    • State Data
    • Teaching
    • Work
    • XML
  • Subscribe via RSS

Lost in the Weeds

May 13th, 2012  |  Published in Campaign Finance, Car Tools, Work | Comments (1)

The indefatigable Alex Howard posted a link today about a draft academic paper on open source and journalism by Nikki Usher of George Washington University and Seth Lewis of the University of Minnesota-Twin Cities. Alex’s tweets are worth a look, so I pulled up the paper and began reading.

Although I didn’t finish graduate school, I have written my fair share of academic papers, so I’m not a complete novice in the area. And despite the fact that it is a draft, as Usher points out, the idea that you would post even a draft on the web and then profess surprise when someone links to it or reads it is a little off, particularly for someone writing about a) journalism and b) open source. If you weren’t expecting critics, maybe it’s a good idea not to let them see the material. Update: in fairness, Nikki Usher was not aware that the paper was online at all. (I should also note that criticism is my full-time non-job; I probably should have sought work in the field, not due to any real talent on my part but based purely on personal enthusiasm.)

There is a good bit to like in the paper, which will be presented this month at the International Communication Association’s annual convention. There is a tendency for those working at the intersection of technology and journalism to focus on the tools – the stuff that actually exists now – rather than systemic changes to journalism itself (there are ahem, some exceptions to this tendency). In part that’s because systemic changes are a hard problem too easily pushed to the background by the demands of doing journalism today. But it’s also in part because incremental changes, as the authors note, can be valuable in changing the whole. But, ok, I get it, and in many ways agree (except for the utopian one-open-source-CMS-to-rule-them-all idea – can we just kill that fantasy and focus on making information portable?).

Then I read the part about Fech. As a contributor to the project, I am admittedly biased in its favor, but I do not think my reaction is solely or even mainly based on that fact. Here’s what they wrote about the project:

The New York Times developed Fech, a tool that helps journalists crawl financial disclosures by political candidates just by knowing a filing number (Strickland, 2011). Just as the discourse around open source tools emphasizes their pro-social benefits, Fech’s creators note that more access to these filings will lead to better journalism. But Fech also gives one more tool to journalists eager for the horse-race style journalism that is divisive and counterproductive for democracy (Patterson, 1993; Cappella & Jamieson, 1997). There is another problem with what could be a strength of Fech: While the source code is posted on Github for other developers, the tool has been built to help people in the newsroom, not to encourage participation by ordinary people.

Where to begin? First, and perhaps least important, it vastly understates what Fech enables journalists and developers to do, particularly in regards to what can be done programmatically with disclosure data. I do not think that it is a stretch to say that the easier it is to examine and search campaign finance disclosures, the easier it will be for reporters and the public to discover interesting and useful pieces of information. Indeed, the use of Fech by news organizations like ProPublica, Reuters and The Associated Press – to say nothing of our use at The Times – has borne that out.

But here’s the line that got my back up: “But Fech also gives one more tool to journalists eager for the horse-race style journalism that is divisive and counterproductive for democracy.” The evidence for that? There isn’t any. This is pure speculation; I would argue that it is refuted by the examples I just cited. Can the authors — or anyone — cite an example of where Fech has been used to enable more horse-race style journalism (in its pejorative sense, which is what I assume the authors meant)? I’m 17 years out of graduate journalism school, but I’m pretty sure that assertions like that need a bit more than a citation to work that says horse-race journalism is bad for democracy. In theory, Fech could be used to run nuclear reactors, I guess, but since there is no evidence of that actually happening, I’m going to discount that as a possibility.

And finally, the authors write: “While the source code is posted on Github for other developers, the tool has been built to help people in the newsroom, not to encourage participation by ordinary people.” Well, yes and no. I would be hard-pressed to describe those people interested in campaign finance data as “ordinary,” but we open sourced Fech so that it could be used in newsrooms and in any other situation. I don’t understand how exactly the authors presume that we built it only to help people in the newsroom, or to discourage participation by non-newsroom folks. The fact that the contributors to Fech come mainly (but not exclusively) from newsrooms who cover campaigns is understandable to me (and to Jeff Larson). I’m not entirely clear how what we’ve done makes it harder for non-newsroom people to participate. I’d love to read about that, but there are no examples or further discussion in the paper (nor was any user of Fech contacted by the authors, from what I can tell. I wonder if they installed it and tried it themselves).

Usher, to her credit, offered several explanations as to why this passage was in the draft. They include the fact that this paper, like many, undergoes blind review. Fair enough, but it’s worth asking whether the reviewers are able to evaluate these claims, since most of them could be debunked by reading posts on Open. It also, like many of my own projects, seemed to have been a bit of a rush job. I know all about that, but was it really so difficult to talk to anyone involved in Fech? Finally, my objections, however valid, don’t damage the overall point of the paper, but reflect the possibility that I “may be getting lost in the weeds.”

That’s the tricky thing about journalism, data and even open source. Weeds matter. If you get the weeds wrong, the eventual result usually suffers. If I’m lost in the weeds, maybe the garden needs some attention.

Our Mark Knoller Problem

May 1st, 2012  |  Published in Journalism | Comments (0)

My colleagues at The Times (and other folks I know who cover the White House) tell me that Mark Knoller, the CBS Radio reporter who reports on the president, is a genuinely nice man and someone who has always been extraordinarily generous about sharing what he knows with other news organizations. Knoller is such a fixture at the White House – he’s covered every administration since Gerald Ford’s – that he’s moved beyond simply being a reporter into the realm of providing a public service: he’s very often cited in other outlets’ stories about presidential travel. A sample:

CBS’s Mark Knoller, who keeps detailed notes on Obama’s travels, recently told The New York Times that since the president filed for re-election, he’s taken 60 domestic trips and 26 of them involved fundraisers.

But Mark Knoller of CBS, the unofficial keeper of presidential work schedules, reported that President George W. Bush had taken more time off than Obama at this point in his first term.

According to presidential watcher Mark Knoller of CBS, George W. Bush, at this time of his presidency, had made 30 visits to his Texas ranch spanning all or part of 220 days. The Obama’s vacation day count is less than half of that.

This isn’t about Knoller as a person or as a reporter. It’s just that this situation – where one person has become the official source of public knowledge about the travels of the President of the United States – is far from ideal. Forget that the government is occasionally off-base on presidential travel statistics; how is it that other news organizations, including my own, have relied on a system in which one person – however diligent and generous – holds such important information?

From an information management standpoint, having Knoller be the keeper of presidential travel information is not only inefficient – what happens if Knoller is on vacation, or busy? – but makes it harder to regularly review the data or incorporate it into other inquiries. In reality, this is our problem, not Knoller’s, and his generosity has enabled us to carry on as if we’d been collecting this information all the time. But we haven’t. It’s easy enough to just ask Knoller, especially since we don’t use the information all that often.

We’re not talking about uncovering classified information here, but the daily whereabouts of the President of the United States. And yet somehow, every other news organization has decided that it’s perfectly ok not to have this information at its fingertips. It probably won’t happen as long as Knoller remains in his job, but what happens if someday CBS decides not to share that information anymore? Or Knoller decides he’s tired of doing this and retires? In the “weak link in the chain” scenario, the rest of us are the weak links, not him. He’s doing his part. Why are we shirking ours?

The Programmer-Reporter

April 21st, 2012  |  Published in Code, Data, Journalism | Comments (1)

Update: If you want a better visual presentation of this idea, check out Ben Welsh’s ISOJ presentation.

I finally have something tangible from work to show to my mother: an A1, above-the-fold story in today’s New York Times. It doesn’t really help explain what I do, but it’s something that’s a bit easier to understand than, say, a listing of git commits.

A colleague of mine at The Times, Michael Strickland, responded on Twitter: “Enough about the designer-programmer. More about the programmer-writer.”

As usual, he’s onto something, particularly when it comes to news organizations. Literate programming has been around quite a while, and I’m lucky enough to work with people who approach code in a way that seems closer to artistry than to engineering. I’m no expert on such things. But I was a full-time reporter for nearly a decade, and I still do my share of reporting, and that’s where I see the greater potential of applying programming: the Programmer-Reporter.

Let’s stipulate right up front that I, like a lot of folks, am a sucker for an Anne Hull story, the kind built on hours and hours of listening, watching and reflecting. What follows is no knock on what my former Palm Beach Post colleague Ron Hayes used to call “notebook-assisted reporting” built on talking to people, writing down what they say and then turning that into a great story. The only reason this post is not about that kind of journalism is because I was never really much good at it.

There are other stories to tell, and other ways to find them. I usually tell classes that I teach that if any of them can write like Hull or has the source development skills of, say, Bob Woodward, then they probably don’t need to learn what I’m teaching them. But the rest — the vast majority, from my experience — may want to pay attention.

A lot of daily beat reporting – from sports to government to business – relies heavily on reporters knowing the habits and schedules of the people and institutions they cover. Certain events happen in a relatively predictable pattern and a lot of the reporting revolves around keeping tabs on them. But news, the stuff we talk about, often consists of things that go against that pattern, the unusual event in a sea of regularity.

It follows then, that journalists should prize methods that would help unearth such anomalies, those needles in haystacks that we hold dear. Some do, and in other cases there are few real methods other than examining every document or attending every meeting. But way too often, across topics and beats, we remain unaware of or ignore practices that could help us spot news and make sense of the larger picture. If we have a system of story development, it’s a system that seems to value serendipity and entropy. Meanwhile, Donald Rumsfeld’s line about “known unknowns” remains stubbornly in effect.

This is true even in areas where reporting relies heavily on data, such as political campaigns. Many of the stories relating to campaign contributions, for example, are a result of reporters meticulously poring over pages of filings, applying the Potter Stewart test: “I’ll know it when I see it.” This is, in too many cases, a waste of time, since we often do have an idea of what we’re looking for, but believe in this idea of “data serendipity” when practice shows us that asking specific questions, or at least about specific ideas, is a better way to go. The easiest question for any source to deflect is, “Anything interesting going on?” Unfortunately, it’s also the easiest one to ask.

One thing that I’ve learned from writing software is that you don’t really want to “make news,” as it were. Predictability is a good thing, and edge cases – when things get weird or different – are what you want to avoid. Reporting, on the other hand, seeks out the edge cases, the departures from the norm. How to make those two come together? Here’s a way: make it possible to expect the unexpected when it comes to analyzing patterns in data. What we need are easily configurable systems that enable reporters to ask questions of data in a consistent manner and then provide results in a way that makes sense for journalists.

I’m not talking about bland TPS reports but something that turns a piece of data into a potential story. For example, if a local congressman has received donations from executives of XYZ Corp. every March in previous election years, but not this year, then that’s potentially newsworthy; maybe XYZ isn’t giving as much, but maybe they no longer support the local politician, or are a bellwether of a lack of business support. In this case, the absence of data – something that’s very hard for people to spot in pages of filings – is the trigger event that can cause a reporter to follow up.

As much as journalists love to wax about serendipity, much of life is based on our habits and patterns, which are predictable enough to be tested against data. The same idea applies to scenarios that may not have happened in the past but could be defined and applied to the data. Reporters are testing out theories all the time, often by calling up sources and putting a question or theory out there. There’s no reason why we can’t enable the same ease of inquiry with data. In fact, it seems possible to do it on a much broader scale and in more precise ways.

Will this lead directly to stories? Not in many cases. Again, this is a typical process for reporters: float an idea, test its merits and polish. Rinse and repeat. If we could build interfaces that would assist in that process by making it easier to ask questions (or define patterns), how many more stories could we find? How many known unknowns could be crossed off the list?

A lot, I think. And despite the thrill of seeing your name on the front page, this is the thing that keeps me energized. This is where programmers can make an impact – either by diving into a subject area or pairing with reporters who have that expertise – by making it easier to find better stories. The news is out there, in most cases. Finding it, nailing it down, that’s the challenge. It is, in programming terms, a hard problem. But the programmers I know love to tackle hard problems.

Investigating House Freshmen Voting Patterns

March 23rd, 2012  |  Published in Fed Data, Work | Comments (0)

One of the great things about what is known as computer-assisted reporting is the chance it offers to prove or disprove conventional wisdom. To separate anecdote – no matter how compelling – from reality. My colleague Jennifer Steinhauer and I got an opportunity to do that last weekend using House voting data that we collect and make available via The New York Times Congress API.

The working question going into the story was that despite the headlines garnered by the freshman class of Republican lawmakers in the House, it wasn’t clear that they were mainly responsible for opposing the GOP leadership, particularly House Speaker John Boehner. Jennifer first outlined a series of votes that attracted Republican opposition (in the House, the majority’s first obligation is to satisfy most, if not all, of its caucus. A significant number of “No” votes from the majority’s ranks is a clear sign that there are some issues to resolve). Then I pulled the vote data from our API, although the same data is available from other sources, too.

We looked at the freshmen Republicans as a voting bloc, and then at members of the Republican Study Committee, the largest sub-group within the House GOP. Since such organizations aren’t technically legislative committees, we had to rely upon its list of members (which turned out to add a wrinkle to the story). Here’s what we found:

But an analysis of voting patterns on the most contentious bills in the 112th Congress shows that House members of the Republican Study Committee — a group of both veterans and newcomers that meets weekly to hammer out a conservative agenda — have cast the bulk of “no” votes on big bills, including those important to Speaker John A. Boehner of Ohio.

The freshmen who have joined the study committee — which was founded in 1973 — play an important role in its renewed clout, having increased its membership to 163 from roughly 110 two years ago. As a group, however, the freshmen are less homogenous and less apt to buck the leadership than the study committee itself is as a whole.

It was relatively straightforward, as data analysis goes, since all we needed to do was pick the universe of votes, pull the data and identify freshmen and RSC members. Identifying freshmen lawmakers seems pretty easy, but there are a few questions to bear in mind. For example, are lawmakers who were elected in special elections before November 2010 freshmen? Are former members who reclaimed their seats in 2010 freshmen? What about people elected after 2011? (We said yes in all cases, which enlarges the usual number cited as the 2010 freshman class).

With RSC members, it was a little more involved, in that an earlier version of the RSC member page indicated that membership had changed somewhat since last July. That became an element of the story:

Since then, over a dozen members have resigned from the study group, including a few freshmen. In interviews, some said that the episode had angered them and that they had tired of the committee’s attempts to define who is worthy of being called a conservative. Most were afraid to tackle the group on the record.

Vote analysis can seem like a dry exercise – there are so many votes, and most of them are fairly lopsided affairs. But voting is important not only for the individual decisions that lawmakers take, but for what it can tell us about the collective behavior of members.

On Legislative Data Transparency

February 4th, 2012  |  Published in Data, Fed Data | Comments (5)

This week I was honored to speak at at the Legislative Data and Transparency Conference put on by the Committee on House Administration. If you’re so inclined, the videos of the presentations are online at the conference site, although I must warn you that they contain heavy doses of XML references and other fun stuff. What follows is not my presentation, strictly speaking, but most of it along with some other thoughts.

Being a former Congressional Quarterly staffer, I have an innate fondness for House Admin, one of the lesser-known committees but one with a large influence over what kinds of information the public can see about the House side of the legislative process. The committee has jurisdiction over the Library of Congress, which by extension means Thomas, the online home of so much Congressional information.

There are many other posts about the general desires of what those folks committed to transparency want when it comes to Congress, but Daniel Schuman of the Sunlight Foundation sums them up pretty well: “To the maximum extent possible, legislative information must be available online, in real time, and in machine readable formats.”

I don’t disagree, and I am sympathetic to complaints that Congress has been slow to address the availability of bulk data. People such as Josh Tauberer have been screen-scraping Thomas since 2004, and I joined in the process a year later at washingtonpost.com. In 2012, we’re both still doing it, now joined by Sunlight, OpenCongress and who knows how many others (speaking of OpenCongress, if you want a less patient restatement of Schuman’s thoughts, OC’s David Moore has a stem-winder of a post for you).

I, too, long for the day when I don’t have to wonder when my HTML parsers will break after a seemingly innocuous change to Thomas’ styles, or when I don’t have to enter three different IDs for a new Senator (Bioguide, LIS and Thomas’ own unique sequential number). But my presentation on Thursday concentrated on a more fundamental need. Before bulk data can become really useful, it has to be more consistent, understandable and accurate. Right now, if you’re not willing to put in a lot of time studying the quirks of Congress, you will always face the likelihood that your data, however lovingly collected, has plenty of errors.

For example, in the Senate it is possible for the Majority Leader and Minority Leader to alter the rules of math when it comes to how many senators constitute a three-fifths majority. The death of Sen. Ted Kennedy in 2009 reduced the number of Democrats in the chamber at that time to 59, and the total number of senators “duly elected and sworn” to 99. For votes requiring a three-fifths majority (thanks, Malcolm), a 99-member Senate would need 59.4 senators for passage, or at least 59. But the party leaders agreed to keep the three-fifths threshold at 60 votes throughout the period when the Senate had 99 senators, not 100. For much of that period, any two-thirds vote displayed on nytimes.com had the wrong number of votes required for passage, because I was relying on math. I could not find any place in the Congressional Record or anywhere else where this was documented.

An edge case, you might say. But when it comes to Congress, there are loads of them. A reporter called me several weeks ago to ask why a seemingly simple question about three members of her state’s delegation was maddeningly hard to answer. All three had been elected to the House the same year, and had served since then. But each of them had a different number of total votes he or she was eligible to vote on. How could that be?

It took me a little while, but the only explanation I could find was that their dates of service had to differ in some way, and my guess was that not all of them were sworn in for each session on the same day. It happens. Unfortunately, neither Thomas nor the Clerk of the House provides an easy way to find out when a particular member was sworn in, despite the fact that it is a basic element of what makes someone a Member of Congress. At the conference, I heard someone say that it would be possible to provide a list of swearing-in dates for every lawmaker. That’s good, and needed, but it’s not good enough. I need, and the data demands, timestamps in this case. That’s the only way I can be sure of what votes a member was or was not eligible to vote on.

You might think that you could find the total number of House votes for a given year by looking at the Clerk’s votes site. In 2011, the last vote was roll call 949. Alas, officially, there were 948 votes that year, because roll call 484 was vacated and replaced by vote 485, and thus never really happened.

In my presentation I cited a few other examples, but they mostly boil down to this: unless we can make congressional information easier to use and understand by people outside the small circle of legislative wonks, bulk data access by itself won’t solve our problems. Today the most creative uses of congressional information, such as Sunlight’s Capitol Words project, suffer from this limitation. I love Capitol Words, but right now the Congressional Record – the source for it – cannot reliably tell me in a machine-readable form whether a particular word or phrase or speech was even spoken out loud on the floor of the House or Senate. That’s kind of a big deal, for reporters, historians and the public.

If we can’t use congressional data to answer what should be straightforward questions, or can’t agree on what the answers should be, providing immediate access to that data in bulk form may not be as helpful as we would think, and in some cases risks adding to the confusion. It may expose more of those problems, which is of some usefulness, but if the ultimate goal is not just access but understanding, we need to address the fundamental issues of accuracy and consistency before we switch on the firehose.

Previously


May 1, 2012
Our Mark Knoller Problem

by Derek Willis | Read | No Comments

My colleagues at The Times (and other folks I know who cover the White House) tell me that Mark Knoller, the CBS Radio reporter who reports on the president, is a genuinely nice man and someone who has always been extraordinarily generous about sharing what he knows with other news organizations. Knoller is such a [...]


Apr 21, 2012
The Programmer-Reporter

by Derek Willis | Read | 1 Comment

Update: If you want a better visual presentation of this idea, check out Ben Welsh’s ISOJ presentation. I finally have something tangible from work to show to my mother: an A1, above-the-fold story in today’s New York Times. It doesn’t really help explain what I do, but it’s something that’s a bit easier to understand [...]


Mar 23, 2012
Investigating House Freshmen Voting Patterns

by Derek Willis | Read | No Comments

One of the great things about what is known as computer-assisted reporting is the chance it offers to prove or disprove conventional wisdom. To separate anecdote – no matter how compelling – from reality. My colleague Jennifer Steinhauer and I got an opportunity to do that last weekend using House voting data that we collect [...]


Feb 4, 2012
On Legislative Data Transparency

by Derek Willis | Read | 5 Comments

This week I was honored to speak at at the Legislative Data and Transparency Conference put on by the Committee on House Administration. If you’re so inclined, the videos of the presentations are online at the conference site, although I must warn you that they contain heavy doses of XML references and other fun stuff. [...]


Oct 17, 2011
What We Don’t Know About Elections

by Derek Willis | Read | 6 Comments

If you happened to be at the recent Online News Association conference in Boston and happened to attend the session on covering the 2012 elections, then a good bit of this will be repetitive. Since there wasn’t a ton of time to expand on what I said, and I don’t want to leave the impression [...]


Oct 4, 2011
RemoteTable Is Your Friend

by Derek Willis | Read | No Comments

Assuming you regularly work with data found online – and if you don’t, you’re probably here by mistake, so welcome! – then you realize what a pain it can be to grab structured files from some site, save them and import them. I have more methods in more apps than I can count that download [...]

About The Scoop

Derek Willis’ weblog on investigative and computer-assisted reporting.

Recent Comments

  • Seth Lewis on Lost in the Weeds
  • Reporters' Lab // News algorithms already exist – and that’s good on The Programmer-Reporter
  • Eric Mill on On Legislative Data Transparency
  • (19:19 06-02-2012) Noticias más populares de #opengov en las ultimas 24 horas | Tuits de Software Libre on On Legislative Data Transparency
  • (15:05 06-02-2012) Noticias más populares de #opengov en las ultimas 24 horas | Tuits de Software Libre on On Legislative Data Transparency

Recent Posts

  • Lost in the Weeds
  • Our Mark Knoller Problem
  • The Programmer-Reporter
  • Investigating House Freshmen Voting Patterns
  • On Legislative Data Transparency

Linking Out

  • Mapping America — Census Bureau 2005-9 American Community Survey - NYTimes.com
    holy crap
  • Backbone.js and Django | joshbohde.com
  • ProPublica
  • Geoff: GeoJSON Feature Functions for JavaScript
  • Introducing Spanner: From Documents to Linked Data Apps—Clark & Parsia: Thinking Clearly
  • A performance lesson on Django QuerySets | Seek Nuance
  • http://www.post-gazette.com/pg/03001/1108747-209.stm
  • CBC News - Canada - Database: Canadian cables in WikiLeaks
  • Federal prosecutors likely to keep jobs after cases collapse - USATODAY.com
  • Strata Gems: Explore and visualize graphs with Gephi - O'Reilly Radar

Contributors

  • Derek Willis
  • Matt

Popular

  • Methadone Overdose Deaths
  • Outsourcing Database Development, or the Caspio Issue
  • The Times
  • On Bomb-Throwing
  • Six Reasons To Look Past Caspio
  • Joyce Meyer Ministry Compensation
  • Django, iCal and vObject
  • A Question of Emphasis
  • Trial By Caspio
  • The Original (and Future?) Facebook
  • Around the Site

    • Home
    • About
    • Projects
    • Fixing Journalism
    • Database of CAR Stories
  • Methods

    • Hacks/Hackers
    • Open
    • Institute for Analytic Journalism
    • CAR in Canada
    • IRE
    • MacDevCenter
    • ONLamp.com
    • Planet MySQL
    • Poynter
    • Resource Shelf
  • People

    • Mark Schaver
    • Jeremy Zawodny
    • Matt Wynn
    • Chase Davis
    • Adrian Holovaty
    • Joe Adams
    • Matt Waite
    • Mike Hillyer
    • Mark Hamilton
    • William P. Hartnett


  • ©2012 The Scoop
    Powered by WordPress using the Gridline Lite theme by Graph Paper Press.