The Scoop

  • Home
  • Projects
  • About The Scoop
  • Fixing Journalism
  • Medill Links
  • Departments
    • API
    • Apple
    • Asides
    • Broadcast
    • Campaign Finance
    • Car Tools
    • Code
    • Data
    • DIY
    • django
    • Fed Data
    • FOIA
    • General
    • IRE
    • Journalism
    • Local Data
    • Mapping
    • Miscellany
    • NonGov Data
    • Online
    • Paper Trail
    • Presentations
    • Public Records
    • Python
    • Rails
    • Ruby
    • SLA
    • Social Network Analysis
    • Sports
    • State Data
    • Teaching
    • Work
    • XML
  • Subscribe via RSS

What We Don’t Know About Elections

October 17th, 2011  |  Published in Data, Journalism, Presentations | Comments (6)

If you happened to be at the recent Online News Association conference in Boston and happened to attend the session on covering the 2012 elections, then a good bit of this will be repetitive. Since there wasn’t a ton of time to expand on what I said, and I don’t want to leave the impression that I’m critical of all election coverage, consider this the write-through.

First, I stand by what I said about how little we understand about the way that elections are won or lost these days. It’s not that political journalism has strayed from its roots, or stopped covering important elements of a modern campaign. It’s that the elements of a modern campaign have changed, and as journalists, we have not kept pace.

You might respond that campaigns still involve quite a lot of stuff that we do understand, such as debates and visits to state fairs and town hall meetings. True. But the nature of media and technology has brought extensive changes to the electoral system, and I don’t believe that we as journalists devote enough attention to understanding those changes. Remember the Dean campaign in 2003? Most of the coverage was on the, for then, staggering online fundraising managed by some doctor from Vermont. But that Wired piece I referenced had it right; Dean’s accomplishment was less a mastery of the Internet but a willingness to embrace its fundamental aspect: you give up some control by bringing other people in, and you gain a host of possibilities. You may, of course, choose badly or falter in some other way, but the lessons and possibilities are becoming clear. At the time, as a Web geek who loved politics, I felt that journalists couldn’t really explain the Dean campaign, because it was so alien to us. Today’s campaigns make me long for the simplicity of 2003.

But let’s stick with fundraising for a bit. Political fundraising can be hugely expensive, because campaigns need to amass large number of donors. Unless you’re the President, it’s hard to repeatedly gather the wealthiest Americans and have them fork over $2,500 or more for the pleasure of your company. So a smart campaign sticks with what works: direct mail is costly, for example, but it’s also effective. Telemarketing takes time and money, but it also works pretty well. Let’s not mess with the script too much. But what if you can mess with the script? Now it’s possible, even trivial, to experiment with Web site design or even advertisements in order to gauge their effectiveness and improve upon them. President Obama had a Director of Analytics for his 2008 campaign, and has been hiring data scientists experienced in predictive modeling.

White men smoking cigars in cramped rooms making gut calls is how we’ve usually understood campaign decision-making. This? Whole new ballgame. Yes, there is still a mass audience that is shaped by the media and big events. But there are now thousands and thousands of “small” audiences – or rather, they always were there. Now campaigns can identify them and deliver precision messages to them. And they can find them online in different ways; an hour after posting on Twitter about the Obama’s campaign use of Github, the campaign’s Digital Director was following me. And that’s the easy part.

While campaigns have a public presence that is mostly recorded and observed, the stuff that goes on behind the scenes is so much more sophisticated than it has been. In 2008 we were fascinated by the Obama campaign’s use of iPhones for data collection; now we’re entering an age where campaigns don’t just collect information by hand, but harvest it and learn from it. An “information arms race,” as GOP consultant Alex Gage puts it.

For most news organizations, the standard approach to campaign coverage is tantamount to bringing a knife to a gun fight. How many data scientists work for news organizations? We are falling behind, and we risk not being able to explain to our readers and users how their representatives get elected or defeated.

None of this is to say that we need to completely abandon our ways of covering elections. Horse-race coverage is and should be a part of campaign coverage, because in many respects elections are like horse races. Things can change rapidly, and small things can have big impact. We still should be on the ground, talking to voters, showing up at town halls and covering debates. We still need to show up and do the legwork.

But if we can’t appreciate, much less understand, what modern campaigns are doing to win elections, how can we hope to explain elections? If we don’t collect at least some of the information available to us – realizing that we can’t get our hands on everything that the campaigns do – we’ll miss the story. Elections will become even bigger surprises to us, and then how long will it be before readers start to ask whether we actually know the people and places we cover?

Surprises make the news. Some of my favorite stories from the 2004 presidential election are in a book by my friends Peter Wallsten and Tom Hamburger, then of the Los Angeles Times. Here’s one anecdote from the key state of Ohio:

One suburban African American woman in Ohio, for example, told us that though she tends to vote Democratic, she was deluged in 2004 with calls, e-mail messages and other forms of communication by Republicans who somehow knew that she was a mother with children in private schools, an active church attendee, an abortion opponent and a golfer.

Think about what this kind of thing means. It means that we cannot assume that the campaign visible to the mass audience is the same campaign that’s being pitched to individuals and groups around the nation, and that winning coalitions can be built not just by harnessing large groups (unions, religious voters, etc.) but also by piecing them together in small units. President Bush’s margin in Ohio in 2004? About 2.5 percent. The only thing that I don’t like about this anecdote is that Wallsten and Hamburger’s book appeared nearly two years later. Is there any evidence that we as journalists have closed the gap since then?

To understand how elections are now being waged, we need to have as many of the tools as do the campaigns. We need to build our own storehouses of data – voter registration, voter history, Census, campaign finance, advertisements and more. We need to be able to tap into the rich stream of material that’s being created and disseminated every day. We need to be able to see the value in small data points that can lead to bigger things.

Elections are great stories. They deserve to be told from a position of confidence and knowledge. We have work to do.

RemoteTable Is Your Friend

October 4th, 2011  |  Published in Car Tools, Data, Ruby | Comments (0)

Assuming you regularly work with data found online – and if you don’t, you’re probably here by mistake, so welcome! – then you realize what a pain it can be to grab structured files from some site, save them and import them. I have more methods in more apps than I can count that download a CSV file, run it through a parser and then save some objects as a result. And it always seems to be a 2-3 step process.

If you’re a fan of the useful CSVKit, a command-line tool written in Python, but you’re a Rubyist, then please do yourself a favor and take a look at RemoteTable, a gem by Seamus Abshere. The two libraries aren’t entirely identical, and CSVKit has been rightly praised by others, so let me dive a little deeper into why RemoteTable is a data parser’s friend.

RemoteTable is essentially a set of useful wrappers around the common process I outlined above: grab a structured file and use it contents. Except that when I say “structured file”, I don’t just mean a CSV. I mean CSVs inside of zip files. Excel spreadsheets of the .xls and .xlsx varieties. Google Spreadsheets and Open Office spreadsheets. Web pages with HTML tables in them. XML. And, what the hell, fixed-width files, too. You’ll want to see the examples.

To those options it adds some useful utilities, much as CSVKit does. You can cut certain columns using the Unix cut utility, skip or crop rows using tail, make the entire file UTF-8 or remove “useless” characters.

It’s simple to try out. If you have Ruby installed and are not on Windows (sorry!), then gem install remote_table and wait a minute or two, as it installs quite a few dependencies for managing all this stuff. Then, in a console session, try grabbing a CSV file:

Each row is converted into an Ordered Hash so that you can refer to columns by name instead of position.

Now, what about a fixed-width file? Maybe one in which I’m only interested in some of the columns? Easy, thanks to RemoteTable’s nice DSL:

Notice that :cut option I passed? That’s where you include arguments to cut. In this case, I removed the financial columns from this file. You can also pass a :select option to it to do on-the-fly filtering of certain columns based on a matcher such as a regular expression or plain text.

RemoteTable can tell you when you’re not being very smart about its use, too, such as causing it to re-fetch the source file multiple times (it warns you when it does this). It also handles local files, so if you want to avoid the downloading bit (or your files aren’t on the Web), that’s fine, too. Honestly, I have no idea why this library isn’t more popular, but you can help make it so on Github.

Measuring Vocabulary Richness (or, Trying Out Django on Heroku)

October 1st, 2011  |  Published in Code, Python | Comments (3)

When Heroku announced Python support this past week, I was interested in seeing how the deployment process worked compared to how Heroku handles Ruby apps. Then a post highlighted by the Python Weekly newsletter caught my eye.

Swizec Teller’s entry, “Measuring vocabulary richness with Python“, described an algorithm by George Udny Yule in a 1944 paper entitled “The statistical study of literary vocabulary.” Yule created a way to quantify the diversity of vocabulary in a given text, and Teller translated that formula into straightforward Python code.

So I made a simple Django app that accepts text via a form and uses Teller’s code to calculate Yule’s I score of vocabulary richness. It uses the really useful Natural Language Toolkit; the only oddity is that when developing locally on a Mac, the standard installation of NLTK via pip is borked, so you need to specify a file to download in your requirements.txt. You can find the demo app here.

I’m not offering a judgment on using Heroku or other instant deployment-type services; most of them seem pretty easy to use but out of my price range for anything significant. But it’s nice to know that services like Heroku, ep.io and others offer enough flexibility to do stuff like natural language parsing.

In Defense of Building Tools

August 10th, 2011  |  Published in Car Tools, Journalism, Work | Comments (10)

My first job in Web development was as a member of washingtonpost.com’s “Tools Team.” I was, in title if not in practice, a Tool.

Done snickering? Let’s move on.

The Tools Team built mostly internal applications and services that helped the Web site run better. I mainly got to work on front-facing projects like the Congress Votes Database, the 2008 presidential campaign and an innovative series on lobbyist Gerald Cassidy. But I did work on a few internal tools, and since I joined The Times in late 2007 I’ve built a few more. I’ve found that such tools are not so different from what we now consider to be journalism by Web development. Chosen wisely and done well, they can have impacts that go far beyond a single story or series. We should not dismiss them as “not journalism.”

If you’re at the geekier end of the journalism spectrum, then chances are your colleagues know about the stuff you can do. They may not understand it or be able to explain it; a former managing editor of mine, when told about the various technical steps to accomplish something useful, would invariably respond with a touch of wonder: “Fuckin’ Internet!” You can explain your work to a decent percentage of your colleagues by invoking Harry Potter or the Lord of the Rings and leave it at that.

But that doesn’t mean that building tools that can be used by broad segments of the newsroom is a one-way street or has to lead to a divide between you and the other journalists. There will be people in every newsroom who mainly take and rarely give, and in those situations being a technologist is no different from being a clerk. Good tools, like good apps, are a product of collaboration and improve the ability of the newsroom in general. They also make for more and better apps.

Case in point: At The Times we have an Inside Congress app that displays information about votes and bills in Congress. The tool that underlies that app is enormous – it has tons more information, and we’re working to surface more and more of it. But the tool – an internal interface – has uses for our congressional reporters, our graphics editors and for me as a developer. I can point a reporter to the vote record comparison tool instead of having to run a database query or, worse, asking someone else to manually recreate something. We use the tool as a sort of canary in the mine to alert us to odd or interesting events, from committee assignment changes to bill sponsorship withdrawals to unusual voting patterns. In some cases, having the data internally has led to improvements in the app itself, such as our “key amendments” pages for certain bills. I didn’t think of that, but someone else who saw the internal tool did, and we built it together.

Perhaps most important to me as a developer, building the internal tool has broadened the number of people I work with and has given me a range of ideas for making apps easier to build and better. Not all of them pan out, but some of them do. Put another way: the tool actually helps me develop closer working relationships with my colleagues.

A good tool doesn’t just make it easier for a reporter to create a story. It actually seeds the story, or makes it possible for more people in a newsroom to collaborate. When you have data but no tool, you become a gatekeeper of a sorts – which is appropriate in many circumstances, but not all. I can’t possibly know what my colleagues are thinking about, considering or being alerted to, but I can make it easier for them to test out theories and do some exploration on their own. Some of them prefer to do their own work, and we certainly miss some opportunities for apps that way. But others consult with me quite a bit, since they now have a much better idea of what we have and what we might be able to do with it.

Skeptics might respond that there is a difference between tools built around journalistic content, like the Congress app, and those that “merely” solve a technical problem. This is a short-sighted argument. What we do as builders of Web applications (external or internal) is informed by everything we touch. Pulling a piece of one tool for use someplace else is a useful technique because it reinforces the value of not repeating yourself and because it sometimes enables you to look at an old problem or situation from a new vantage point.

Back at washingtonpost.com, my former colleague Adrian Holovaty liked to say that we didn’t build internal versions of our apps because the public version was the internal version. Fair enough, to a point, but I think that line can veer into the data ghetto when not rigidly policed.

Most of my colleagues, I’m confident, have very little idea what it is that I specifically do. Sometimes I spend the time educating, and sometimes I let our tools help with the evangelization process. However they see my work, I’m pretty happy as long as it contributes to our journalism together. App developer? Sure. Tool maker? Why not. Labels don’t interest me much, and most of my colleagues don’t seem to care. The results – the journalism – are what matter.

Why Teach SQL?

July 27th, 2011  |  Published in Car Tools, Teaching | Comments (4)

There was an interesting discussion on the NICAR-L listserv today about teaching database skills. More specifically, which software to teach and how to teach it. Should you go with SQLite, as I do? What about MS Access (the consensus seemed to lean against)? Is it too much to ask students to install database server software such as MySQL or PostgreSQL?

These are complicated questions, made moreso by the options now available for teaching database skills. When I attended an IRE database bootcamp in 1997 (taught by my now-colleague Jo Craven McGinty), there were basically three options: the then-young Access, FoxPro or Paradox. Hard to believe, but back then I worked in a newsroom that had FoxPro and Paradox, but not really Access (Note: if you are under 30 and reading this, you may not even know what FoxPro and Paradox are. That’s ok. They, an in particular FoxPro, were wonderful database managers in their day.)

Not only do we now have open source options (SQLite, MySQL, Postgres) and SQL Server, but we also have a variety of “database-like” Web applications, like Fusion Tables and Google Refine, that can do some of the things that only desktop software used to do. And let’s face it, Excel is a very powerful tool for data analysis. Many of the things a reporter might want to do to a data file, such as sorting and filtering, are arguably a lot easier in Excel or another spreadsheet.

So why even teach SQL, then? The reasons I do it, and will continue to, are these:

  1. SQL is an excellent and relatively simple way to enhance your data interviewing skills. When you have to write out your questions, you tend to think about them a little more than if you’re just pointing and clicking around. This is why when I had to teach Access, I bypassed the visual query builder. Yes, SQL queries involve writing more than doing an Excel filter, but those syntax errors also make you consider what you’re doing, and that’s a good thing.
  2. SQL is still common enough on the Web that teaching it provides an additional branch, if you will, of learning, or at least makes it easier. When I explain how Facebook assembles all your friend’s posts, comments and pictures, I usually do so by pointing out the existence of FQL. If you already know SQL, it’s a very small leap to understanding, at a basic level, how Facebook works.
  3. There are some times when you will absolutely need to use a SQL database. Or, at least, something that’s not Excel. Multi-million-row tables. Regular expression-based pattern matching. Intensive, complicated queries. If you haven’t explored SQL, you might not know these are even possible, and you might give up.

As to what to use when teaching SQL, I stick with SQLite despite Sarah Cohen‘s completely valid point that date and time support is much more complicated than it should be. Perhaps a new installment of Troy Thibodeaux’s excellent tutorial will help address that issue. In the meantime, let’s keep teaching SQL – and asking questions.

Previously


Oct 4, 2011
RemoteTable Is Your Friend

by Derek Willis | Read | No Comments

Assuming you regularly work with data found online – and if you don’t, you’re probably here by mistake, so welcome! – then you realize what a pain it can be to grab structured files from some site, save them and import them. I have more methods in more apps than I can count that download [...]


Oct 1, 2011
Measuring Vocabulary Richness (or, Trying Out Django on Heroku)

by Derek Willis | Read | 3 Comments

When Heroku announced Python support this past week, I was interested in seeing how the deployment process worked compared to how Heroku handles Ruby apps. Then a post highlighted by the Python Weekly newsletter caught my eye. Swizec Teller’s entry, “Measuring vocabulary richness with Python“, described an algorithm by George Udny Yule in a 1944 [...]


Aug 10, 2011
In Defense of Building Tools

by Derek Willis | Read | 10 Comments

My first job in Web development was as a member of washingtonpost.com’s “Tools Team.” I was, in title if not in practice, a Tool. Done snickering? Let’s move on. The Tools Team built mostly internal applications and services that helped the Web site run better. I mainly got to work on front-facing projects like the [...]


Jul 27, 2011
Why Teach SQL?

by Derek Willis | Read | 4 Comments

There was an interesting discussion on the NICAR-L listserv today about teaching database skills. More specifically, which software to teach and how to teach it. Should you go with SQLite, as I do? What about MS Access (the consensus seemed to lean against)? Is it too much to ask students to install database server software [...]


May 1, 2011
Interviewing Data

by Derek Willis | Read | 2 Comments

To my mother’s regret, I was never the literature lover she is. And I am not remotely the writer I might have been expected to be, given that my parents both taught English, one at the high school level and the other at college. I also am not the most graceful of interviewers, as my [...]


Mar 28, 2011
On Technical Challenges to Accessing Government Information

by Derek Willis | Read | 2 Comments

If you’re in D.C. on April 12 and are interested in government records, you may want to consider attending the Media Access to Government Information Conference (MAGIC) being held at National Archives building on Pennsylvania Ave. I’ll be one of the panelists there, but don’t let that dissuade you; there are far brighter people who [...]

About The Scoop

Derek Willis’ weblog on investigative and computer-assisted reporting.

Recent Comments

  • Phil Underwood on Django, iCal and vObject
  • Derek Willis on Xpdf on the Mac
  • Danielle on Xpdf on the Mac
  • Christopher on Measuring Vocabulary Richness (or, Trying Out Django on Heroku)
  • malcolm tesla on A GitHub for Data?

Recent Posts

  • What We Don’t Know About Elections
  • RemoteTable Is Your Friend
  • Measuring Vocabulary Richness (or, Trying Out Django on Heroku)
  • In Defense of Building Tools
  • Why Teach SQL?

Linking Out

  • Mapping America — Census Bureau 2005-9 American Community Survey - NYTimes.com
    holy crap
  • Backbone.js and Django | joshbohde.com
  • ProPublica
  • Geoff: GeoJSON Feature Functions for JavaScript
  • Introducing Spanner: From Documents to Linked Data Apps—Clark & Parsia: Thinking Clearly
  • A performance lesson on Django QuerySets | Seek Nuance
  • http://www.post-gazette.com/pg/03001/1108747-209.stm
  • CBC News - Canada - Database: Canadian cables in WikiLeaks
  • Federal prosecutors likely to keep jobs after cases collapse - USATODAY.com
  • Strata Gems: Explore and visualize graphs with Gephi - O'Reilly Radar

Contributors

  • Derek Willis
  • Matt

Popular

  • Methadone Overdose Deaths
  • Outsourcing Database Development, or the Caspio Issue
  • The Times
  • On Bomb-Throwing
  • Six Reasons To Look Past Caspio
  • Joyce Meyer Ministry Compensation
  • Django, iCal and vObject
  • A Question of Emphasis
  • Trial By Caspio
  • The Original (and Future?) Facebook
  • Around the Site

    • Home
    • About
    • Projects
    • Fixing Journalism
    • Database of CAR Stories
  • Methods

    • Hacks/Hackers
    • Open
    • Institute for Analytic Journalism
    • CAR in Canada
    • IRE
    • MacDevCenter
    • ONLamp.com
    • Planet MySQL
    • Poynter
    • Resource Shelf
  • People

    • Mark Schaver
    • Jeremy Zawodny
    • Matt Wynn
    • Chase Davis
    • Adrian Holovaty
    • Joe Adams
    • Matt Waite
    • Mike Hillyer
    • Mark Hamilton
    • William P. Hartnett


  • ©2012 The Scoop
    Powered by WordPress using the Gridline Lite theme by Graph Paper Press.