<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>The Scoop</title>
	<atom:link href="http://blog.thescoop.org/feed/" rel="self" type="application/rss+xml" />
	<link>http://blog.thescoop.org</link>
	<description>Derek Willis' weblog on investigative and computer-assisted reporting.</description>
	<lastBuildDate>Tue, 18 Oct 2011 02:10:16 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
		<item>
		<title>What We Don&#8217;t Know About Elections</title>
		<link>http://blog.thescoop.org/archives/2011/10/17/what-we-dont-know-about-elections/</link>
		<comments>http://blog.thescoop.org/archives/2011/10/17/what-we-dont-know-about-elections/#comments</comments>
		<pubDate>Tue, 18 Oct 2011 01:46:07 +0000</pubDate>
		<dc:creator>Derek Willis</dc:creator>
				<category><![CDATA[Data]]></category>
		<category><![CDATA[Journalism]]></category>
		<category><![CDATA[Presentations]]></category>

		<guid isPermaLink="false">http://blog.thescoop.org/?p=5631</guid>
		<description><![CDATA[If you happened to be at the recent Online News Association conference in Boston and happened to attend the session on covering the 2012 elections, then a good bit of this will be repetitive. Since there wasn&#8217;t a ton of time to expand on what I said, and I don&#8217;t want to leave the impression [...]]]></description>
			<content:encoded><![CDATA[<p>If you happened to be at the recent <a href="http://ona11.journalists.org">Online News Association conference in Boston</a> and happened to attend the <a href="http://ona11.journalists.org/sessions/innovative-ways-to-cover-the-2012-election/">session on covering the 2012 elections</a>, then a good bit of this will be repetitive. Since there wasn&#8217;t a ton of time to expand on what I said, and I don&#8217;t want to leave the impression that I&#8217;m critical of all election coverage, consider this the write-through.</p>
<p>First, I stand by <a href="http://twitter.com/#!/kzhu91/status/117273879105384449">what I said</a> about how little we understand about the way that elections are won or lost these days. It&#8217;s not that political journalism has strayed from its roots, or stopped covering important elements of a modern campaign. It&#8217;s that the elements of a modern campaign have changed, and as journalists, we have not kept pace.</p>
<p>You might respond that campaigns still involve quite a lot of stuff that we <em>do</em> understand, such as debates and visits to state fairs and town hall meetings. True. But the nature of media and technology has brought extensive changes to the electoral system, and I don&#8217;t believe that we as journalists devote enough attention to understanding those changes. Remember the <a href="http://www.wired.com/wired/archive/12.01/dean.html">Dean campaign in 2003</a>? Most of the coverage was on the, for then, staggering online fundraising managed by some doctor from Vermont. But that Wired piece I referenced had it right; Dean&#8217;s accomplishment was less a mastery of the Internet but a willingness to embrace its fundamental aspect: you give up some control by bringing other people in, and you gain a host of possibilities. You may, of course, choose badly or falter in some other way, but the lessons and possibilities are becoming clear. At the time, as a Web geek who loved politics, I felt that journalists couldn&#8217;t really explain the Dean campaign, because it was so alien to us. Today&#8217;s campaigns make me long for the simplicity of 2003.</p>
<p>But let&#8217;s stick with fundraising for a bit. Political fundraising can be hugely expensive, because campaigns need to amass large number of donors. Unless you&#8217;re the President, it&#8217;s hard to repeatedly gather the wealthiest Americans and have them fork over $2,500 or more for the pleasure of your company. So a smart campaign sticks with what works: direct mail is costly, for example, but it&#8217;s also effective. Telemarketing takes time and money, but it also works pretty well. Let&#8217;s not mess with the script too much. <a href="http://blog.optimizely.com/how-obama-raised-60-million-by-running-an-exp">But what if you <em>can</em> mess with the script</a>? Now it&#8217;s possible, even trivial, to experiment with Web site design or <a href="http://themonkeycage.org/blog/2011/08/23/rick-perrys-eggheads/">even advertisements</a> in order to gauge their effectiveness and improve upon them. President Obama had a <a href="http://www.youtube.com/watch?v=71bH8z6iqSc">Director of Analytics</a> for his 2008 campaign, and <a href="http://www.datashaping.com/jobs18843x.shtml">has been hiring data scientists experienced in predictive modeling</a>.</p>
<p>White men smoking cigars in cramped rooms making gut calls is how we&#8217;ve usually understood campaign decision-making. This? Whole new ballgame. Yes, there is still a mass audience that is shaped by the media and big events. But there are now thousands and thousands of &#8220;small&#8221; audiences &#8211; or rather, they always were there. Now campaigns can identify them and deliver precision messages to them. And they can find them online in different ways; an hour after <a href="http://twitter.com/#!/derekwillis/status/126084472666984448">posting on Twitter</a> about the Obama&#8217;s campaign use of Github, the <a href="http://twitter.com/#!/teddygoff">campaign&#8217;s Digital Director</a> was following me. And that&#8217;s the easy part.</p>
<p>While campaigns have a public presence that is mostly recorded and observed, the stuff that goes on behind the scenes is so much more sophisticated than it has been. In 2008 we were fascinated by <a href="http://www.jackandjillpolitics.com/2008/10/obama-launches-iphone-app-makes-everyone-a-campaign-worker/">the Obama campaign&#8217;s use of iPhones for data collection</a>; now we&#8217;re entering an age where campaigns don&#8217;t just collect information by hand, but harvest it and learn from it. An &#8220;<a href="http://www.targetpointconsulting.com/ToThePoint/2011/09/27/the-information-arms-race">information arms race</a>,&#8221; as GOP consultant Alex Gage puts it.</p>
<p>For most news organizations, the standard approach to campaign coverage is tantamount to bringing a knife to a gun fight. How many data scientists work for news organizations? We are falling behind, and we risk not being able to explain to our readers and users how their representatives get elected or defeated.</p>
<p>None of this is to say that we need to completely abandon our ways of covering elections. Horse-race coverage is and should be a part of campaign coverage, because in many respects elections are like horse races. Things can change rapidly, and small things can have big impact. We still should be on the ground, talking to voters, showing up at town halls and covering debates. We still need to show up and do the legwork.</p>
<p>But if we can&#8217;t appreciate, much less understand, what modern campaigns are doing to win elections, how can we hope to explain elections? If we don&#8217;t collect at least some of the information available to us &#8211; realizing that we can&#8217;t get our hands on everything that the campaigns do &#8211; we&#8217;ll miss the story. Elections will become even bigger surprises to us, and then how long will it be before readers start to ask whether we actually know the people and places we cover?</p>
<p>Surprises make the news. Some of my favorite stories from the 2004 presidential election are in <a href="http://www.amazon.com/One-Party-Country-Republican-Dominance/dp/0471776726">a book</a> by my friends Peter Wallsten and Tom Hamburger, then of the Los Angeles Times. Here&#8217;s <a href="http://www.latimes.com/news/opinion/commentary/la-op-hamburger25jun25,0,906381.story">one anecdote from the key state of Ohio</a>:</p>
<blockquote><p>One suburban African American woman in Ohio, for example, told us that though she tends to vote Democratic, she was deluged in 2004 with calls, e-mail messages and other forms of communication by Republicans who somehow knew that she was a mother with children in private schools, an active church attendee, an abortion opponent and a golfer.</p></blockquote>
<p>Think about what this kind of thing means. It means that we cannot assume that the campaign visible to the mass audience is the same campaign that&#8217;s being pitched to individuals and groups around the nation, and that winning coalitions can be built not just by harnessing large groups (unions, religious voters, etc.) but also by piecing them together in small units. President Bush&#8217;s margin in Ohio in 2004? <a href="http://www.nytimes.com/packages/html/politics/2004_ELECTIONRESULTS_GRAPHIC/">About 2.5 percent</a>. The only thing that I don&#8217;t like about this anecdote is that Wallsten and Hamburger&#8217;s book appeared nearly two years later. Is there any evidence that we as journalists have closed the gap since then?</p>
<p>To understand how elections are now being waged, we need to have as many of the tools as do the campaigns. We need to build our own storehouses of data &#8211; <a href="http://www.wakegov.com/elections/8data.htm">voter registration</a>, <a href="http://www.sos.georgia.gov/elections/voter_registration/voterhistory.asp">voter history</a>, Census, campaign finance, <a href="http://transition.fcc.gov/mb/audio/decdoc/public_and_broadcasting.html#_Toc202587585">advertisements</a> and more. We need to be able to tap into the rich stream of material that&#8217;s being created and disseminated every day. We need to be able to see the value in small data points that can lead to bigger things.</p>
<p>Elections are great stories. They deserve to be told from a position of confidence and knowledge. We have work to do.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.thescoop.org/archives/2011/10/17/what-we-dont-know-about-elections/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
		<item>
		<title>RemoteTable Is Your Friend</title>
		<link>http://blog.thescoop.org/archives/2011/10/04/remotetable-is-your-friend/</link>
		<comments>http://blog.thescoop.org/archives/2011/10/04/remotetable-is-your-friend/#comments</comments>
		<pubDate>Wed, 05 Oct 2011 01:14:51 +0000</pubDate>
		<dc:creator>Derek Willis</dc:creator>
				<category><![CDATA[Car Tools]]></category>
		<category><![CDATA[Data]]></category>
		<category><![CDATA[Ruby]]></category>

		<guid isPermaLink="false">http://blog.thescoop.org/?p=5640</guid>
		<description><![CDATA[Assuming you regularly work with data found online &#8211; and if you don&#8217;t, you&#8217;re probably here by mistake, so welcome! &#8211; then you realize what a pain it can be to grab structured files from some site, save them and import them. I have more methods in more apps than I can count that download [...]]]></description>
			<content:encoded><![CDATA[<p>Assuming you regularly work with data found online &#8211; and if you don&#8217;t, you&#8217;re probably here by mistake, so welcome! &#8211; then you realize what a pain it can be to grab structured files from some site, save them and import them. I have more methods in more apps than I can count that download a CSV file, run it through a parser and then save some objects as a result. And it always seems to be a 2-3 step process.</p>
<p>If you&#8217;re a fan of the useful <a href="http://csvkit.readthedocs.org/en/latest/tutorial/examining_the_data.html">CSVKit</a>, a command-line tool written in Python, but you&#8217;re a Rubyist, then please do yourself a favor and take a look at <a href="https://github.com/seamusabshere/remote_table">RemoteTable</a>, a gem by <a href="http://workingwithrails.com/person/15878-seamus-abshere">Seamus Abshere</a>. The two libraries aren&#8217;t entirely identical, and <a href="http://www.anthonydebarros.com/2011/09/11/csvkit-data-files/">CSVKit has been rightly praised by others</a>, so let me dive a little deeper into why RemoteTable is a data parser&#8217;s friend.</p>
<p>RemoteTable is essentially a set of useful wrappers around the common process I outlined above: grab a structured file and use it contents. Except that when I say &#8220;structured file&#8221;, I don&#8217;t just mean a CSV. I mean CSVs inside of zip files. Excel spreadsheets of the .xls and .xlsx varieties. Google Spreadsheets and Open Office spreadsheets. Web pages with HTML tables in them. XML. And, what the hell, fixed-width files, too. You&#8217;ll want to <a href="https://github.com/seamusabshere/remote_table/blob/master/README.rdoc">see the examples</a>.</p>
<p>To those options it adds some useful utilities, much as CSVKit does. You can cut certain columns using the <a href="http://www.softpanorama.org/Tools/cut.shtml">Unix cut utility</a>, skip or crop rows using tail, make the entire file UTF-8 or remove &#8220;useless&#8221; characters.</p>
<p>It&#8217;s simple to try out. If you have Ruby installed and are not on Windows (sorry!), then gem install remote_table and wait a minute or two, as it installs quite a few dependencies for managing all this stuff. Then, in a console session, try grabbing a CSV file:</p>
<p><script src="https://gist.github.com/1263335.js"> </script></p>
<p>Each row is converted into an Ordered Hash so that you can refer to columns by name instead of position.</p>
<p>Now, what about a fixed-width file? Maybe one in which I&#8217;m only interested in some of the columns? Easy, thanks to RemoteTable&#8217;s nice DSL:</p>
<p><script src="https://gist.github.com/1263343.js"> </script></p>
<p>Notice that :cut option I passed? That&#8217;s where you include arguments to cut. In this case, I removed the financial columns from this file. You can also pass a :select option to it to do on-the-fly filtering of certain columns based on a matcher such as a regular expression or plain text.</p>
<p>RemoteTable can tell you when you&#8217;re not being very smart about its use, too, such as causing it to re-fetch the source file multiple times (it warns you when it does this). It also handles local files, so if you want to avoid the downloading bit (or your files aren&#8217;t on the Web), that&#8217;s fine, too. Honestly, I have no idea why this library isn&#8217;t more popular, but you can help make it so on <a href="https://github.com/seamusabshere/remote_table">Github</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.thescoop.org/archives/2011/10/04/remotetable-is-your-friend/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Measuring Vocabulary Richness (or, Trying Out Django on Heroku)</title>
		<link>http://blog.thescoop.org/archives/2011/10/01/measuring-vocabulary-richness-trying-out-django-on-heroku/</link>
		<comments>http://blog.thescoop.org/archives/2011/10/01/measuring-vocabulary-richness-trying-out-django-on-heroku/#comments</comments>
		<pubDate>Sun, 02 Oct 2011 02:14:52 +0000</pubDate>
		<dc:creator>Derek Willis</dc:creator>
				<category><![CDATA[Code]]></category>
		<category><![CDATA[Python]]></category>

		<guid isPermaLink="false">http://blog.thescoop.org/?p=5635</guid>
		<description><![CDATA[When Heroku announced Python support this past week, I was interested in seeing how the deployment process worked compared to how Heroku handles Ruby apps. Then a post highlighted by the Python Weekly newsletter caught my eye. Swizec Teller&#8217;s entry, &#8220;Measuring vocabulary richness with Python&#8220;, described an algorithm by George Udny Yule in a 1944 [...]]]></description>
			<content:encoded><![CDATA[<p>When Heroku <a href="http://blog.heroku.com/archives/2011/9/28/python_and_django/">announced Python support this past week</a>, I was interested in seeing how the deployment process worked compared to how Heroku handles Ruby apps. Then a post highlighted by the <a href="http://www.pythonweekly.com/">Python Weekly newsletter</a> caught my eye.</p>
<p>Swizec Teller&#8217;s entry, &#8220;<a href="http://swizec.com/blog/measuring-vocabulary-richness-with-python/swizec/2528">Measuring vocabulary richness with Python</a>&#8220;, described an algorithm by <a href="http://statprob.com/encyclopedia/GeorgeUdnyYule.html">George Udny Yule</a> in a 1944 paper entitled &#8220;<a href="http://scholar.google.com/scholar?q=The+statistical+study+of+literary+vocabulary">The statistical study of literary vocabulary</a>.&#8221; Yule created a way to quantify the diversity of vocabulary in a given text, and Teller translated that formula into straightforward Python code.</p>
<p>So I made a <a href="https://github.com/dwillis/Rich-Vocab">simple Django app</a> that accepts text via a form and uses Teller&#8217;s code to calculate Yule&#8217;s I score of vocabulary richness. It uses the really useful <a href="http://www.nltk.org/">Natural Language Toolkit</a>; the only oddity is that when developing locally on a Mac, the standard installation of NLTK via pip is borked, so you need to <a href="https://github.com/dwillis/Rich-Vocab/blob/master/requirements.txt">specify a file to download in your requirements.txt</a>. You can find <a href="http://richvocab.herokuapp.com/">the demo app here</a>.</p>
<p>I&#8217;m not offering a judgment on using Heroku or other instant deployment-type services; most of them seem pretty easy to use but out of my price range for anything significant. But it&#8217;s nice to know that services like Heroku, ep.io and others offer enough flexibility to do stuff like natural language parsing.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.thescoop.org/archives/2011/10/01/measuring-vocabulary-richness-trying-out-django-on-heroku/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>In Defense of Building Tools</title>
		<link>http://blog.thescoop.org/archives/2011/08/10/in-defense-of-building-tools/</link>
		<comments>http://blog.thescoop.org/archives/2011/08/10/in-defense-of-building-tools/#comments</comments>
		<pubDate>Thu, 11 Aug 2011 00:45:16 +0000</pubDate>
		<dc:creator>Derek Willis</dc:creator>
				<category><![CDATA[Car Tools]]></category>
		<category><![CDATA[Journalism]]></category>
		<category><![CDATA[Work]]></category>

		<guid isPermaLink="false">http://blog.thescoop.org/?p=5622</guid>
		<description><![CDATA[My first job in Web development was as a member of washingtonpost.com&#8217;s &#8220;Tools Team.&#8221; I was, in title if not in practice, a Tool. Done snickering? Let&#8217;s move on. The Tools Team built mostly internal applications and services that helped the Web site run better. I mainly got to work on front-facing projects like the [...]]]></description>
			<content:encoded><![CDATA[<p>My first job in Web development was as a member of washingtonpost.com&#8217;s &#8220;Tools Team.&#8221; I was, in title if not in practice, a Tool.</p>
<p>Done snickering? Let&#8217;s move on.</p>
<p>The Tools Team built mostly internal applications and services that helped the Web site run better. I mainly got to work on front-facing projects like the <a href="http://projects.washingtonpost.com/congress/112/">Congress Votes Database</a>, the <a href="http://projects.washingtonpost.com/2008-presidential-candidates/">2008 presidential campaign</a> and an innovative series on <a href="http://projects.washingtonpost.com/fec/specials/cassidy/">lobbyist Gerald Cassidy</a>. But I did work on a few internal tools, and since I joined The Times in late 2007 I&#8217;ve built a few more. I&#8217;ve found that such tools are not so different from what we now consider to be journalism by Web development. Chosen wisely and done well, they can have impacts that go far beyond a single story or series. We should not <a href="https://twitter.com/hbillings/status/101391263248560128">dismiss them as &#8220;not journalism.</a>&#8221;</p>
<p>If you&#8217;re at the geekier end of the journalism spectrum, then chances are your colleagues know about the stuff you can do. They may not understand it or be able to explain it; a former managing editor of mine, when told about the various technical steps to accomplish something useful, would invariably respond with a touch of wonder: &#8220;Fuckin&#8217; Internet!&#8221; You can explain your work to a decent percentage of your colleagues by invoking Harry Potter or the Lord of the Rings and leave it at that.</p>
<p>But that doesn&#8217;t mean that building tools that can be used by broad segments of the newsroom is a one-way street or has to lead to a divide between you and the other journalists. There will be people in every newsroom who mainly take and rarely give, and in those situations being a technologist is no different from being a clerk. Good tools, like good apps, are a product of collaboration and improve the ability of the newsroom in general. They also make for more and better apps.</p>
<p>Case in point: At The Times we have <a href="http://politics.nytimes.com/congress/">an Inside Congress app</a> that displays information about votes and bills in Congress. The tool that underlies that app is enormous &#8211; it has tons more information, and we&#8217;re working to surface more and more of it. But the tool &#8211; an internal interface &#8211; has uses for our congressional reporters, our graphics editors and for me as a developer. I can point a reporter to the vote record comparison tool instead of having to run a database query or, worse, asking someone else to manually recreate something. We use the tool as a sort of canary in the mine to alert us to odd or interesting events, from committee assignment changes to bill sponsorship withdrawals to unusual voting patterns. In some cases, having the data internally has led to improvements in the app itself, such as our <a href="http://politics.nytimes.com/congress/bills/111/hr3590/amendments">&#8220;key amendments&#8221;</a> pages for certain bills. I didn&#8217;t think of that, but someone else who saw the internal tool did, and we built it together.</p>
<p>Perhaps most important to me as a developer, building the internal tool has broadened the number of people I work with and has given me a range of ideas for making apps easier to build and better. Not all of them pan out, but <a href="http://www.nytimes.com/interactive/2010/07/07/us/politics/20100707-kagan-vote-tracker.html">some of them do</a>. Put another way: the tool actually helps me develop closer working relationships with my colleagues.</p>
<p>A good tool doesn&#8217;t just make it easier for a reporter to create a story. It actually seeds the story, or makes it possible for more people in a newsroom to collaborate. When you have data but no tool, you become a gatekeeper of a sorts &#8211; which is appropriate in many circumstances, but not all. I can&#8217;t possibly know what my colleagues are thinking about, considering or being alerted to, but I can make it easier for them to test out theories and do some exploration on their own. Some of them prefer to do their own work, and we certainly miss some opportunities for apps that way. But others consult with me quite a bit, since they now have a much better idea of what we have and what we might be able to do with it.</p>
<p>Skeptics might respond that there is a difference between tools built around journalistic content, like the Congress app, and those that &#8220;merely&#8221; solve a technical problem. This is a short-sighted argument. What we do as builders of Web applications (external or internal) is informed by everything we touch. Pulling a piece of one tool for use someplace else is a useful technique because it <a href="https://twitter.com/yurivictor/status/101419578936147968">reinforces the value of not repeating yourself</a> and because it sometimes enables you to look at an old problem or situation from a new vantage point.</p>
<p>Back at washingtonpost.com, my former colleague <a href="http://www.holovaty.com/">Adrian Holovaty</a> liked to say that we didn&#8217;t build internal versions of our apps because the public version was the internal version. Fair enough, to a point, but I think that line can veer into the <a href="http://www.mattwaite.com/posts/2008/jan/03/data-ghettos/">data ghetto</a> when not rigidly policed.</p>
<p>Most of my colleagues, I&#8217;m confident, have very little idea what it is that I specifically do. Sometimes I spend the time educating, and sometimes I let our tools help with the evangelization process. However they see my work, I&#8217;m pretty happy as long as it contributes to our journalism together. App developer? Sure. Tool maker? Why not. Labels don&#8217;t interest me much, and most of my colleagues don&#8217;t seem to care. The results &#8211; the journalism &#8211; are what matter.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.thescoop.org/archives/2011/08/10/in-defense-of-building-tools/feed/</wfw:commentRss>
		<slash:comments>10</slash:comments>
		</item>
		<item>
		<title>Why Teach SQL?</title>
		<link>http://blog.thescoop.org/archives/2011/07/27/why-teach-sql/</link>
		<comments>http://blog.thescoop.org/archives/2011/07/27/why-teach-sql/#comments</comments>
		<pubDate>Thu, 28 Jul 2011 00:53:19 +0000</pubDate>
		<dc:creator>Derek Willis</dc:creator>
				<category><![CDATA[Car Tools]]></category>
		<category><![CDATA[Teaching]]></category>

		<guid isPermaLink="false">http://blog.thescoop.org/?p=5615</guid>
		<description><![CDATA[There was an interesting discussion on the NICAR-L listserv today about teaching database skills. More specifically, which software to teach and how to teach it. Should you go with SQLite, as I do? What about MS Access (the consensus seemed to lean against)? Is it too much to ask students to install database server software [...]]]></description>
			<content:encoded><![CDATA[<p>There was an interesting discussion on the NICAR-L listserv today about teaching database skills. More specifically, which software to teach and how to teach it. Should you go with <a href="http://blog.thescoop.org/archives/2008/01/23/teaching-sqlite/">SQLite</a>, as I do? What about MS Access (the consensus <a href="http://blog.thescoop.org/archives/2009/06/02/the-case-against-teaching-access/">seemed to lean against</a>)? Is it too much to ask students to install database server software such as MySQL or PostgreSQL?</p>
<p>These are complicated questions, made moreso by the options now available for teaching database skills. When I attended an IRE database bootcamp in 1997 (taught by my now-colleague <a href="http://topics.nytimes.com/topics/reference/timestopics/people/m/jo_craven_mcginty/index.html">Jo Craven McGinty</a>), there were basically three options: the then-young Access, FoxPro or Paradox. Hard to believe, but back then I worked in a newsroom that had <a href="http://en.wikipedia.org/wiki/FoxPro_2">FoxPro</a> and <a href="http://en.wikipedia.org/wiki/Paradox_%28database%29">Paradox</a>, but not really Access (Note: if you are under 30 and reading this, you may not even know what FoxPro and Paradox are. That&#8217;s ok. They, an in particular FoxPro, were wonderful database managers in their day.)</p>
<p>Not only do we now have open source options (SQLite, MySQL, Postgres) and SQL Server, but we also have a variety of &#8220;database-like&#8221; Web applications, like Fusion Tables and Google Refine, that can do some of the things that only desktop software used to do. And let&#8217;s face it, Excel is a very powerful tool for data analysis. Many of the things a reporter might want to do to a data file, such as sorting and filtering, are arguably a lot easier in Excel or another spreadsheet.</p>
<p>So why even teach SQL, then? The reasons I do it, and will continue to, are these:</p>
<ol>
<li>SQL is an excellent and relatively simple way to enhance your <a href="http://blog.thescoop.org/archives/2011/05/01/interviewing-data/">data interviewing</a> skills. When you have to write out your questions, you tend to think about them a little more than if you&#8217;re just pointing and clicking around. This is why when I had to teach Access, I bypassed the visual query builder. Yes, SQL queries involve writing more than doing an Excel filter, but those syntax errors also make you consider what you&#8217;re doing, and that&#8217;s a good thing.</li>
<li>SQL is still common enough on the Web that teaching it provides an additional branch, if you will, of learning, or at least makes it easier. When I explain how Facebook assembles all your friend&#8217;s posts, comments and pictures, I usually do so by pointing out the existence of <a href="https://developers.facebook.com/docs/reference/fql/">FQL</a>. If you already know SQL, it&#8217;s a very small leap to understanding, at a basic level, how Facebook works.</li>
<li>There are some times when you will absolutely need to use a SQL database. Or, at least, something that&#8217;s not Excel. Multi-million-row tables. Regular expression-based pattern matching. Intensive, complicated queries. If you haven&#8217;t explored SQL, you might not know these are even possible, and you might give up.</li>
</ol>
<p>As to what to use when teaching SQL, I stick with SQLite despite <a href="http://fds.duke.edu/db/Sanford/sarah.cohen">Sarah Cohen</a>&#8216;s completely valid point that <a href="http://www.sqlite.org/lang_datefunc.html">date and time support</a> is much more complicated than it should be. Perhaps a new installment of <a href="https://github.com/tthibo/SQL-Tutorial">Troy Thibodeaux&#8217;s excellent tutorial</a> will help address that issue. In the meantime, let&#8217;s keep teaching SQL &#8211; and asking questions.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.thescoop.org/archives/2011/07/27/why-teach-sql/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Interviewing Data</title>
		<link>http://blog.thescoop.org/archives/2011/05/01/interviewing-data/</link>
		<comments>http://blog.thescoop.org/archives/2011/05/01/interviewing-data/#comments</comments>
		<pubDate>Mon, 02 May 2011 00:35:57 +0000</pubDate>
		<dc:creator>Derek Willis</dc:creator>
				<category><![CDATA[Data]]></category>
		<category><![CDATA[IRE]]></category>
		<category><![CDATA[Journalism]]></category>

		<guid isPermaLink="false">http://blog.thescoop.org/?p=5606</guid>
		<description><![CDATA[To my mother&#8217;s regret, I was never the literature lover she is. And I am not remotely the writer I might have been expected to be, given that my parents both taught English, one at the high school level and the other at college. I also am not the most graceful of interviewers, as my [...]]]></description>
			<content:encoded><![CDATA[<p>To my mother&#8217;s regret, I was never the literature lover she is. And I am not remotely the writer I might have been expected to be, given that my parents both taught English, one at the high school level and the other at college. I also am not the most graceful of interviewers, as my questions tend to run on for too long instead of zeroing in on clear questions.</p>
<p>You might ask, &#8220;How is it that you&#8217;ve managed to keep a job in journalism, then?&#8221;</p>
<p>There&#8217;s no single answer to that, although the majority of it would have to be everything I learned from being a member of <a href="http://www.ire.org/">Investigative Reporters &#038; Editors</a>. And of that part, what I&#8217;ve really learned to love and work at is the other kind of interviewing. The one you don&#8217;t hear much about in journalism school: interviewing data.</p>
<p>To be fair, you really don&#8217;t hear all that much about the craft of interviewing people at journalism school, either. There is the occasional class, but the way that most people I know get better at it is simply by doing. When people ask me how I can approach complete strangers and ask them detailed and occasionally personal questions, I&#8217;m quick to reply that I spent four summers delivering breakfast in bed to newlyweds in the Poconos. When you&#8217;ve had a naked man answer the door at 8 a.m. and tell you to put the trays down next to the tripod-mounted video camera, talking to evenly partially-clothed strangers gets pretty easy.</p>
<p>Interviewing data takes practice, too, although I can&#8217;t really find a parallel from my days waiting tables. Both kinds of interviewing have much in common: you want to be as prepared as possible so as to better evaluate the results and be able to adapt your questions to the situation. Both require you to place a solid block of skepticism, even suspicion, on your shoulders as you embark. And both, if done well, can result in an unexpected admission &#8211; something even the subject of the interview didn&#8217;t really &#8220;know&#8221;.</p>
<p>This is why I continue to teach spreadsheets in classes, because they make for excellent initial interview tools. Looking at some data in a spreadsheet, you can easily size it up with basic sorting and filtering. That&#8217;s kind of the &#8220;getting-to-know-you&#8221; phase of the data interview. What are the ranges of this data? What looks unusual? Just as you get first impressions upon meeting someone, you get similar feelings about data.</p>
<p>With data you have to ask all the basic questions you do with a person, just so you know exactly what you&#8217;re dealing with. Questions like: &#8220;How old are you?&#8221;, &#8220;Where were you born?&#8221;, &#8220;Who do you report to?&#8221; work for both people and data (although I suppose &#8220;made&#8221; is a better word than &#8220;born&#8221;). And then, once you&#8217;ve got a solid foundation, you ask the trickier questions, the ones that you need to really think about. The ones that, when you&#8217;re planning a big interview with the subject of your investigation, you game-plan and write out as if they were lines in a soap opera.</p>
<p>And that&#8217;s where the big difference is: with data, you can ask a lot of potentially embarrassing questions, and the data won&#8217;t complain, walk out or threaten to sue. You can ask variations of the same question 20 times and the data won&#8217;t mind. When I say that I prefer interviewing data to people, this is why. Data will only lie to you if it&#8217;s just bad data or if you misunderstand the question. Unfortunately, almost every data set is &#8220;bad&#8221; in some way. But once you find that out, you usually can deal with it.</p>
<p>With the increased availability of information in structured forms, the skill of interviewing data is even more valuable now that it has been in the past. And yet it&#8217;s still considered a niche, a specialty skill. It&#8217;s odd, because what makes a good interviewer is not whether she uses a digital recorder or a pen. The technology itself is a tool. The crucial factor is the skill in being an interviewer &#8211; preparation, knowing what questions to ask and knowing when something isn&#8217;t right.</p>
<p>You wouldn&#8217;t stumble into an interview with a source having done no research, no preparation. Why in the world should journalists treat data sources any differently?</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.thescoop.org/archives/2011/05/01/interviewing-data/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>On Technical Challenges to Accessing Government Information</title>
		<link>http://blog.thescoop.org/archives/2011/03/28/on-technical-challenges-to-accessing-government-information/</link>
		<comments>http://blog.thescoop.org/archives/2011/03/28/on-technical-challenges-to-accessing-government-information/#comments</comments>
		<pubDate>Mon, 28 Mar 2011 17:40:55 +0000</pubDate>
		<dc:creator>Derek Willis</dc:creator>
				<category><![CDATA[API]]></category>
		<category><![CDATA[Data]]></category>
		<category><![CDATA[Presentations]]></category>

		<guid isPermaLink="false">http://blog.thescoop.org/?p=5601</guid>
		<description><![CDATA[If you&#8217;re in D.C. on April 12 and are interested in government records, you may want to consider attending the Media Access to Government Information Conference (MAGIC) being held at National Archives building on Pennsylvania Ave. I&#8217;ll be one of the panelists there, but don&#8217;t let that dissuade you; there are far brighter people who [...]]]></description>
			<content:encoded><![CDATA[<p>If you&#8217;re in D.C. on April 12 and are interested in government records, you may want to consider attending the <a href="http://www.archives.gov/ncast/news/events/magic.html">Media Access to Government Information Conference</a> (MAGIC) being held at National Archives building on Pennsylvania Ave. I&#8217;ll be one of the panelists there, but don&#8217;t let that dissuade you; there are far brighter people who will be there.</p>
<p>As part of our participation, panelists were asked to write a 1,000-word comment on the topic of their panel. Mine is &#8220;What are the common technical challenges journalists face in making sense of government documents and analyzing government actions, and how could those be overcome?&#8221; Faithful readers will know that I could probably go on for hours on this, but here is what I sent to the conference organizers (sans links):<br />
<span id="more-5601"></span><br />
I very much doubt that the conference organizers intended this, but the fact that our responses to these questions were requested in either PDF or MS Word formats is an excellent example of one of the technical challenges for journalists when dealing with government documents. So in the spirit of openness, I wrote this in Google Docs.</p>
<p>Both journalists and government employees who create and manage information need to know about more than the usual options for the collection and dissemination of information. Part of the technical failure rightly belongs to journalists &#8212; too often, we don’t ask or don’t know how to ask for information in a way that makes it easy to use. But far too often, government officials are either unaware of their format options or, more perniciously, all too aware and resort to distributing documents in, for example, a locked PDF.</p>
<p>I have been told many times that to release information in a format that would allow it to be copied is not the policy of a government agency. Those government agencies fail to understand what public information is. I also have been the recipient of records that clearly were stored in spreadsheet software but, for purposes of public release, have been printed out, scanned into images and stored as a PDF. Obfuscation and paranoia are not technical challenges, but they contribute to them, forcing journalists to acquire costly software or spend additional time overcoming an artificial roadblock.</p>
<p>This challenge is not due to deficiencies in software produced by any particular company but rather in the understanding of how information can and should be made available to citizens. To the greatest extent possible, government information should be made available in formats that allow its users to copy, sift and reorganize it as they please. In practice, this means favoring text-based and open formats over images, PDFs and closed formats. I am less concerned about how agencies store their data, as long as they are able to export it in common formats or reliable workarounds exist, but there are exceptions to this.</p>
<p>Map data, for example, is commonly stored at the government level in <a href="http://en.wikipedia.org/wiki/Shapefile">ESRI’s shapefile format</a>, which, from the point of view of a journalist, has advantages and disadvantages. ESRI is a large, well-known company with products in use at most government agencies that have geographic data, so it makes sense that GIS data would be provided in that format. But not all newsrooms, and certainly not all journalists, are able to obtain ESRI’s software or have access to someone who can easily convert from shapefiles to other formats, such as the <a href="http://code.google.com/apis/kml/documentation/kml_tut.html">KML standard</a> now owned by Google. <a href="http://www.google.com/search?sourceid=chrome&#038;ie=UTF-8&#038;q=filetype%3Akml+site%3Agov">Some government agencies</a> already produce useful geographic data in KML format, but many others could join that list.</p>
<p>Fixing this situation will require education of journalists and government employees of the benefits and ease of working with open formats. The benefits for journalists are apparent: faster access to information that they can immediately put to use. Training journalists is a time-consuming and inefficient process, but journalists must break out of the mindset that government information only comes in documents.</p>
<p>Doing so means that journalists need to become as comfortable interviewing data as they are interviewing people. The benefits for government may need some more time to explain, but they exist. The <a href="http://www.fec.gov/">Federal Election Commission</a> is a case in point.</p>
<p>Thanks to a commitment to maintaining stable, available and well-documented data, the FEC makes it possible for its users to obtain and analyze information when they want to, even on a late-night deadline. This isn’t new-fangled technology; the <a href="http://fec.gov/finance/disclosure/ftp_download.shtml">FEC’s FTP site</a> has been operating for years. But the agency operates as if it trusts its data users, not from a defensive standpoint. As a result, the FEC is rightfully seen as an agency that makes it possible for reporters to do their jobs, not as an impediment to that goal.</p>
<p>A more recent, but increasingly significant, technical challenge is that too many government agencies fail to make better use of the best information distribution platform they have: the Internet. In a digital age, some agencies continue to treat all records either as documents, or when they do make data available, it is done as a single dump. In many cases, journalists do not need an entire dataset; they are more likely to want to answer a single question or small set of questions. In those cases where government agencies make this possible, it is usually through a Web form of their own design &#8211; one which often is tailored to heavy users such as the regulated community.</p>
<p>Providing <a href="http://en.wikipedia.org/wiki/Application_programming_interface">Application Programming Interfaces</a> (APIs) to government data via XML or JSON feeds would make it possible for journalists and Web developers to take advantage of government data without having to download and process enormous files. And while adding an API will incur an up-front cost, it will also save agencies employee time handling requests that could be done computer-to-computer.</p>
<p>Yet such APIs are very rare in government, even though they would make it easier for users and journalists to combine disparate data, and would make it possible to build more useful applications from government data. We know this to be true, because in the absence of any meaningful government approach to disseminating legislative data, several outside organizations, including my own, have developed APIs to help spur the use and spread of congressional data.</p>
<p>But in order to do this, we have had to essentially reverse-engineer the <a href="http://thomas.loc.gov/">Thomas site</a> operated by the Library of Congress, writing fragile HTML parsers that can break should the LoC change the structure of individual pages. So, in order to answer anything beyond the most basic question on legislative matters, a journalist must either spend hours looking up information one page at a time or be able to write a computer program to parse those pages. There has to be a better way.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.thescoop.org/archives/2011/03/28/on-technical-challenges-to-accessing-government-information/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>A Feed for WSJ A-Heds</title>
		<link>http://blog.thescoop.org/archives/2011/03/23/a-feed-for-wsj-a-heds/</link>
		<comments>http://blog.thescoop.org/archives/2011/03/23/a-feed-for-wsj-a-heds/#comments</comments>
		<pubDate>Wed, 23 Mar 2011 13:18:17 +0000</pubDate>
		<dc:creator>Derek Willis</dc:creator>
				<category><![CDATA[Code]]></category>

		<guid isPermaLink="false">http://blog.thescoop.org/?p=5594</guid>
		<description><![CDATA[Updated: WSJ&#8217;s Zach Seward points out that there is an official A-Hed feed, albeit not promoted. Most folks in journalism know about the Wall Street Journal&#8217;s famed &#8216;A-Hed&#8217; stories &#8211; the ones that used to be in the middle column of the paper that are off-beat, sometimes funny and extremely well-written. It turns out that [...]]]></description>
			<content:encoded><![CDATA[<p><em>Updated: WSJ&#8217;s Zach Seward <a href="http://blog.thescoop.org/archives/2011/03/23/a-feed-for-wsj-a-heds/#comment-164533">points out</a> that there is <a href="http://online.wsj.com/xml/rss/3_7826.xml">an official A-Hed feed</a>, albeit not promoted.</em></p>
<p>Most folks in journalism know about the Wall Street Journal&#8217;s famed <a href="http://online.wsj.com/article/SB10001424052702303362404575580494180594982.html?mod=WSJ_Ahed_RIGHTTopCarousel_1">&#8216;A-Hed&#8217; stories</a> &#8211; the ones that used to be in the middle column of the paper that are off-beat, sometimes funny and extremely well-written. It turns out that the Journal has <a href="http://online.wsj.com/public/page/page-one-ahed.html">a landing page for those pieces</a>, which is a great idea. My Northwestern colleague Matt Mansfield pointed it out to me after reading <a href="http://online.wsj.com/article/SB10001424052748704076804576180681670512722.html?mod=WSJ_Ahed_AutomatedTypes">one such piece</a> by <a href="http://twitter.com/#!/dannyyadron">Danny Yadron</a>, a former student.</p>
<p>It&#8217;s nice to have a single spot for the A-Heds, but alas, there is no feed, as my former colleague Amanda Zamora <a href="http://twitter.com/#!/amzam/status/49840130835496960">pointed out</a>. Problem solved. Thanks to Google App Engine and Python, <a href="http://wsjaheds.appspot.com/">there is now</a>. It&#8217;s updated daily and the code (such as it is) can be found on <a href="https://github.com/dwillis/AHeds">Github</a>. Add it to your feed reader today!</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.thescoop.org/archives/2011/03/23/a-feed-for-wsj-a-heds/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Trying Out Exhibit for WordPress</title>
		<link>http://blog.thescoop.org/archives/2011/03/11/trying-out-exhibit-for-wordpress/</link>
		<comments>http://blog.thescoop.org/archives/2011/03/11/trying-out-exhibit-for-wordpress/#comments</comments>
		<pubDate>Sat, 12 Mar 2011 01:48:32 +0000</pubDate>
		<dc:creator>Derek Willis</dc:creator>
				<category><![CDATA[Data]]></category>

		<guid isPermaLink="false">http://blog.thescoop.org/?p=5549</guid>
		<description><![CDATA[At the invitation of David Karger and his team at MIT, I&#8217;ve been playing around with the WordPress plugin for Exhibit to do some basic data visualizations on this blog. I got a chance to meet and talk to Karger in Raleigh last month and hear about his work on Exhibit and other projects. Here&#8217;s [...]]]></description>
			<content:encoded><![CDATA[<p>At the invitation of <a href="http://people.csail.mit.edu/karger/">David Karger</a> and his team at MIT, I&#8217;ve been playing around with the <a href="http://code.google.com/p/wordpress-exhibit/">WordPress plugin for Exhibit</a> to do some basic data visualizations on this blog. I got a chance to meet and talk to Karger in Raleigh last month and hear about his work on <a href="http://www.simile-widgets.org/exhibit/">Exhibit</a> and other projects.</p>
<p>Here&#8217;s a basic example of a chart using the WordPress plugin; it charts the percentage of drives that ended in touchdowns for six not-quite-randomly-selected college football teams from the 2010 season (data from <a href="http://www.cfbreference.com/">College Football Reference</a>).</p>
<p><iframe src='http://blog.thescoop.org/wp-content/plugins/datapress/wp-exhibit-only.php?iframe&exhibitid=1&postid=5549&currentview=inline' width='100%' height='700' scrolling='auto' frameborder='0'>
                                      <p>Your browser does not support iframes.</p>
                                      </iframe><p><b>Note: This post contains a interactive data presentation that may not show up in your feed reader.</b> For the full experience, visit <a href='http://blog.thescoop.org/archives/2011/03/11/trying-out-exhibit-for-wordpress/'>this article</a> in your web browser.</p>  <ul><li> <a href='http://spreadsheets.google.com/feeds/list/0AkEVCRLyVeSJdEdxTGxtT21qamVsOEV3R0JmMXhlMGc/od6/public/basic?alt=json-in-script'>drive outcomes - Google &#25991;&#26723;</a>
</ul></p>
]]></content:encoded>
			<wfw:commentRss>http://blog.thescoop.org/archives/2011/03/11/trying-out-exhibit-for-wordpress/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>What APIs Mean for Data Journalists</title>
		<link>http://blog.thescoop.org/archives/2011/03/06/what-apis-mean-for-data-journalists/</link>
		<comments>http://blog.thescoop.org/archives/2011/03/06/what-apis-mean-for-data-journalists/#comments</comments>
		<pubDate>Mon, 07 Mar 2011 02:05:08 +0000</pubDate>
		<dc:creator>Derek Willis</dc:creator>
				<category><![CDATA[Data]]></category>
		<category><![CDATA[Journalism]]></category>

		<guid isPermaLink="false">http://blog.thescoop.org/?p=5566</guid>
		<description><![CDATA[Anthony DeBarros of USA Today and I talked about APIs at this year&#8217;s CAR conference in Raleigh. We got a lot of &#8220;Web people&#8221;, to use a lame expression, in the audience. If you&#8217;re a reporter who works with data, why should you care? The simple answer is that APIs are an extension of what [...]]]></description>
			<content:encoded><![CDATA[<p>Anthony DeBarros of USA Today and I <a href="http://cwu.me/fXBFLr">talked about APIs</a> at this year&#8217;s CAR conference in Raleigh. We got a lot of &#8220;Web people&#8221;, to use a lame expression, in the audience. If you&#8217;re a reporter who works with data, why should you care?</p>
<p>The simple answer is that APIs are an extension of what reporters do every day: ask questions. The difference is that instead of forcing reporters to gather data from multiple sources, format it to fit your local database needs and then update that database when new releases are available, APIs allow reporters to query live data from all over the Web. If you have experience working with, say, Microsoft Access and setting up an ODBC connection to a remote database, APIs are kind of like that &#8211; except that you have near-instant access to more sources of data, more useful tools (like geocoders) and more timely information than ever before.</p>
<p>My path working with data went something like this: spreadsheets came first, which I routinely describe as the &#8220;gateway drug&#8221; of computer-assisted reporting. Some people become such Excel wizards that it almost doesn&#8217;t make sense for them to move beyond that expertise; there is so much you can do in a spreadsheet that alone it would be worth the time to learn. But there were things about spreadsheets that annoyed and frustrated me. Pivot tables were a clumsy fit for me &#8211; they got me close to what I wanted in many instances but never quite there. And so I moved onto databases.</p>
<p>Databases are still one of my favorite things. They are powerful, relatively flexible and range in utility from the ultra-portable SQLite to the transactional goodness that is PostgreSQL. But they take time and effort to build, maintain and &#8211; perhaps most importantly in the long run &#8211; connect to additional sources of information. APIs are not a complete solution to these problems, but they provide a very good one that data journalists should be familiar with and consider incorporating into their work.</p>
<p>A simple example is the reporter who wants to track the votes of his or her state&#8217;s delegation in Congress. There are several APIs for this data, including <a href="http://developer.nytimes.com/docs/read/congress_api">the one I work on</a> and another by <a href="http://www.opencongress.org/api">OpenCongress</a>. The reporter could build a database of these votes by hand or write scripts to parse the House and Senate vote data and insert them into it. But why, when the data is freely available via HTTP?</p>
<p>It can&#8217;t be that simple, can it? Well, no. But it can be simpler. The data you get from APIs usually comes in XML or JSON. Data journalists have, for better or worse, been dealing with XML for awhile now. JSON may be less familiar, but it is quite nice to deal with and there are plenty of libraries with which to do so. But even better than that is the fact that other people have already solved that problem for you. Not long after we released the NYT Congress API I noticed <a href="https://github.com/hoverbird/ny-times-congress">a Ruby client library for it on Github</a>. I had never met the author; he had never contacted me. Just the same, he made it easier for people using Ruby to query the API and get back data. There&#8217;s also an <a href="https://github.com/eyeseast/python-nytcongress">excellent Python library for it</a>, written by NPR&#8217;s <a href="http://www.chrisamico.com/about/">Chris Amico</a>.</p>
<p>Thus can you, the data journalist, benefit from other people who need and use APIs. Check out <a href="https://github.com/opengovernment/govkit">GovKit</a>, a Ruby wrapper to multiple government and political APIs, created by the folks at the <a href="http://www.participatorypolitics.org/">Participatory Politics Foundation</a>. Go play with it, and figure out what sorts of things you can do when the number of data sources you&#8217;re able to tap into multiplies overnight. The possibilities for journalists are only limited by the kinds of questions we can imagine and try to answer. APIs can make it easier to act on that greatest of questions: What if?</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.thescoop.org/archives/2011/03/06/what-apis-mean-for-data-journalists/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
	</channel>
</rss>

<!-- Dynamic page generated in 1.607 seconds. -->
<!-- Cached page generated by WP-Super-Cache on 2012-02-04 03:39:47 -->

