<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>The Scoop</title>
	<atom:link href="http://blog.thescoop.org/feed/" rel="self" type="application/rss+xml" />
	<link>http://blog.thescoop.org</link>
	<description>Derek Willis' weblog on investigative and computer-assisted reporting.</description>
	<lastBuildDate>Sun, 13 May 2012 22:59:09 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.2</generator>
		<item>
		<title>Lost in the Weeds</title>
		<link>http://blog.thescoop.org/archives/2012/05/13/lost-in-the-weeds/</link>
		<comments>http://blog.thescoop.org/archives/2012/05/13/lost-in-the-weeds/#comments</comments>
		<pubDate>Sun, 13 May 2012 20:53:03 +0000</pubDate>
		<dc:creator>Derek Willis</dc:creator>
				<category><![CDATA[Campaign Finance]]></category>
		<category><![CDATA[Car Tools]]></category>
		<category><![CDATA[Work]]></category>

		<guid isPermaLink="false">http://blog.thescoop.org/?p=5709</guid>
		<description><![CDATA[The indefatigable Alex Howard posted a link today about a draft academic paper on open source and journalism by Nikki Usher of George Washington University and Seth Lewis of the University of Minnesota-Twin Cities. Alex&#8217;s tweets are worth a look, so I pulled up the paper and began reading. Although I didn&#8217;t finish graduate school, [...]]]></description>
			<content:encoded><![CDATA[<p>The indefatigable Alex Howard <a href="https://twitter.com/digiphile/status/201712378356838401">posted a link</a> today about <a href="http://conservancy.umn.edu/bitstream/123292/1/Usher-Lewis%20ICA%202012.pdf">a draft academic paper on open source and journalism</a> by Nikki Usher of George Washington University and Seth Lewis of the University of Minnesota-Twin Cities. Alex&#8217;s tweets are worth a look, so I pulled up the paper and began reading.</p>
<p>Although I didn&#8217;t finish graduate school, I have written my fair share of academic papers, so I&#8217;m not a complete novice in the area. And despite the fact that <a href="https://twitter.com/nikkiusher/status/201721394898280449">it is a draft</a>, as Usher points out, the idea that you would post even a draft on the web and then <a href="https://twitter.com/nikkiusher/status/201721394898280449">profess surprise</a> when someone links to it or reads it is a little off, particularly for someone writing about a) journalism and b) open source. If you weren&#8217;t expecting critics, maybe it&#8217;s a good idea not to let them see the material. <em>Update: in fairness, Nikki Usher was not aware that the paper was online at all.</em> (I should also note that criticism is my full-time non-job; I probably should have sought work in the field, not due to any real talent on my part but based purely on personal enthusiasm.)</p>
<p>There is a good bit to like in the paper, which will be presented this month at the International Communication Association&#8217;s annual convention. There is a tendency for those working at the intersection of technology and journalism to focus on the tools &#8211; the stuff that actually exists now &#8211; rather than systemic changes to journalism itself (there are ahem, <a href="http://blog.thescoop.org/thefix/">some exceptions</a> to this tendency). In part that&#8217;s because systemic changes are a hard problem too easily pushed to the background by the demands of doing journalism today. But it&#8217;s also in part because incremental changes, as the authors note, can be valuable in changing the whole. But, ok, I get it, and in many ways agree (except for the utopian one-open-source-CMS-to-rule-them-all idea &#8211; can we just kill that fantasy and focus on making information portable?).</p>
<p>Then I read the part about <a href="http://nytimes.github.com/Fech/">Fech</a>. As a contributor to the project, I am admittedly biased in its favor, but I do not think my reaction is solely or even mainly based on that fact. Here&#8217;s what they wrote about the project:</p>
<blockquote><p>The New York Times developed Fech, a tool that helps journalists crawl financial disclosures by political candidates just by knowing a filing number (Strickland, 2011). Just as the discourse around open source tools emphasizes their pro-social benefits, Fech’s creators note that more access to these filings will lead to better journalism. But Fech also gives one more tool to journalists eager for the horse-race style journalism that is divisive and counterproductive for democracy (Patterson, 1993; Cappella &#038; Jamieson, 1997). There is another problem with what could be a strength of Fech: While the source code is posted on Github for other developers, the tool has been built to help people in the newsroom, not to encourage participation by ordinary people.</p></blockquote>
<p>Where to begin? First, and perhaps least important, it vastly understates what Fech enables journalists and developers to do, particularly in regards to what can be done <em>programmatically</em> with disclosure data. I do not think that it is a stretch to say that the easier it is to examine and search campaign finance disclosures, the easier it will be for reporters and the public to discover interesting and useful pieces of information. Indeed, the use of Fech by news organizations like <a href="http://www.propublica.org/article/campaign-spending-shows-political-ties-self-dealing">ProPublica</a>, <a href="http://www.reuters.com/article/2012/05/10/us-usa-campaign-superpacs-idUSBRE8490K820120510">Reuters</a> and The Associated Press &#8211; to say nothing of our use at The Times &#8211; has borne that out.</p>
<p>But here&#8217;s the line that got my back up: &#8220;But Fech also gives one more tool to journalists eager for the horse-race style journalism that is divisive and counterproductive for democracy.&#8221; The evidence for that? There isn&#8217;t any. This is pure speculation; I would argue that it is refuted by the examples I just cited. Can the authors &#8212; or anyone &#8212; cite an example of where Fech has been used to enable more horse-race style journalism (in its pejorative sense, which is what I assume the authors meant)? I&#8217;m 17 years out of graduate journalism school, but I&#8217;m pretty sure that assertions like that need a bit more than a citation to work that says horse-race journalism is bad for democracy. In theory, Fech could be used to run nuclear reactors, I guess, but since <em>there is no evidence of that actually happening</em>, I&#8217;m going to discount that as a possibility.</p>
<p>And finally, the authors write: &#8220;While the source code is posted on Github for other developers, the tool has been built to help people in the newsroom, not to encourage participation by ordinary people.&#8221; Well, yes and no. I would be hard-pressed to describe those people interested in campaign finance data as &#8220;ordinary,&#8221; but we open sourced Fech so that it could be used in newsrooms and in any other situation. I don&#8217;t understand how exactly the authors presume that we built it only to help people in the newsroom, or to discourage participation by non-newsroom folks. The fact that the contributors to Fech come mainly (but not exclusively) from newsrooms who cover campaigns is understandable to me (and to <a href="https://twitter.com/thejefflarson/status/201734341200588800">Jeff Larson</a>). I&#8217;m not entirely clear how what we&#8217;ve done makes it harder for non-newsroom people to participate. I&#8217;d love to read about that, but there are no examples or further discussion in the paper (nor was any user of Fech contacted by the authors, from what I can tell. I wonder if they installed it and tried it themselves).</p>
<p>Usher, to her credit, offered several explanations as to why this passage was in the draft. They include the fact that this paper, like many, undergoes <a href="https://twitter.com/nikkiusher/status/201740095206862848">blind review</a>. Fair enough, but it&#8217;s worth asking whether the reviewers are able to evaluate these claims, since most of them could be debunked by reading <a href="http://open.blogs.nytimes.com/?s=campaign+finance">posts on Open</a>. It also, like many of my own projects, seemed to have been <a href="https://twitter.com/nikkiusher/status/201740279647186945">a bit of a rush job</a>. I know all about that, but was it really so difficult to talk to anyone involved in Fech? Finally, my objections, however valid, don&#8217;t damage the overall point of the paper, but reflect the possibility that I &#8220;<a href="https://twitter.com/nikkiusher/status/201741437094735873">may be getting lost in the weeds</a>.&#8221;</p>
<p>That&#8217;s the tricky thing about journalism, data and even open source. Weeds matter. If you get the weeds wrong, the eventual result usually suffers. If I&#8217;m lost in the weeds, maybe the garden needs some attention.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.thescoop.org/archives/2012/05/13/lost-in-the-weeds/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Our Mark Knoller Problem</title>
		<link>http://blog.thescoop.org/archives/2012/05/01/our-mark-knoller-problem/</link>
		<comments>http://blog.thescoop.org/archives/2012/05/01/our-mark-knoller-problem/#comments</comments>
		<pubDate>Tue, 01 May 2012 12:14:22 +0000</pubDate>
		<dc:creator>Derek Willis</dc:creator>
				<category><![CDATA[Journalism]]></category>

		<guid isPermaLink="false">http://blog.thescoop.org/?p=5703</guid>
		<description><![CDATA[My colleagues at The Times (and other folks I know who cover the White House) tell me that Mark Knoller, the CBS Radio reporter who reports on the president, is a genuinely nice man and someone who has always been extraordinarily generous about sharing what he knows with other news organizations. Knoller is such a [...]]]></description>
			<content:encoded><![CDATA[<p>My colleagues at The Times (and other folks I know who cover the White House) tell me that <a href="http://www.cbsnews.com/8301-18564_162-524942/mark-knoller/">Mark Knoller</a>, the CBS Radio reporter who reports on the president, is a genuinely nice man and someone who has always been extraordinarily generous about sharing what he knows with other news organizations. Knoller is such a fixture at the White House &#8211; he&#8217;s covered every administration since Gerald Ford&#8217;s &#8211; that he&#8217;s moved beyond simply being a reporter <a href="http://dekerivers.wordpress.com/2011/08/18/meet-cbs-radios-mark-knoller-who-counts-president-obamas-vacation-days/">into the realm of providing a public service</a>: he&#8217;s very often cited in other outlets&#8217; stories about presidential travel. A sample:</p>
<blockquote><p>
CBS&#8217;s Mark Knoller, <a href="http://www.huffingtonpost.com/2012/04/25/obama-travel-gop-fraud-complaintl_n_1453713.html">who keeps detailed notes on Obama&#8217;s travels</a>, recently told The New York Times that since the president filed for re-election, he&#8217;s taken 60 domestic trips and 26 of them involved fundraisers.</p>
<p>But Mark Knoller of CBS, <a href="http://www.suntimes.com/news/politics/7127735-418/obamas-vacation-ripped-but-hes-taken-less-time-off-than-bush.html">the unofficial keeper of presidential work schedules</a>, reported that President George W. Bush had taken more time off than Obama at this point in his first term.</p>
<p>According to <a href="http://washingtonexaminer.com/politics/washington-secrets/2012/02/michelles-ski-trip-marks-16-obama-vacations/294051">presidential watcher Mark Knoller of CBS</a>, George W. Bush, at this time of his presidency, had made 30 visits to his Texas ranch spanning all or part of 220 days. The Obama&#8217;s vacation day count is less than half of that.
</p></blockquote>
<p>This isn&#8217;t about Knoller as a person or as a reporter. It&#8217;s just that this situation &#8211; where one person has become the official source of public knowledge about the travels of the President of the United States &#8211; is far from ideal. Forget that the government is occasionally off-base on presidential travel statistics; how is it that other news organizations, including my own, have relied on a system in which one person &#8211; however diligent and generous &#8211; holds such important information?</p>
<p>From an information management standpoint, having Knoller be the keeper of presidential travel information is not only inefficient &#8211; what happens if Knoller is on vacation, or busy? &#8211; but makes it harder to regularly review the data or incorporate it into other inquiries. In reality, this is our problem, not Knoller&#8217;s, and his generosity has enabled us to carry on as if we&#8217;d been collecting this information all the time. But we haven&#8217;t. It&#8217;s easy enough to just ask Knoller, especially since we don&#8217;t use the information all that often.</p>
<p>We&#8217;re not talking about uncovering classified information here, but the daily whereabouts of the President of the United States. And yet somehow, every other news organization has decided that it&#8217;s perfectly ok not to have this information at its fingertips. It probably won&#8217;t happen as long as Knoller remains in his job, but what happens if someday CBS decides not to share that information anymore? Or Knoller decides he&#8217;s tired of doing this and retires? In the &#8220;weak link in the chain&#8221; scenario, the rest of us are the weak links, not him. He&#8217;s doing his part. Why are we shirking ours?</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.thescoop.org/archives/2012/05/01/our-mark-knoller-problem/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>The Programmer-Reporter</title>
		<link>http://blog.thescoop.org/archives/2012/04/21/the-programmer-reporter/</link>
		<comments>http://blog.thescoop.org/archives/2012/04/21/the-programmer-reporter/#comments</comments>
		<pubDate>Sat, 21 Apr 2012 22:48:34 +0000</pubDate>
		<dc:creator>Derek Willis</dc:creator>
				<category><![CDATA[Code]]></category>
		<category><![CDATA[Data]]></category>
		<category><![CDATA[Journalism]]></category>

		<guid isPermaLink="false">http://blog.thescoop.org/?p=5658</guid>
		<description><![CDATA[Update: If you want a better visual presentation of this idea, check out Ben Welsh&#8217;s ISOJ presentation. I finally have something tangible from work to show to my mother: an A1, above-the-fold story in today&#8217;s New York Times. It doesn&#8217;t really help explain what I do, but it&#8217;s something that&#8217;s a bit easier to understand [...]]]></description>
			<content:encoded><![CDATA[<p><em>Update: If you want a better visual presentation of this idea, check out <a href="http://www.youtube.com/watch?v=iP-On8PzEy8">Ben Welsh&#8217;s ISOJ presentation</a>.</em></p>
<p>I finally have something tangible from work to show to my mother: an <a href="http://www.nytimes.com/2012/04/21/us/politics/obama-campaign-faces-dropoff-in-big-donations.html">A1, above-the-fold story in today&#8217;s New York Times</a>. It doesn&#8217;t really help explain what I do, but it&#8217;s something that&#8217;s a bit easier to understand than, say, a listing of git commits.</p>
<p>A colleague of mine at The Times, Michael Strickland, <a href="https://twitter.com/moriogawa/status/193733886432387072">responded on Twitter</a>: &#8220;Enough about the designer-programmer. More about the programmer-writer.&#8221;</p>
<p>As usual, he&#8217;s onto something, particularly when it comes to news organizations. <a href="http://en.wikipedia.org/wiki/Literate_programming">Literate programming</a> has been around quite a while, and I&#8217;m lucky enough to work with people who approach code in a way that seems closer to artistry than to engineering. I&#8217;m no expert on such things. But I was a full-time reporter for nearly a decade, and I still do my share of reporting, and that&#8217;s where I see the greater potential of applying programming: the Programmer-Reporter.</p>
<p>Let&#8217;s stipulate right up front that I, like a lot of folks, am a sucker for <a href="http://www.washingtonpost.com/wp-dyn/content/article/2008/10/04/AR2008100402333.html">an Anne Hull story</a>, the kind built on hours and hours of listening, watching and reflecting. What follows is no knock on what my former Palm Beach Post colleague Ron Hayes used to call &#8220;notebook-assisted reporting&#8221; built on talking to people, writing down what they say and then turning that into a great story. The only reason this post is not about that kind of journalism is because I was never really much good at it.</p>
<p>There are other stories to tell, and other ways to find them. I usually tell classes that I teach that if any of them can write like Hull or has the source development skills of, say, Bob Woodward, then they probably don&#8217;t need to learn what I&#8217;m teaching them. But the rest &#8212; the vast majority, from my experience &#8212; may want to pay attention.</p>
<p>A lot of daily beat reporting &#8211; from sports to government to business &#8211; relies heavily on reporters knowing the habits and schedules of the people and institutions they cover. Certain events happen in a relatively predictable pattern and a lot of the reporting revolves around keeping tabs on them. But news, the stuff we talk about, often consists of things that go against that pattern, the unusual event in a sea of regularity.</p>
<p>It follows then, that journalists should prize methods that would help unearth such anomalies, those needles in haystacks that we hold dear. Some do, and in other cases there are few real methods other than examining every document or attending every meeting. But way too often, across topics and beats, we remain unaware of or ignore practices that could help us spot news and make sense of the larger picture. If we have a system of story development, it&#8217;s a system that seems to value serendipity and entropy. Meanwhile, Donald Rumsfeld&#8217;s line about &#8220;known unknowns&#8221; remains stubbornly in effect.</p>
<p>This is true even in areas where reporting relies heavily on data, such as political campaigns. Many of the stories relating to campaign contributions, for example, are a result of reporters meticulously poring over pages of filings, applying the Potter Stewart test: &#8220;I&#8217;ll know it when I see it.&#8221; This is, in too many cases, a waste of time, since we often do have an idea of what we&#8217;re looking for, but believe in this idea of &#8220;data serendipity&#8221; when practice shows us that asking specific questions, or at least about specific ideas, is a better way to go. The easiest question for any source to deflect is, &#8220;Anything interesting going on?&#8221; Unfortunately, it&#8217;s also the easiest one to ask.</p>
<p>One thing that I&#8217;ve learned from writing software is that you don&#8217;t really want to &#8220;make news,&#8221; as it were. Predictability is a good thing, and edge cases &#8211; when things get weird or different &#8211; are what you want to avoid. Reporting, on the other hand, seeks out the edge cases, the departures from the norm. How to make those two come together? Here&#8217;s a way: make it possible to expect the unexpected when it comes to analyzing patterns in data. What we need are easily configurable systems that enable reporters to ask questions of data in a consistent manner and then provide results in a way that makes sense for journalists. </p>
<p>I&#8217;m not talking about bland TPS reports but something that turns a piece of data into a potential story. For example, if a local congressman has received donations from executives of XYZ Corp. every March in previous election years, but not this year, then that&#8217;s potentially newsworthy; maybe XYZ isn&#8217;t giving as much, but maybe they no longer support the local politician, or are a bellwether of a lack of business support. In this case, the absence of data &#8211; something that&#8217;s very hard for people to spot in pages of filings &#8211; is the trigger event that can cause a reporter to follow up.</p>
<p>As much as journalists love to wax about serendipity, much of life is based on our habits and patterns, which are predictable enough to be tested against data. The same idea applies to scenarios that may not have happened in the past but could be defined and applied to the data. Reporters are testing out theories all the time, often by calling up sources and putting a question or theory out there. There&#8217;s no reason why we can&#8217;t enable the same ease of inquiry with data. In fact, it seems possible to do it on a much broader scale and in more precise ways.</p>
<p>Will this lead directly to stories? Not in many cases. Again, this is a typical process for reporters: float an idea, test its merits and polish. Rinse and repeat. If we could build interfaces that would assist in that process by making it easier to ask questions (or define patterns), how many more stories could we find? How many known unknowns could be crossed off the list?</p>
<p>A lot, I think. And despite the thrill of seeing your name on the front page, this is the thing that keeps me energized. This is where programmers can make an impact &#8211; either by <a href="https://github.com/bycoffe/fec-guide">diving into a subject area</a> or pairing with reporters who have that expertise &#8211; by making it easier to find better stories. The news is out there, in most cases. Finding it, nailing it down, that&#8217;s the challenge. It is, in programming terms, a hard problem. But the programmers I know love to tackle hard problems.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.thescoop.org/archives/2012/04/21/the-programmer-reporter/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Investigating House Freshmen Voting Patterns</title>
		<link>http://blog.thescoop.org/archives/2012/03/23/investigating-house-freshmen-voting-patterns/</link>
		<comments>http://blog.thescoop.org/archives/2012/03/23/investigating-house-freshmen-voting-patterns/#comments</comments>
		<pubDate>Fri, 23 Mar 2012 14:16:57 +0000</pubDate>
		<dc:creator>Derek Willis</dc:creator>
				<category><![CDATA[Fed Data]]></category>
		<category><![CDATA[Work]]></category>

		<guid isPermaLink="false">http://blog.thescoop.org/?p=5690</guid>
		<description><![CDATA[One of the great things about what is known as computer-assisted reporting is the chance it offers to prove or disprove conventional wisdom. To separate anecdote &#8211; no matter how compelling &#8211; from reality. My colleague Jennifer Steinhauer and I got an opportunity to do that last weekend using House voting data that we collect [...]]]></description>
			<content:encoded><![CDATA[<p>One of the great things about what is known as computer-assisted reporting is the chance it offers to prove or disprove conventional wisdom. To separate anecdote &#8211; no matter how compelling &#8211; from reality. My colleague Jennifer Steinhauer and I <a href="http://www.nytimes.com/2012/03/17/us/politics/house-freshmen-not-as-defiant-as-their-reputation-suggests.html">got an opportunity to do that last weekend</a> using House voting data that we collect and make available via <a href="http://developer.nytimes.com/docs/read/congress_api">The New York Times Congress API</a>.</p>
<p>The working question going into the story was that despite the headlines garnered by the freshman class of Republican lawmakers in the House, it wasn&#8217;t clear that they were mainly responsible for opposing the GOP leadership, particularly House Speaker John Boehner. Jennifer first outlined a series of votes that attracted Republican opposition (in the House, the majority&#8217;s first obligation is to satisfy most, if not all, of its caucus. A significant number of &#8220;No&#8221; votes from the majority&#8217;s ranks is a clear sign that there are some issues to resolve). Then I pulled the vote data from our API, although the same data is available from other sources, too.</p>
<p>We looked at the freshmen Republicans as a voting bloc, and then at members of the <a href="http://rsc.jordan.house.gov/">Republican Study Committee</a>, the largest sub-group within the House GOP. Since such organizations aren&#8217;t technically legislative committees, we had to rely upon <a href="http://rsc.jordan.house.gov/AboutRSC/Members/">its list of members</a> (which turned out to add a wrinkle to the story). Here&#8217;s what we found:</p>
<blockquote><p>But an analysis of voting patterns on the most contentious bills in the 112th Congress shows that House members of the Republican Study Committee — a group of both veterans and newcomers that meets weekly to hammer out a conservative agenda — <a href="http://www.nytimes.com/imagepages/2012/03/17/us/17conferencegrx.html">have cast the bulk of &#8220;no&#8221; votes on big bills</a>, including those important to Speaker John A. Boehner of Ohio.</p>
<p>The freshmen who have joined the study committee — which was founded in 1973 — play an important role in its renewed clout, having increased its membership to 163 from roughly 110 two years ago. As a group, however, the freshmen are less homogenous and less apt to buck the leadership than the study committee itself is as a whole.</p></blockquote>
<p>It was relatively straightforward, as data analysis goes, since all we needed to do was pick the universe of votes, pull the data and identify freshmen and RSC members. Identifying freshmen lawmakers seems pretty easy, but there are a few questions to bear in mind. For example, are lawmakers who were elected in special elections before November 2010 freshmen? Are former members who reclaimed their seats in 2010 freshmen? What about people elected after 2011? (We said yes in all cases, which enlarges the usual number cited as the 2010 freshman class).</p>
<p>With RSC members, it was a little more involved, in that an <a href="http://web.archive.org/web/20110717124830/http://rsc.jordan.house.gov/AboutRSC/Members/">earlier version of the RSC member page</a> indicated that membership had changed somewhat since last July. That became an element of the story:</p>
<blockquote><p>Since then, over a dozen members have resigned from the study group, including a few freshmen. In interviews, some said that the episode had angered them and that they had tired of the committee’s attempts to define who is worthy of being called a conservative. Most were afraid to tackle the group on the record.</p></blockquote>
<p>Vote analysis can seem like a dry exercise &#8211; there are so many votes, and most of them are fairly lopsided affairs. But voting is important not only for the individual decisions that lawmakers take, but for what it can tell us about the collective behavior of members.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.thescoop.org/archives/2012/03/23/investigating-house-freshmen-voting-patterns/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>On Legislative Data Transparency</title>
		<link>http://blog.thescoop.org/archives/2012/02/04/on-legislative-data-transparency/</link>
		<comments>http://blog.thescoop.org/archives/2012/02/04/on-legislative-data-transparency/#comments</comments>
		<pubDate>Sun, 05 Feb 2012 02:35:38 +0000</pubDate>
		<dc:creator>Derek Willis</dc:creator>
				<category><![CDATA[Data]]></category>
		<category><![CDATA[Fed Data]]></category>

		<guid isPermaLink="false">http://blog.thescoop.org/?p=5667</guid>
		<description><![CDATA[This week I was honored to speak at at the Legislative Data and Transparency Conference put on by the Committee on House Administration. If you&#8217;re so inclined, the videos of the presentations are online at the conference site, although I must warn you that they contain heavy doses of XML references and other fun stuff. [...]]]></description>
			<content:encoded><![CDATA[<p>This week I was honored to speak at at the <a href="http://cha.house.gov/about/contact-us/legislative-data-conference">Legislative Data and Transparency Conference</a> put on by the <a href="http://cha.house.gov">Committee on House Administration</a>. If you&#8217;re so inclined, the videos of the presentations are online at the conference site, although I must warn you that they contain heavy doses of XML references and other fun stuff. What follows is not my presentation, strictly speaking, but most of it along with some other thoughts.</p>
<p>Being a former Congressional Quarterly staffer, I have an innate fondness for House Admin, one of the lesser-known committees but one with a large influence over what kinds of information the public can see about the House side of the legislative process. The committee has jurisdiction over the Library of Congress, which by extension means <a href="http://thomas.loc.gov/home/thomas.php">Thomas</a>, the online home of so much Congressional information.</p>
<p>There are many other posts about the general desires of what those folks committed to transparency want when it comes to Congress, but Daniel Schuman of the Sunlight Foundation <a href="http://sunlightfoundation.com/blog/2012/02/02/benchmarks-for-measuring-success-for-legislative-data-transparency/">sums them up pretty well</a>: &#8220;To the maximum extent possible, legislative information must be available online, in real time, and in machine readable formats.&#8221;</p>
<p>I don&#8217;t disagree, and I am sympathetic to complaints that Congress has been slow to address the availability of bulk data. People such as <a href="http://www.quora.com/How-does-Joshua-Tauberers-Congress-tracker-www-GovTrack-us-work-and-what-are-its-pros-and-cons">Josh Tauberer have been screen-scraping Thomas since 2004</a>, and I joined in the process a year later at washingtonpost.com. In 2012, we&#8217;re both still doing it, now joined by <a href="http://services.sunlightlabs.com/docs/Sunlight_Congress_API/">Sunlight</a>, <a href="http://www.opencongress.org/">OpenCongress</a> and who knows how many others (speaking of OpenCongress, if you want a less patient restatement of Schuman&#8217;s thoughts, <a href="http://www.opencongress.org/articles/view/2470-Liberate-OpenGovData-Now">OC&#8217;s David Moore has a stem-winder of a post for you</a>).</p>
<p>I, too, long for the day when I don&#8217;t have to wonder when my HTML parsers will break after a seemingly innocuous change to Thomas&#8217; styles, or when I don&#8217;t have to enter three different IDs for a new Senator (Bioguide, LIS and Thomas&#8217; own unique sequential number). But my presentation on Thursday concentrated on a  more fundamental need. Before bulk data can become really useful, it has to be more consistent, understandable and accurate. Right now, if you&#8217;re not willing to put in a lot of time studying the quirks of Congress, you will always face the likelihood that your data, however lovingly collected, has plenty of errors.</p>
<p>For example, in the Senate it is possible for the Majority Leader and Minority Leader to alter the rules of math when it comes to how many senators constitute a three-fifths majority. The death of Sen. Ted Kennedy in 2009 reduced the number of Democrats in the chamber at that time to 59, and the total number of senators &#8220;duly elected and sworn&#8221; to 99. For votes requiring a three-fifths majority (thanks, Malcolm), a 99-member Senate would need 59.4 senators for passage, or at least 59. But the party leaders agreed to keep the three-fifths threshold at 60 votes throughout the period when the Senate had 99 senators, not 100. For much of that period, any two-thirds vote displayed on nytimes.com had the wrong number of votes required for passage, because I was relying on math. I could not find any place in the Congressional Record or anywhere else where this was documented.</p>
<p>An edge case, you might say. But when it comes to Congress, there are loads of them. A reporter called me several weeks ago to ask why a seemingly simple question about three members of her state&#8217;s delegation was maddeningly hard to answer. All three had been elected to the House the same year, and had served since then. But each of them had a different number of total votes he or she was eligible to vote on. How could that be?</p>
<p>It took me a little while, but the only explanation I could find was that their dates of service had to differ in some way, and my guess was that not all of them were sworn in for each session on the same day. <a href="http://www.politico.com/news/stories/0111/47170.html">It happens</a>. Unfortunately, neither Thomas nor the <a href="http://clerk.house.gov/">Clerk of the House</a> provides an easy way to find out when a particular member was sworn in, despite the fact that it is a basic element of what makes someone a Member of Congress. At the conference, I heard someone say that it would be possible to provide a list of swearing-in dates for every lawmaker. That&#8217;s good, and needed, but it&#8217;s not good enough. I need, and the data demands, timestamps in this case. That&#8217;s the only way I can be sure of what votes a member was or was not eligible to vote on.</p>
<p>You might think that you could find the total number of House votes for a given year by looking at the Clerk&#8217;s votes site. <a href="http://clerk.house.gov/evs/2011/index.asp">In 2011, the last vote was roll call 949</a>. Alas, officially, there were 948 votes that year, <a href="http://clerk.house.gov/evs/2011/roll484.xml">because roll call 484 was vacated</a> and replaced by vote 485, and thus never really happened.</p>
<p>In my presentation I cited a few other examples, but they mostly boil down to this: unless we can make congressional information easier to use and understand by people outside the small circle of legislative wonks, bulk data access by itself won&#8217;t solve our problems. Today the most creative uses of congressional information, such as <a href="http://capitolwords.org">Sunlight&#8217;s Capitol Words project</a>, suffer from this limitation. I love Capitol Words, but right now the Congressional Record &#8211; the source for it &#8211; cannot reliably tell me in a machine-readable form whether a particular word or phrase or speech was even spoken out loud on the floor of the House or Senate. That&#8217;s kind of a big deal, for reporters, historians and the public.</p>
<p>If we can&#8217;t use congressional data to answer what should be straightforward questions, or can&#8217;t agree on what the answers should be, providing immediate access to that data in bulk form may not be as helpful as we would think, and in some cases risks adding to the confusion. It may expose more of those problems, which is of some usefulness, but if the ultimate goal is not just access but understanding, we need to address the fundamental issues of accuracy and consistency before we switch on the firehose.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.thescoop.org/archives/2012/02/04/on-legislative-data-transparency/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
		<item>
		<title>What We Don&#8217;t Know About Elections</title>
		<link>http://blog.thescoop.org/archives/2011/10/17/what-we-dont-know-about-elections/</link>
		<comments>http://blog.thescoop.org/archives/2011/10/17/what-we-dont-know-about-elections/#comments</comments>
		<pubDate>Tue, 18 Oct 2011 01:46:07 +0000</pubDate>
		<dc:creator>Derek Willis</dc:creator>
				<category><![CDATA[Data]]></category>
		<category><![CDATA[Journalism]]></category>
		<category><![CDATA[Presentations]]></category>

		<guid isPermaLink="false">http://blog.thescoop.org/?p=5631</guid>
		<description><![CDATA[If you happened to be at the recent Online News Association conference in Boston and happened to attend the session on covering the 2012 elections, then a good bit of this will be repetitive. Since there wasn&#8217;t a ton of time to expand on what I said, and I don&#8217;t want to leave the impression [...]]]></description>
			<content:encoded><![CDATA[<p>If you happened to be at the recent <a href="http://ona11.journalists.org">Online News Association conference in Boston</a> and happened to attend the <a href="http://ona11.journalists.org/sessions/innovative-ways-to-cover-the-2012-election/">session on covering the 2012 elections</a>, then a good bit of this will be repetitive. Since there wasn&#8217;t a ton of time to expand on what I said, and I don&#8217;t want to leave the impression that I&#8217;m critical of all election coverage, consider this the write-through.</p>
<p>First, I stand by <a href="http://twitter.com/#!/kzhu91/status/117273879105384449">what I said</a> about how little we understand about the way that elections are won or lost these days. It&#8217;s not that political journalism has strayed from its roots, or stopped covering important elements of a modern campaign. It&#8217;s that the elements of a modern campaign have changed, and as journalists, we have not kept pace.</p>
<p>You might respond that campaigns still involve quite a lot of stuff that we <em>do</em> understand, such as debates and visits to state fairs and town hall meetings. True. But the nature of media and technology has brought extensive changes to the electoral system, and I don&#8217;t believe that we as journalists devote enough attention to understanding those changes. Remember the <a href="http://www.wired.com/wired/archive/12.01/dean.html">Dean campaign in 2003</a>? Most of the coverage was on the, for then, staggering online fundraising managed by some doctor from Vermont. But that Wired piece I referenced had it right; Dean&#8217;s accomplishment was less a mastery of the Internet but a willingness to embrace its fundamental aspect: you give up some control by bringing other people in, and you gain a host of possibilities. You may, of course, choose badly or falter in some other way, but the lessons and possibilities are becoming clear. At the time, as a Web geek who loved politics, I felt that journalists couldn&#8217;t really explain the Dean campaign, because it was so alien to us. Today&#8217;s campaigns make me long for the simplicity of 2003.</p>
<p>But let&#8217;s stick with fundraising for a bit. Political fundraising can be hugely expensive, because campaigns need to amass large number of donors. Unless you&#8217;re the President, it&#8217;s hard to repeatedly gather the wealthiest Americans and have them fork over $2,500 or more for the pleasure of your company. So a smart campaign sticks with what works: direct mail is costly, for example, but it&#8217;s also effective. Telemarketing takes time and money, but it also works pretty well. Let&#8217;s not mess with the script too much. <a href="http://blog.optimizely.com/how-obama-raised-60-million-by-running-an-exp">But what if you <em>can</em> mess with the script</a>? Now it&#8217;s possible, even trivial, to experiment with Web site design or <a href="http://themonkeycage.org/blog/2011/08/23/rick-perrys-eggheads/">even advertisements</a> in order to gauge their effectiveness and improve upon them. President Obama had a <a href="http://www.youtube.com/watch?v=71bH8z6iqSc">Director of Analytics</a> for his 2008 campaign, and <a href="http://www.datashaping.com/jobs18843x.shtml">has been hiring data scientists experienced in predictive modeling</a>.</p>
<p>White men smoking cigars in cramped rooms making gut calls is how we&#8217;ve usually understood campaign decision-making. This? Whole new ballgame. Yes, there is still a mass audience that is shaped by the media and big events. But there are now thousands and thousands of &#8220;small&#8221; audiences &#8211; or rather, they always were there. Now campaigns can identify them and deliver precision messages to them. And they can find them online in different ways; an hour after <a href="http://twitter.com/#!/derekwillis/status/126084472666984448">posting on Twitter</a> about the Obama&#8217;s campaign use of Github, the <a href="http://twitter.com/#!/teddygoff">campaign&#8217;s Digital Director</a> was following me. And that&#8217;s the easy part.</p>
<p>While campaigns have a public presence that is mostly recorded and observed, the stuff that goes on behind the scenes is so much more sophisticated than it has been. In 2008 we were fascinated by <a href="http://www.jackandjillpolitics.com/2008/10/obama-launches-iphone-app-makes-everyone-a-campaign-worker/">the Obama campaign&#8217;s use of iPhones for data collection</a>; now we&#8217;re entering an age where campaigns don&#8217;t just collect information by hand, but harvest it and learn from it. An &#8220;<a href="http://www.targetpointconsulting.com/ToThePoint/2011/09/27/the-information-arms-race">information arms race</a>,&#8221; as GOP consultant Alex Gage puts it.</p>
<p>For most news organizations, the standard approach to campaign coverage is tantamount to bringing a knife to a gun fight. How many data scientists work for news organizations? We are falling behind, and we risk not being able to explain to our readers and users how their representatives get elected or defeated.</p>
<p>None of this is to say that we need to completely abandon our ways of covering elections. Horse-race coverage is and should be a part of campaign coverage, because in many respects elections are like horse races. Things can change rapidly, and small things can have big impact. We still should be on the ground, talking to voters, showing up at town halls and covering debates. We still need to show up and do the legwork.</p>
<p>But if we can&#8217;t appreciate, much less understand, what modern campaigns are doing to win elections, how can we hope to explain elections? If we don&#8217;t collect at least some of the information available to us &#8211; realizing that we can&#8217;t get our hands on everything that the campaigns do &#8211; we&#8217;ll miss the story. Elections will become even bigger surprises to us, and then how long will it be before readers start to ask whether we actually know the people and places we cover?</p>
<p>Surprises make the news. Some of my favorite stories from the 2004 presidential election are in <a href="http://www.amazon.com/One-Party-Country-Republican-Dominance/dp/0471776726">a book</a> by my friends Peter Wallsten and Tom Hamburger, then of the Los Angeles Times. Here&#8217;s <a href="http://www.latimes.com/news/opinion/commentary/la-op-hamburger25jun25,0,906381.story">one anecdote from the key state of Ohio</a>:</p>
<blockquote><p>One suburban African American woman in Ohio, for example, told us that though she tends to vote Democratic, she was deluged in 2004 with calls, e-mail messages and other forms of communication by Republicans who somehow knew that she was a mother with children in private schools, an active church attendee, an abortion opponent and a golfer.</p></blockquote>
<p>Think about what this kind of thing means. It means that we cannot assume that the campaign visible to the mass audience is the same campaign that&#8217;s being pitched to individuals and groups around the nation, and that winning coalitions can be built not just by harnessing large groups (unions, religious voters, etc.) but also by piecing them together in small units. President Bush&#8217;s margin in Ohio in 2004? <a href="http://www.nytimes.com/packages/html/politics/2004_ELECTIONRESULTS_GRAPHIC/">About 2.5 percent</a>. The only thing that I don&#8217;t like about this anecdote is that Wallsten and Hamburger&#8217;s book appeared nearly two years later. Is there any evidence that we as journalists have closed the gap since then?</p>
<p>To understand how elections are now being waged, we need to have as many of the tools as do the campaigns. We need to build our own storehouses of data &#8211; <a href="http://www.wakegov.com/elections/8data.htm">voter registration</a>, <a href="http://www.sos.georgia.gov/elections/voter_registration/voterhistory.asp">voter history</a>, Census, campaign finance, <a href="http://transition.fcc.gov/mb/audio/decdoc/public_and_broadcasting.html#_Toc202587585">advertisements</a> and more. We need to be able to tap into the rich stream of material that&#8217;s being created and disseminated every day. We need to be able to see the value in small data points that can lead to bigger things.</p>
<p>Elections are great stories. They deserve to be told from a position of confidence and knowledge. We have work to do.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.thescoop.org/archives/2011/10/17/what-we-dont-know-about-elections/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
		<item>
		<title>RemoteTable Is Your Friend</title>
		<link>http://blog.thescoop.org/archives/2011/10/04/remotetable-is-your-friend/</link>
		<comments>http://blog.thescoop.org/archives/2011/10/04/remotetable-is-your-friend/#comments</comments>
		<pubDate>Wed, 05 Oct 2011 01:14:51 +0000</pubDate>
		<dc:creator>Derek Willis</dc:creator>
				<category><![CDATA[Car Tools]]></category>
		<category><![CDATA[Data]]></category>
		<category><![CDATA[Ruby]]></category>

		<guid isPermaLink="false">http://blog.thescoop.org/?p=5640</guid>
		<description><![CDATA[Assuming you regularly work with data found online &#8211; and if you don&#8217;t, you&#8217;re probably here by mistake, so welcome! &#8211; then you realize what a pain it can be to grab structured files from some site, save them and import them. I have more methods in more apps than I can count that download [...]]]></description>
			<content:encoded><![CDATA[<p>Assuming you regularly work with data found online &#8211; and if you don&#8217;t, you&#8217;re probably here by mistake, so welcome! &#8211; then you realize what a pain it can be to grab structured files from some site, save them and import them. I have more methods in more apps than I can count that download a CSV file, run it through a parser and then save some objects as a result. And it always seems to be a 2-3 step process.</p>
<p>If you&#8217;re a fan of the useful <a href="http://csvkit.readthedocs.org/en/latest/tutorial/examining_the_data.html">CSVKit</a>, a command-line tool written in Python, but you&#8217;re a Rubyist, then please do yourself a favor and take a look at <a href="https://github.com/seamusabshere/remote_table">RemoteTable</a>, a gem by <a href="http://workingwithrails.com/person/15878-seamus-abshere">Seamus Abshere</a>. The two libraries aren&#8217;t entirely identical, and <a href="http://www.anthonydebarros.com/2011/09/11/csvkit-data-files/">CSVKit has been rightly praised by others</a>, so let me dive a little deeper into why RemoteTable is a data parser&#8217;s friend.</p>
<p>RemoteTable is essentially a set of useful wrappers around the common process I outlined above: grab a structured file and use it contents. Except that when I say &#8220;structured file&#8221;, I don&#8217;t just mean a CSV. I mean CSVs inside of zip files. Excel spreadsheets of the .xls and .xlsx varieties. Google Spreadsheets and Open Office spreadsheets. Web pages with HTML tables in them. XML. And, what the hell, fixed-width files, too. You&#8217;ll want to <a href="https://github.com/seamusabshere/remote_table/blob/master/README.rdoc">see the examples</a>.</p>
<p>To those options it adds some useful utilities, much as CSVKit does. You can cut certain columns using the <a href="http://www.softpanorama.org/Tools/cut.shtml">Unix cut utility</a>, skip or crop rows using tail, make the entire file UTF-8 or remove &#8220;useless&#8221; characters.</p>
<p>It&#8217;s simple to try out. If you have Ruby installed and are not on Windows (sorry!), then gem install remote_table and wait a minute or two, as it installs quite a few dependencies for managing all this stuff. Then, in a console session, try grabbing a CSV file:</p>
<p><script src="https://gist.github.com/1263335.js"> </script></p>
<p>Each row is converted into an Ordered Hash so that you can refer to columns by name instead of position.</p>
<p>Now, what about a fixed-width file? Maybe one in which I&#8217;m only interested in some of the columns? Easy, thanks to RemoteTable&#8217;s nice DSL:</p>
<p><script src="https://gist.github.com/1263343.js"> </script></p>
<p>Notice that :cut option I passed? That&#8217;s where you include arguments to cut. In this case, I removed the financial columns from this file. You can also pass a :select option to it to do on-the-fly filtering of certain columns based on a matcher such as a regular expression or plain text.</p>
<p>RemoteTable can tell you when you&#8217;re not being very smart about its use, too, such as causing it to re-fetch the source file multiple times (it warns you when it does this). It also handles local files, so if you want to avoid the downloading bit (or your files aren&#8217;t on the Web), that&#8217;s fine, too. Honestly, I have no idea why this library isn&#8217;t more popular, but you can help make it so on <a href="https://github.com/seamusabshere/remote_table">Github</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.thescoop.org/archives/2011/10/04/remotetable-is-your-friend/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Measuring Vocabulary Richness (or, Trying Out Django on Heroku)</title>
		<link>http://blog.thescoop.org/archives/2011/10/01/measuring-vocabulary-richness-trying-out-django-on-heroku/</link>
		<comments>http://blog.thescoop.org/archives/2011/10/01/measuring-vocabulary-richness-trying-out-django-on-heroku/#comments</comments>
		<pubDate>Sun, 02 Oct 2011 02:14:52 +0000</pubDate>
		<dc:creator>Derek Willis</dc:creator>
				<category><![CDATA[Code]]></category>
		<category><![CDATA[Python]]></category>

		<guid isPermaLink="false">http://blog.thescoop.org/?p=5635</guid>
		<description><![CDATA[When Heroku announced Python support this past week, I was interested in seeing how the deployment process worked compared to how Heroku handles Ruby apps. Then a post highlighted by the Python Weekly newsletter caught my eye. Swizec Teller&#8217;s entry, &#8220;Measuring vocabulary richness with Python&#8220;, described an algorithm by George Udny Yule in a 1944 [...]]]></description>
			<content:encoded><![CDATA[<p>When Heroku <a href="http://blog.heroku.com/archives/2011/9/28/python_and_django/">announced Python support this past week</a>, I was interested in seeing how the deployment process worked compared to how Heroku handles Ruby apps. Then a post highlighted by the <a href="http://www.pythonweekly.com/">Python Weekly newsletter</a> caught my eye.</p>
<p>Swizec Teller&#8217;s entry, &#8220;<a href="http://swizec.com/blog/measuring-vocabulary-richness-with-python/swizec/2528">Measuring vocabulary richness with Python</a>&#8220;, described an algorithm by <a href="http://statprob.com/encyclopedia/GeorgeUdnyYule.html">George Udny Yule</a> in a 1944 paper entitled &#8220;<a href="http://scholar.google.com/scholar?q=The+statistical+study+of+literary+vocabulary">The statistical study of literary vocabulary</a>.&#8221; Yule created a way to quantify the diversity of vocabulary in a given text, and Teller translated that formula into straightforward Python code.</p>
<p>So I made a <a href="https://github.com/dwillis/Rich-Vocab">simple Django app</a> that accepts text via a form and uses Teller&#8217;s code to calculate Yule&#8217;s I score of vocabulary richness. It uses the really useful <a href="http://www.nltk.org/">Natural Language Toolkit</a>; the only oddity is that when developing locally on a Mac, the standard installation of NLTK via pip is borked, so you need to <a href="https://github.com/dwillis/Rich-Vocab/blob/master/requirements.txt">specify a file to download in your requirements.txt</a>. You can find <a href="http://richvocab.herokuapp.com/">the demo app here</a>.</p>
<p>I&#8217;m not offering a judgment on using Heroku or other instant deployment-type services; most of them seem pretty easy to use but out of my price range for anything significant. But it&#8217;s nice to know that services like Heroku, ep.io and others offer enough flexibility to do stuff like natural language parsing.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.thescoop.org/archives/2011/10/01/measuring-vocabulary-richness-trying-out-django-on-heroku/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>In Defense of Building Tools</title>
		<link>http://blog.thescoop.org/archives/2011/08/10/in-defense-of-building-tools/</link>
		<comments>http://blog.thescoop.org/archives/2011/08/10/in-defense-of-building-tools/#comments</comments>
		<pubDate>Thu, 11 Aug 2011 00:45:16 +0000</pubDate>
		<dc:creator>Derek Willis</dc:creator>
				<category><![CDATA[Car Tools]]></category>
		<category><![CDATA[Journalism]]></category>
		<category><![CDATA[Work]]></category>

		<guid isPermaLink="false">http://blog.thescoop.org/?p=5622</guid>
		<description><![CDATA[My first job in Web development was as a member of washingtonpost.com&#8217;s &#8220;Tools Team.&#8221; I was, in title if not in practice, a Tool. Done snickering? Let&#8217;s move on. The Tools Team built mostly internal applications and services that helped the Web site run better. I mainly got to work on front-facing projects like the [...]]]></description>
			<content:encoded><![CDATA[<p>My first job in Web development was as a member of washingtonpost.com&#8217;s &#8220;Tools Team.&#8221; I was, in title if not in practice, a Tool.</p>
<p>Done snickering? Let&#8217;s move on.</p>
<p>The Tools Team built mostly internal applications and services that helped the Web site run better. I mainly got to work on front-facing projects like the <a href="http://projects.washingtonpost.com/congress/112/">Congress Votes Database</a>, the <a href="http://projects.washingtonpost.com/2008-presidential-candidates/">2008 presidential campaign</a> and an innovative series on <a href="http://projects.washingtonpost.com/fec/specials/cassidy/">lobbyist Gerald Cassidy</a>. But I did work on a few internal tools, and since I joined The Times in late 2007 I&#8217;ve built a few more. I&#8217;ve found that such tools are not so different from what we now consider to be journalism by Web development. Chosen wisely and done well, they can have impacts that go far beyond a single story or series. We should not <a href="https://twitter.com/hbillings/status/101391263248560128">dismiss them as &#8220;not journalism.</a>&#8221;</p>
<p>If you&#8217;re at the geekier end of the journalism spectrum, then chances are your colleagues know about the stuff you can do. They may not understand it or be able to explain it; a former managing editor of mine, when told about the various technical steps to accomplish something useful, would invariably respond with a touch of wonder: &#8220;Fuckin&#8217; Internet!&#8221; You can explain your work to a decent percentage of your colleagues by invoking Harry Potter or the Lord of the Rings and leave it at that.</p>
<p>But that doesn&#8217;t mean that building tools that can be used by broad segments of the newsroom is a one-way street or has to lead to a divide between you and the other journalists. There will be people in every newsroom who mainly take and rarely give, and in those situations being a technologist is no different from being a clerk. Good tools, like good apps, are a product of collaboration and improve the ability of the newsroom in general. They also make for more and better apps.</p>
<p>Case in point: At The Times we have <a href="http://politics.nytimes.com/congress/">an Inside Congress app</a> that displays information about votes and bills in Congress. The tool that underlies that app is enormous &#8211; it has tons more information, and we&#8217;re working to surface more and more of it. But the tool &#8211; an internal interface &#8211; has uses for our congressional reporters, our graphics editors and for me as a developer. I can point a reporter to the vote record comparison tool instead of having to run a database query or, worse, asking someone else to manually recreate something. We use the tool as a sort of canary in the mine to alert us to odd or interesting events, from committee assignment changes to bill sponsorship withdrawals to unusual voting patterns. In some cases, having the data internally has led to improvements in the app itself, such as our <a href="http://politics.nytimes.com/congress/bills/111/hr3590/amendments">&#8220;key amendments&#8221;</a> pages for certain bills. I didn&#8217;t think of that, but someone else who saw the internal tool did, and we built it together.</p>
<p>Perhaps most important to me as a developer, building the internal tool has broadened the number of people I work with and has given me a range of ideas for making apps easier to build and better. Not all of them pan out, but <a href="http://www.nytimes.com/interactive/2010/07/07/us/politics/20100707-kagan-vote-tracker.html">some of them do</a>. Put another way: the tool actually helps me develop closer working relationships with my colleagues.</p>
<p>A good tool doesn&#8217;t just make it easier for a reporter to create a story. It actually seeds the story, or makes it possible for more people in a newsroom to collaborate. When you have data but no tool, you become a gatekeeper of a sorts &#8211; which is appropriate in many circumstances, but not all. I can&#8217;t possibly know what my colleagues are thinking about, considering or being alerted to, but I can make it easier for them to test out theories and do some exploration on their own. Some of them prefer to do their own work, and we certainly miss some opportunities for apps that way. But others consult with me quite a bit, since they now have a much better idea of what we have and what we might be able to do with it.</p>
<p>Skeptics might respond that there is a difference between tools built around journalistic content, like the Congress app, and those that &#8220;merely&#8221; solve a technical problem. This is a short-sighted argument. What we do as builders of Web applications (external or internal) is informed by everything we touch. Pulling a piece of one tool for use someplace else is a useful technique because it <a href="https://twitter.com/yurivictor/status/101419578936147968">reinforces the value of not repeating yourself</a> and because it sometimes enables you to look at an old problem or situation from a new vantage point.</p>
<p>Back at washingtonpost.com, my former colleague <a href="http://www.holovaty.com/">Adrian Holovaty</a> liked to say that we didn&#8217;t build internal versions of our apps because the public version was the internal version. Fair enough, to a point, but I think that line can veer into the <a href="http://www.mattwaite.com/posts/2008/jan/03/data-ghettos/">data ghetto</a> when not rigidly policed.</p>
<p>Most of my colleagues, I&#8217;m confident, have very little idea what it is that I specifically do. Sometimes I spend the time educating, and sometimes I let our tools help with the evangelization process. However they see my work, I&#8217;m pretty happy as long as it contributes to our journalism together. App developer? Sure. Tool maker? Why not. Labels don&#8217;t interest me much, and most of my colleagues don&#8217;t seem to care. The results &#8211; the journalism &#8211; are what matter.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.thescoop.org/archives/2011/08/10/in-defense-of-building-tools/feed/</wfw:commentRss>
		<slash:comments>10</slash:comments>
		</item>
		<item>
		<title>Why Teach SQL?</title>
		<link>http://blog.thescoop.org/archives/2011/07/27/why-teach-sql/</link>
		<comments>http://blog.thescoop.org/archives/2011/07/27/why-teach-sql/#comments</comments>
		<pubDate>Thu, 28 Jul 2011 00:53:19 +0000</pubDate>
		<dc:creator>Derek Willis</dc:creator>
				<category><![CDATA[Car Tools]]></category>
		<category><![CDATA[Teaching]]></category>

		<guid isPermaLink="false">http://blog.thescoop.org/?p=5615</guid>
		<description><![CDATA[There was an interesting discussion on the NICAR-L listserv today about teaching database skills. More specifically, which software to teach and how to teach it. Should you go with SQLite, as I do? What about MS Access (the consensus seemed to lean against)? Is it too much to ask students to install database server software [...]]]></description>
			<content:encoded><![CDATA[<p>There was an interesting discussion on the NICAR-L listserv today about teaching database skills. More specifically, which software to teach and how to teach it. Should you go with <a href="http://blog.thescoop.org/archives/2008/01/23/teaching-sqlite/">SQLite</a>, as I do? What about MS Access (the consensus <a href="http://blog.thescoop.org/archives/2009/06/02/the-case-against-teaching-access/">seemed to lean against</a>)? Is it too much to ask students to install database server software such as MySQL or PostgreSQL?</p>
<p>These are complicated questions, made moreso by the options now available for teaching database skills. When I attended an IRE database bootcamp in 1997 (taught by my now-colleague <a href="http://topics.nytimes.com/topics/reference/timestopics/people/m/jo_craven_mcginty/index.html">Jo Craven McGinty</a>), there were basically three options: the then-young Access, FoxPro or Paradox. Hard to believe, but back then I worked in a newsroom that had <a href="http://en.wikipedia.org/wiki/FoxPro_2">FoxPro</a> and <a href="http://en.wikipedia.org/wiki/Paradox_%28database%29">Paradox</a>, but not really Access (Note: if you are under 30 and reading this, you may not even know what FoxPro and Paradox are. That&#8217;s ok. They, an in particular FoxPro, were wonderful database managers in their day.)</p>
<p>Not only do we now have open source options (SQLite, MySQL, Postgres) and SQL Server, but we also have a variety of &#8220;database-like&#8221; Web applications, like Fusion Tables and Google Refine, that can do some of the things that only desktop software used to do. And let&#8217;s face it, Excel is a very powerful tool for data analysis. Many of the things a reporter might want to do to a data file, such as sorting and filtering, are arguably a lot easier in Excel or another spreadsheet.</p>
<p>So why even teach SQL, then? The reasons I do it, and will continue to, are these:</p>
<ol>
<li>SQL is an excellent and relatively simple way to enhance your <a href="http://blog.thescoop.org/archives/2011/05/01/interviewing-data/">data interviewing</a> skills. When you have to write out your questions, you tend to think about them a little more than if you&#8217;re just pointing and clicking around. This is why when I had to teach Access, I bypassed the visual query builder. Yes, SQL queries involve writing more than doing an Excel filter, but those syntax errors also make you consider what you&#8217;re doing, and that&#8217;s a good thing.</li>
<li>SQL is still common enough on the Web that teaching it provides an additional branch, if you will, of learning, or at least makes it easier. When I explain how Facebook assembles all your friend&#8217;s posts, comments and pictures, I usually do so by pointing out the existence of <a href="https://developers.facebook.com/docs/reference/fql/">FQL</a>. If you already know SQL, it&#8217;s a very small leap to understanding, at a basic level, how Facebook works.</li>
<li>There are some times when you will absolutely need to use a SQL database. Or, at least, something that&#8217;s not Excel. Multi-million-row tables. Regular expression-based pattern matching. Intensive, complicated queries. If you haven&#8217;t explored SQL, you might not know these are even possible, and you might give up.</li>
</ol>
<p>As to what to use when teaching SQL, I stick with SQLite despite <a href="http://fds.duke.edu/db/Sanford/sarah.cohen">Sarah Cohen</a>&#8216;s completely valid point that <a href="http://www.sqlite.org/lang_datefunc.html">date and time support</a> is much more complicated than it should be. Perhaps a new installment of <a href="https://github.com/tthibo/SQL-Tutorial">Troy Thibodeaux&#8217;s excellent tutorial</a> will help address that issue. In the meantime, let&#8217;s keep teaching SQL &#8211; and asking questions.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.thescoop.org/archives/2011/07/27/why-teach-sql/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
	</channel>
</rss>

<!-- Dynamic page generated in 1.552 seconds. -->
<!-- Cached page generated by WP-Super-Cache on 2012-05-17 01:48:18 -->

