The Scoop

  • Home
  • Projects
  • About The Scoop
  • Fixing Journalism
  • Medill Links
  • Departments
    • API
    • Apple
    • Asides
    • Broadcast
    • Campaign Finance
    • Car Tools
    • Code
    • Data
    • DIY
    • django
    • Fed Data
    • FOIA
    • General
    • IRE
    • Journalism
    • Local Data
    • Mapping
    • Miscellany
    • NonGov Data
    • Online
    • Paper Trail
    • Presentations
    • Public Records
    • Python
    • Rails
    • Ruby
    • SLA
    • Social Network Analysis
    • Sports
    • State Data
    • Teaching
    • Work
    • XML
  • Subscribe via RSS

Measuring Vocabulary Richness (or, Trying Out Django on Heroku)

October 1st, 2011  |  Published in Code, Python  |  3 Comments

When Heroku announced Python support this past week, I was interested in seeing how the deployment process worked compared to how Heroku handles Ruby apps. Then a post highlighted by the Python Weekly newsletter caught my eye.

Swizec Teller’s entry, “Measuring vocabulary richness with Python“, described an algorithm by George Udny Yule in a 1944 paper entitled “The statistical study of literary vocabulary.” Yule created a way to quantify the diversity of vocabulary in a given text, and Teller translated that formula into straightforward Python code.

So I made a simple Django app that accepts text via a form and uses Teller’s code to calculate Yule’s I score of vocabulary richness. It uses the really useful Natural Language Toolkit; the only oddity is that when developing locally on a Mac, the standard installation of NLTK via pip is borked, so you need to specify a file to download in your requirements.txt. You can find the demo app here.

I’m not offering a judgment on using Heroku or other instant deployment-type services; most of them seem pretty easy to use but out of my price range for anything significant. But it’s nice to know that services like Heroku, ep.io and others offer enough flexibility to do stuff like natural language parsing.

Responses

Feed Trackback Address
  1. Chris Amico says:

    October 2nd, 2011 at 11:24 am (#)

    Was just about to do a similar experiment myself, probably using this little demo app. I’ve been enjoying ep.io, but I’m curious how the setup time, cost and performance of Heroku compares.

    One thing these types of services could end up being good for is building specific services–like language analysis or text extraction–that require time to process and which can be abstracted out of other apps and used via web service. Something to think on, anyway.

  2. Ben says:

    October 3rd, 2011 at 10:07 am (#)

    So what’s the next step? Having it slurp up people’s Twitter accounts and give them a score?

  3. Christopher says:

    November 27th, 2011 at 1:19 am (#)

    I am curious, what does such a thing cost? I never really thought about it until I read this page. Is it that expensive?

    I would hope that as CPUs performance improves that the cost and performance will improve as well.

    Interesting reading.

    Christopher

Leave a Response

Recent Comments

  • Seth Lewis on Lost in the Weeds
  • Reporters' Lab // News algorithms already exist – and that’s good on The Programmer-Reporter
  • Eric Mill on On Legislative Data Transparency
  • (19:19 06-02-2012) Noticias más populares de #opengov en las ultimas 24 horas | Tuits de Software Libre on On Legislative Data Transparency
  • (15:05 06-02-2012) Noticias más populares de #opengov en las ultimas 24 horas | Tuits de Software Libre on On Legislative Data Transparency

Recent Posts

  • Lost in the Weeds
  • Our Mark Knoller Problem
  • The Programmer-Reporter
  • Investigating House Freshmen Voting Patterns
  • On Legislative Data Transparency

Linking Out

  • Mapping America — Census Bureau 2005-9 American Community Survey - NYTimes.com
    holy crap
  • Backbone.js and Django | joshbohde.com
  • ProPublica
  • Geoff: GeoJSON Feature Functions for JavaScript
  • Introducing Spanner: From Documents to Linked Data Apps—Clark & Parsia: Thinking Clearly
  • A performance lesson on Django QuerySets | Seek Nuance
  • http://www.post-gazette.com/pg/03001/1108747-209.stm
  • CBC News - Canada - Database: Canadian cables in WikiLeaks
  • Federal prosecutors likely to keep jobs after cases collapse - USATODAY.com
  • Strata Gems: Explore and visualize graphs with Gephi - O'Reilly Radar


©2012 The Scoop
Powered by WordPress using the Gridline Lite theme by Graph Paper Press.