The Scoop

  • Home
  • Projects
  • About The Scoop
  • Fixing Journalism
  • Departments
    • Apple
    • Asides
    • Broadcast
    • Campaign Finance
    • Car Tools
    • Data
    • DIY
    • django
    • Fed Data
    • FOIA
    • General
    • IRE
    • Journalism
    • Local Data
    • Mapping
    • Miscellany
    • NonGov Data
    • Online
    • Paper Trail
    • Presentations
    • Public Records
    • Python
    • Rails
    • SLA
    • Social Network Analysis
    • Sports
    • State Data
    • Teaching
    • Work
    • XML
  • Subscribe via RSS

Xpdf on the Mac

March 18th, 2005  |  Published in Car Tools  |  1 Comment

Last year I wrote a piece for Uplink on using Xpdf to convert PDF documents into text tables, but that piece focused on using xpdf on Win32 systems. Here’s an adapted guide to installing and using Xpdf on the Mac.

Here’s a question that should have a familiar ring: How do I get text out of a PDF file?

Painfully, if your experience has been anything like mine.

The mere existence of Adobe Acrobat files has been a boon for reporters because governments and agencies everywhere have been able to make documents broadly available over the Internet. It’s hard not to love that.

But if you’ve ever tried getting tables out of a PDF document - and we’ve all tried - the results usually aren’t worth the effort. Until now.

A free command line utility called Xpdf will save you time and aggravation. It will, in most circumstances, enable you to go from PDF to Excel in a matter of seconds, rather than minutes or hours. Did I mention that it’s free?

You can find Xpdf and it comes in packages for Windows and Linux/Unix. OS X users should download the source code (at this writing the file is xpdf-3.00.tar.gz) to their desktop. That file will expand into a folder labeled xpdf-3.00. Open up the Terminal and type the following (hit return after each step):

cd Desktop
cd xpdf-3.00
./configure

This will take a minute or so. Then type:

make

Again, you’ll wait a few minutes until it finishes, then:

make install

You may have to use “sudo make” and, when prompted, enter the password for your computer.

Once xpdf installs, you can put a PDF file anywhere in your home directory (I usually have a single folder for this) and navigate to that directory in the Terminal using “cd /location of file”, and then typing:

pdftotext -layout pdfname.pdf

Depending on the size of the PDF file, your output text file (with the same name as the original) will be in the same folder in a matter of seconds.

Let’s go through the command line syntax. First, the command “pdftotext” is required for this process, and “pdf2text” won’t work. The “-layout” tag tells Xpdf that you want to preserve the layout that the PDF file uses, which keeps the text in those nice, clean tables. And you need to have the fullname of the file (I recommend a single-word name, even though Windows supports filenames with spaces). That’s it.

The resulting text file will be the entire text of the PDF, meaning that you may have to wade through pages of text in order to get to your tables. The preservation of the PDF’s layout means that if a page contained two tables side-by-side, that’s the way they will look in the text file, too.

Xpdf doesn’t work in all instances; specifically, it won’t convert PDFs that have been locked by their creators. Don’t bother asking the author of Xpdf, either, as he has posted a message on his Web site indicating that he will not add that ability.

But for most government documents, Xpdf can be a huge time-saver and allow you to spend more time actually analyzing data rather than trying to free it from the confines of the PDF.

Responses

Feed
  1. Hanspeter says:

    October 10th, 2008 at 3:02 pm (#)

    For the record, Fink already has xpdf in it’s list of packages, so ‘fink install xpdf’ will go through all the necessary steps.

Leave a Response

Recent Comments

  • First step in bringing change: find the believers |  on Six Reasons To Look Past Caspio
  • Ed on Joyce Meyer Ministry Compensation
  • Tim on Represent and GeoDjango
  • Tim on Represent and GeoDjango
  • Strange Attractor » Blog Archive » links for 2008-12-23 on Represent and GeoDjango

Recent Posts

  • Represent and GeoDjango
  • Liz Donovan, News Researcher
  • Deploying Django with Fabric
  • Even More Fumblerooski
  • White House Beat Feature Request


©2009 The Scoop
Powered by WordPress using the Gridline Lite theme by Graph Paper Press.