Using RSS in the Newsroom
Using RSS in the Newsroom
Derek Willis
The Washington Post
NICAR 2006
(a copy of this handout can be found at http://www.thescoop.org/projects/irenicar/)
What is RSS?
RSS is a syndication format for delivering content over the Web. Text, pictures, audio, video, whatever. It’s all enclosed in XML. That means you can turn it into data. Or you can scrape HTML or other formats into RSS to push it out to the newsroom.
Learning About RSS
Technical definition and spec can be found at Harvard. (Pay attention to required elements)
Turning RSS into Data
I recommend Feed Parser, a Python module. Feed Parser turns RSS elements into Python objects, which can be inserted into SQL databases.
Some other options:
Perl: http://www.petercooper.co.uk/archives/000995.html
Perl: http://tageloehner.de/rss2sql.pl.txt
PHP: http://kynikeren.com/tech/category/feedparser/
Turning HTML into RSS – an example
FEC press releases. This Python script scrapes HTML from the FEC web site and builds an RSS feed.
Turning SQL into RSS
Another example: Delaware legislation signed into law (feed). The following Python code, using Feed Parser, inserts it into a MySQL database:
"""
delaws.py - an example of turning RSS into SQL data.
Derek Willis, The Washington Post
dwillis@gmail.com
March 2006
This script turns an RSS feed into a MySQL database table, using the feedparser module to select specific attributes within the feed.
"""
# import required modules
import feedparser, MySQLdb
# set up connection to MySQL database
db = MySQLdb.connect(host='host',user='user',passwd='pass',db='dbname')
cursor=db.cursor()
# fetch feed and turn into feedparser object
d = feedparser.parse('http://www.legis.state.de.us/LIS/LIS143.NSF/GovSignedFeed.xml')
# determine number of entries in feed, which is used in processing loop
x = len(d.entries)
# processing loop - for each entry, selects certain attributes and inserts them into a MySQL table. Also converts the RSS entry date into a MySQL date
for i in range(x):
d2 = d.entries[i].date_parsed
d3 = str(d2[0])+'-'+str(d2[1])+'-'+str(d2[2])
cursor.execute("""INSERT INTO delaws (bill, date, description, url) VALUES (%s,%s,%s,%s)""", (d.entries[i].title, d3, d.entries[i].description, d.entries[i].link))
# commit insertions to table (optional)
db.commit()