Wednesday, October 28, 2009

Recipe 11.16. A Simple Feed Aggregator










Recipe 11.16. A Simple Feed Aggregator


Credit: Rod Gaither


XML is the basis for many specialized langages. One of the most popular is RSS, an XML format often used to store lists of articles from web pages. With a tool called an aggregator, you can collect weblog entries and articles from several
web sites'
RSS feeds, and read all those web sites at once without having to skip from one to the other. Here, we'll create a simple aggregator in Ruby.


Before aggregating
RSS feeds, let's start by reading a single one. Fortunately we have several options for parsing RSS feeds into Ruby data structures. The Ruby standard library has built-in support for the three major versions of the RSS format (0.9, 1.0, and 2.0). This example uses the standard rss library to parse an RSS 2.0 feed and print out the titles of the items in the feed:



require 'rss/2.0'
require 'open-uri'

url = 'http://www.oreillynet.com/pub/feed/1?format=rss2'
feed = RSS::Parser.parse(open(url).read, false)
puts "=== Channel: #{feed.channel.title} ==="
feed.items.each do |item|
puts item.title
puts " (#{item.link})"
puts
puts item.description
end
# === Channel: O'Reilly Network Articles ===
# How to Make Your Sound Sing with Vocoders
# (http://digitalmedia.oreilly.com/2006/03/29/vocoder-tutorial-and-tips.html)
# …



Unfortunately, the standard
rss
library is a little out of date. There's a newer syndication format called
Atom, which serves the same purpose as
RSS, and the rss library doesn't support it. Any serious aggregator must support all the major syndication formats.


So instead, our aggregator will use Lucas Carlson's Simple
RSS library, available as the
simple-rss
gem. This library supports the three main versions of RSS, plus Atom, and it does so in a relaxed way so that ill-formed feeds have a better chance of being read.


Here's the example above, rewritten to use Simple RSS. As you can see, only the name of the class is different:



require 'rubygems'
require '
simple-rss'
url = 'http://www.oreillynet.com/pub/feed/1?format=rss2'
feed = RSS::Parser.parse(open(url), false)
puts "=== Channel: #{feed.channel.title} ==="
feed.items.each do |item|
puts item.title
puts " (#{item.link})"
puts
puts item.description
end



Now we have a general method of reading a single RSS or Atom feed. Time to work on some aggregation!


Although the aggregator will be a simple Ruby script, there's no reason not to use Ruby's object-oriented features. Our approach will be to create a class to encapsulate the aggregator's data and behavior, and then write a sample program to use the class.


The
RSSAggregator
class that follows is a bare-bones aggregator that reads from multiple syndication feeds when instantiated. It uses a few simple methods to expose the data it has read.



#!/usr/bin/ruby
# rss-aggregator.rb - Simple RSS and Atom Feed Aggregator

require 'rubygems'
require 'simple-rss'
require 'open-uri'

class
RSSAggregator
def initialize(feed_urls)
@feed_urls = feed_urls
@feeds = []
read_feeds
end

protected
def read_feeds
@feed_urls.each { |url| @feeds.push(SimpleRSS.new(open(url).read)) }
end
public
def refresh
@feeds.clear
read_feeds
end

def channel_counts
@feeds.each_with_index do |feed, index|
channel = "Channel(#{index.to_s}): #{feed.channel.title}"
articles = "Articles: #{feed.items.size.to_s}"
puts channel + ', ' + articles
end
end

def list_articles(id)
puts "=== Channel(#{id.to_s}): #{@feeds[id].channel.title} ==="
@feeds[id].items.each { |item| puts ' ' + item.title }
end

def list_all
@feeds.each_with_index { |f, i| list_articles(i) }
end
end



Now we just need a few more lines of code to instantiate and use an
RSSAggregator
object:



test = RSSAggregator.new(ARGV)
test.channel_counts
puts "\n"
test.list_all



Here's the output from a run of the test program against a few feed URLs:



$ ruby rss-aggregator.rb http://www.rubyriver.org/rss.xml \
http://rss.slashdot.org/Slashdot/slashdot \
http://www.oreillynet.com/pub/feed/1 \
http://safari.oreilly.com/rss/
Channel(0): RubyRiver, Articles: 20
Channel(1): Slashdot, Articles: 10
Channel(2): O'Reilly Network Articles, Articles: 15
Channel(3): O'Reilly Network Safari Bookshelf, Articles: 10
=== Channel(0): RubyRiver ===
Mantis style isn't eas…
It's wonderful when tw…
Red tailed hawk
37signals




While a long way from a fully functional RSS aggregator, this program illustrates the basic requirements of any real aggregator. From this starting point, you can expand and refine the features of RSSAggregator.


One very important feature missing from the aggregator is support for the If-Modified-Since HTTP request header. When you call RSSAggregator#refresh, your aggregator downloads the specified feeds, even if it just grabbed the same feeds and none of them have changed since then. This wastes bandwidth.


Polite aggregators keep track of when they last grabbed a certain feed, and when they request it again they do a conditional request by supplying an HTTP request header called If-Modified Since. The details are a little beyond our scope, but basically the web server serves the reuqested feed only if it has changed since the last time the RSSAggregator downloaded it.


Another important feature our RSSAggregator is missing is the ability to store the articles it fetches. A real aggregator would store articles on disk or in a database to keep track of which stories are new since the last fetch, and to keep articles available even after they become old news and drop out of the feed.


Our simple aggregator counts the articles and lists their titles for review, but it doesn't actually provide access to the article detail. As seen in the first example, the SimpleRSS.item has a link attribute containing the URL for the article, and a description attribute containing the (possibly HTML) body of the article. A real aggregator might generate a list of articles in HTML format for use in a browser, or convert the body of each article to text for output to a terminal.



See Also


  • Recipe 14.1, "Grabbing the Contents of a Web Page"

  • Recipe 14.3, "Customizing HTTP Request Headers"

  • Recipe 11.15, "Converting HTML Documents from the Web into Text"

  • A good comparison of the RSS and Atom formats (http://www.intertwingly.net/wiki/pie/Rss20AndAtom10Compared)

  • Details on the Simple RSS project (http://simple-rss.rubyforge.org/)

  • The FeedTools project has a more sophisticated aggregator library that supports caching and If-Modified-Since; see http://sporkmonger.com/projects/feedtools/ for details

  • "HTTP Conditional Get for RSS Hackers" is a readable introduction to If-Modified-Since (http://fishbowl.pastiche.org/2002/10/21/http_conditional_get_for_rss_hackers)













No comments: