There are blogs on the web, that are written/configured in a way, that the RSS
or Atom feed contains only a teaser (or no content at all), and one must open
a link to get the real content – and thus load all the crap on the page,
something RSS feeds were designed to avoid. Dittygirl has added one of
those sites in her feed reader, and told me that it takes lots of resources on
her netbook to load the whole page – not to mention the discomfort of leaving
the feed reader.
I accepted the challenge, and decided to write a Python RSS gateway in less
than 30 minutes. I chose plain WSGI, something I wanted to play with, and
this project was a perfect match for its simplicity and lightweightness. Plain
WSGI applications are Python modules with a callable named application
, which
the webserver will call every time, an HTTP request is made. The callable gets
two parameters,
- a dictionary of environment values (including the Path of the query,
IP address of the browser, etc.), and
- a callable, which can be used to signal the web server about the progress.
In this case, the script ignores the path, so only the second parameter is used.
def application(environ, start_response):
rss = getfeed()
response_headers = [('Content-Type', 'text/xml; charset=UTF-8'),
('Content-Length', str(len(rss)))]
start_response('200 OK', response_headers)
return [rss]
Simple enough, the function emits a successful HTTP status, the necessary
headers, and returns the content. The list (array) format is needed because a
WSGI application can be a generator too (using a yield statement), which can
be handy when rendering larger content, so the server expects an iterable result.
The real “business logic” is in the getfeed
function, which first tries to
load a cache, to avoid abusing the resources of the target server. I chose JSON
as it's included in the standard Python libraries, and easy to debug.
try:
with open(CACHE, 'rb') as f:
cached = json.load(f)
etag = cached['etag']
except:
etag = ''
Next, I load the original feed, using the cached ETag value to encourage
conditional HTTP GET. The urllib2.urlopen
function can operate on a
Request
object, which takes a third parameter, that can be used to add HTTP
headers. If the server responds with a HTTP 304 Not Modified
, urlopen
raises an HTTPError
, and the script knows that the cache can be used.
try:
feedfp = urlopen(Request('http://HOSTNAME/feed/',
None, {'If-None-Match': etag}))
except HTTPError as e:
if e.code != 304:
raise
return cached['content'].encode('utf-8')
I used lxml to handle the contents, as it's a really convenient and fast
library for XML/HTML parsing and manipulation. I compiled the XPath
queries used for every item in the head of the module for performance reasons.
GUID = etree.XPath('guid/text()')
IFRAME = etree.XPath('iframe')
DESC = etree.XPath('description')
To avoid unnecessary copying, lxml's etree can parse the object returned by
urlopen
directly, and returns an object, which behaves like a DOM on steroids.
The GUID
XPath extracts the URL of the current feed item, and the HTML parser
of lxml takes care of it. The actual contents of the post is helpfully put in a
div
with the class post-content
, so I took advantage of lxml's HTML helper
functions to get the div
I needed.
While I was there, I also removed the first iframe
from the post, which
contains the Facebook tracker bug Like button. Finally, I
cleared the class
attribute of the div
element, and serialized its contents
to HTML to replace the useless description of the feed item.
feed = etree.parse(feedfp)
for entry in feed.xpath('/rss/channel/item'):
ehtml = html.parse(GUID(entry)[0]).getroot()
div = ehtml.find_class('post-content')[0]
div.remove(IFRAME(div)[0])
div.set('class', '')
DESC(entry)[0].text = etree.CDATA(etree.tostring(div, method="html"))
There are two things left. First, the URL that points to the feed itself needs
to be modified to produce a valid feed, and the result needs to be serialized
into a string.
link = feed.xpath('/rss/channel/a:link',
namespaces={'a': 'http://www.w3.org/2005/Atom'})[0]
link.set('href', 'http://URL_OF_FEED_GATEWAY/')
retval = etree.tostring(feed)
The second and final step is to save the ETag we got from the HTTP response and
the transformed content to the cache in order to minimize the amount of
resources (ab)used.
with open(CACHE, 'wb') as f:
json.dump(dict(etag=feedfp.info()['ETag'], content=retval), f)
return retval
You might say, that it's not fully optimized, the design is monolithic, and so
on – but it was done in less than 30 minutes, and it's been working perfectly
ever since. It's a typical quick-and-dirty hack, and although it contains no
technical breakthrough, I learned a few things, and I hope someone else might
also do by reading it. Happy hacking!