VSzA techblog

Contact

vsza at vsza dot hu
Twitter: dn3t

Flattr

Member of

F33dme vs. Django 1.4 HOWTO

2013-05-31

Although asciimoo unofficially abandoned it for potion, I've been using f33dme with slight modifications as a feed reader since May 2011. On 4^th May 2013, Debian released Wheezy, so when I upgraded the server I ran my f33dme instance on, I got Django 1.4 along with it. As with major upgrades, nothing worked after the upgrade, so I had to tweak the code to make it work with the new release of the framework.

First of all, the database configuration in settings.py were just simple key-value pairs like DATABASE_ENGINE = 'sqlite3', these had to be replaced with a more structured block like the one below.

DATABASES = {
    'default': {
        'ENGINE': 'sqlite3',
        ...
    }
}

Then starting the service using manage.py displayed the following error message.

Error: One or more models did not validate:
admin.logentry: 'user' has a relation with model
    <class 'django.contrib.auth.models.User'>, which
    has either not been installed or is abstract.

Abdul Rafi wrote on Stack Overflow that such issues could be solved by adding django.contrib.auth to INSTALLED_APPS, and in case of f33dme, it was already there, I just had to uncomment it. After this modification, manage.py started without problems, but rendering the page resulted in the error message below.

ImproperlyConfigured: Error importing template source loader
    django.template.loaders.filesystem.load_template_source: "'module'
    object has no attribute 'load_template_source'"

Searching the web for the text above led me to another Stack Overflow question, and correcting the template loaders section in settings.py solved the issue. Although it's not a strictly Django-related problem, but another component called feedparser also got upgraded and started returning such values that resulted in TypeError exceptions, so the handler in fetch.py also had to be extended to deal with such cases.

With the modifications described above, f33dme now works like a charm, although deprecation warnings still get written to the logs both from Django and feedparser, but these can be dealt with till the next Debian upgrade, and until then, I have a working feed reader.

Mangling RSS feeds with Python

2011-10-28

There are blogs on the web, that are written/configured in a way, that the RSS or Atom feed contains only a teaser (or no content at all), and one must open a link to get the real content – and thus load all the crap on the page, something RSS feeds were designed to avoid. Dittygirl has added one of those sites in her feed reader, and told me that it takes lots of resources on her netbook to load the whole page – not to mention the discomfort of leaving the feed reader.

I accepted the challenge, and decided to write a Python RSS gateway in less than 30 minutes. I chose plain WSGI, something I wanted to play with, and this project was a perfect match for its simplicity and lightweightness. Plain WSGI applications are Python modules with a callable named application, which the webserver will call every time, an HTTP request is made. The callable gets two parameters,

a dictionary of environment values (including the Path of the query, IP address of the browser, etc.), and
a callable, which can be used to signal the web server about the progress.

In this case, the script ignores the path, so only the second parameter is used.

def application(environ, start_response):
  rss = getfeed()
  response_headers = [('Content-Type', 'text/xml; charset=UTF-8'),
                      ('Content-Length', str(len(rss)))]
  start_response('200 OK', response_headers)
  return [rss]

Simple enough, the function emits a successful HTTP status, the necessary headers, and returns the content. The list (array) format is needed because a WSGI application can be a generator too (using a yield statement), which can be handy when rendering larger content, so the server expects an iterable result.

The real “business logic” is in the getfeed function, which first tries to load a cache, to avoid abusing the resources of the target server. I chose JSON as it's included in the standard Python libraries, and easy to debug.

try:
  with open(CACHE, 'rb') as f:
    cached = json.load(f)
  etag = cached['etag']
except:
  etag = ''

Next, I load the original feed, using the cached ETag value to encourage conditional HTTP GET. The urllib2.urlopen function can operate on a Request object, which takes a third parameter, that can be used to add HTTP headers. If the server responds with a HTTP 304 Not Modified, urlopen raises an HTTPError, and the script knows that the cache can be used.

try:
  feedfp = urlopen(Request('http://HOSTNAME/feed/',
      None, {'If-None-Match': etag}))
except HTTPError as e:
  if e.code != 304:
    raise
  return cached['content'].encode('utf-8')

I used lxml to handle the contents, as it's a really convenient and fast library for XML/HTML parsing and manipulation. I compiled the XPath queries used for every item in the head of the module for performance reasons.

GUID = etree.XPath('guid/text()')
IFRAME = etree.XPath('iframe')
DESC = etree.XPath('description')

To avoid unnecessary copying, lxml's etree can parse the object returned by urlopen directly, and returns an object, which behaves like a DOM on steroids. The GUID XPath extracts the URL of the current feed item, and the HTML parser of lxml takes care of it. The actual contents of the post is helpfully put in a div with the class post-content, so I took advantage of lxml's HTML helper functions to get the div I needed.

While I was there, I also removed the first iframe from the post, which contains the Facebook ~~tracker bug~~ Like button. Finally, I cleared the class attribute of the div element, and serialized its contents to HTML to replace the useless description of the feed item.

feed = etree.parse(feedfp)
for entry in feed.xpath('/rss/channel/item'):
  ehtml = html.parse(GUID(entry)[0]).getroot()
  div = ehtml.find_class('post-content')[0]
  div.remove(IFRAME(div)[0])
  div.set('class', '')
  DESC(entry)[0].text = etree.CDATA(etree.tostring(div, method="html"))

There are two things left. First, the URL that points to the feed itself needs to be modified to produce a valid feed, and the result needs to be serialized into a string.

link = feed.xpath('/rss/channel/a:link',
  namespaces={'a': 'http://www.w3.org/2005/Atom'})[0]
link.set('href', 'http://URL_OF_FEED_GATEWAY/')
retval = etree.tostring(feed)

The second and final step is to save the ETag we got from the HTTP response and the transformed content to the cache in order to minimize the amount of resources (ab)used.

with open(CACHE, 'wb') as f:
  json.dump(dict(etag=feedfp.info()['ETag'], content=retval), f)
return retval

You might say, that it's not fully optimized, the design is monolithic, and so on – but it was done in less than 30 minutes, and it's been working perfectly ever since. It's a typical quick-and-dirty hack, and although it contains no technical breakthrough, I learned a few things, and I hope someone else might also do by reading it. Happy hacking!

Org-mode to RSS and custom HTML

2011-07-01

Org-mode is one of the many outliner solutions I've seen, and I prefer it because of its slogan "Your Life in Plain Text". It allows me to keep track of my life in form of notes, lists and plans using only a text editor, and as the name suggests, it has its origins in Emacs, and it's possible to export these files to a number of formats. My problem was that I found no easy way to customize the HTML output, and I wanted to create a solution that'd allow me to generate an HTML page and an RSS feed from an .org file of mine.

My first hack was a really rudimentary solution that used ugly regular expressions and sed tied together in a shell script. It had many problems: first of all, it depended heavily on the formatting of the document, even small deviations would've made it fail. Also, it was difficult (e.g. required less-readable constructs) to achieve things that are usually trivial using any XML-friendly environment, for instance closing tags if there are no remaining items, but before closing the whole document.

#!/bin/bash
INFILE="input.org"
OUTFILE="output.html"
rm -f $OUTFILE
cat <<HTML >$OUTFILE
... HTML header ...
HTML
N=1
while read LINE; do
    OUT=$(echo "$LINE" | sed \
        -e 's/^[^*].*$//' \
        -e 's/^\*\ \(.*\)$/<h2>\1<\/h2>/' \
        -e 's/^\*\*\ DONE\ \[\[\([^]]*\)\]\[\([^]]*\)\]\]/<li class="done"><a href="\1">\2<\/a><\/li>/' \
        -e 's/^\*\*\ \[\[\([^]]*\)\]\[\([^]]*\)\]\]/<li><a href="\1">\2<\/a><\/li>/')
    echo "$OUT" | grep -v '/h2' >/dev/null 2>&1 || [ $N -eq 1 ] || echo "</ul>" >>$OUTFILE
    N=$[$N + 1]
    echo "$OUT" >>$OUTFILE
    echo "$OUT" | grep -v '/h2' >/dev/null 2>&1 || echo "<ul>" >>$OUTFILE
done <$INFILE
echo "</ul></body></html>" >>$OUTFILE

I thought of playing around with third party org-mode parsers, such as the Python one I used and improved called Orgnode, but that would've been also a compromise between a clean solution that involves to external org parsing and the simple but rude shell script, having few pros, but the cons of both. In the end I took a look on the export options of org-mode and found that it's capable of creating DocBook output. I hadn't used DocBook before, only heard of it from vmiklos, but figured out that it's an XML-based document markup language.

The remaining task is transforming XML to XML, and XSLT is the most powerful tool for it. I created two stylesheets, one for XHTML output and one for RSS, they almost instantly worked and produced the same (or better) output as the shell script. I also found a bug/feature in the DocBook export as it converts URL quoted special characters to their literal equivalents (such as %C3%B6 to ö), which may cause incompatibilities in some browsers and also makes the RSS invalid.

<?xml version="1.0" encoding="utf-8"?>
<xsl:stylesheet version="1.0"
  xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
  xmlns:db="http://docbook.org/ns/docbook"
  xmlns:xlink="http://www.w3.org/1999/xlink">
  <xsl:output method="xml" />
  <xsl:template match="/">
    <rss version="2.0">
      <channel>
        <title>...</title>
        <description>...</description>
        <link>http://...</link>
        <lastBuildDate><xsl:value-of select="$lbd" /></lastBuildDate>
        <xsl:for-each select="db:article/db:section/db:section/db:title">
          <xsl:if test="not(contains(text(), 'DONE'))">
            <item>
              <link><xsl:value-of select="db:link/@xlink:href" /></link>
              <guid><xsl:value-of select="db:link/@xlink:href" /></guid>
              <description><xsl:value-of select="db:link" /></description>
              <title><xsl:value-of select="db:link" /></title>
            </item>
          </xsl:if>
        </xsl:for-each>
      </channel>
    </rss>
  </xsl:template>
</xsl:stylesheet>

RSS needs the build date to be passed in RFC 2822 format, which is much easier to do in shell rather than some weird XSLT way. I used xsltproc which allows parameters to be passed on the command line, and the date command is capable of printing the date just in the right format. This way the RSS can be published using the following command.

$ xsltproc --stringparam lbd "$(date --rfc-2822)" \
    rss.xsl docbook.xml >feed.rss