VSzA techblog

Tracking history of docx files with Git

2012-03-27

Just as with PHP, OOXML, and specifically, docx is not my favorite format, but when I use it, I prefer tracking the history using my preferred SCM of choice, Git. What makes it perfect to track documents is not only the fact that setting up a repository takes one command and a few miliseconds, but its ability to use an external program to transform artifacts (files) to text before displaying differences, which results in meaningful diffs.

The process of setting up an environment like this is described best in Chapter 7.2 of Pro Git. The solution I found best to convert docx files to plain text was docx2txt, especially since it's available as a Debian package in the official repositories, so it takes only an apt-get install docx2txt to have it installed on a Debian/Ubuntu box.

The only problem was that Git executes the text conversion program with the name of the input file given as the first and only argument, and docx2txt (in contrast with catdoc or antiword, which uses the standard output) saves the text content of foo.docx in foo.txt. Because of this, I needed to create a wrapper in the form of the following small shell script.

#!/bin/sh
docx2txt <$1

That being done, the only thing left to do is configuring Git to use this wrapper for docx files by issuing the following commands in the root of the repository.

$ git config diff.docx.textconv /path/to/wrapper.sh
$ echo "*.docx diff=docx" >>.git/info/attributes

permalink


next posts >
< prev post

CC BY-SA RSS Export
Proudly powered by Utterson