Bryan Lawrence : xml handling in python

Bryan Lawrence

... personal wiki, blog and notes

xml handling in python

Much of NDG will depend on code for handling xml documents. This will probably need to be done in two architectures: python and java (to support two communities: scientific application programmers and web development of access tools).

Up to now there have been two packages in the frame from the python perspective:

I have been using libxml2 in my certificate checking and signing code for NDG (e.g xmlsec) but even for such a simple task as signing a document and checking the signature using libxml2, we had a nightmare getting a java an a python implementation agreeing ... it was also apparent that using libxml2 in python sucks. However, we're going to stick with libxml2 for the next six months anyway ...

There is some good news on the horizon: Until I read Ryan Tomoko's blog entry I was not aware of the lxml project linking the elementtree api to libxml ...

Even more interesting in some ways is the sequence of discussion between Nelson Minar and Uche Ogbuji on the plethora of un-pythonic ways of doing xml handling in python. Nelson has code snippets from PyXML, libxml2 and elementtree. This was followed up by Uche initially in two blogs. The first of which is summarised with

If you're coming more from a Python background, and XML is just something that's getting in your way, try Amara. If you're coming from an XML background, and you think in DOM, XSLT and all that, try 4Suite.

In Uche's second blog entry he added some snippets from 4Suite and Amara.

This is followed up by Nelson again who concludes:

There are too many XML choices in Python. And the obvious ones aren't right!

Nelson also goes on to comment that even the Amaya syntax isn't particularly easy at startup. Uche's comeback (number three) is interesting, and would appear to demonstrate that Amaya is incredibly efficient and easy to use.

Buried in the comments to Uche number two is the following on lxml (which is where I started on this from Ryan's blog entry): from lxml import etree tree = etree.parse('ot.xml') tree.xpath('(//v)5/text()') which returns [u'And God called the light Day, and the darkness he called Night. And the evening and the morning were the first day.\n'] The point being that this is a much more pythonesque interface to libxml2 than the existing binding, and is compatable with elementtree. This might allow us some real flexibility for NDG.

However, on libxml2 the final word may come from Uche number one:

... libxml2 is a miracle of function, but alas in a form that doesn't suit Python one bit. I know that folks are working on better libxml2 wrappers, but familiar as I am with the C code, I honestly don't believe they can produce anything truly Pythonesque without losing all the performance gains.

Well, what are we going to do? We're committed to libxml2 in NDG phase one, but that doesn't commit us to how we interact with it, we can either use dedicated wrappers (current approach) or we can investigate lxml some more.

Now that celementtree is out, and it seems also to be blindingly quick, we'll investigate that for NDG2, so lxml has significant advantages. Obviously we'll also investigate Amaya, but previous experience in trying to get 4Suite to work on our Suse boxes (failed) has not been good (to be fair, we put virtually no effort into it, but the whole point is that these tools need to be trivial to install and use).

Categories: python xml ndg

This page last modified Wednesday 19 January, 2005
DISCLAIMER: This is a personal blog. Nothing written here reflects an official opinion of my employer or any funding agency.