... personal wiki, blog and notes
Bryan's Blog 2006/05
Suffering with unicode
Our discovery portal needs to handle discovery documents why may not include xml with the correct encoding declarations.
To see the sort or problem this introduces consider the following python code
import ElementTree as ET a = unicode('Andr?','latin-1') b= '<test>%s</test>'%a c=ET.fromstring(b)
Traceback (most recent call last): File "<stdin>", line 1, in ? File "ElementTree.py", line 960, in XML parser.feed(text) File "ElementTree.py", line 1242, in feed self._parser.Parse(data, 0) UnicodeEncodeError: 'ascii' codec can't encode characters in position 10-11: ordinal not in range(128)
(amusingly I couldn't use my embed handler to format the python code for this wiki entry because of the same problem)
Now, I can fix this, sort of, using:
I say sort of, because for arbitrary content (in this example content a), it still breaks ... because it would seem that the resulting string b can still fail to import into ElementTree ...
For example, in one document we have 1.3 <= tau < 3.6 in some encoding ... and even after using an encode('ascii','replace') option we get this error:
self = <ElementTree.XMLTreeBuilder instance>, self._parser = <pyexpat.xmlparser object>, self._parser.Parse = <built-in method Parse of pyexpat.xmlparser object>, data = '<DIF><Entry_ID>badc.nerc.ac.uk:DIF:...</DIF> ExpatError: not well-formed (invalid token): line 1, column 11389 args = ('not well-formed (invalid token): line 1, column 11389',) code = 4 lineno = 1 offset = 11389
(and the string above includes column 11389) which at least is not longer a unicode error.
Well, I'm not going to solve this now, because I'm off on holiday for a week, but if anyone has solved it by the time I get back, I'll be very happy!
Update: well I haven't even gone on holiday yet, but Dieter Maurer on the xml-sig mailing list has pointed out that the < sign needs to be escaped in XML ... and I'm pretty certain that it isn't ... so now I can go on holiday knowing that it's a relatively simple thing to fix!
So the bottom line was that I had two sets of problems with my input docs: genuine unicode problems and unescaped content ... and I thought it was just one class of problem which is why I struggled with it ...
ndg security and ubuntu
The current version of NDG security has a number of python dependencies, including m2crypto and pyxmlsec ... the good news is that it's relatively easy to get the things in place under ubuntu, the bad news is that it's a pain working out what packages you need, hence this list (which is mainly for my benefit). You've probably already got libxml2 installed, but you also need:
all of which can be installed using apt-get. You will also need to get a tar file of pyxmlsec which you can install with
sudo python setup.py build install
and we need celementtree, which is also a download, and python setup.py install task ...
We know this will be a pain in some circumstances, and we plan next year to try and significantly streamline our package dependencies ...
One of the things we have grappled with rather unsatisfactorily in the NDG is how to declare in discovery and browse metadata
that specific services are available to manipulate the described data entities
and, for a given service, what the binding between the service and data id is to invoke a service instance.
This is prety important, as has been pointed out numerous times before, but probably most eloquently in an FGDC Geospatial Interoperability Reference Model discussion:
For distributed computing, the service and information viewpoints are crucial and intertwined. For instance, information content isn't useful without services to transmit and use it. Conversely, invoking a service effectively requires that its underlying information be available and its meaning clear. However, the two viewpoints are also separable: one may define how to represent information regardless of what services carry it; or how to invoke a service regardless of how it packages its information.
Thus far in NDG, where in discovery we have been using the NASA GCMD DIF, we have been pretty limited in what we can do, so we extended the DIF schema to support a hack in the related URL ...
Basically what we did is add into the related URL the following:
Leaving aside the fact that we've embedded a naughty character (&) in what should be XML, we then create a binding for a user in the GUI between that service and the dataset id ... it's clumsy, ugly, and of no use to anyone else who might obtain our record via OAI.
Ideally of course the metadata needs to be useful to both a human and automatic service discovery and binding tools. In the example above, we (NDG) know how to construct the binding between the service and the dataset id to make a human usable (clickable from a gui) URL, but no one else would. Likewise, there is no possibility of interoperability based on automatic tools. Such tools would be likely to use something like WSDL, or ISO19119 or both, or more ... (neither provide too much information about what the semantics are of the operations provided, one needs a data access query model -DAQM -which we've termed "Q" in our metadata taxonomy).
However, if we step back from the standards and ask ourselves what we need, I think it's something like the following:
where I've made up the tags to get across some semantic meaning (yes, I know, I should have done it in UML).
OK, I think I know what we need, now how does this work in the standards world, what have I forgotten? What do we need to do to make it interoperable, and what are the steps along the way?
Well those are rhetoric questions, I know some of the first things I need to do: starting with a chat to my mate Jeremy Tandy at the Met Office who is wrestling with the same questions for the SIMDAT project, and then I think I'll be off reading the standards documents again. I suspect I'll have to find out more about OWL-S as I'm pretty sure there will be more tools in that area (given that ISO19139 is only just arriving for ISO19115 and there is no matching equivalent that I'm aware of for ISO19119).
Evaluating Climate Cloud
Jonathan Flowerdew at the University of Oxford has been working with me for the last two and a half years on methods of evaluating clouds in climate models. We've recently submitted a paper on this work to Climate Dynamics. If you want a copy of a preprint, contact him (or contact me to get his contact details).
The use of nudging and feature tracking to evaluate climate model cloud ? J.P. Flowerdew , B.N. Lawrence , D.G. Andrews
A feature tracking technique has been used to study the large-scale structure of North Atlantic low-pressure systems. Composite anomaly patterns from ERA-40 reanalyses and the International Satellite Cloud Climatology Project (ISCCP) match theoretical expectations, although the ISCCP low cloud ?eld requires careful interpretation. The same technique has been applied to the HadAM3 version of the UK Met Of?ce Uni?ed Model. The major observed features are qualitatively reproduced, but statistical analysis of mean feature strengths reveals some discrepancies. To study model behaviour under more controlled conditions, a simple nudging scheme has been developed where wind, temperature and humidity ?elds are relaxed towards ERA-40 reanalyses. This reduces the aliasing of interannual variability in comparisons between model and observations, or between different versions of the model. Nudging also permits a separation of errors in model circulation from those in diagnostic calculations based on that circulation.
ISO 21127 aka CIDOC CRM - more metadata dejavu
Most of my colleagues in the environmental sciences wont have come across ISO 21227 (to be fair, it may not yet exist, but heck, most of my colleagues in environmental science haven't come across any ISO standard ...). I was introduced to the concepts behind it by my colleague Matthew Stiff, from the NERC Centre for Ecology and Hydrology (CEH), and I've just been nosying through a powerpoint tutorial, which introduce the CIDOC Conceptual Reference Modle (CRM) (ppt) which would appear to be the heart of it. Maybe the topic: Information and documentation -- A reference ontology for the interchange of cultural heritage information isn't going to engage too many of my colleagues, but maybe it should because the key concept is that:
Semantic interoperability in culture can be achieved by an ?extensible ontology of relationships? and explicit event modeling, that provides shared explanation rather than prescription of a common data structure.
That sounds familiar, if we change "events" to "observations", and replace "in culture" with "in environmental science" we'd all be on the same page ... although maybe some of my hard science mates wouldn't like the word relationships ...
Reading on we find that the CIDOC CRM aims to approximate a conceptualisation of real world phenomena ... sounds like the feature type model approximating the universe of discourse to me ... One key difference from the GML world though is the early acceptance of objects (features in my language) having multiple inheritance (which is hard to do in XML schema, hence a problem for GML).
I'm not the first to make the link between the ISO/OGC world and the CIDOC world of course, Martin Doerr who is one of the CIDOC authors made the connection explicit in a comparison of the CIDOC CRM with an early version of what became GML (pdf). Regrettably his conclusions are a bit dated now (five years is a long time on our business). It'd be interesting if someone did a comparison of the OGC Observations and Measures spec (or the new draft) with the CIDOC CRM ... meanwhile, when I can get my hands on the standard itself, it may make interesting reading to help inform our semantic web developments.
kubuntu dapper beta broken
I really wanted beagle, and the breezy beagle was broken (or to be precise it seemed to be progressively corrupting my kde directories), so I upgraded to the new dapper beta ...
It all seemed fine when working from home, and the new beagle (with kerry-beagle) is great, but I came into work this morning and stuck the laptop into the docking station, and two significant problems occurred:
cups only sees my local printer, and I can't make it browse for the network printers (despite browsing being on in my cups config file).
and nothing I can do seems to change anything. I can't change anything via kcontrol or the web inteface to cups.
These problems don't seem new! It's pretty poor if a major release like kubuntu dapper (even in beta) can't get printing right!
Update: OK, some of my problems were to do with some incompatibility between my (pre-existing) cupsd.conf file and the new release. Using the new release file means i can now edit the file, but browsing is still broken ...
Xorg has broken ... I used to be able to slap the laptop down, and get my 1920x1200 big screen working by restarting the x-server ... but now nothing I can do is making this screen anything but a clone of my laptop 1024x768. I've just spent rather more time than I should have trying to fix it ...
Note to self: the radeon driver has a man page ...
by Bryan Lawrence : 2006/05/02 (permalink)