... personal wiki, blog and notes
Bryan's Blog 2008/05
Identifiers, Persistence and Citation
Identifiers are pretty important things when you care about curation and citation, and they'll be pretty important for metafor too.
At the beginning of the NDG project we came up with an identifier schema, which said that all NDG identifiers would look like this idowner__schema__localID e.g. badc.nerc.ac.uk__DIF__xyz123 where the idowner governed the uniqueness of the localID. If we unpack that a bit, we are asserting that there is an object in the real world which the badc has labeled with local identifier xyz123 and we're describing it with a DIF format object. In practice, the object is always an aggregation, so in ORE terms, there is a resource map which has the above ID.
Sam would argue that we blundered by doing that, he thinks the "real identifier" of the underlying object could be thought of as badc.nerc.ac.uk__xyz123, and we might be better using a restful convention, so that we wrote it as badc.nerc.ac.uk__xyz123.DIF. One of the reasons for his argument is that over time (curation), the format of our description of xyz123 will disappear: for example, we know that we are going to replace DIF with ISO (note that we have lots of other format descriptions of objects as well). This matters, because we now have to consider how we are persisting things, and how are we citing them. I would argue that the semantics are the same even if the syntax has changed, but I concede the semantics are more obvious in his version, so it's not too late to change, in particular, I suspect it's more obvious that badc.nerc.ac.uk__xyz123.atom and badc.nerc.ac.uk__xyz.DIF are two views (resource maps) of the same object (aggregation).
Either way, if we take the persistence issue firstly: once we bed this identifier scheme down, we're stuck with it, so what do we do when we retire the DIF format view of the resource? Well, I would argue that we should issue an HTTP 301 (Moved Permanently) redirect to the ISO view. This means we can never retire the IRI which includes the DIF, but we can retire the actual document. Note that this is independent of the two versions of the syntax above.
Secondly, what about citation? Well that's an interesting issue (obviously). You might recall (yeah, unlikely I know), that we came up with a citation syntax that looks like this:
Lawrence, B.N. (1990): My Radar Data, [http://featuretype.registry/verticalProfile anotherID]. British Atmospheric Data Centre [Available from http://badc.nerc.ac.uk/data/mst/v3/ URN:badc.nerc.ac.uk/localID]
and we made the point we could use a DOI at the end of that instead. That citation identifier looks a lot like Sam's form (funny that, Sam was sneaking his ideas in while I wasn't looking). A legitimate question is what is the point of the URN in that citation? If the data hasn't moved from http://badc.nerc.ac.uk/data/mst/v3/ the URN is redundant, and if it has, how do I actually use the URN? Well, I suspect we need to make the URN dereferenceable, and it should be a splash page to all the current views - which is essentially the same functionality that you get from a DOI. The only question then is what handle system one trusts enough: a DOI, an HDL, or do we trust that even if the badc changes our name, we can keep our address and use redirects? In which case we have
Lawrence, B.N. (1990): My Radar Data, [http://featuretype.registry/verticalProfile anotherID]. British Atmospheric Data Centre [Available from http://badc.nerc.ac.uk/somepath/badc.nerc.ac.uk/localID].
with or without the Available from which I reckon is redundant, and noting that http://badc.nerc.ac.uk/somepath is semantically the same as doi:. Sam's form URI wins (it doesn't necessarily have to, but it's more economical, in which case the argument for using a handler for citation economy alone is given some weight).
EA and Subversion, Resolved
I'd got to the point of recommending in a support request at codeweavers that a work around might be to try and replace the call to subversion with a windows bat file that invoked linux subversion rather than trying to get windows native subversion working properly.
Jeremy was far smarter than that. Yes, we've ended up invoking linux subversion, but via a different route.
The first step we took was to replace the native subversion.exe call with a simple linux script (I had no idea that one could even do that, having assumed that from a windows cmd.exe one had to call windows stuff ... but note that the trick was to make sure the script had no filename extension, and point to it in the EA cofiguration as if it were an executable). Having done that, we could see what EA was up to, and we found a few wrinkles.
Jeremy then came up with a winelib application (svngate) which handles all the issue with windows paths, and also a bug in the way EA uses subversion config-dir (a bug which doesn't seem to cause problems under windows, even though it ought to). In passing, Jeremy also fixed a wee bugette in the wine cmd.exe which was also necessary to make things work. All the code is on the crossover wiki.
So I'm a happy codeweavers client. I'm less happy with how Sparx dealt (commercially) with their end of this, but that's a story for another day. (Update 06/06/08: I'm probably being unfair, their technical support are now taking this and running with it; their linux product will be svngate aware and is getting linux specific bug fixes.)
Update 02/06/08. There was another wrinkle I discovered after a while ... the old cr/lf unix/windows problem. This can be relatively easily fixed,as Jeremy had seen this coming. I created my own version of subversion (/home/user/bsvn) with
#!/bin/bash svn "$*" | flip -m -
and set SVNGATE_BIN to /home/usr/bsvn!
Update 02/06/09. Actually that previous script doesn't quite work in all cases (i.e. where the svn content has blanks and hyphens in filenames). Better seems to be:
#!/bin/bash svn "$@" | flip -m -
by Bryan Lawrence : 2008/05/24 : 1 trackback : 6 comments (permalink)
The EU has recently seen fit to fund a new project called METAFOR.
The main objective of metafor is:
to develop a ''Common Information Model (CIM)" to describe climate data an the models that produce it in a standard way, and to ensure the wide adoption of the CIM. Metafor will address the fragmentation and gaps in availability of metadata (data describing data) as well as the duplication of information collection and problems of identifying, access and using climate data that are currently found in existing repositories.
Our main role is in deploying services to connect CIM descriptions of climate and data across European (and hopefully wider) repositories, but right now the team is concentrating on an initial prototype CIM - building on previous work done by many projects (including Curator).
A number of my recent activities have already been aimed at metafor ... in particular the standards review that is currently underway will inform both metafor and MOLES.
For some reason my fellow project participants want to do much of their cogitation on private email lists and fora. While I don't think that's the best way forward, I have to respect the joint position. However, I will blog about METAFOR as much as I can, and I'll obviously be keen to take any feedback to the team.
Metadata, Effort and Scope
I keep on harping on about how metadata management is time intensive, and the importance of standards.
Users keep on harping on about wanting access to all our data, and our funders keep wanting to cut our funding because "surely we can automate all that stuff now".
I've written elsewhere about information requirements, this is more a set of generic thoughts.
So here are some rules of thumb, 'even with a great information infrastructure, about information projects:
If you need to code up something to handle a datastream, you might be able to handle o(10) instances per year.
If you have to do something o(hundreds of times), it's possible provided each instance can be done quickly (at most a few hours).
o(thousands) are feasible with a modicum of automation (and each instance of automation falls into the first category), and
o(tens of thousands and bigger) are highly unlikely without both automation and no requirement for human involvement.
What about processes (as opposed to projects):
In the UK 220 working days a year is about standard. Let's remove about 20 days for courses, staff meetings etc ... so that leaves about 200 days or, for a working day of 7.5 hours, a working year of about 1500 hours.
So a job that takes a few hours per item can only be done a few hundred times a year. A case in point: in the last year 260 standard names were added to the CF conventions. One person (Alison Pamment) read every definition, checked sources, sent emails for revision of name and definition etc. Alison works half time, so she only had 750'ish hours for this job, so I reckon she had a pretty good throughput; averaging roughly three hours per standard name.
Now that job is carried out by NCAS/BADC for the global community, and the reasonable expectation is that names that are proposed have been through some sort of design, definition, and internal/community vetting process before even getting to CF.
So, from a BADC point of view, we have to go through every datastream, identify all the parameters, compare with the CF definitions, and propose new names etc as necessary. If all our data hand standard names, we'd be able to expoit that effort to produce lovely interfaces where people could find all the data we hold without much effort, if they only cared about the parameter that was measured. But they don't. Unfortunately for us (workload), and fortunately for science, people do care about how things are measured (and/or predicted). So the standard names are the the thin end of the wedge. We also have to worry about all the rest of the metadata: the instruments, activities, data production tools, observation stations etc (the MOLES entities).
As a backwards looking project: Last year, we think we might have ingested about 20 milliion files. Give or take. Not all of which are marked up with CF standard names, and nearly none (statistically) were associated with MOLES entities. Truthfully we don't know how much data we have for which our metadata is inadequate (chickens and eggs). As always, we were under-resourced for getting all that information at the time.
My rule of thumb says our only hope of working it out and doing our job properly by getting the information is that we need to identify a few dozen datastreams (at most), and then automate some way of finding out what the appropriate entities and parameters were. If it's manual we're stuffed, Sam has another rule of thumb: if something has to be done (as a backwards project, rather than as a forward process) more than a thousand times, unless it's trivial it wont get done, even with unlimited time, because it's untenable for one human to do such a thing, and we don't have enough staff to share it out.
Fortunately, for a domain expert, some of these mappings are trivial. But some wont be, and even distinguishing between them is an issue ...
Still, I genuinely believe we can get this right going forward, and do it right for some of our data going backwards. Do I believe non-domain experts could do this at all? No I don't. So where does that leave the UKRDS which, at least on the face of it, has grandiose aims for all research data? (As the intellectual inheritor of the UKDA, I'm all in favour of it, as a project for all research data, forget it!)
Sam and I had a good chat about cost models today - he's in the process of honing the model we use for charging NERC for programme data management support.
Most folk think the major cost in data management is for storage, and yes, for a PB-scale repository that might be true, but even then it might not, it all depends on the amount of diversity in the data holdings. If you're in it for the long time, then the information management costs trump the data storage costs.
Some folk also think that we should spend a lot of time on retention policies, and actively review and discard data when it's no longer relevant. I'm on record that from the point of view of the data storage cost, the data that we hold is a marginal cost on the cost of storing the data we expect. So I asserted to Sam that it's a waste of time to carry out retention assessment, since the cost of doing so (in person time) outweighs the benefits of removing the data from storage. I then rapidly had to caveat that when we do information migration (from, for example, one metadata system to another), there may be a signficant cost in doing so, so it is appropriate to assess datasets for retention at that point. (But again, this is not about storage costs, it's about information migration costs).
Sam called me on that too! He pointed out that not looking at something is the same as throwing it out, it just takes longer. His point was that if the designated community associated with the data is itself changing, then their requirements of the dataset may be changing (perhaps the storage format is obsolete from a consumer point of view even if it is ok from a bit-storage point of view, perhaps the information doesn't include key parameters which define some aspect of the context of the data production etc). In that case, the information value of the data holding is degrading, and at some point the data become worthless.
I nearly argued that the designated communities don't change faster than our information systems, but while it might be true now for us, it's almost certainly not true of colleagues in other data centres with more traditional big-iron-databases as both their persistence and information stores ... and I hope it wont remain true of us (our current obsession with MOLES changes needs to change to an obsession with populating a relatively static information type landscape).
However, the main cost of data management remains in the ingestion phase, gathering and storing the contextural information and (where the data is high volume) putting the first copy into our archive. Sam had one other trenchant point to make about this: the gathering information phase cost is roughly proportional to the size of the programme generating the data: if it includes lots of people, then finding out what they are doing, and what the data management issues are will be a signficant cost, nearly independent of the actual amount of data that needs to be stored: human communication takes time and costs money!
In a maze of twisty little standards, all alike
I'm in the process of revisiting MOLES and putting it into an atom context, with a touch of GML and ORE on the side. I thought I'd take five minutes to add a proper external specification for people and organisations.
Five minutes! Hmmm !! When will I get back to atmospheric related work? Can I get someone else to go down this hole?
Train of thought follows:
So, there's FOAF.
Do any folk other than semantic web geeks actually use FOAF? (I say geeks advisedly, it's hard to take seriously a specification that explicitly includes a geekcode.) Even from a semantic web perspective wouldn't it be better to use a more fully featured specification and extract RDF from that?
So then there is OASIS CIQ (thanks Simon), which includes
Extensible Name and Address Language (xNAL)
Extensible Name Language (xNL) to define a Party?s name (person/company)
Extensible Address Language (xAL) to define a party?s address(es)
Extensible Party Information Language (xPIL) for defining a Party?s unique information (tel, e-mail, account, url, identification cards, etc. in addition to name and address)
Extensible Party Relationships Language (xPRL) to define party relationships namely, person(s) to person(s), person(s) to organisation(s) and organisation(s) to organisation(s) relationships.
Well that feels overhyped, and beyond what is needed. Perhaps xNAL would be OK, but xPIL probably wont gain me much and I think the xPRL is heading into RDF territory in an over-constrained way.
What about reverting to the ISO19115 specification. Arguably ISO19115 shouldn't be defining party information (in the same way I was complaining about it not using Dublin Core), but it does. What would that give us? CI_ResponsibleParty. Well, I'm comfortable about this?
What about Atom! Atom settles down to a very simple person specification: name, email address and an IRI. Well that's pretty good, because I have an IRI which doesn't even need to exist (and if it does could point to any of the above). What it does is unambiguously identify a person, unlike name which can differ according to the phase of the moon (Bryan N Lawrence, Bryan Lawrence, B.N. Lawrence, B. Lawrence etc). But I do want some of the other stuff, role and so on, so I could use CI_ResponsibleParty, in the knowledge that I can extract an atomPersonConstruct from that for serialisation into Atom (I think I'd have to stuff the IRI into the id attribute of CI_ResponsibleParty).
OK, I'm going to stop there, it seems that with an IRI to avoid ambiguity, and CI_ResponsibleParty, I can do what I need to do.
But I could have spent a couple of hours in a far more pleasant way.
(Update 09/06/08: I note that KML 2.2 allows an <xal:AddressDetails> structured address ... but it doesn't use it ... yet.)
Beginning to get a grip on ORE
This Friday afternoon I was trying to get to the bottom of ORE. ORE is pretty much defined in RDF and lots of accompanying text. I've been trying to find a way of boiling down the essence of it. UML (at least as I use it) doesn't quite do the job, so this is the best I could do:
The next step is to look through the atom implementation documentation.
Ignorance is xhtml bliss
Wow, I was mucking with some validation for this site (in passing), and I thought "while I'm here, I might as well change this site to deliver application/xhtml+xml rather than text/html". What a blunder.
I'll fix it for you poor IE folk ... soon.
From ORE to DublinCore
Standards really are like buses, there's another along every minute, exactly which one should you choose? I'm deep in a little "standards review" as part of our MOLES upgrade. I plan to muse on the role of standards another day, this post is really about Dublin Core!
You've seen me investigate atom. You know I've been delving in ISO19115. You know I'm deep into the OGC framework of GML and application schema and all that. You know I think that Observations and Measurements is a good thing.
Today's task was to investigate ORE a little more, and the first thing I did was try and chase down the ORE vocabulary, which surprisingly, isn't in the data model per se, it lives in it's own document. Anyway, in doing so, I discovered something that I must have known once, and forgotten: Dublin Core is itself an ISO standard (ISO15836:2003). Of course no one refers to DC via it's ISO roots, because they're toll barred (i.e. the ISO version costs money), wheras the public Dublin Core site stands proud.
What amazes me of course is that Dublin Core and ISO19115 use different vocabularies for the same things, even though Dublin Core preceded ISO19115. What was TC211 thinking? Of course ISO19115 covers a lot more, but why wasn't ISO15836 explicitly in the core of ISO19115? The situation is stupid beyond belief: someone even had to convene a working group to address mapping between them. I've extracted the key mapping here.
Mind you, Dublin Core is evolving, unlke ISO15836 which by definition is static. We might come back to that issue. Anyway, the current DublinCore fifteen which describe a Resource look like this:
|term||what it is||type|
|contributor||a contributors name||A|
|coverage||spatial and or temporal jurisdiction, range, or topic||B|
|creator||the primary author's name||A|
|date||of an event applicable to the resource||C|
|description||of the Resource||D or E|
|format||format, physical medium or dimensions (!)||F|
|identifier||reference to the resource||G|
|language||a language of the resource||B (best is RFC4646)|
|publisher||name of an entity making the resource available||A|
|relation||a related resource||B|
|rights||rights information||D (G)|
|source||a related resource from which the described resource is derived||G|
|subject||describes the resource with keywords||B|
|title||the name of the resource||D|
|type||nature or genre of the resource||H|
We can see the "types" of the Dublin Core elements have some semantics which reduce to
|A||free text (names)|
|B||free text (best to use arbitrary controlled vocab)|
|C||free text (dates)|
|D||really free text|
|F||free text (best to use MIME types)|
|G||free text (best to use a URI)|
|H||free text, B, but best to use dcmi-types|
The last vocabulary consists of Collection, Dataset, Event, Image, InteractiveResource, MovingImage, PhysicalObject, Service, Software, Sound, StillImage, Text. (Note that StillImage differs from Image in that the former includes digital objects as well as "artifacts").
RDF software stacks.
So we want an RDF triple store with all the trimmings!
We're running postgres as our preferred RDB. We've got some experience with Tomcat as a java service container. We prefer python in a pylons stack and from scripts as our interface layers (ideally we don't want to programme our applications in Java 1).
There appear to be four2 candidate technologies to consider as part (or all of) our stack: Jena, Sesame, RDFLib, and RDFAlchemy. The former two provide java interfaces to persistance stores, and both support postgres in the backend. RDFLib provides a python interface to a persistance store, but might not support postgres. RDFAlchemy provides a python interface to support RDFLib, Sesame via both the http interface and a SPARQL endpoint, and Jena via the same SPARQL endpoint (and underlying Joseki implementation).
Would using postgres as our backend database perform well enough? Our good friend Katie Portwin (and colleague) think so.
There appear to be three different persistance formats insofar as RDFLib, Jena and Sesame layout their RDF content in a different way. Even within Java there is no consistent API:
Currently, Jena and Sesame are the two most popular implementations for RDF store. Because there is no RDF API specification accepted by the Java community, Programmers use either Jena API or Sesame API to publish, inquire, and reason over RDF triples. Thus the resulted RDF application source code is tightly coupled with either Jena or Sesame API.
Are there any (recent) data which compare the performance of the three persistence formats and their API service stacks? It doesn't look like it, but I think we can conclude that either Jena or Sesame will perform OK, and I suspect RDFLib will too. Which of these provide the most flexibility into the future? Well, there are solutions to the interface problem on the Java side: Weijian Fang's Jena Sesame Model which provides access to a Sesame repository through the Jena API, and the Sesame-Jena Adaptor; and clearly from a python perspective RDFAlchemy is designed to hide all the API and persistence variability from the interface developer. I think if we went down the RDFLib route we'd either be stuck with python all the way down (not normally a problem), or we'd have to use it's SPARQL interface.
I have slight reservations about RDFAlchemy in that the relevant google group only has 14 members (including me), and appears to be in a phase of punctuated equilibrium as development revolves around one bloke.
Conclusions: if we went down postgres > tomcat(sesame) -> RDFAlchemy we'd be able to upgrade our interface layers if RDFAlchemy died by plugging in something based on pysesame and/or some bespoke python sparql implementation (it's been done, so we could use it, or build it. Others have built their own pylons thin layers to sesame too). We'd obviously be able to change our backends rather easily too in this situation. (Meanwhile, I intend to play with RDFlib in the interest of learning about manipulating RDFa.)
Link of the day: State of the Semantic Web, March 2008.
atom for moles
As we progress with our MOLES updating, the issue of how best to serialise the MOLES content becomes rather crucial, as it impacts storage, presentation, and yes, semantic content: some buckets are better than other buckets!
Atom (rfc4287) is all the rage right now, which means there will be tooling for creating Atom and parsing etc, and Atom is extensible. It's also simple. Just how simple? Well, the meat of it boils down to one big UML diagram or three smaller diagrams which address:
The basic entities (feeds and entries),
The basic content (note that the xhtml could include RDFa!)
and links (note that while atom has it's own link class for "special links", xhtml content can also contain "regular" html links).
These three diagrams encapsulate what I think I need to know to move on with bringing MOLES, observations and measurements, and Atom together.
You know, those of us out there in the Ruby/Python/Erlang fringes might think we're building the Next Big Thing, and we might be right too, but make no mistake about it: as of today, Java is the Big Leagues, the Show, where you find the most Big Money and Big Iron and Big Projects. You don't have to love it, but you'd be moronic to ignore it.
... and the programmers ask for Big Money, write Big Code, and we can't afford the Big Money to pay for them, or the Big Time to read the code ... let alone maintain it.
(Which is not to say I have any problems with someone else giving/selling me an application in Java which solves one of my problems - provided they maintain it, I'm not that moronic :-)
There's nothing like a big volcano to remind one of our precarious hold on planet earth. Thanks to James for drawing my attention to Chaiten, and the fabulous pictures, via: Alan Sullivan, nuestroclima and the NASA earth observatory.
Along with genetics1, volcanism was the other thing that could have kept me from physics ... neither quite made it :-).
Anyway, I'm not quite sure what sort of volcano it is, nor of the real import of the explosions thus far, but as a caldera type volcano it could be more impressive yet ... if it's even vaguely similar to Taupo. When I was a kid we grew up with stories of the 7m deep pyroclastic flow in Napier (a little over 100 km away).