Bryan Lawrence : Bryan's Blog 2006

Bryan Lawrence

... personal wiki, blog and notes

Bryan's Blog 2006

bye bye Google

Our role in the web ecosystem is to make it easy for scientists to do science, which means making it easy for them to find data, to manipulate it, and to do new things. Making it easy to find things, means giving them as many tools as possible!

The Google SOAP API was my initial motivation for our data discovery deployment. However, along with our SOAP API (which will lie at the backend), we'll have a RESTful vanilla API and probably a suitable version of OpenSearch too. So it's crying shame (via Tim Bray) that Google is backing away from their leadership position in this area. With every such step they loose a little bit of their magic, and a little bit of the good will that the community had for them - maybe I will try that other search engine after all!

by Bryan Lawrence : 2006/12/19 : 0 trackbacks : 0 comments (permalink)

Access Control

I've said it before, and I'll say it again. If you have high volume or high value real resources on the web, you need access control!

Back in August I introduced the simple "gatekeeper" methodology we have planned within the DEWS project. This simple idea could be used to "protect" any resource, and in particular, along with the WCS, we could use it for OPenDAP.

We could do it the following way:

  • Deploy pydap.org.

  • Introduce a layer of wsgi middleware that called ndg security and provide gatekeeping functionality directly within that application.

We'd never have to modify pydap. But we would need to think about how the clients interacted with the pydap server. Realistically no one uses pydap with a browser: people either use bespoke OPenDAP clients, or they use an application linked to a client library (the most popular of which would appear to be versions of the netcdf bindings).

In either case, it's unlikely that folk would be rebuilding their applications to take advantage of ndg security, so we'd need to deal with an out of band establishment of the security context.

ndg security requires the gatekeeper to have access to a proxy certificate (to identify the data requestor) and an ndg attribute certificate (to assert what roles the data requestor has). If a browser were to contact a gatekeeper, the normal sequence of events would be to redirect to a login service, which would instantiate a session manager instance, which would get the credentials and load them into an ndg wallet instance before redirecting back to the gatekeeper with url arguments which would then be used to populate a client-side cookie with the session id and the session manager address (for future requests).

In the case of OpenDAP (or any other non-ndg specific applications which we might wish to secure with ndg security) we need to 1) obtain the credentials independently of the data request, and 2) we need to communicate those credentials to the gatekeeper with every request without the benefit of a cookie.

We could do the first of these with an equivalent command to the grid_proxy_init concept with the globus toolkit (although our version would probably need the URI of the data object in case specific attribute certificates were required). That command, say, ndg_securitycontext_init could populate a local file based ndg wallet, but there'd still initially be no way of getting the certificates through a client application to our gatekeeper server except via the URL. So we'd not be able to use the contents of the local wallet directly, although it could be queried to provide an token in the argument in the URL which could be intercepted by the gatekeeper.

Ideally the token would be the proxy and attribute certificate themselves, but that would make already cumbersome OPenDAP URL's horrendous. It'd probably be as easy to utilise a remote session manager instantiated by the ndg_securitycontext_init command , and simply provide the address of that wallet in the URL, although we'd have to do this with each request (no cookie remember :-).

Obviously that token would be worth intercepting by the bad guys, because if you can get it, you can pretend to be the data requestor until the token contents time out. We could minimise the risk of this by insisting on using https to communicate with our gatekeeper (and pydap) server, but this would be an overhead on the data transport, which might be a problem for high volume transmissions.

Alternatively, we want to somehow make the token useless if intercepted. A lightweight way of doing this which would stop some lightweight threats would be to ensure the token includes the IP address of the data requestor.

We could also make sure that the data request itself (the URL) was signed and encrypted by the proxy certificate in the wallet, making it only possible to do replay attacks (the token wouldn't be independently available), we'd end up with a URL something like:

http://ourgateeper.address/encrypted_data_request

but it'd be nightmare for the user, imagine doing a sequence like the following for every transaction in a matlab or idl session (assuming we've already instantiated our ndg_security_context):

  1. identify the data uri

  2. run ndg_encrypt_uri programme (which we'd have to provide)

  3. copy the new cumbersome url to the application

  4. get an error

  5. wonder whether we got our copy and paste right

  6. realise we had the original data uri wrong (for example)

  7. ...

So, that's going to be too hard for real people. We might have to live with a one time token produced by the ndg_securitycontext_init command, which can be appended to "normal" opendap urls ...

I'm sure the real security folks wouldn't like this as much as full ndg security, as it would be possible to spoof the originating ip, and bad guys can exist on the same machine as good guys. However, in practice I think this would be a sufficient level of security for all our datasets, but it'd still be ugly!

Ideally of course the opendap folks would modify their libraries to have an access control call out which would allow any security infrastructure to be bolted in. There is a rumour that the Australian Burea of Meteorology has contracted them to do something like that, but I've yet to see any details. (And of course ideally we wouldn't have to have our own bespoke ndg security too, but until the mainstream meets our access control needs, we'll continue to have to roll our own!)

by Bryan Lawrence : 2006/12/19 : Categories ndg python : 0 trackbacks : 1 comment (permalink)

Service Descriptions

A little over six months ago, I introduced the thorny problem of service binding to my blog. Of course it hasn't gone away. Last week I gave a talk (see my the SEEGRID talk on my talks page) about "Grid-OGC collision" in which I made some specific statements, amongst which were:

At the moment WSDL2 is where we are investing our thinking time (ISO19119 is a meta-model for services rather than a SDL)

GRID has more sophisticated service binding, access control and authentication, workflow! The OGC community should not reinvent tooling!

I also made a throw away comment about Service Description Languages along the lines of which "there's another one along every week", by which I meant that folk appear to keep getting pissed off with what's available and building their own cut down simpler SDLs.

This was all in the context of knowing that many in the OGC community sees the ebXML registry information model (ebRIM, pdf) as a key part of the way forward for solving this sort of problem. (The reason why ebRIM is so attractive is that it allows independent description of services and metadata, with the associations in the ebRIM catalogue being used to produce late binding of what can be done to specific datasets. Of course using ebRIM only postpones the question of what is used to describe the services, and what associations are put in the system.)

Josh Lieberman called me on on those assertions, and asked me specifically to say exactly what the Grid community had that the OGC community should use! Of course I dissembled: one of the problems with being in my position is I know just enough to make assertions (and decisions) based on what I've read and heard (via the UK e-science community), but I make little attempt to remember the detail.

Before I get going, I have to confess to some lazy use of the word grid in the context of the talk and this note. Frankly, what I really wanted to do in the talk I gave was to encourage the OGC community to avoid reinventing wheels, and when I said Grid, I really meant: "everyone who is not in the OGC community who are dealing with tightly coupled strongly typed systems and the semantics of their interoperation", that is I meant it in the cyberinfrastructure/e-science/research sense. I should also confess that not being part of the OWS4 initiative also means that there are probably sections of the OGC community who would both agree with my position, and even better, have done something about it!

It turns out that Josh already knows about everything below, but I've persisted in writing it here, because a) it's a useful collection of links for me, and b) it might provoke some useful discussion, either now or at the inception of a future NDG project (if it happens) ... Flawed or not, this is where I am:

  1. Ideally we have to put some sort of online resource service endpoint descriptions in our dataset metadata. In this context we can't rely on the RIM alone because the ISO19139 documents may be harvested and used elsewhere where the RIM association content may not be available.

  2. From various corridor gossip I've heard that populating and interacting with ebRIM systems may be fiendishly hard.

  3. Ideally we want service descriptions themselves to be independent of the dataset endpoint descriptions.

  4. If we believe in Service Orientated Architectures at all, then one of the reasons for doing it is to allow orchestration.

  5. Orchestration wont spring up ab initio, we need well-described services and things that humans can work with first!

  6. The hardest part of service description is in producing a machine readable (and understandable) description of what the service does and what it's input and outputs are.

  7. The OGC community probably lead the world in strong-typing of complex data features (based on the concepts of ISO19101 and ISO19109). However, while the domain modelling in UML is quite mature (e.g. via the HollowWorld), and the serialisation of data objects into GML is mature, the serialisation of feature interfaces has been forgotten.

  8. There have been many attempts at building web service description languages (too many to list), and even the odd (incomplete) attempt to evaluate some of them.

  9. There are two types of coupling that we care about (tight/loose), and two types of typing (strong/weak) in the services we care about. I've introduced these before. Ideally we should be using something which can support services built around all four combinations.

  10. In the case of tight coupling and strong typing, we can build sophisticated orchestration frameworks with real workflows. In this situation a human maybe able to build complex workflows without inspecting all the service details themselves simply by relying on tools. However, we need to remember that because the resulting service interfaces are likely to be quite brittle changing one service interface will result in the necessity to consider rebuilding all the client service interfaces!

  11. In the case of loose coupling and weak typing, orchestration is more likely to be done by well informed humans since there is unlikely to be enough machine understandable information in the web service descriptions (however presented) to allow complex automatic detection of what is possible and allowed. (i.e mash-ups are great, but depend on humans to work out what is possible).

  12. Many of us will be building services that belong in both camps, so being able to use the same tooling for both jobs would be helpful!

  13. I can't find a description of the OGC GetCapabilities document which would allow one to construct any generic tooling which is not tightly bound to the service it describes (i.e. as far as I can tell the GetCapabilities document can only be consumed by something which already understands the service-type it is looking at). Nonetheless, GetCapabilities does at least expose what data exists under a service (if it is a service which exposes data at all).

  14. Even the OGC invented GetCapabilities expecting it would be replaced (page 10 of 04-016r3) :

    A specific OWS Implementation Specification or implementation can provide additional operation(s) returning service metadata for a server. Such operations can return service metadata using different data structures and/or formats, such as WSDL or ebRIM. When such operation(s) have been sufficiently specified and shown more useful, the OGC may decide to require those operation(s) instead of the current GetCapabilities operation.

  15. WSDL2 is a vast improvement on WSDL1.1 in terms of support for the semantic description of data types. In particular, of relevance to the OGC community, the data types can be described by reference to external xml-schema ... i.e. we can use our GML application schema to constrain service types. (Incidentally, the WSDL2 primer is one of the clearest documents I've found in terms of exposing what some of these things do!).

  16. The grid community have been building clear concepts of how to build semantic descriptions of web services (e.g. OWL-S - see especially the description of the relationship between OWL-S and other web service description languages.)

  17. While there are a number of different ways of annotating wsdl2 to produce semantic descriptions of the services exposed (e.g. SAWSDL and WSDL-S), there are already tools being developed to create and exploit them.

  18. People have said really good things to me about SCUFL (Simple Conceptual Unfied Flow Language) and Taverna. I understand from the folks at Newcastle that they might not be applicable to the OGC services processes, but the key point is that folk find these sorts of tools useful in real workflows. Even if SCUFL/Taverna aren't appropriate, there is also some work being done with BPEL at OMII which may be more relevant. It sounds like the SAW-GEO project will be looking at this stuff in great detail, and I'm looking forward to their outputs.

  19. We've built the concept of affordance into our next version of CSML1 However, we'll not be able to use those in service descriptions within the NDG2 project.

So, what's next?

Of course not all web services are OGC web services, and that's especially true of us in the NDG. The best we will be able to do in the next few months is to provide human readable service descriptions within metadata documents. However, if we ever get to build an NDG3, choosing the right service description methodology will be crucial. We would expect to use WSDL2 and ebRIM, but are hopeful that simpler interfaces to ebRIM will be available by then.

1: which isn't visible at that link yet, but it will be in a month or so, you can see pre-release stuff at http://proj.badc.rl.ac.uk/ndg(ret).

by Bryan Lawrence : 2006/12/07 : Categories ndg metadata computing : 0 trackbacks : 1 comment (permalink)

Swivel Data

I've just found out about swivel.com (via Savas).

The idea is that apparently one can "upload data, any data" and "display it to others visually". It's a nice idea, but it's also grossly false advertising. One cannot upload "any" data and "visualise" it. Maps? Binary data? That said, I can still see utility for some.

It's also not a particularly new idea. Looking through the comments on a techcrunch article about swivel shows a number of other places trying to do a similar thing (http://www.data360.org and http://www.1010data.com). There's also a pointer to a now defunct company with a similar model.

While I wish them well, it's frustrating for those of us with complicated datasets and/or complex long term data management problems. It leads to unreasistic expectations of what can be done with generic data archives.

Issues that they will eventually need to consider:

  • Exactly what is the underlying data model? In what format should data be submitted to conform with it? (No, I'm not going to register to find out whether answers to these questions exist behind the login screens. That sort of stuff needs to be upfront!)

  • One of the reasons we demand our data providers don't give us spreadsheet data is because inevitably they come with inadequate metadata, and it's damned hard to do any automatic cataloguing1 of data contents etc. So I suspect they may struggle with their cataloguing once they have a lot of data, and the users may struggle with actually using the data for anything real - particularly if they last a year or two and their metadata links start breaking!

  • Just like YouTube, they're going to have to worry about licensing! Folk are quite precious about their data IPR!

1: The IR folk are always claiming this isn't a problem, any old text search mechanism will work fine: and that's true provided the text exists and the data model is bloody simple, which it might well be in the case of swivel. (ret).

by Bryan Lawrence : 2006/12/07 : 0 trackbacks : 1 comment (permalink)

Service Orientated Architecture - Two Years On!

I've been blogging for slightly over two years. After I wrote my SOA article earlier today, I realised that my second ever blog post was on SOA. Then, as now, I had been reading Savas's blog.

Unfortunately, in the intervening time two things have happened:

  • I have spent more and more time down the SOA hole, to the detriment of my Atmospheric Science, and

  • I forgot this:

    Share schema and contract, not type (avoid the baggage of explit type, allow the things on either end to implement them how they like).

This was in a discussion of distributed objects versus service orientated architectures. I should have reread that entry and the links within (especially this one) before wading into deep waters! It would have helped focus my argument.

Going back then we have from the original article that inspired my second blog:

The motivation behind service-orientation is to reduce the coupling between applications.

Well, I'm not so sure about that, I think it's really about what Lesley Wyborn in her SEEGRID III presentation called "decomposition" - and the reason why we want decomposition is so that we can have "orchestration". So it's actually about reducing the coupling within applications so the components of application(s) can be reassembled to do new and interesting things - potentially across system and/or organisational boundaries. Having done so, we might still want strong typing (although this paints an interesting perspective on typing and the use of xml schema).

Anyway, I need to get back to thinking about other activities here for a while and forget about SOA details. Fortunately Simon Cox sent me a photo which puts SOA in a proper perspective, and he's kindly allowed me to put it up here:

Image: static/2006/12/07/SOA.jpg

by Bryan Lawrence : 2006/12/07 : Categories ndg : 1 trackback : 2 comments (permalink)

who pays the watcher?

The observer/guardian reports that the Met Office budget for climate change "has been slashed".

Or more accurately, that's what the headlines says. The main story is a bit more confusing: the budget cuts appear to be over the whole Met Office, not just for climate change, and the slash is a proposed cut of 3% on the main forecasting activity. The story points out that Defra, who pay for most of the climate change work have not yet indicated whether they are planning a cut. The story quotes an unnamed Met office spokesperson.

Before I say any more let me firstly state that 1) I haven't spoken to anyone at the Met Office about this, 2) I work with the Met Office, and 3) I am trying to get funding from Defra as well (for supporting the communication of climate change results and storing the data ... I'll blog more of that another day) so I'm not a completely disinterested bystander.

This sort of story is incredibly annoying on a number of levels. The reporters have conflated climate change research with forecasting - The climate change work depends on the supercomputing, but the Met Office has a complex internal market, so it's not obvious that there is much, if any risk, to the climate change work in this cut. Maybe there is a genuine issue there, but I can't tell from the article1. I'll have to ask someone (but may not be in a position to blog the answer).

It also provides plenty of ammunition for the septics ... the reason why we should worry about this cut is because this is the hottest year on record? Should we not worry about cuts in a cold year?

There's obviously a problem here. Most of us in the business of climate change research (and support I suppose ... I can hardly claim to be doing climate science per se any more), would argue that more funding is needed to minimise uncertainty in prediction - especially at the regional scale. However, the septics always claim that we preach gloom and doom in order to increase the funding. Catch-22. But articles like this one that barely make grammatical sense, conflate lots of issues (there are others mixed in there too), don't help anyone!

1: Although obviously climate science is a bit of an ecosystem activity: cut the weather research in the wrong place and you'd definitely influence the climate science, but there are bound to be some parts of weather research at the met office that don't affect climate science! (Not that I'm arguing that cutting weather research is a good idea either, just asking for clarity in what's actually at risk here) (ret).

by Bryan Lawrence : 2006/12/04 : Categories climate environment : 0 trackbacks : 0 comments (permalink)

No Silver Bullet Exists

Another bout of web services "religious war" has broken out again. We've been here before! This time it's based on one funny and accurate diatribe about SOAP. The resulting frenzy in the blogosphere has yielded some quality comments, and even some declarations of victory by those who think REST is the one true way.

Amongst the furore, there were four key comments:

  1. Nelson Minar whose opinion on soap web services has to be respected states:

    The deeper problem with SOAP is strong typing. WSDL accomplishes its magic via XML Schema and strongly typed messages. But strong typing is a bad choice for loosely coupled distributed systems. The moment you need to change anything, the type signature changes and all the clients that were built to your earlier protocol spec break. And I don't just mean major semantic changes break things, but cosmetic things like accepting a 64 bit int where you use used to only accept 32 bit ints, or making a parameter optional. SOAP, in practice, is incredibly brittle. If you're building a web service for the world to use, you need to make it flexible and loose and a bit sloppy. Strong typing is the wrong choice.

  2. This is backed up by Joe Gregorio:

    ... if you don't have control of both ends of the wire then loosely typed documents beat strongly typed data-structure serializations.

  3. And finally, Sam Ruby pointed out that it's just much easier to handle problems with developing restful applications. He also makes the following throw away:

    In addition to all the architectural benefits of REST, as well as all the pragmatic experience the web has built up over time with caching and intermediaries? benefits and experience that WS-* forsakes ...

  4. Gunnar Peterson makes the point that it's not just about unsecured applications (via Sam Ruby):

    ... But if you are going to say that REST is so much simpler than SOAP then you should compare REST with HMAC, et. al. to the sorts of encryption and signature services WS-Security gives you and then see how much simpler is.

Which brings me to me why I wanted to say something. It's just not as simple as some might say, and even Roy Fielding didn't claim that REST solved all the problems in the world! I wonder how many of the vocieferous RESTful advocates have actually read his thesis?

Some choice quotes:

Some architectural styles are often portrayed as ?silver bullet? solutions for all forms of software. However, a good designer should select a style that matches the needs of the particular problem being solved. Choosing the right architectural style for a network-based application requires an understanding of the problem domain and thereby the communication needs of the application, an awareness of the variety of architectural styles and the particular concerns they address, and the ability to anticipate the sensitivity of each interaction style to the characteristics of network-based communication.

The REST interface is designed to be efficient for large- grain hypermedia data transfer, optimizing for the common case of the Web, but resulting in an interface that is not optimal for other forms of architectural interaction.

REST is not intended to capture all possible uses of the Web protocol standards. There are applications of HTTP and URI that do not match the application model of a distributed hypermedia system.

So where do I stand on all this? In practice, some of the applications of this stuff are simply not "distributed hypermedia applications" (which is what REST was designed for). Some of them really are distributed object activities (the OGC things), and some of the assumptions of REST are violated in the grid world - for example, I'm not terribly interested in automatic caching when my data objects are huge -10 GB+-, and write performance is as important is read performance, and latency doesn't matter because machines are doing the work. (I'm happy to provide quotes from Fielding's thesis where he lists these things as reasons for REST!)

But we don't just do big science data moving at the BADC. In trying to get my thoughts together on this, I came up with the following classification of communities based on what they are trying to do:

     Strong Typing    Weak Typing  
  Tight Coupling    Grid (Implicit1 Typing)    "Secured"2 Web Applications  
  Loose Coupling    OGC (Explicit3 Typing)    "Web 2" & Mash-ups  

where

  1. Implicit Typing: Nearly all grid applications are file based at the "usage" point or use OGSA/DAI. Either way the applications are tightly coupled to an implicit knowledge of exactly what the contents of the data resources are - the semantic content of the data resources is known to the (human) builder of workflows. I'm not aware of any real attempt to do late binding of services based on the semantic content1 of the data resources.

  2. Secured: Here I'm implying more than just the use of https as a transport mechanism. There is usage of sophisticated AAA mechanisms which include role-based access control - but in the final analysis the actual processes or transactions are relatively loosely coupled.

  3. Explicit: The OpenGeospatial Community has built very sophisticated mechanisms of building detailed descriptions of computational objects (features) which match onto the features of the real world. These objects have detailed structures, multiple attributes, maybe decomposable, and have implicit interfaces which afford behaviour.

(As an aside, note that in general the tightly coupled systems have strong security. Also that recent work within the OGC community has been building SOAP bindings and strong security into the OGC web service paradigm. It's unfortunate that that work is being done in the context of a project known as GeoDRM ... which has all sorts of bad connotations ... mostly it's about GeoAAA2, not DRM!).

All the really good arguments for REST lie in the bottom right corner of the table ... and to be fair, its also where the majority of web usage should lie too!

The reality is that the availability of tooling plus the type of task makes decisions for me! We're building things that are a mash-up of web-services (which we secure using WS-security) and plain old XML type services (which we secure using gatekeepers that exploit WS-*). We do do some REST things. We're not using much "pure" grid tooling, because the python tooling isn't mature enough yet. But we will.

So there is no silver bullet right now! For me, any sort of fundamentalism sucks. The final word belongs to Nelson Minar:

Truly, none of this protocol fiddling matters. Just do something that works.

(Update, 4th Dec. Aaargh: the material in italics above somehow got lost presumably because I missed the last seconds of my wireless time in the great southern blackspot.)

1: Where I'm meaning semantic content at the level of detail exposed by, for example, GML Application Schema (ret).
2: Authentication, Authorisation, Access (ret).

by Bryan Lawrence : 2006/12/01 : Categories ndg computing : 2 trackbacks : 2 comments (permalink)

A proposal for profiling ISO19139

I've been flagging issues with profiling ISO19139 for some time - see Oct 19 and Aug 15 and especially the comments in the latter.

Over this week a group of us (Rob Atkinson, Simon Cox, Clemens Portele and myself) have been reviewing the reality of problems with profiling ISO19139 and conforming with appendix F of ISO19115.

The following is my summary of what I think we agreed.

extensions

We believe the situation is relatively straightforward for extensions to ISO19139:

  1. A given community decides on a profile

  2. The extensions are defined and documented in UML, so that an extended element/class is a specialised class which

    1. lives in a specific profile package

    2. has a different name, and

    3. has an attribute which documents which iso type it is intended to replace.

  3. This new package is then serialised into a new schema,

    • where the new class retains the different name from the UML definition and maintains the gco:isoType attribute to identify the parent (extended iso) element.

  4. and instances are validated against that schema using standard mechanisms.

  5. Interoperability requires that when these instances are made available outside of the community for which the profile is understood, either the producer should transform the document back to vanilla ISO19139 (easily done since all extended elements can be renamed using the iso-type attribute along with removal of material from the new namespace).

restrictions

We believe that most communities will be profiling ISO19139 by restriction.

Typical restrictions will include (but not be limited to):

  • limiting the cardinality of attributes (e.g. making an optional attribute mandatory),

  • restricting the type of some attributes (e.g. forcing something to be a date where currently a date or a string is allowed).

  • replacing string options with codelists

In all these cases, the resulting instance documents ought conform directly to the parent ISO19139 schema since these restrictions have not introduced new semantics to any element/class - they are simply constraints.

Accordingly, where a community profile wishes to restrict a given element class we recommend that:

  1. The restrictions are defined and documented in UML, so that an restricted element/class is a specialised class which

    1. lives in a specific profile package

    2. has a different name, and

    3. the constraints are modelled using the Object Constraint Language (OCL), and

    4. the specialisation is stereotyped as a <<restriction>> to guide serialisation 12.

  2. These constraints are then serialized as schematron3 commands, which are maintained separately from the profile schema (if it exists - a purely restrictive profile would not require a new schema).

  3. Instances are then validated using the standard XML techniques (which should ensure that they are valid ISO19139) and by a schematron processor.

  4. Interoperability of these instances is trivial given they directly conform to ISO19139 (there is no requirement by the consumer of such a profile to see the constraint serialisation).

For example:

Image: static/2006/11/30/iso19139_restriction.jpg

Right now, the schematron serialisation requires manual (human) interpretation of the OCL to construct the serialisation. It would appear that a direct serialisation of the full power of OCL would be non-trivial, so we are recommending that

  1. communities attempt to use the most simple OCL commensurate with their restricting requirements, and

  2. publicise their best practice, so that eventually

  3. it would be possible to codify the best practice into a "recommended OCL for profile restrictions" document, which would then be amenable to

  4. the development of an automatic parser (that, for example could be built into a future version of [shape change]).

Thus far the only significant criticism of this schematron serialisation approach is that it might not be possible to trivially build a metadata editor which conforms to any arbitrary profile since such tooling would need to be able to both parse the schema and the schematron.

The only other possible approach would be the "clone-and-modify" approach at serialisation. In this case, at serialisation time the schema name changes and the element definitions are directly restricted in the new schema. This new schema then looks like, and behaves like ISO19139, but isn't ISO19139: we believe that inevitable governance issues would arise in the maintenance of the serialisation. Further, like the extension case, instances would need transformation when shared outside the community.

However, it would appear that deploying schematron constraints may not be that difficult in some tools. For example tools exist that can use schematron with xforms.

Further, it is our belief that it will be more easy to deal with hierarchical (and even multiple-inheritance) profiling using the OCL approach as well. For example, an organisation generating metadata may belong to multiple governance domains (e.g. the BADC is a British institution, producing atmospheric data: one might expect our ISO19139 profiles to conform to both the British and WMO standard profiles). It would be easy to test this for restrictions, we simply validate using both schematrons independently!

1: The use of a stereotype for serialisation guidance is the methodology followed in ISO19139 itself. We've chosen restriction here. Although we're not totally comfortable with the particular choice of word, none of us could come up with a better one. (ret).
2: It was suggested that we could do without the stereotype since the serialisation code could simply identify targets for serialisation into schematron alone simply by the fact that the specialisation consisted of OCL alone. However, it is possible that profile maintainers might choose to create real specialisations in an extension (for example, by categorising a particular class into a number of different restricted specialisations, each of which would need to be named). Although that's an extension case, the parser needs to be agnostic as to whether it's an extension or restriction as it processes each element from the UML. (ret).
3: It may be there are other ways of implementing OCL, but schematron seems to be the tool of choice at the moment (ret).

by Bryan Lawrence : 2006/11/30 : Categories ndg metadata iso19115 : 0 trackbacks : 10 comments (permalink)

Wireless Internet Blackspot - Australia

I've spent most of the week in Canberra, Australia, attending three different events - a standards workshop, AUKEGGS and the SEEGRID III conference (programme pdf).

It's been quite a frustrating experience for communication:

  • The hotel I'm staying in has no high-speed internet (despite claiming to do so on it's website!). I would have moved but there appear to be no other hotel rooms available in this town for love nor money (one of the conference attendees has been sleeping on local sofas!).

  • None of the three venues had wireless that could be configured for public access , nor people available to deal with registration onto available networks.

  • The public hotspots (like the one I'm sitting in now) are filthy expensive ($26 Australian for two hours connectivity!), and provide poor connectivity (I can't get my VPN to get up and stay up).

I can get email (at this hotspot), but I have to do it through the braindead outlook web interface ... which in practice means I'm cherry picking a handful of things that a) I spot amongst the deluge, and b) I can deal with. The rest is just building up ... ready to be a millstone around my neck next week.

In a wired and wireless world, connectivity matters. I can't afford to be this out of touch. Do I want to come back to Australia under these circumstances? Nope!

by Bryan Lawrence : 2006/11/30 : 0 trackbacks : 2 comments (permalink)

Common search terms at badc

Here are the top twenty search terms recently requested on the badc site.

  1    rainfall  
  2    wind  
  3    temperature  
  4    sst  
  5    wind speed  
  6    hadisst  
  7    ecmwf  
  8    rain  
  9    hadcm3  
  10    cet  
  11    radiosonde  
  12    precipitation  
  13    ozone  
  14    midas  
  15    soil temperature  
  16    solar radiation  
  17    humidity  
  18    faam  
  19    sunshine  
  20    solar  

Proof positive I think that we need to call the ontology server in our search backend, so that search terms actually match the right datasets (currently rainfall matches 5 datasts, rain 13, and precipitation 3).

by Bryan Lawrence : 2006/11/21 : Categories ndg badc : 0 trackbacks : 0 comments (permalink)

status of OGC specs

At a recent internal meeting we found the following little table helpful:

  name    commonly implemented    current spec    draft spec  
  WCS       1.0    1.1  
  WFS    1.0    1.1    1.2? 
  WMS    1.1.1    1.3     
  WPS          0.4  

What's important to note are

  • that implementers lag behind specs, and that's ok

  • specs sometimes are not backward compatible, and that's ok (if there is a good reason),

  • the newer specs are much better for our community than the old ones, but there is much to do.

Update, December 7th: I'm told that even where implementations claim to adhere to the older specs, many are not even complete implementations of those specs.

by Bryan Lawrence : 2006/11/20 : Categories ndg : 0 trackbacks : 0 comments (permalink)

browser crap

As is fairly obvious from the range of dribble that appears on this website, I dabble in many of the technologies my team work with. One of the things I've been doing is developing the interface to the new NDG discovery service (this is the sort of thing I find I can do on the train home from meetings when my brain is too addled to think properly as it's kind of mechanical: make mistake, fix it, move slowly on - it beats sudoku and patience and the other stuff I see tired folk do at the end of the day).

Anyway, this is about browser crap. The following simple piece of css is designed to support identical buttons (whether "real" buttons, or pseudo buttons which are really hyperlinks). I developed it on firefox, and then looked at how it looked under IE and konqueror.

Oh how glad I am that I don't do interfaces for a living.

Here's the CSS:

a.button {border: 1px solid black; background-color: #F4F4F4; 
          padding-bottom: 3px; padding-top:3px; 
          padding-left: 4px; padding-right: 4px; 
          text-decoration:none; color:black;}

button.button {border: 2px solid black; background-color: #F4FFFF;                  
               padding-bottom: 1px; padding-top:1px; 
               padding-left: 4px; padding-right: 4px; 
               text-decoration:none; color:black;}

And this is what we get:

  • Firefox:

    Image: 2006/11/15/ndgCSSfirefox.jpg

  • Konqueror:

    Image: 2006/11/15/ndgCSSkonqueror.jpg

  • Internet Explorer:

    Image: 2006/11/15/ndgCSSie.jpg

Note the vertical alignment of the "real" button, and the font differences under IE. Yuck. As it happens I've decided I don't really want them to look alike, but I felt like bleating about how crap browsers are at consistent CSS support (and this for a really simple piece of CSS usage).

by Bryan Lawrence : 2006/11/15 : Categories ndg computing : 0 trackbacks : 2 comments (permalink)

Answers.Yahoo.Com to the rescue

Once upon a time I mentioned that we now have a megane estate car. I raved about it's fuel economy.

Today I was confronted with two light bulb failures to fix: the reversing light, and the front left indicator. No problem I thought. Well, thus far I've failed to fix the reversing light (but it looks doable), and I failed to replace the indicator light too. Fortunately I didn't have to, because at the point where I'd just about given up trying to fit my hand up an impossibly small hole, while turning the air blue about the quality of French engineering, I resorted to the Internet. Luckily (thanks Google too) I found this, which I'm repeating in it's entirety because I trust me more than yahoo to keep this for the lifetime of my car:

... lock your steering out left or right depending on which side is blown you will see a removable cover which you might need a screwdriver to pop off..now this is tricky and make sure you take off your watch or your arm will get stuck put your hand up and feel around for a flat plastic leg about an inch long this is the holder that the bulb sits into then screws into the lamp most of the time the bulb isn't blown its just this holder comes loose so turn on your indicators and twist the holder to tighten it and they more than likely will start working again as this closes the contacts between the bulb and the lamp...if the bulb is blown twist the other way and keep pulling back while twisting so it will pop out when the legs get into position you will have to fiddle around the bulb/holder to get it down and out the gap in the panel as there isn't much room usually u have to take your hand out to give room for the bulb to come down then change the bulb and you should grease the rubber seal on the holder just makes twisting it back easier..fitment is the exact opposite of removal but it is tricky and will have to bend your hand over and back and twist around and in and out but if you have the patients you should have no problem...but say a prayer you just have to tighten it not change the bulb!

by Bryan Lawrence : 2006/11/12 : 0 trackbacks : 0 comments (permalink)

taking OGSA DAI seriously again

Early on in the evolution of the NERC DataGrid we investigated OGSA/DAI, which is a "data access and integration" component of the Globus stable. We rejected it for a number of reasons, chief of which were that the software was immature, and it didn't seem to offer much more than what has recently been recently termed WS-JDBC (albeit perhaps with a dash of WS-XML:DB API).

Of course the years roll on, and maybe we should have revisited it, but it has always seemed like the rest of the globus toolkit - a good idea, a bit immature, maturity just over the horizon - maybe useable next year ... with the same fatal flaw: for a group with relatively little spare engineering time, it was always over the horizon, every year!

It still looks like for us it'll be next year (technically "next year" now means "in some successor project to NDG2"), despite the fact that even within the Met community it's gaining some traction. However if it's going to be next year, we have to know some more about it sooner than that, so it seemed good to see that Ian Foster had chased up on the criticism by getting Malcolm Atkinson to say a few words.

The trouble is, even after reading the article, I'm not convinced we should investigate further. Malcolm's points were essentially:

  1. it supports multiple backend formats (OK, but this actually just means you still have to know how to query and understand the schema which defines the backend content) ...

  2. it is extensible ("OGSA-DAI has three popular extensibility points, the data resource adapters, the activities and the client libraries") ... hmmm in what way is JDBC not extensible in the same sense? (OK, I know the answer to that, it's obvious that OGSA-DAI would allow a consistent framework for access to multiple backends, but the other two are surely the same? Still, score a point for OGSA/DAI).

  3. "... OGSA-DAI contains a variety of multiple-data source functions, such as DQP (still in prototype), and multi-site query facilities which deal with partial availability." Hmmm, well I understand the words "Distributed Query Processing" but don't really understand how that is much use unless the backend resources are pretty homogeneous (never the case for me). The rest is a mystery to me.

So, I'm still not enlightened. Which is a source of frustration (it always seems like Globus nearly offers us what we want), however, I can see one minor place where OGSA/DAI might have helped us in NDG2, but I can't help wondering whether the investment in effort would have been commensurate with the reward ... given that my problem still remains that for all the problems which I find interesting, we worry about the nature of the things we store (their "feature-type"). For us, describing those features in a way that is queryable and can be interpretted by client software is the domain of the OGC webservices. It's not clear to me what role OGSA/DAI could play for us in that context (although there is a "delivery" component of OGSA/DAI which is nagging at me in the context of asynchronous delivery of data, which could be important for WPS or WCS with big data objects).

In the UK we have a couple of projects running under the banner of "Grid-OGC collision", at least one of which I think seems to be aiming to confront OGSA/DAI with OGC WFS/WCS. As I say, I can't see the mileage in it myself, but I've been wrong before, will be wrong again, and am glad someone else is doing the investigating. If there is an NDG3, we'll be looking to those projects to guide us as to whether there really is a role for OGSA/DAI in our activities which is beyond a putative "WS-JDBC".

by Bryan Lawrence : 2006/11/12 : Categories ndg : 0 trackbacks : 4 comments (permalink)

climate change speed

Climate change is one of those (many) phrases that every reader/listener interprets in their own way. James Annan has interpretted it in one way in a discussion which I'll summarise as "the a priori1 position that climate change would be detrimental has no scientific/logical underpinning". (Well, actually I think he's really expressing the converse: "that no change is the best possible outcome, is not necessarily wrong"; but it amounts to the same thing.)

Put in yet another way: Given that it's unlikely that some cosmic conspiracy has put us a maximum of some sort of climate-ecosystem-human-society efficiency, it's hard to argue a priori that change would be a bad thing, as it might take us somewhere better. (The metric of exactly what is better is irrelevant to the argument, but I was disappointed that James' arguments were all pretty western hemisphere-acious.)

However, I think there's a fatal flaw in this argument. I think one can argue a priori that the speed of climate change could be detrimental, even if the place one ended up was some how "better" (always assuming we reached an equilibrium before economic/ecosystem meltdown). It seems quite clear that neither existing human societies, nor natural ecosystems (if such still exist) can respond to a rapidly changing environment without detrimental impact.

So, my point is that climate change is a problem both in terms of magnitude and rate of change. Some of us are as much worried about the speed of change as we are of the eventual magnitude. (James does allude to the issue of rapid climate change in passing, but I think misses it's importance in discussing whether the status quo may or may not be better than the result of climate change).

Given that the speed of change is at least as important (if not more so at some point) than the magnitude of the change, then arguments about our ability to adapt being relevant need to be taken with a grain of salt. The first question should not be whether we (or our ecosystems) can adapt enough, but whether they can adapt as quickly as things are going to be changing.

I think we now know that the speed of change is going to be a problem, but I would argue that actually the most likely a priori position would have been to expect just that: Anthropogenic climate change is most likely to be detrimental - if only because of the speed! And unlike James, I do think this is a logical position!

1: Here my definition (and I think that of James) is that a priori means "before we do/did any calculations that actually show the expected climate change would be detrimental" (ret).

by Bryan Lawrence : 2006/11/12 : Categories climate environment : 1 trackback : 7 comments (permalink)

Goodbye Korea

I'm sitting typing this in Incheon airport (which is unbelievably quiet, nearly everything is closed, and while it's now 10 pm, stuff was closed when I got here around 8.30 pm).

Sure enough I didn't get out of the hotel ... not enough time at lunch time and complete exhaustion last night. But I had a fabulous view out of the hotel window, and all my collagues were raving about how nice a city Seoul is. I must come back (and I must remember not to even consider driving ... the traffic is a nightmare).

The conference itself was a bizarre mixture of overview talks with (from my perspective) little content, and some detailed descriptions of testbeds and exemplar systems (like our NERC DataGrid). I think nearly everyone in the audience must have spent half their time frustrated: the technical folks bored with the overview talks, and the WMO representatives trying to find out about the planned WMO Information System, buried with acronmyms and unfamiliar concepts in the technical talks. Still, most folk seemed to be getting something out of it.

From an NDG perspective, what was interesting is how much everyone is converging on the same technologies. OAI is everywhere. Everyone is struggling with ISO19139 (some more aware of the ramifications than others). OGSA/DAI is making a come-back. The WIS will interact with academia far more successfully than earlier WMO systems (while WMO Resolution 40 rears it's ugly head as far as inter-country interoperability is concerned, at least it protects the doing of science).

Now for the ludicrously long flight back to the UK (via Dubai - don't ask!)

by Bryan Lawrence : 2006/11/07 : Categories ndg : 1 trackback : 0 comments (permalink)

Sanity from Hulme

Todays posts may be giving some the wrong impression about what I think about climate change. I've been decrying attacks on foundations, not supporting what Mike Hulme terms "The Discourse of Catastrophe", that is, the whipping up of a "State Of Fear" (ouch, could it be that Crichton at least got that bit right?).

Of course what we really need is reasoned discussion about adaptation and mitigation, and the development of a proper understanding of what climate change might actually be.

Just to set the record straight (thanks for pointing this out James), I thoroughly agree with Mike's article, which concludes with:

I believe climate change is real, must be faced and action taken. But the discourse of catastrophe is in danger of tipping society onto a negative, depressive and reactionary trajectory.

(And still no sleep ... I go back to bed to try between posts ... honest).

by Bryan Lawrence : 2006/11/05 : Categories climate environment crichton : 0 trackbacks : 0 comments (permalink)

Jetlagged thoughts on amateurism

I knew that I might come into some criticism for using words like "Pretentious" and "dribble" and "drivel" about Monckton's "article" in the Sunday Telegraph. It has started already (see the comments). One of the problems about my using this sort of language is that it opens the door to others using the same sort of language about me ...

Fair enough. I suppose I should have tried to be more temperate. I will try (for example, I've severely censored my original verision of what follows). But it makes me very VERY annoyed that newspapers give this sort of stuff oxygen.

Why?

Last week I was driving from somewhere to somewhere, and on route I listened to an interesting piece on Radio Four about Poincare. One of the things they said resonated: Poincare was working at a time (the last time!) when it was possible for an outstanding mathematician to understand in detail the entire breadth of mathematics.

I don't know when the last time was that it was possible for someone to understand in detail the entire breadth of atmospheric science, but it sure isn't now! An observational atmospheric scientist will not understand numerical modelling in detail (heck, even a modeller is unlikely to understand in detail all of the GCM she/he is using, one has to rely on experts in other fields, for example, a GCM has an atmosphere and an ocean, Q.E.D.)

So how do we cope. Peer Review is how we cope. We rely on a mechanism which ensures (as far as possible, it's not perfect) that the pieces we put together are validated by peers - that is, by people who do understand in detail those pieces - and the joining together of the pieces is validated by people who understand the joining together, and then the interpretation is validated by people who understand the methodology of interpretation. And all the while we try and include quantified uncertainty, and probability estimates, and caveats, caveats and more caveats. And sometimes we find fault, we find errors in what has been done and published, and so we redo, we improve, and we move forward.

Then someone comes along, drives a truck and trailor through this, simply cannot have done due diligence, and hacks away at some poorly understood detail. The whole process is damned in a few words. The work of hundreds of scientists is attributed to "the UN" as if some civil servant somewhere had produced a briefing paper, rather than the IPCC process being about the best synthesis of knowledge about climate it is possible to create (I'm not saying it's perfect, but I am saying there is currently nothing better). I'm sorry, but the day of the educated amateur has pretty much gone (and Richard Lindzen, if you ever read this, the day of the MIT professor knowing all there is to know about everything has gone too!)

With no peer review, just a dose of editorial whimsey, and Monckton's thoughts get read by more people than any of the thoughts of the hundreds of individuals who have spent years contributing to understanding the climate.

Don't anyone mention "balance" either. If the ST published reviews of the hundreds of papers that have been peer reviewed and gone into this work, then fair enough, they could publish the other stuff too - but this isn't balance, it's the equivalent of putting Monckton's gnat on a giant seesaw with a few battleships on the other end, and claiming his end would sit on the ground.

So, should I waste my time reading his article through properly, and damning the arguments? No I shouldn't. I blog for many reasons, sometimes for fun, sometimes to contribute my pieces to public understanding of science, sometimes for catharsis - like now - and sometimes professionally to provide notes to myself and my colleagues . Responding might fit in the Public Understanding category, but I think my time would be better spent elsewhere. And when he gets his thoughts published in a serious journal, I'll take them seriously. Frankly, it's this sort of nonsense that stops some of my colleagues blogging. They simply don't want to open themselves up to the tedium of responding. The nice thing about peer-review is that rubbish only has to be rejected by two or three reviewers. We don't all have to waste our time.

I'm sorry if that sounds elitist. It's not a cult of the elite. I'm more than happy for anyone to try and distil the state of knowledge, and in my own little way, I sometimes try and contribute to the public understanding of science, but damn it, enough's enough ...

Yes, it's five a.m., and I'm jetlagged, and can't sleep, and this hasn't helped.

by Bryan Lawrence : 2006/11/05 : Categories environment : 1 trackback : 1 comment (permalink)

Another thing I would do if I had time ...

... would be to take this load of pretentious dribble apart. I can't actually bring myself to read it through. It's introduced by a equally drivellacious article in the Sunday Telegraph. That article included this gem:

Dick Lindzen emailed me last week to say that constant repetition of wrong numbers doesn't make them right.

What Monckton needs to consider is that the constant repetition of discredited arguments doesn't make them credible. It's a delicious irony that he can't see that ...

What's not amusing is that some hardworking folk are going to have to take more time to discredit it (and it'll be easy). I just wish I had the time to add my voice, but I suspect it's better for me to be working on making sure that the raw material (credible numbers) are discoverable, documented and easily manipulated.

I came across this stuff following up from William (especially the comments) which led me to Tim Worstall. I was interested in trying to understand the arguments about the economic scenarios. I have to confess that this issue is my achilles heel, I've never taken the time to really understand the economic plausibility or otherwise of the various scenarios. I plan to change that over the next six months or so, in particular as the fourth assessment report is released (I've promised myself to read every word).

by Bryan Lawrence : 2006/11/05 : Categories environment climate : 1 trackback : 5 comments (permalink)

Not getting to see Seoul

I'm in Seoul for two days, and will have spent nearly as long getting here and going home as I've got here. When I was younger I used to want a jetset lifestyle, it sounded so glamorous. The truth is far more mundane: long distance air travel is uncomfortable, of dubious morality1, and oftentimes one doesn't see anything at all. This trip I've arrived in the dark, will leave in the dark, and have two full days work to do. I might, just might, leave the hotel and see a local restaurant tomorrow night (but I may collapse from jetlag and do room service, as I've done tonight) ... Sorry Korea, you deserve more time (I have the interest, but this is not the time!)

I'm here to give a talk on the data discovery work we are doing in the NERC DataGrid, as part of the World Meteorological Organisation's Technical Conference on WMO Information Systems (TECO-WIS). The talk I'm giving is this:

The NERC Metadata Gateway

The Natural Environment Research Council's NERC DataGrid (NDG) brings together a range of data archives in institutions responsible for research in atmospheric and oceanographic science. This activity, part of the UK national e-science programme, has both delivered an operational data discovery service, and built and deployed a number of new metadata components based on the ISO TC211 standards aimed at allowing the construction of standards compliant data services. In addition, because each of the partners has existing large user databases which cannot be shared because of privacy law, the NDG has developed a completely decentralized access control structure based on simple web services and standard security tools .

One of the applications has been the redeployment of an earlier data discovery portal, the NERC Metadata Gateway, using the Open Archives Initiative Protocol for Metadata Harvesting (OAI/PMH). In this presentation we concentrate on the practicalities of building that discovery portal, report on early experiences involved with interacting with, and harvesting, ISO19139 documents, and discuss issues associated with deploying services from within that portal.

1: I only do it in those situations when I think "being there" will make a difference (ret).

by Bryan Lawrence : 2006/11/05 : 0 trackbacks : 0 comments (permalink)

The Future of Physics and Science

  • A few days ago the BBC carried a news item: science students have to work (much) harder (in terms of time) than arts students at university.

  • Student fees are starting to bite.

  • Not many jobs in the paper have "Physicist" in the title, even though many employers may well hire physicists in preference to other graduates to fill posts.

  • Yesterday I discovered that the University of Reading is closing its physics department - not enough students enrolling.

These things are not unconnected! Deja vu. As a former physics academic, I've seen it all before ... (in another country :-).

But it's not just physics that suffers: who's going to do all the hard environmental science then?

(Update: trackback closed due to excessive trackback spam)

by Bryan Lawrence : 2006/11/03 : Categories environment : 0 trackbacks : 2 comments (permalink)

The Stern Way Forward

Yesterday I waded through the first two thirds of the Executive Summary ... I thought it best to finish it today, otherwise I would be at risk of not only not reading the whole thing, but not making it through the executive summary :-)

The way forward he proposes depends on three pillars:

  • Establishing a price for Carbon,

  • A Technology Price

  • The removal of barriers to behavioral change.

In terms of pricing Stern recommends use of one or more of tax, trading or regulation, with the mix depending on choices within specific jurisdictions. I have to say, without reading the main text, I don't understand how he can argue that different jurisdictions can achieve carbon prices in different ways (surely it would lead to some form of carousel fraud) ... but I'm no economist, so ok ...

Policy incentives include

... technology policy, covering the full spectrum from research and development, to demonstration and early stage deployment ... but closer collaboration between government and industry will further stimulate the development of a broad portfolio of low carbon technologies and reduce costs.

I find it (amusing, sad, worrying) that the report suggests that existing policy incentives to support the market should only increase by two to five times. This at a time when the UK incentives run out half way through the year. That suggests to me that an order of magnitude increase is necessary (after all, the take up is relatively low, and even with incentives one has to be pretty wealthy to get into home generation).

In terms of behavioural change, he makes the point that:

Even where measures to reduce emissions are cost-effective, there may be barriers preventing action. These include a lack of reliable information, transaction costs, and behavioural and organisational inertia. ... Regulatory measures can play a powerful role in cutting through these complexities, and providing clarity and certainty. Minimum standards for buildings and appliances have proved a cost-effective way to improve performance, where price signals alone may be too muted to have a significant impact.

The clear message throughout is that the market can't do this alone! From the obvious point that carbon costs are an "externality" (the producer of carbon dioxide does not themselves pay the costs), through to the reality that regulation and taxation are going to be necessary to begin to change minds - I reckon hearts will follow (if they're not already there!)

Another obvious (to me) point is that we need to start thinking and planning about adaptation now! We have some decades of climate change ahead of us, regardless of what we can achieve in changing emissions!

We hear a lot about how there is no point in the UK doing anything because it contributes only 2% of the global emmissions. However, it was good to read

... China's goals to reduce energy used for each unit of GDP by 20% from 2006-2010 and to promote the use of renewable energy. India has created an Integrated Energy Policy for the same period that includes measures to expand access to cleaner energy for poor people and to increase energy efficiency.

It would be nice to hear concrete proposals from the U.S. and Australia!

I'll leave the last word to Stern:

Above all, reducing the risks of climate change requires collective action. It requires co-operation between countries, through international frameworks that support the achievement of shared goals. It requires a partnership between the public and private sector, working with civil society and with individuals. It is still possible to avoid the worst impacts of climate change; but it requires strong and urgent collective action. Delay would be costly and dangerous.

by Bryan Lawrence : 2006/10/31 : Categories environment (permalink)

Stern Facts

Like William Connolley I doubt I'll ever read the whole thing, but it's intriguing to wade through the 27 page executive summary at least.

The Bad News

Under a BAU scenario, the stock of greenhouse gases could more than treble by the end of the century, giving at least a 50% risk of exceeding 5?C global average temperature change during the following decades. This would take humans into unknown territory. An illustration of the scale of such an increase is that we are now only around 5?C warmer than in the last ice age.

Well, that 5 degree figures is a bit hard to fathom (although I think he uses that with respect to the period 2100-2200), but the comparison with the scale to the last ice age is rather a good one. Even if the real number might be 2 to 3 degrees C, it puts things in perspective somewhat - even those of us who may claim to be professionals still have to get a grip on the emotional reaction that a few degrees C isn't that much really. Put like that, it obviously is!

The disaster list is pretty awesome:

  • Water supply problems (Melting glaciers: initially a flood risk, lead to a fall in water availability, not to mention changes in water availability associated with changing weather patterns) ...

  • Declining crop yields (particularly in the higher range of predictions)

  • Death rates from malnutrition, heat stress and vector borne diseases (malaria dengue fever etc) increase ...

  • Rising sea levels ... threatening the homes of 1 in 20 people!

  • Ecosystem melt down (15-40% of species for only a 2C increase!) (Plus ocean acidification with unquantifiable impact on fish stocks)

Then:

Impacts on this scale could spill over national borders, exacerbating the damage further. Rising sea levels and other climate-driven changes could drive millions of people to migrate ...rise in sea levels, which is a possibility by the end of the century... Climate-related shocks have sparked violent conflict in the past, and conflict is a serious risk in areas such as West Africa, the Nile Basin and Central Asia.

But maybe this will get more attention in the City:

At higher temperatures, developed economies face a growing risk of large-scale shocks - for example, the rising costs of extreme weather events could affect global financial markets through higher and more volatile costs of insurance.

I find all the arguments about changes in GDP difficult to follow, possibly because one never really knows where the baseline is (unless one is an economist), but this seems a pretty straight forward statement:

In summary, analyses that take into account the full ranges of both impacts and possible outcomes - that is, that employ the basic economics of risk - suggest that BAU climate change will reduce welfare by an amount equivalent to a reduction in consumption per head of between 5 and 20%. Taking account of the increasing scientific evidence of greater risks, of aversion to the possibilities of catastrophe, and of a broader approach to the consequences than implied by narrow output measures, the appropriate estimate is likely to be in the upper part of this range.

I've blogged before (Jan 2005a, Jan 20005b, Jul 2005, and Aug 2005) about the future of the oil economy, but maybe I've been on the wrong tack:

The shift to a low-carbon global economy will take place against the background of an abundant supply of fossil fuels. That is to say, the stocks of hydrocarbons that are profitable to extract (under current policies) are more than enough to take the world to levels of greenhouse-gas concentrations well beyond 750ppm CO2e, with very dangerous consequences. Indeed, under BAU, energy users are likely to switch towards more carbon-intensive coal and oil shales, increasing rates of emissions growth.

The economic analysis makes it clear that there is a high price to delay. As he says:

Delay in taking action on climate change would make it necessary to accept both more climate change and, eventually, higher mitigation costs. Weak action in the next 10-20 years would put stabilisation even at 550ppm CO2e beyond reach ? and this level is already associated with significant risks.

The Good News

He thinks there is a way out:

Yet despite the historical pattern and the BAU projections, the world does not need to choose between averting climate change and promoting growth and development. Changes in energy technologies and the structure of economies have reduced the responsiveness of emissions to income growth, particularly in some of the richest countries. With strong, deliberate policy choices, it is possible to ?decarbonise? both developed and developing economies on the scale required for climate stabilisation, while maintaining economic growth in both.

I think the Aussie and the American governments understand the last part of this (opportunities), but they want to somehow avoid the fist part ... (costs):

Reversing the historical trend in emissions growth, and achieving cuts of 25% or more against today?s levels is a major challenge. Costs will be incurred as the world shifts from a high-carbon to a low-carbon trajectory. But there will also be business opportunities as the markets for low-carbon, high-efficiency goods and services expand.

For those of us in paranoid Europe, worried about Russian control of our gas supplies:

National objectives for energy security can also be pursued alongside climate change objectives. Energy efficiency and diversification of energy sources and supplies support energy security, as do clear long-term policy frameworks for investors in power generation.

In Britain today the headlines are all about the taxation that will result from doing something about this, but Stern also points out that while

the social cost of carbon will also rise steadily over time ... This does not mean that consumers will always face rising prices for the goods and services that they currently enjoy, as innovation driven by strong policy will ultimately reduce the carbon intensity of our economies, and consumers will then see reductions in the prices that they pay as low-carbon technologies mature.

I've got that in the good news section on the grounds that the clear message is that the cost of doing something about this isn't going to rise and rise, but it isn't good news for our next thirty years. I fear for the tourism and export agriculture of countries a long way from anywhere else (e.g. New Zealand!)

At this point I'm on page seventeen, and I'm tired ... more soon!

Update, 31 Oct: See James Annan for a critique of the science part ..

by Bryan Lawrence : 2006/10/30 : Categories climate environment : 1 trackback : 0 comments (permalink)

Subtle Discipline Drift

Recently Oxford University advertised for a Lectureship in Atmospheric Physics, and Imperial College is currently advertising for a raft of positions, from lectureships to professorships. In some ways I was and am tempted (despite having supped at the font of "proper" academia before, and having been burned ... a lectureship, in NZ at least, being far far more stressful than what I do now :-). I do love atmospheric science.

However, while I'm tempted, I'm also realistic. In the past five years, I've drifted (and sometimes been pushed) towards what has recently in the UK been called e-science (that name is now deprecated, who knows what we'll call it next year!). I no longer read the atmospheric science journals, not even the title pages, so I haven't a clue what's in the literature. One has only to look at what I blog about nowadays to realise that whatever I'm doing now, it's not atmospheric science per se - although everything I do is predicated towards making the doing of atmospheric science easier.

I keep telling my staff that one should always be appraising one's career options, assessing what doors are closing and opening as time goes by, and carpa diem etc. So, here's my assessment of one of my options right now. Given I'm so out of touch, I'm not sure I'll even feel comfortable supervising atmospheric science students, which is really scary. I think that's the sound of my atmospheric science door closing, if not permanently, pretty tightly anyway - it'd take some effort and time to prise it open again! But I like working with students, so if you're an academic in a computer science department in my part of the world, and fancy getting them working on some atmospheric science related problems, get in touch, we could talk about co-supervision. Hopefully that's another door opening ...

by Bryan Lawrence : 2006/10/26 : 0 trackbacks : 0 comments (permalink)

More Stupid Patent Litigation

I'm with Tim Bray on this. Why isn't the internet in an uproar? IBM is litigating Amazon on patent violations, it's all pretty incredible, but the two most silly are:

If Amazon is found guilty of this, then the entire ediface of data distribution in science will be violating these patents too. In fact, pretty much all e-commerce is covered in these patents (and the others they're claiming are violated).

IBM should be ashamed, even if it's only to over-turn that ludicrous one-click patent ...

by Bryan Lawrence : 2006/10/26 : 0 trackbacks : 0 comments (permalink)

Exploring Web Server Backends - installing fastcgi and lighttpd

A few months ago, I was investigating web server options (one, two, three). I finished that series saying I needed to investigate wsgi. Well that time has come, there are a number of reasons why wsgi and fastcgi (or scgi) may be important to us. However I'm a little bit wary about Apache and fastcgi after getting the impression that lighttpd may be the way to go for fastcgi. So, this note is a list of my experiences getting a wsgi hello world going on my dapper laptop. (As usual, my interest in doing this for myself is to understand the major issues, not because I'm personally going to be working on this).

(I'm doing this using my own /usr/local/bin/python2.5 rather than the system default python.)

Got Lighttpd:

sudo apt-get install lighttpd

(This started a process running under www-data, and put scripts in /etc/init.d, so I may well have this automatically starting when I boot, which isn't really what I want on a laptop ... I'll investigate that later).

Got flup, noting that no 2.5 egg existed, I had to get a tar ball, and setup install it (nb: remember using my local python):

wget http://www.saddi.com/software/flup/dist/flup-r2030.tar.gz
tar xzvf flup-r2030.tar.gz
cd flup-r2030
sudo python setup.py install

Following cleverdevil (Jonathan Lacour) I grabbed scgi while I was at it.

sudo easy_install scgi

but for my first steps, I'm planning on getting vanilla fastcgi working. I may play with scgi later. Meanwhile for fastcgi, I'm basically following cleverdevil again, adjusted for my ubuntu apt-installed lightty.

I modified the file ''10-fastcgi.conf in /etc/lighttpd/conf-available to be

## FastCGI programs have the same functionality as CGI programs,
## but are considerably faster through lower interpreter startup
## time and socketed communication
##
## Documentation: /usr/share/doc/lighttpd-doc/fastcgi.txt.gz
##                http://www.lighttpd.net/documentation/fastcgi.html

server.modules   += ( "mod_fastcgi" )

## Start a FastCGI server for python test example
fastcgi.debug = 1
fastcgi.server    = ( ".fcgi" =>
                      ( "localhost" =>
                                        (
                          "socket" => "/tmp/fcgi.sock",
                          "min-procs" => 2
                                        )
                                      )
                                )

and put a sym link to this file into my /etc/lighttpd/conf-enabled directory. (Update 27 Oct: Oops, I had a non-working version of 10-fastcgi.conf here until today. The one above is the one I have working ... today).

I put the test file in my /var/www directory as test.fcgi:

#!/usr/local/bin/python
from flup.server.fcgi import WSGIServer

def myapp(environ, start_response):
    start_response('200 OK', [('Content-Type', 'text/plain')])
    return ['Hello World!\n']

WSGIServer(myapp, bindAddress = '/tmp/fcgi.sock').run()

***
highlight file error
***

And I ran it:

python test.fcgi

and it sits there running.

Now, trying to access it on http://localhost.localdomain/test.fcgi results in a 500 Internal Server Error. A check in the access log showed many instances of this (associated with much head scratching and time wasting):

2006-10-25 21:54:06: (mod_fastcgi.c.2669) fcgi-server re-enabled: unix:/tmp/fcgi.sock
2006-10-26 08:09:30: (mod_fastcgi.c.1739) connect failed: Permission denied on unix:/tmp/fcgi.sock
2006-10-26 08:09:30: (mod_fastcgi.c.2851) backend died, ...

Eventually the penny dropped. The server is running as www-data which has no access permissions to the unix domain socket (/tmp/fcgi.sock) created by the user (whether me or root) running the python fast.cgi server code ...

So, I changed the permissions on /var/www to allow www-data access, and reran the python command:

sudo su www-data
python test.fcgi

And lo and behold, I get a "Hello World" on http://localhost.localdomain/fast.cgi.

by Bryan Lawrence : 2006/10/26 : Categories badc ndg computing python : 1 trackback : 0 comments (permalink)

Citing data with ISO19139

I thought I might try and work out exactly what tags I might use for my previous citation example, if I was using ISO19139 (i.e. in the metadata of another dataset).

The appropriate piece of ISO19139/19115 is the CI_Citation element, which defines the metadata describing authoratative reference information ... which in my mind should also include other datasets!

Some if it is "straight forward" (I don't plan to admit how long it took to work this out :-) :

<CI_Citation xmlns="http://www.isotc211.org/2005/gmd" xmlns:gco="http://www.isotc211.org/2005/gco" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.isotc211.org/2005/gmd http://www.isotc211.org/2005/gmd/citation.xsd">
<title>
<gco:CharacterString>Mesosphere-Stratosphere-Troposphere Radar at Aberystwyth </gco:CharacterString>
</title>
<alternateTitle>
<gco:CharacterString>MST </gco:CharacterString>
</alternateTitle>
<date>
<CI_Date>
<date>
<gco:Date>2006 </gco:Date>
</date>
<dateType>
<CI_DateTypeCode codeList="http://www.isotc211.org/2005/resources/CodeList/gmxCodelists.xml#CI_DateTypeCode" codeListValue="publication"> </CI_DateTypeCode>
</dateType>
</CI_Date>
</date>
<identifier>
<MD_Identifier>
<code>
<gco:CharacterString>badc.nerc.ac.uk/data/mst/v3/upd15032006 </gco:CharacterString>
</code>
</MD_Identifier>
</identifier>
<citedResponsibleParty>
<CI_ResponsibleParty>
<organisationName>
<gco:CharacterString>Natural Environment Research Council </gco:CharacterString>
</organisationName>
<role>
<CI_RoleCode codeList="http://www.isotc211.org/2005/resources/CodeList/gmxCodelists.xml#CI_RoleCode" codeListValue="Author"> </CI_RoleCode>
</role>
</CI_ResponsibleParty>
</citedResponsibleParty>
<citedResponsibleParty>
<CI_ResponsibleParty>
<organisationName>
<gco:CharacterString>British Atmospheric Data Centre </gco:CharacterString>
</organisationName>
<role>
<CI_RoleCode codeList="http://www.isotc211.org/2005/resources/CodeList/gmxCodelists.xml#CI_RoleCode" codeListValue="Publisher"> </CI_RoleCode>
</role>
</CI_ResponsibleParty>
</citedResponsibleParty>
<citedResponsibleParty>
<CI_ResponsibleParty>
<organisationName>
<gco:CharacterString>British Atmospheric Data Centre </gco:CharacterString>
</organisationName>
<contactInfo>
<CI_Contact>
<onlineResource>
<CI_OnlineResource>
<linkage>
<URL>http://badc.nerc.ac.uk/data/mst/v3/</URL>
</linkage>
<function>
<CI_OnLineFunctionCode codeList="http://www.isotc211.org/2005/resources/CodeList/gmxCodelists.xml#CI_OnlineFunctionCode" codeListValue="download"></CI_OnLineFunctionCode>
</function>
</CI_OnlineResource>
</onlineResource>
</CI_Contact>
</contactInfo>
<role>
<CI_RoleCode codeList="http://www.isotc211.org/2005/resources/CodeList/gmxCodelists.xml#CI_RoleCode" codeListValue="custodian"></CI_RoleCode>
</role>
</CI_ResponsibleParty>
</citedResponsibleParty>
<presentationForm>
<CI_PresentationFormCode codeList="http://www.isotc211.org/2005/resources/CodeList/gmxCodelists.xml#CI_PresentationFormCode" codeListValue="profileDigital"> </CI_PresentationFormCode>
</presentationForm>
</CI_Citation>

OK, it's pretty nasty in terms of verbiage, but as (some) folk keep saying, this is for computers not humans - never mind that a human has to write some code to handle it - but it's not as bad as I feared!

In getting that far, we see that I've nearly managed to get the same information content, but there are some pretty important omissions (I think, caveat emptor, I'd glad to be wrong about this):

  1. I don't see any way to indicate that the dataset is being updated (the "ongoing" tag in my previous example", ideally this would require a spot for an MD_MaintenanceFrequencyCode in the citation).

  2. I don't see any way of indicating a particular part of a dataset (that is, having separate identifiers both for the dataset and for particular features within it).

  3. Despite support for feature-type descriptions within an ISO19139 document proper (in the MD_FeatureTypeDescription tag), one can't identify which features are in a cited dataset. We're reduced to using CI_PresentationFormCode, which strikes me to be a completely ugly compromise between feature descriptions and a text element. The one I've chosen here (profile) is partly right, but doesn't get across that this dataset consists of timeseries of vertical profiles!

  4. One can't, as far as I can see, identify when the dataset was accessed (or the date of the citations validity) and I think this is rather crucial for citation of online material.

I guess those are the minimum extensions we'd need to support citeable datasets! (By the way, I've ignored the option of using otherCitationDetails as one is only allow one of those in the citation!)

Update: Note that BADC appears as both a publisher and a custodian, actually, following my discussion of the distinction, I think at the moment, one would want to remove the publisher role ... and leave only the custodian role (in the ISO19139, the text citation form can't distinguish between these roles).

by Bryan Lawrence : 2006/10/25 : Categories iso19115 metadata ndg claddier : 1 comment (permalink)

Wierd unicodeness

For some reason my blog has suddenly developed some sort of unicode problem, which is making a large number of pages core dump. I don't know why. I'm investigating ... meanwhile, I'm trapping the error, but you may assume strange things will happen today!

Update (10-15am): At the moment, I'm trapping the page content in the wiki formatter, forcing it to utf-8, doing my wiki formatting (in wikiBNL), and then forcing it back to ascii before returning the content to the leonardo page provider. I need to do the last step, because there is some error higher up which is breaking with a strict asci code conversion. What is utterly wierd is that this was fine yesterday! Since then I have changed my embedhandler (to support namespaces in xml pretty printing, something I'll blog about sometime), but I fail to understand how that would lead to this problem ...

Update (10-35am): Well, it's not the new embedhandler. I commented out the new xml stuff, and the problem still exists ... (I never thought it was, but I think it's the only thing I've touched ...

Update (11am): I don't understand this, and haven't time to fix it right now. This means that some pages with non-asci may have some spurious ? until I can fix this ...

by Bryan Lawrence : 2006/10/25 : Categories python : 0 trackbacks : 0 comments (permalink)

Persistence

Just after I wrote my last post on data citation, I found Joseph Reagle's blog entry on bibliography and citation. He's making a number of points, one of which was about transience. In the comments to his post, and in Joseph's comment on my post, two solutions to deal with internet transience are mentioned: the wayback machine and webcite.

I've looked at the wayback machine in the past, but there is no way that it represents any realistic full sample of the internet (for example, as of today, it has exactly one impression of home.badc.rl.ac.uk/lawrence - from 2004!) ... but how could it? It's an unrealistic task. What I do see it as is a (potentially very useful) set of time capsules ... that is samples!

By contrast, webcite allows the creater of content to submit URLs for archival, thus ensuring when one writes an academic document, the material will be archived, and the citation will be persistent. This is a downright excellent idea, provided you believe in the persistence of the webcitation consortium (and I have no reason not to). The subtext however, is that the citation is a document, it wont help us with data - and not just because data may be large, the other issue is that the webcitation folk would have to take on support for data access tools, and I think the same argument applies to them as applies to libraries in this regard!

This brings me back to my point about data citation: we had better only allow it when we believe in the persistence of the organisation making the data available, and that will consist of rather more than just having the bits and bytes available for an http GET!

by Bryan Lawrence : 2006/10/23 : Categories curation claddier (permalink)

Citation, Hosting and Publication

Returning to my series on citation (parts one, two, and three).

My last example was an MST data set held at the BADC, and I was suggesting something like this (for a citation):

Natural Environment Research Council, Mesosphere-Stratosphere-Troposphere Radar at Aberystwyth, [Internet], British Atmospheric Data Centre (BADC), 1990-, urn badc.nerc.ac.uk/data/mst/v3/upd15032006, feature 200409031205 [http://featuretype.registry/verticalProfile] [downloaded Sep 21 2006, available from http://badc.nerc.ac.uk/data/mst/v3/]

which I could also write like this to give some hint of the semantics:

<citation>
<Author> Natural Environment Research Council </Author>
<Title> Mesosphere-Stratosphere-Troposphere Radar at Aberystwyth </Title>
<Medium> Internet </Medium>
<Publisher> British Atmospheric Data Centre (BADC) </Publisher>
<PublicationDate status="ongoing"> 1990 </PublicationDate>
<Identifier> badc.nerc.ac.uk/data/mst/v3/upd15032006 </Identifier>
<Feature>
<FeatureType>http://featuretype.registry/verticalProfile </FeatureType>
<LocalID>200409031205 </LocalID>
</Feature>
<AccessDate> Sep 21 2006 </AccessDate>
<AvailableAt>
<url>http://badc.nerc.ac.uk/data/mst/v3/</url>
</AvailableAt>
</citation>

The tags are made up, but hopefully identify the important semantic content of the citation. As I said last time, there is some redundant information there, but maybe not (there is no guarantee that the Identifier and the AvailableAt carry the same semantic content).

Inherent in that example, and my meaning, was a concept of publication, and I introduced that distinction by comparing the MST and our ASHOE dataset (which is really "published" elsewhere). In the library world, there is a concept of "Version of Record", which isn't exactly analogous, but I would argue BADC holds the dataset equivalent of the version of record for the MST, and NASA AMES the equivalent for the ASHOE dataset.

Generally, in scholarly publication, in the past one distinguished between the refereed literature, the published literature and the grey literature1, where the latter might not have been allowed as a valid citation. The situation has become more complicated with the urge to cite digital material, but one of the reasons for the old rules was about attempting to ensure permanence and access - something that is obviously becoming a problem again. Thus, we should explore the concepts of publication and version of record a bit further, before we create new problems. Cathy Jones, working on the CLADDIER project, has made the point in email that a publisher does something to the original that adds value, and I think in the case of digital data, that something should include at least:

  • provision of catalogue metadata

  • some commitment to maintenance of the resource at the AvailableAt url

  • some commitment to the resource being conformant to the description of the Feature

  • some commitment to the maintenance of the mapping between the identifier and the resource.

And so, in a reputable article (whatever that means), or in the metadata of a published dataset, I wouldn't allow the citation of a dataset that didn't meet at least those criteria, but once we have met those criteria, then that first version should be the version of record, and copies held elsewhere should most definitely distinguish between the publisher and the availability URI.

Arguably the 2nd and 4th of these criteria could be collapsed down to the use of a DOI. While that's true, I think the use of both helps the citation user (just as I think it best to do a journal citation with all of the volume, page number and DOI). However, if the publisher does choose to use a DOI, it would help if the holders of other copies did not! Whether or not it's true, the use of a DOI does imply some higher level of ownership than simply making a copy available.

Implicit in my discussion of the metadata of a published dataset, is the idea that just as in the document world, we could introduce the concept of some sort of kite-mark or refereeing of datasets. A refereed dataset would be

  • available at a persistent location

  • accompanied by more comprehensive metadata (which might include calibration information, algorithm descriptions, the algorithm codes themselves etc)

  • quality controlled, with adequate error and/or uncertainty information

and it would have been

  • assessed as to it's adherence to such standards.

There might or might not be a special graphical interface to the data and other well known interfaces (e.g. WCS etc) ought probably be provided.

Datasets published after going through such a procedure would essentially have come from a "Data Journal", and so in my example above, such the <Publisher> would become the name of the organisation responsible for the procedure, and the <Title> might well become the title of the "Data Journal".

1: Grey Literature: i.e. documents, bound or otherwise, produced by individuals and/or institutions, but which were not commercially available, and therefore, by implication, not very accessible. (ret).

by Bryan Lawrence : 2006/10/20 : Categories claddier metadata ndg curation : 3 trackbacks : 7 comments (permalink)

The Economist Goes Green

I'm a bit slow to find out about this (which is down to not enough time reading my feeds): Anyway, The Economist is arguing for action on greenhouse gas emissions: Editorial (7 Sep).

Is this a tipping point of sorts?

(Thanks to Andrew Dessler).

by Bryan Lawrence : 2006/10/20 : Categories climate environment (permalink)

On substitution groups and ISO19139

I have bleated already about the difficulties of using ISO19139 with restrictions which introduce new tag names.

Now the official way to do this is probably to exploit substitution groups in the new xml schema associated with your restriction. So, if one wanted to restrict, for example gmd:MD_Metadata one might start in your new schema with something like

<element substitutionGroup="gmd:MD_Metadata" type="ourMD_Metadata" name="restricted_MD_Metadata"> stuff </element>

(See w3schools, the xml schema primer, or Walmsey, 2001, to explain the syntax). Then in the instance document, one would have

<restricted_MD_Metadata> stuff </restricted_MD_Metadata>

At this point one could use the xml schema validation machinery to ensure one had a nice valid instance of the new restricted schema.

My beef is how we use this. The gurus will tell you there is no problem, and maybe there isn't if one wants to invest an enormous amount of time in complex handlers (even so, maybe it's not that straight forward for data binding, and perhaps the tools aren't really that mature - or weren't in 2004).

So, if I'm writing code to handle ISO19139 documents, I'm going to be writing xslt or using xpath or xquery to get at particular content or I'm going to have to invest in brute force if I want to handle things in a high level language like python (as far as I know there are no pythonic tools that get close to this sort of requirement internally).

Let's just explore the brute force method, and a simple use case: I have harvested ISO19139 profiles (I'm starting to think "variants" - complete with quotes - would be a better term :-) from a number of places, and want to deliver the titles to a web page ... so I need to find the titles. I can't assume I can use a simple xpath expression (which is supported in python) to find all the titles. I have to parse all the relevant schemas, and do something complex to find the new title elements. In practice, I have to support each profile as a completely different schema, they might as well not share the ISO19139 heritage - even though there are advantages in the ISO19115 content heritage. Yuck.

OK, now suppose I hand this off to an xquery engine. How easy is that? Let's assume it's not buggy ... This is essentially the use case described as 1.9.4.7 Q7 in the June 2006 use case document. I'm not that familiar with xquery, but it appears that

schema-element(gmd:MD_Metadata)

should then match any element which is linked to it via a substitution group declaration like that above. If it really is that simple, then this is much easier than the brute force method, and a good reason for passing my problems to a real xquery engine.

However, this may well work fine for handling document reshaping type tasks, but returning to the use case, I could well have tens of thousands of harvested documents, if not millions, and so I may well be considering indexing. I don't know, but it would appear that one has to rewrite all substitution group elements when producing an index - does our eXist native xml database technology do this automatically for me? I don't know, and that's the point.

All this marvellous xml technology is bloody complicated, and all to handle the case that a community wants to restrict the usage of some tags or lists! Why make all this grief? Wouldn't it be far easier to give community guidance, but accept perfectly valid ISO19139 documents which fall outside that guidance, because we could all simply follow the simple rules in David Orchard's article article, and especially the one I've highlighted before:

Document consumers must ingore any XML attributes or elements in a valid XML document that they do not recognise.

We could rewrite that as:

Communities should give guidance on those ISO19139 attributes or elements that need populating for usage within the community (and which might need to be handled by community tools).

Job done. No complicated machinery. More tools available, easier indexing, and much easier human parsing of everything ... (from the schemas, to the instances, and all the code that handles them).

by Bryan Lawrence : 2006/10/19 : Categories xml iso19115 metadata ndg : 1 trackback : 0 comments (permalink)

Two completely unconnected things

I've just had a burst of rather intensive work over the last couple of weeks (hence the silence), and this lunch time I've rather ground to a halt. So, by way of light entertainment, I clicked on my akregator and started reading from the enormous number of unread things from the various feeds I think/thought I need/want to follow ...

Herewith are two links which apart from how I found them, are completely unconnected. I'm drawing to your attention because for very different reasons I valued reading them.

1) The first, is from John Fleck's blog, and is actually a higher profile restatement of a comment on an earlier entry by Nick Brooks discussing cultural responses to climate change. Go read the entire thing, but this is a taster:

... even if we accept that environmental crises led to the emergence of civilisation, we are still looking at collapse - the collapse of the societies that preceded these new cultures. In the Sahara we know that lifestyles based on mobile cattle herding collapsed, as did hunting and gathering. It?s the survivors who adapt, after the event.

It's that last sentence that got me thinking ...

2) And now from the April the first collection at the bmj, where quite clearly there is some sort of calendar problem, we have a series of fabulous articles and responses, but John Fleck (again), pointed out this one on the half-life of teaspoons ...

I wonder if they would get the same results with plastic teaspoons!

One serious. One frivolous. Viva Blogging! How how else would one find this sort of stuff?

by Bryan Lawrence : 2006/10/12 : Categories climate : 0 trackbacks : 2 comments (permalink)

Programmers should be miserable

Steve Yegge (via Tim Bray):

... I don't like Java, but I do use it. Liking and using are mostly orthogonal dimensions, and if you like the language you're using even a little bit, you're lucky. That, or you just haven't gotten broad enough exposure to know how miserable you ought to be.

by Bryan Lawrence : 2006/09/27 : 0 trackbacks : 0 comments (permalink)

More on citation - part three, delving

Following on ...

In my last example, I considered a (made up) uri which was supposed to be of meaning to the BADC that would allow cleaner citing of our MST radar dataset. The URI looked like this: badc.nerc.ac.uk/data/mst/v3/upd15032006

There are two obvious components to this uri, the first is a domain address, badc.nerc.ac.uk, the second is the string /data/mst/v3/upd15032006. This latter uri was constructed with a number of local implicit subcomponents. I've included

  • "data" as a local identifier to identify the schema for my uri construction.

  • "mst" identifies a dataset

  • "v3" is a (made-up) version

  • "upd15032006" is a (made-up) last modification date. (for some datasets updating through the date, we probably ought to have a modification time in the URI)

Now the format of this uri is rather unimportant. What I'm asserting here is that all these components are really useful for citing data. Rather like the concepts of volume and page number and issue date for a journal. Now I don't really care too much how different sites do their uri schemes, but I do care a lot that they contain these concepts.

This uri is an example of what I think of as a "base source". That is, I expect I might well consider citing something within this "database". We might also worry about what I mean by "downloaded" in that citation ...

I've suggested elsewhere that a priori one needs a concept of a feature-type (ISO19109) before there is much point citing within a database. In this case what I mean is there is no point citing a "thingie" within a database, unless there is a (generally or otherwise) understood concept of what I mean when I say "thingie". It could be a gene-sequence, it could be a radar-profile, it could be a chemical equation ... but if it's any of those things, somewhere there is a definition of what that means, and a description of the format of the result I'll get when I go get one of those things. That somewhere is (or should be) a feature-type registry (ISO19110) in our world, but it doesn't matter if it's not, the point is that the concepts and some realisation of them must exist for citation within a database to have meaning.

This whole discussion is necessary because we have no a priori concept like page or chapter ... with documents we know what we are going to get back .. .a document ... but if we cite something that is data we don't. This is a problem. For example, take opendap, which is rather popular. If I give you an opendap url, e.g.

http://www.cdc.noaa.gov/cgi-bin/nph-nc/Datasets/reynolds_sst/sst.mnmean.nc

As the opendap pages say:

The simplest thing you can do with this URL is to download the data it points to. You could feed it to a DODS-enabled data analysis package like Ferret, or you could append .asc, and feed the URL to a regular web browser like Netscape. This will work, but you don't really want to do it because in binary form, there are about 28 megabytes of data at that URL.

So clearly one doesn't want to go pointing at arbitrary urls without the right software and a bit of knowledge about what might exist at the url ... so in practice it's much better to cite a document which tells you about the data that was cited ... but now that document itself had better be as permanent as the data.

That being so, even better might be to have some way in the citation of doing this in a meaningful manner. If we push our MST example a bit further, what do we get?

Natural Environment Research Council, Mesosphere-Stratosphere-Troposphere Radar at Aberystwyth, [Internet], British Atmospheric Data Centre (BADC), 1990-, urn badc.nerc.ac.uk/data/mst/v3/upd15032006, feature 200409031205 [http://featuretype.registry/verticalProfile] [downloaded Sep 21 2006, available from http://badc.nerc.ac.uk/data/mst/v3/]

where 0200409031205 is yet another made up identifier, which points to an object of type "verticalProfile" in registry http://featuretype.registry ... while I know the identifier afficionados don't like semantics in the identifier, me, I think there should be to help with the meaning of citations, and so in this case it might be timing of the radar profile.

It looks like we have some redundant information, because I could give the exact url of the data object (which could incorporate the uri), but remember there is no guarantee that the BADC uri and the download url are the same! Also, we may need the feature type uri to be different from either the base-source uri and the download url. (Haven't thought this bit through yet, but it seems conceptually possible).

by Bryan Lawrence : 2006/09/26 : Categories curation ndg metadata claddier : 1 trackback : 0 comments (permalink)

Hiding the Heat - Solar Dimming.

First we had global dimming. Now we have Solar Dimming. I've blogged before about the relationship between sunspot numbers and global mean temperature.

New Scientist has an article entitled Saved by the Sun 1, where they are reporting that

Some astronomers are predicting that the sun is about to enter another quiet period.

which would lead to a bit of cooling. There are a large number of quotes, but no actual references (of course), so I'm curious as to the state of these predictions. Some of the quotes came from people I know and respect (especially Jo Haigh, but she isn't in the solar prediction game). A quick google yielded more "news stories" (e.g. this russian one, click only if you like flashing adverts). (It also yielded this crap ... don't waste your time clicking on this one unless you want to throw up).

Eventually, I came to this in Space Weather News, which has a real reference. I haven't read that, but the story suggests the next sunspot cycle will be comparable with 1906. Well, that doesn't sound like it will buy us much time to think (which is the basic thread of the New Scientist article). What about predictions beyond the next cycle? Well, I couldn't find any of those ...

So, so much for being able to put off hard choices!

1: New Scientist, 16 Sep 2006, page 32 (ret).

by Bryan Lawrence : 2006/09/26 : Categories climate environment : 0 trackbacks : 0 comments (permalink)

Academic Citation of Blogs

I know I sort of predicted this, but New Scientist1 claims that:

Formal scientific papers are now even beginning to cite blogs as references

I would like to see some examples of this!

(As an aside, it is good to see that Peter Murray Rust is blogging, and even more so on the importance of blogging in scientific communication and slagging off PDF too!)

1: Amanda Gefter in New Scientist, 16 Sep 2006, page 48 (ret).

by Bryan Lawrence : 2006/09/25 : Categories curation : 0 trackbacks : 0 comments (permalink)

More on citation - part two, MST

Yesterday I started talking about how I think one should cite data we hold on behalf of someone else. That discussion isn't yet finished, issues of citing within datasets, and other "standards" for citation still need to be discussed (including how we parse and store them electronically), and I still haven't gotten to addressing any of my points from the original blog entry. Before we go there, it's helpful to discuss our mst radar data set.

The mst radar dataset consists of data from 1990 to the present. Over time the data format and methodology of data collection (i.e. how the raw radar returns are converted to, for example, wind) have changed. Nonetheless, one can imagine someone wanting to cite a timeseries extending back to the beginning or a particular days data, or perhaps the whole thing.

Do we publish the data? We will hold the data in perpetuity, and make it available for scientific use, so in some senses yes 1 (In the same sense as we publish the ashoe data). I'm going to get back to this issue of what I think publication should be for data.

Meanwhile, how should one cite this? Still using the U.S. National Library for Medicine recommendations (pdf), we should probably consider this as an online database.

Things to think about:

  • Our first issue is: "Who is the author?". I think this is a case where this is a NERC facility, so it is NERC.

  • What is the title? Well at the BADC we call it "The NERC Mesosphere-Stratosphere-Troposphere Radar Facility at Aberystwyth", but actually I think we should consider renaming the dataset to something which is n't about a facility, or a funder. Better would be the "The Mesosphere-Stratosphere-Troposphere Radar at Aberystwyth"

  • What is the urn, what is the update date, what is the actual url of the data? Actually, despite being a data centre, the answer to none of these three questions is especially obvious. It should be ... but that's part of what claddier is about, to expose these sorts of issues.

The whole thing would then be, for example:

Natural Environment Research Council, Mesosphere-Stratosphere-Troposphere Radar at Aberystwyth, [Internet], British Atmospheric Data Centre (BADC), 1990-, urn http://badc.nerc.ac.uk/data/mst/, [updated Mar 15 2006, downloaded Sep 21 2006, Available from http://badc.nerc.ac.uk/getdata/data_browser/badc/mst/data/mst-products-v2/cartesian] ]

Well, that's obviously horrible. Let's just for a moment imagine our retrieval system was a bit more citation friendly.

  • We might, for example, have a clean urn which helps identify the version and length of the data. Ideally it might look something like: badc.nerc.ac.uk/data/mst/v3/upd15032006. In which case we don't need the updated phrase.

  • We might have a cleaner url to the location of the data, which hides the method of download

In which case the following would be legitimate:

Natural Environment Research Council, Mesosphere-Stratosphere-Troposphere Radar at Aberystwyth, [Internet], British Atmospheric Data Centre (BADC), 1990-, urn badc.nerc.ac.uk/data/mst/v3/upd15032006, [downloaded Sep 21 2006, available from http://badc.nerc.ac.uk/data/mst/v3/]

(In all this, note that in all this, today and yesterday, I have completely removed any reference to the physical location of the database (eg, London, or in our case, The physical location is so irrelevant. It made sense to include when one could physically go to an archive and get a copy of a document, but you can't do that with our data. Coming here will get you nothing. Further, we could move the badc from Chilton to Oxford tomorrow, and it would make NO difference to the accuracy of the citation. Why include it then? I think the location convention has to die when applied to electronic retrievals).

Well, that's enough for now. Next we'll consider citing into that archive, and some of the other issues ...

1: Wikipaedia thinks so: "Publishing is the activity of putting information into the public arena. ..." (ret).

by Bryan Lawrence : 2006/09/22 : Categories claddier curation metadata : 1 trackback : 2 comments (permalink)

More on citation - part one, ashoe

We've just been having an interesting conversation about the citation of datasets in the context of claddier. I think I've got the use of we and I correct in what follows to indicate what I think as opposed to what we discussed and agreed ...

I've wittered on about this before, where I considered six issues. Today we discussed two things: how we might cite some specific BADC datasets and we came up with some examples, which led to; how we distinguish between datasets we hold on someone else's behalf, rather like a library, and those which we publish ourselves.

So, we were considering two of our datasets: one, the MST radar data, and the ASHOE mission data (which I was a participant in, hence my choosing it as an example).

The latter dataset is essentially an online copy of a CD which we obtained from NASA, and so while we host it online for the UK academic community, we are certainly not the publishers. The former is a dataset which we hold as the primary dataset for NERC, who require us to make it available. In neither case have we done anything ourselves to allow me to feel comfortable with the grandiose phrase that "we publish it" (indeed, in the former case I feel quite uncomfortable with that concept).

So how do we expect folk to cite these datasets? Here, I'm going to discuss the ashoe dataset alone. The ashoe CD is essentially a compilation of data from a bunch of principal investigators. The compilation was produced by Gaines and Hipskind, so I think this should be dealt with by treating it like an anthology. So that's the authors (or creators) dealt with.

Following the U.S. National Library for Medicine Recommended Formats for Bibliographic Citation1(pdf), we have:

Gaines and Hipskind, The Airborne Southern Hemisphere Ozone Experiment; and Measurements for Assessing the Effects of Stratospheric Aircraft (ASHOE/MAESA) CDROM [Internet], Nasa Ames Research Centre, Earth Science Project Archives, c1994 [cited September 21, 2006] Available from http://badc.nerc.ac.uk/data/ashoe/

Now the real data is actually at the Earth Science Project Office Archive but note that this is an updated version of the data (I think, the file dates seem to indicate this) from that on the cdrom, which is described online. (I think this means we should add the new data to our archive as well ... but that's an issue for another day).

Anyway this shows several issues

  • The date of the data is actually unknown (at least in the public information I found, without rooting around too hard). One can't rely on the filesystem time ...

  • Why should folk indicate that they found it our site, when strictly the ESPO themselves"published" it online? I would argue there are three possible reasons why this is a good idea, which might or might not apply:

    • One knows the original version is no longer online (as appears to be the case here).

    • One knows that the original online version is somewhat more ephemeral and the place where you are using the citation cares about this issue (e.g. the American Geophysical Union at one time only allowed citation of datasets in registered repositories2

    • With data, I think there is always some risk that the version of data you downloaded from one place will be different from that downloaded from somewhere else (it has certainly happened that we have hosted corrupted versions of data that were non-trivial to identify 3). Now this might argue to always going back to the original source before conducting the analysis which results in the citation, and if possible that's the best idea, but again, with data, that may not be possible for reasons of volume, bandwidth, or server capacity of whatever. I think it best then to indicate which version you used by indicating where it came from, and the download date.

Having said all that, there is a way forward. Going back to the actual citation, I don't much like it, nor do I like much else I've seen. So here's what I think we should do: I think the "cited" should be "obtained on" for data, and it should follow the available at, so better would be:

Gaines and Hipskind, The Airborne Southern Hemisphere Ozone Experiment; and Measurements for Assessing the Effects of Stratospheric Aircraft (ASHOE/MAESA) CDROM [Internet], Nasa Ames Research Centre, Earth Science Project Archives, c1994 [Available from http://badc.nerc.ac.uk/data/ashoe/, obtained on September 21, 2006]

In our discussions today we pushed on to thinking about the ESPO version as the data equivalent of a NASA Tech Note document, and in our discussion of versioning decided it would be best if an identifier which had meaning to the publisher (ESPO) was also included (to deal with, for example, versioning). Now arguably that might be the original http uri but it could be anything with meaning to the publisher. In this case we only have the http url, so we could have:

Gaines and Hipskind, The Airborne Southern Hemisphere Ozone Experiment; and Measurements for Assessing the Effects of Stratospheric Aircraft (ASHOE/MAESA) CDROM [Internet], Nasa Ames Research Centre, Earth Science Project Archives, c1994, urn http://cloud1.arc.nasa.gov/ashoe_maesa/project/cdrom.html [Available from http://badc.nerc.ac.uk/data/ashoe/, obtained on September 21, 2006]

Note that I'm implying that I want to use a urn which has meaning to the original publisher if I can, which means ideally data publishers should publish the base source citation urn for their datasets to help get this right. If the base urn is in fact a uri or url, then the original site attribution has magically appeared if that's what the publisher wants.

What I mean by base source, and all the other hanging threads will have to wait for another time.

Update (26/10/07): The NLM's Recommended Formats for Bibliographic Citation has been updated, and the new revision is available here. I have yet to check whether the revision has any implications for our citation format!

1: Thanks to Chris in comments to my original blog entry. (ret).
2: An early version of their data policy appears here, Update, 22nd Sep, the current version is here ... NB: BADC has been approved by AGU. (ret).
3: And regrettably you can't always tell this from our metadata, we really have to do better! (ret).

by Bryan Lawrence : 2006/09/21 : Categories curation claddier metadata : 1 trackback : 1 comment (permalink)

Diesel v Petrol

Over the summer we upgraded our petrol engine renault megane hatchback and replaced it with a diesel megane estate (the dCi-100 version). We've gone from 40 miles per gallon (sorry about the imperial units) to between 55 and 60 miles per gallon. (Our first full tank did 59.5 mpg.)

One of the things I like about the new1 car is the computer mpg readout ... we both tend to drive with that showing, and both of us have started driving slower and more carefully so that the mpg figure will go up. What a simple innovation ... knowing when you're being a gas hog makes one do it less!

I've noted that driving at 55-60 mph is the optimum, and 70-ish mph results in mpg figures in the high forties ...

All of this has me wondering why, as an easy and effective way of dropping the greenhouse gas emissions we in Britain don't

  • Make the motorway speed limit 60 mph (yes, I'll hate it too from a driving point of view, but from a conscience point of view I'll feel better), and

  • Make diesel cheaper than petrol, to encourage more folk to move over ...

As far as I understand it, here in the UK the price difference is down to taxation. The only explanation I've heard is that the greater price is a hangover from the days when diesel engines delivered bad particulate pollution. It would seem to me given modern engine technologies, there is a prima facie case for putting the taxation boot on the other foot, making petrol more expensive per litre (yes, we really do measure our petrol prices per litre and our performance in miles per gallon).

Are there any reasons why diesel production is more expensive economically or in greenhouse gas terms to make my argument wrong?

1: Strictly, it's a newer car, since we didn't buy it new (ret).

by Bryan Lawrence : 2006/09/13 : Categories environment : 1 trackback : 0 comments (permalink)

Final Version of the CF Paper

After much delay1, most of it occasioned by my workload, we've produced a final version of the CF paper: Maintaining and Advancing the CF Standard for Earth System Science Community Data, Lawrence, B.N., R. Drach, B.E. Eaton, J. M. Gregory, S. C. Hankin, R.K. Lowry, R.K. Rew, and K. E. Taylor.

A pdf version can be found here.

The abstract reads:

The Climate and Forecast (CF) conventions governing metadata appearing in netCDF files are becoming ever more important to earth system science communities. This paper outlines proposals for the future of CF, based on discussions at an international meeting held at the British Atmospheric Data Centre in 2005. The proposal presented here is aimed at maintaining the scientific integrity of the CF conventions, while transitioning to a community governance structure (from the current situation where CF is maintained informally by the original authors).

1: The first version was in November last year (ret).

by Bryan Lawrence : 2006/09/12 : Categories cf (permalink)

Normal Service Will Be Resumed

Yes, I'm here, had a wonderful two weeks in Wales (well, wonderful except for the weather), and arrived back to work to a major panic trying to get a proposal ready. Expect that to be over next week, and me to restart blogging. I have much to talk about (including my new toucan T60P from emperor linux).

by Bryan Lawrence : 2006/09/09 (permalink)

to extend or not to extend ...

A few months ago I wrote a few words about practicalities with ISO1939. The key reason for using ISO19139 is that it is a standard for metadata interoperability. A key a priori assumption is that one has a community who have agreed to interoperate ... because in practice to interoperate will require a profile of ISO19139 which defines for a community what interoperability means to them.

Well, that's clear then. Communities will build profiles. This is a good thing, because it means ISO19139 will have direct value to those communities, providing an infrastructure where their concepts can either be specialised by limiting broad concepts in the parent schema or by extending concepts in the parent schema to add new attributes etc or both. Two communities doing this for real that I care about are the IOC and the WMO.

The problem with profiles though, is that profiles probably wont be easily consumed by other communities (i.e. will the WMO be able to consume the IOC instances?), but there may be things that one can do when building a profile that will help consumability (I'm trying desperately hard to avoid food puns ...) ... and I think it's going to be really important that the designers of these profiles think about this, because nearly all the interesting problems we have to solve in the world are on the boundaries between communities.

These boundaries are particularly important for catalogue builders, because catalogues are for finding things that you don't know a lot about already, for most of us, by definition those are things that are the furthest from our core experiences (and so are most likely to be the least likely for us to be able to build into our core profiles).

So, building for interoperability outside our communities should be just as important as building for interoperability within our communities (modulo the reality that the core communities always pay for these things).

Well, philosophy aside, what's on the table?

Last time I made the point that if you extend a schema, for interoperability it will help if you can export your records in vanilla ISO19139 as well as your custom-ISO19139.

John Hockaday was a bit more explicit in an email to the metadata mailing list. If the header to an ISO19139 profile instance includes the link to the parent profile schema (as it must), then:

If the profiles are extensions, i.e. add extra elements to the ISO19115 metadata standard, then they will have to be translated using XSL to the ISO19139 format for other people around the world to use. The translation will remove the extra elements because ISO19139 will not recognise them as valid elements.

If profiles are restrictive, i.e., don't allow some ISO19115 metadata records, then they should also validate against the ISO19139 XSDs without translation using an XSL except that the header will have to be changed to point to the ISO19139 namespaces or the parser is told to use the ISO19139 XSDs.

He goes on to say (as I did) that

I expect the common format for exchange of metadata is the ISO19139 format. All profiles should provide and XSL for translation of their profile XML to the ISO19139 XML format. This will allow exchange of any metadata to be easily achieved.

BUT

I wonder whether we were right. I think we all understand that metadata translation is (generally a lossy) activity.

What do we want our interoperable records to achieve? Generally, I think we want discovery between communities. So the most important thing to do is find information.

In practice, we can exchange any derivative of the ISO19139 profiles using OAI and store them in xml databases - we don't need necessarily to parse them into a relational schema or into "our" xml schema. The question should really be How do I consume instances from someone elses schema?. As I've said before, we should ignore what we don't understand, and look for what we do! That means my discovery tool may want to look for familiar tags, use those, but when the discovery client wants to consume the record (perhaps having navigated using the tools I've provided), they want the original record (without any losses introduced by translation).

So, what we need to do in practice is index using common tags, so ideally the developers of profiles and standards should extend/contract all they like, but if possible to do this by declaring new (specialised) elements in new namespaces, but with the same tag names!"

But the advice to the IOC includes this:

Some of the metadata packages have also restrictions. ?EG. dataQuality. ?We would have expected that there should have been a specialisation of the DQ_DataQuality class called MP_DataQuality that shows these constraints.

And every time one does this, it results in an invisible tag that wont work in a portable environment (so we will ignore it in code). What this says to me is that specialisation by restriction and name change leads to a lack of interoperability ... exactly where it should be easiest ... after all, the instance document should conform exactly to the parent schema too... and I had thought we would only have to change the header to make our indexes and validations work ... (as John suggested). But instead this advice results in an instance that does need to be transformed by XSLT. It would be nice to avoid that where possible. Wouldn't it be better in this situation to NOT change the tag name in the new schema? Could it be that the IOC has got bad advice, or is there some subtlety in the rules for restriction that I (and presumably John) haven't spotted?

I need to get back into this for the NumSim project and for some collaborations for our WMO partners ...

by Bryan Lawrence : 2006/08/15 : Categories iso19115 metadata ndg : 2 trackbacks : 4 comments (permalink)

On Access Control

As part of the dews project, we need to deliver access control for OGC Web Services. In particular, we're planning on limiting access to resources delivered by geoserver. The current concept for dealing with this is displayed in some simple UML:

Image: static/2006/08/15/LegacyApplications.jpg

The bottom line is that a normal request will first involve a redirection to establish a security context, followed by a re-request using it, and then calling the application itself. More details are on the ndg trac site.

by Bryan Lawrence : 2006/08/15 : Categories ndg : 1 trackback : 0 comments (permalink)

On printing this blog

I had a read of this and implemented a simple print.css so it should be nicer to print things out from this blog now ...

by Bryan Lawrence : 2006/08/15 : Categories python (permalink)

Granule Concepts

We've been struggling with a few concepts in mapping how we want variables and datasets to be related. The struggle, as in most technical discussions is that one needs to be very exact about what one is saying. To try and simplify our discussions (and maybe yours), I've tried to produce some UML which relates concepts like datasets and files to thing we actually deal with.

Image: static/2006/08/11/GranuleConcepts.jpg

The key relationships:

  • datasets can be composed of other datasets

  • datasets have discovey records

  • datasets can be composed of granules

  • granules are not independently discoverable

  • granules are composed of phenomena

  • granules are associated with (potentially multiple) files

  • files can be associated with (potentially multiple) granules

  • granules are associated with variables

  • a phenomenon is associated with (potenially multiple) variables

  • a variable has exactly one CF standard name and is associated with exactly one member of the BODC vocabulary

  • a CF variable may be associated with CF coordinate variables, and if so,

  • the relationship between the variable and the coorddinate may have a CF cell method associated with it.

  • a CF coordinate variable is simply a special case of a CF variable

  • a MOLES dgDataEntity is an implementation of the concept of a dataset

  • a MOLES dgGranule is an implementation of the concept of a granule.

by Bryan Lawrence : 2006/08/11 : Categories ndg metadata moles (permalink)

New Plans for Leonardo

One of the things that came out of today's meeting about campaign support within BADC, was a requirement to provide a "campaign diary", which would

  • be persistent beyond the duration of a campaign,

  • allow the upload of files for sharing (providing a directory interface),

  • provide a timeline of entries, activities and links,

  • support multiple authors,

  • allow annotation

  • support notification.

  • support mathematics

Most of this would be supported by any blogging package, but Leonardo has a couple of significant advantages, not least of which is my own personal involvement in the development. Hopefully we'll test drive leonardo in this context for a real campaign later this year.

To do that, we'll need to:

  • vastly improve the documentation

  • provide a print.css

  • better handle trackback and comments to allow nesting

  • support version control

  • provide atom feeds of the categories

  • allow a directory view of the url space, including icons for non xhtml mime types.

Is it worth doing the engineering? Should we find something that already does all of this? My gut feeling is that is still worth the effort. Apart from my dilettante-like efforts:

  • Leonardo is standalone (that's important for getting folk to deploy their own instances),

  • Leonardo does support maths,

  • Leonardo is easy to modify ...

Something to think about when I come back from hols I think .. not for now!

by Bryan Lawrence : 2006/08/09 : Categories leonardo python badc (permalink)

dilettante

A long time ago I was given some good advice which I have consistently ignored: "to get a reputation, you need to be an expert in something, and the last thing you want to be is a dilettante".

I think it's good advice, but sometimes you can get away with it, and mostly I have ... but the last eight working days have stretched my belief that I have got away with it!

I've been taking advantage of having five day weeks (no child care days during school holidays when Anne is at home to look after Elizabeth), to have a succession of day long "special issue meetings" as well as a bunch of other shorter meetings. The biggish meetings have been:

  • processing affordance (in the context of GML and NDG),

  • security pardigms (in the context of NDG),

  • content managment systems (for NCAS and BADC),

  • python package management (for NDG).

The smaller meetings have covered

  • Campaign Management by the BADC for the UK atmospheric science community,

  • Parallel processing plans for climate diagnostics,

  • Data Assimilation Theory,

  • Ocean/Atmosphere Coupling

  • Metadata objects for environmental science,

  • The CCLRC environmental strategy.

... and I've had the usual raft of scheduled meetings ...

If that's not evidence of a dilettante's workload, I don't know what is. I just about feel like I'm keeping my head above water, but I know what I haven't done ... meanwhile I sure know that

  1. I need a holiday, and

  2. I need some sustained time on some single issues.

Fortunately the holidays are nearly upon me ... I'll be spending the last two weeks of August in Wales, a week in the mountains, and a week on the beach. No laptop. No internet. Will it be bliss?

I don't know what i'm going to do about the single issue thing ...

by Bryan Lawrence : 2006/08/09 (permalink)

what is bt up to this time?

For the last three nights (including tonight), konqueror has been misbehaving on both my home linux systems (one running Suse, one kubuntu). Neither O/S has been modified for a long time ...

... yet all of a sudden, nearly all web pages return "Unexpected end of data, some information may be lost."

Image: static/2006/08/08/konqError.jpg

Mozilla seems fine to the same sites. My laptop which is the ubuntu system is fine when I vpn into work ... to the same sites ...

Conclusion: BT has been up to something with caching or some sort of nasty middleware, but I can't think of any way of getting this fault description through their microsoftacious technical support route ...

Has anyone else seen this? Or anything like it?

(Update: it looks like the css files are not being downloaded. I can't even view a css file from konqueror via bt, even if it is visible in firefox ... now I'm really confused .. nor can I view page source, even though the dom is inspectable and visible).

(Update, 11th, and now it's all fine again ...)

by Bryan Lawrence : 2006/08/09 : Categories broadband : 0 trackbacks : 7 comments (permalink)

Affording Interfaces

... and I don't mean how much it costs :-)

Yesterday I tried to get across the concept of processing affordance, which is a construct which it appears that we have to invent because of limitations in GML.

The big deal we have left to thrash around with is how to describe these affordances and link them to interfaces ... Well, I think they're different ideas, as I've tried to express in this diagram:

Image: static/2006/07/28/SimpleAffordances3b.jpg

It has been suggested that we use a registry formalism to describe the association between operations and features (affordance), but that was when we (I) hadn't made a distinction between operations and interfaces. If we use a registry how do we deal with the choices that need to be made about inheritance of affordance? (Update: Andrew tells me that I should look at WSDL2, so I will ... and note that the diagram has also been altered to avoid an incorrect UML use of interface stereotype that Andrew pointed out to me).

Obviously the computational science world has wrestled with this sort of problem for ages, so there has to be a clean solution, we just need to work out how to deploy it in our "model-driven" architecture environment.

by Bryan Lawrence : 2006/07/28 : Categories ndg : 1 trackback (permalink)

What does zero emission mean?

physorg is reporting that Texas and Illinois will compete for the worlds first near-zero-emissions coal plant.

Fabulous I thought, until I asked myself what zero-emission meant, and chased it up. Actually it means "near zero emission into the atmosphere right now". It would seem the project is about building plants that do carbon sequestration. That's not necessarily a bad thing. Frankly, I have no idea whether it will work long term or not. But what is a bad thing is to be duplicitous while you're doing it, these things are not actually zero emissions, they're still polluters. What this actually means is the pollution will be buried! Again, nothing necessarily wrong with that, humanity has been doing it since civilization was invented, but burying pollution is not the same as not polluting and it must eventually be a problem. If we're going to talk zero-emission, then we ought to be explicit about what it means.

(It must be the humidity, I'm feeling pedantic today).

by Bryan Lawrence : 2006/07/27 : Categories environment (permalink)

On Processing Affordance

When we produced the Exeter Communique, we spent a lot of time talking about something that Simon Cox has termed "processing affordance". A processing affordance is a property of a feature1 type which expresses "what can be done to or with it".

Some of us (ok, maybe just me), find it useful to distinguish between intrinsic affordances and extrinsic affordances, by which we (me) mean things that depend on the properties of the feature that were anticipated by the feature type creator (intrinsic), and things that may be independent of the properties of the feature, and are certainly things that are not described and maintained by the person or organisation that governs the description of a specific feature type (extrinsic).

This blog entry is effectively the record of a conversation that Andrew Woolf, Jeremy Tandy and I had, where we were trying to tie down (mainly for my benefit I fear), exactly why we need this affordance concept. (Of course, it's my record, so it may be nonsense ...)

If we concentrate only on intrinsic affordances, then they are certainly something we describe in our UML domain modelling. A simple example would be that if we had a gridseries feature, we clearly anticipate an operation of subsetting which allows one to extract a series ... (we anticipate a lot of other things too, but that's fairly obvious). Suppressing extraneous detail (and arguments about whether a geometry like "grid" should be a feature), in UML, we might have our simple domain model looking like:

Image: static/2006/07/27/SimpleAffordances.jpg

So far so good, but when we want to serialise it, and if we want to use GML, then we hit a serious snag. The design and history of GML as a data modelling activity means that the GML coverage type has already ditched the operations in its version of the parent CV_Coverage class, and the GML schema has no inherent mechanism for describing operations. (I spent a lot of time being confused by this, as folk had insisted on saying xml-schema has no mechanism for describing the operations, and that's obviously daft ... what they had meant was that the GML-xml-schema has no mechanism).

Because we don't believe in multiple inheritance (the ISO mechanism explicitly says we should only inherit from one class, which amongst other reasons, makes the software easier to automate), we're stuck. In practice we have to serialise via GML schema so we have to come up with an independent method of serialising the operations (which when they are inherently tied to a feature become "affordances"), and then creating the link. Something like the following

Image: static/2006/07/27/SimpleAffordances2.jpg

The question we are left with then is how exactly to implement these relationships between operations and their parent features in a way that is likely to be consistent with the OGC methodology and one that others will buy into. That will be the topic of another note ...

1: where we use the word feature in the ISO19123 sense (ret).

by Bryan Lawrence : 2006/07/27 : Categories ndg : 1 trackback (permalink)

NASA earth exploration

A couple of days ago, I reported that NASA was killing off earth observation. Chris, in comments to that post, pointed out that despite the overall change in emphasis the current strategic plan (pdf) in section 3a listed eight new missions. Given I'm supposed to know a bit about this stuff, I thought I'd make sure I was up with the play with NASA's plans. So for the record the NASA planned missions are:

  • the "National Polar Orbiting Operational Environmental Satellite System Preparatory Project" (phew) ... NPP ... which is a tri-agency thing, aiming to ''eliminate the financial redundancy of acquiring and operating polar-orbiting environmental satellite systems, while continuing to satisfy U.S. operational requirement for data ...". A new polar orbiter will be launched in 2009 with a five year design life.

  • Cloudsat and CALIPSO - a pretty important mission from a climate perspective (and being a cloud guy nowadays of significant interest to me), and it launched earlier this year.

  • The Glory Mission - which I have to admit to not having heard of before a bit of research today, is an on and off-again type mission. It looks like it's still scheduled for a 2008 launch, and mission aims are (from one of the glory web sites):

    1. The determination of the global distribution, microphysical properties, and chemical composition of natural and anthropogenic aerosols and clouds with accuracy and coverage sufficient for a reliable quantification of the aerosol direct and indirect effects on climate;

    2. The continued measurement of the total solar irradiance to determine the Sun's direct and indirect effect on the Earth's climate.

  • The Global Precipitation Measurement Mission (GPM). As their web site states, precipitation is a key part of the climate system, and this is a pretty important joint activity with the Japanese Aerospace Exploration Agency (JAXA) and other international agencies to develop a constellation of satellites to measure precipitation a global scale. (It appears that Bush's influence on this activity has already been hindering, and it's been further delayed this year! - 2013+?).

  • The Ocean Surface Topography Mission (OSTM)- this is about operational altimetry (i.e. measuring sea surface height with enough regularity to enable the use of the data directly in models via assimilation). It's another massive joint effort involving EUMETSAT, NOAA, as well as NASA and France's CNES. Due for launch in 2008!

  • Aquarius, to be launched in 2009], will be a joint mission with Argentina to measure global sea surface salinity. Again, this is pretty important, because this will help quantify the physical processes and exchanges between the atmosphere, land surface and ocean (i.e. runoff , sea ice freezing and melting, evaporation and precipitation over the oceans).

  • The orbiting Carbon Observatory (OCO, scheduled for launch in late 2008) will be producing vertical atmospheric profiles of aerosol content, temperature, CO2 and water vapor. Retrieved data will also include scalar measures of albedo, surface pressure and the column averaged dry air CO2. With a planned resolution of gridded data products of one degree monthly averages, and sources and sinks mapped at a slightly lower resolution, it will still provide enough information to start evaluating the real greenhouse gas polluters. The data may well be politically embarassing!

The document also reports a number of other key NASA activities which are directly related to earth observation (and climate change):

  • Significant work on advancing radar, laser etc technologies to enable new space instruments.

  • They are planning to work with the US Geological Survey to secure long-term data continuity of Landsat type observations,

  • They are making significant progress assimilating EO data into models

  • They are working on policy and management decision support tools, and

  • Working on interagency cooperation to mature their research instruments into operational systems (and to utilise operational systems for research)

So there is no doubt that NASA is doing and will do fabulous earth science that will make huge contributions to understanding anthropogenic (and other) effects on the earth system.

However, what is noticeable is that all these missions (with the exception of GPM) are due for launch in the next three years (and GPM has been heavily delayed, so it's an "older" mission already). One might argue that the baleful influence of politicians is already visible in the lack of new earth observation missions for the next period (or maybe I just don't know about them and they're not yet visible in the strategic plan because they're at an earlier state of development ... I promise to ask some of my colleagues who ought to know more, and if there are things in the pipeline I'll post about them).

Clearly now the EO words have gone from the mission statement, one can anciticipate that it will be even harder to get new missions ... and even harder to justify (with NASA funding) working up the data. However, one can hope that the planned decadal survey by the US National Research Council, referred to at the end of section 3a in the NASA strategic plan, might have strong enough words in it that NASA just has to do the work!

(One might argue that as a kiwi working in Britain I have no right to be bleating about what NASA is and isn't doing, but the fact of the matter is that the climate and environment are global problems, requiring global solutions, and we all need to pull together and use whatever instruments and data we can to make progress, so we all need to know what the key players are up to!)

by Bryan Lawrence : 2006/07/27 : Categories environment : 0 trackbacks : 0 comments (permalink)

Oreskes responds

Very early in my blogging career, I reported the clear consensus in the climate community over the realism and causes of global warming. Recently a smart famous man refuted the Science paper I quoted in a crappy article in a crappy paper (well, any paper that let such a poor piece get published is a crappy paper - sorry Wall Street Journal).

Anyway, the original author (Naomi Oreskes) has written a response piece in the Los Angeles Times (not one of my normal reads, so thanks to John Fleck). Anyway, some choice bits:

My study demonstrated that there is no significant disagreement within the scientific community that the Earth is warming and that human activities are the principal cause.

To be sure, there are a handful of scientists ... who disagree with the rest of the scientific community ... this is not surprising. In any scientific community, there are always some individuals who simply refuse to accept new ideas and evidence. This is especially true when the new evidence strikes at their core beliefs and values ... Scientific communities include tortoises and hares, mavericks and mules.

I've conflated a couple of her paragraphs to produce the following, but I think it's a fair summary of the situation:

... the panel (the IPCC) has issued three assessments (1990, 1995, 2001), representing the combined expertise of 2,000 scientists from more than 100 countries, and a fourth report is due out shortly. Its conclusions ? global warming is occurring, humans have a major role in it ? have been ratified by scientists around the world in published scientific papers, in statements issued by professional scientific societies and ... Yet some climate-change deniers insist that the observed changes might be natural, perhaps caused by variations in solar irradiance or other forces we don't yet understand. Perhaps there are other explanations for the receding glaciers. But "perhaps" is not evidence.

by Bryan Lawrence : 2006/07/24 : Categories environment climate (permalink)

America sliding towards Australia

More evidence of convergence can be found in the fact that the U.S. is trying to kill off earth observation ... yet again.

The press release quotes Jim Hansen:

They're making it clear that they ... prefer that NASA work on something that's not causing them a problem."

by Bryan Lawrence : 2006/07/24 : Categories climate environment : 1 comment (permalink)

lurking in Exeter ... in the cool

It's my month for lengthy train journeys: last week to Plymouth for a couple of days, and this week to Exeter for a couple of days. So far (three out of the four I will have completed by tomorrow) the train journeys have been pretty much on time, and good value (in terms of time spent working and minimising my CO2 production). I mention the timeliness to to see if Murphy reads blogs! (Update:Murphy not a blogreader: four out of four timely trains!)

I feel guilty. I just spoke to Anne back in Oxfordshire where the digital thermometer inside the house is reading 32C ... and this at a time when Benson (EGUB) is reporting 33C ... so obviously it has been hotter (generally inside our house doesn't get nearly as hot as it is outside) ... according to the BBC today is the hottest July day - 36.3C somewhere - since records began (beating a record that has stood since 1911). I'm guilty because it's only 24C here in Exeter, and I can imagine how uncomfortable it is at home ...

by Bryan Lawrence : 2006/07/19 (permalink)

Laptop Purchase

My laptop is nearing the end of it's life. Switches are breaking. Applicatinos run slow and the fan seems to be perpetually on what with beagle and all the other stuff I seem to depend on (expensive memory hogs like the java versions of freemind and the oxygen xml editor along with a cross over office instantiation of Enterprise Architect1).

So, I've been considering how to get what I want, which is a pre-installed linux on hardware which meets my criteria. I want

  • Core Duo 2.0 GHz up ...

  • 80 GB Hard disk 7200 rpm

  • kubuntu

  • 1 GB memory

  • 1.8 to 2.3 kg in weight (4 to 5 pounds in old money, sorry American readers, your units are behind the times)

  • A docking station

  • Decent battery performance.

  • Support for external monitor at 1900x1200 or a display projector (beamer) without hassle.

  • Software suspend and hibernate.

  • Higher resolution display than 1024x768 ... preferably in both dimensions!

I've had a look at the Dell 620, but the docking station options on the dell site seem to imply I can't plug my monitor cable in ... what's the point of that? Also, can't find anyone who will preinstall linux for me on one of those (yes, I could do it myself, but this thread suggests that I might have more work than I want ... or have time for).

A Lenovo Thinkpad T60 looks like a starter, and emperorlinux appear to offer what I need (albeit with an American keyboard). But EL are very slow at replying to email ... not sure about giving them money! (Update 20th of July: slow they may be, but they can do what I want and they have got back to me with a compelling quote - now all I have to do is get it though our local purchasing) (EL have a Sony that fit the specs too, and while I currently have a Sony the don't have anything approaching a sensible global warranty).

The only company I could find in the UK which preinstalled linux had too small a set of offerings to deliver what I need. I've heard rumours of HP and lenovo preinstalling linux but HP doesn't have any suitable hardware (whatever the O/S) and Lenovo don't actually offer it to individuals (yet?) as far as I can tell.

Anyone able to give me any advice?2 ... Certainly seems like there is a commercial opportunity here somewhere ...

1: Given that only one of those four key components is free, I think no one can argue that I only ever espouse free and/or open software (ret).
2: JP, don't suggest a mac again, we're not that wealthy at the BADC :-) (ret).

by Bryan Lawrence : 2006/07/18 : Categories computing (permalink)

openlayers

I think the advent of openlayers might potentially be more important than Google Earth ... it's certainly something we'll have to implement for ndg.

by Bryan Lawrence : 2006/07/18 : Categories ndg computing (permalink)

Whither Our Web Servers - Part III

In which we return to performance ... having been discussing web server technology from first principles to proxy servers.

So, we have two things for which we need to improve performance:

  • our web servers per se, and

  • the data delivery services they expose,

and we need do this in a way that preserves efficient software development and maintenance (we need to avoid the curse of premature optimisation1, at the same time as improving services).

So the first thing we should do is ask what we mean by "improving performance". Well, in this context I do mean improving the user experience but I don't mean ensuring accessibility and standards compliance. One quick win will be to use more web servers (i.e. buy more hardware), and simply spread the load across more systems. That's underway. On the slightly longer timescale (weeks to months), we're investigating ultra monkey, which will allow us to address serving data more efficiently. We're also looking at replacing NFS in our storage systems as much as possible with more efficient protocols. On the even longer timescale, we need to work out how to migrate some of our python cgi stuff to faster server environments, which is really what this series of blog notes is about. While the server environment is not really a pinch point for many of our data delivery services (which are generally constrained by NFS I/O, processing code, and the FTP protocol in sequence), they are the point at which user frustration can begin ... while a slow website annoys folks more than is warranted, it is still the perception that counts!

Ok, so back to the server, starting with python server frameworks. A framework is something that helps build a web application, minimally including something to resolve the URLs, deal with HTTP headers etc, give access to context variables and provide session management. Most frameworks provide some sort of database access system ... we'll come back to that.

Given that we need to be able to go from demonstration to production quickly, it makes sense that we deploy frameworks that allow fast development and good efficiency. Obviously given our environment (and budget), we're looking at open source solutions (with the caveat that I'm nervous about open source frameworks that don't have big communities).

But there are a zillion python frameworks, and that's the biggest problem python has ... too much choice (and fragmented development and support). However, python also has the Web Service Gateway Interface (WSGI), which provides a protocol to allow frameworks to communicate with applications. In principle one can develop an application that doesn't know too much about the framework in which it is embedded ... but it probably still needs to talk to a database. The database access could be built into middleware (which could lie between a parent framework and your application, all communicating with WSGI) but more usually the database interaction is integral to the code, is framework specific, and needs to be in the application. That breaks some of the framework integration possibilities of WSGI. (Incidentally database integration is the big problem with most of the frameworks we've investigated, they work fine if you are developing a new application, but may be less flexible in deploying an application which needs to talk to an existing database).

Nonetheless, it appears that WSGI still allows some significant flexibility if you're trying to work up from CGI to something which performs better. Apparently frameworks have generally had their own way of talking to the server: minimally using CGI, and then either generic (Apache) modules like mod_python, mod_fastcgi and mod_scgi (of which more in a later post). However, most of them now offer WSGI, which means one can develop code in any given framework that can be embedded in most others, and talk to servers either via such middleware frameworks, or directly via a WSGI server.

Once again, it's time to stop, and we still haven't gotten to performance. I might as well be writing a book. That wasn't my intention :-(. Anyway, next time we'll talk about deploying WSGI etc.

1: I used Cook Computing as a link for this, because I like the fact he tends to go back and get the quotes right, e.g. Postel's Principle (ret).

by Bryan Lawrence : 2006/07/18 : Categories ndg computing badc : 1 trackback (permalink)

the scope of blogging

Allan Doyle pointed me to a provoking piece asking why the GIS community don't have blog conversations? Well, I'm not really a GIS person but as is obvious from the direction that ndg is taking, I'm heading in that direction. Dave Bouwman's piece essentially posits two explanations:

  • It's a numbers thing ... the community is too small.

  • It's a technical savvy thing. The GIS community isn't up with the play (clearly clever enough, but maybe it's just one tool - albeit an easy one - to far?)

The links in Allan's post point to some rebuttal of these points, and add some more explanations, including teasing out "the newness of it all" as a (the?) fundemental explanation.

I think these are all fair explanations, but I have another (which is a variation of the size thing). From what I can see there are two classes of (technical) blog conversation going on:

  • what I might term open source discussions where the work and conversation is not the core activity of the folk involved, and

  • corporate communication blogging - both managed and unmanaged (by which I mean some of which is planned and controlled by management, and some of which is actively encouraged, but not guided).

These two classes then represent

  • enthusiastic folk talking about what they like doing beyond their jobs, and

  • enthusiastic folk talking about what they do because either the company makes it happen or it accepts that it's a good thing (TM) for the company.

These enthusiastic folk are generally good communicators who get a lot of out of conversations, virtual or otherwise. There are another two classes of folks:

  • folks who plain don't communicate well. Never mind the medium. So let's forget them, they'll never blog effectively, if at all.

  • folks who could and would be enthusiastic bloggers, but are not in the position to be publicly so ... in order to protect IPR for later commercial (or academic exploitation). Generally, but not always, there will be an employer proscription (either real or imagined), which is behind these folk not blogging (even if they were savvy enough and the community big enough).

I think this last reason is directly coupled to the newness thing, and the maturity of any given company's understanding of the open revolution; that is, that progress, in some markets, will be faster and more financially successful, by working in public than in private. This sort of thing is espoused, by amongst others Simon Phipps of Sun1.

Note that this is not just about open source, it's about open working; my guess is that effective bloggers are more effective workers and deliver the goods faster ... by exploiting the new learning/feedback medium. However, that exploitation depends on critical mass, and both GIS and climate science are still missing that critical mass, and that's one reason why GIS workers aren't blogging: there aren't yet enough open source GIS blogs creating a critical mass so that the more commercially orientated folks become even better at meeting their commercial goals by blogging ...

So, my variation of the size things ends up needing a bigger pool of bloggers, and so Dave's redux post is exactly right (for all communities). Those of us who blog should:

  • make an effort to bring more people into blog / rss / aggregator awareness,

  • suggest that people start their own blogs so we have wider diversity,and

  • compose posts such that they invite conversation.

I only wish I had the time to write something shorter, and thus more inviting of a reply ...

1: via Tim Bray who also adds a bit of sanity on the reality of open source development, although he neglects the academic development of open source things that seeds some really successful projects and businesses. (ret).

by Bryan Lawrence : 2006/07/18 (permalink)

Trac Macro Hacking

I make a lot of use of freemind, and because ndg uses trac, I wanted to be able to put my notes straight up on the wiki.

I had a look around for relevant macros, and found this one, but while it seems to make nice flash diagrams, I just wanted to be able to simply list my freemind... so I delved into trac and wrote a simple wiki macro.

The code is here. To use it, simply add the mindmap as an attachment to a trac wiki page and then put

 [[SimpleFreeMind(attachmentFileName)]]

on your wiki page. It wont preview, but it will be there for real ... It's a bit ropey and untested, so I'm not putting it up on the trac-hacks site until it has had some significant use.

by Bryan Lawrence : 2006/07/17 : Categories python ndg : 2 comments (permalink)

Curious as to how this forecast goes.

The met office via the BBC is forecasting another hot week:

Image: static/forecasts/bbc.forecast.5day.060716.jpg

it'll be interesting to see how this one goes ...

Update (Tuesday night): A pretty good forecast thus far, at least as far as the maxima go. It hit at least 31 yesterday, and at least 32 today. I'm not so sure that the minimum temperature forecast is that good though!Anyway, this is the next five days forecast:

Image: static/forecasts/bbc.forecast.5day.060718.jpg

One day I might dig out some real data and look into whether or not this really is an uncommonly hot summer ...

by Bryan Lawrence : 2006/07/16 : Categories environment (permalink)

hapless bt strike again

Some of you may remember my bt broadband saga ... Last week my voyager 2091 modem died ... all the lights flash properly but there was no one home inside ... and it no longer reacted to the reset button (or anything actually, dhcp from ethernet and wireless got a great big nothing back) ...

So, I rang BT broadband technical support, and they (after an hour on the phone) agreed with me that it was buggered ... agreed that it was still under warranty, and supposedly ordered me one (to arrive within three to four working days).

You'll not be surprised to know that it didn't turn up, and so I phoned up last night, got fobbed off til today,and then eventually I discover that they weren't sending me one because there was no record of my having been sent one on this telephone line ... (so why did they think it was under warranty?)

How can they

  • have no record of having sent it to me? (Maybe some crapola between the home-highway team and the broadband team?)

  • possibly not ring me up and tell me the product that they promised isn't going to come? (Shades of engineer visits that don't happen).

So today, after two phone calls to two different numbers talking to two people who had no clues ... I finally got transferred to someone who sounded sensible, and she's ordered me one (for free!), to arrive Monday ... we'll see.

How can such a big company survive with such crap customer-management systems?

by Bryan Lawrence : 2006/07/13 : Categories broadband : 1 comment (permalink)

Whither Our Web Servers - Part II

Last week I introduced our problem (how to move from python-cgi to python-"something faster", but in a framework with other stuff), and some things I've learned. This weeks episode covers mod_proxy and a bit more ...

OK, so we left it with mod_xxx (insert your language) not recommended as a solution, and perhaps a taste of an implicit recommendation for lightpd and/or scgi if you're python based). But before we go there, we have some more things to consider.

Apparently what Zope and Java servers actually do is create their own http servers which do their own:

URL parsing/generation for you to make sticking them behind a HTTP proxy (like Apache?s mod_proxy, or Squid, or whatever) at an arbitrary point in the URI a piece of cake. Typically you redirect some URL?s traffic (a virtual host, subdirectory, etc.) off to the dedicated app server the same way a proxy server sits between your web browser and the web server. It works just like directing requests off to a Handler in Apache, except the request is actually sent off to another HTTP server instead of handed off to a module or CGI script. And of course the reply comes back as a HTTP object that?s sent back to the originator.

Mark seems to argue that expanding on this approach, which I would summarise as Using a proxy server to farm requests off to dedicated servers deploying http interfaces, ought to be the road down which most should travel. However he also predicts that what he calls the technically less impressive options of FastCGI and/or SCGI will get improved implementations in Apache instead and become more important. Pedro Melo summarises this choice as:

should we continue with (FS)CGI and friends or use basic HTTP between the front-end web server and the back-end application servers?

and goes on to conclude that if we use HTTP between the front and back end servers, we end up having to write, test and debug the http stack for each back end server, instead of using tried and tested implementations (e.g. Apache, lightpd etc) to do the http server, and something specific for the application, joined by a relatively simple protocol.

Ian Bicking in a typically well argued piece makes a strong argument for the (X)CGI school:

What FastCGI and SCGI provide that HTTP doesn't is a clear separation of the original request and the delegated request. REMOTE_ADDR is the IP address of original request, not the frontend server, and HTTP_HOST is also the original host. SCRIPT_NAME and PATH_INFO are separated out, giving you some idea of context.

Ok, at this point I'm sold on the argument that we use a proxy server to farm things out to a backend service (or server) which might be an (X])CGI module running on the initial server, or it might be running on a different server (still with Apache or lightpd or somesuch interfacing it) allowing more sophisticated access control and/or load distribution.

That's probably a good place to stop today's missive. Next time we'll see if we can get any evidence on the efficiency of these backend servers and compare them with the jsp backend servers. I have a friend who has offered to eat his hat if java on the backend server isn't the most efficient option ... I don't particularly want him to eat his hat, but I'd rather folk didn't offer up such risky eating options!

by Bryan Lawrence : 2006/07/10 : Categories ndg badc python : 2 trackbacks : 0 comments (permalink)

Microsoft goes open document

Finally, microsoft have come out and admitted they're being forced to play in the open document sandpit. Tim Bray has a roundup of relevant links. Bob Sutor has a of history links that define how we got here. I've been quiet on this issue for a long time, from pressure of work. But you can see that I care (and why) if you check out my msxml category page.

The bottom line is that Microsoft will initially deliver an import routine so Office can read ODF documents ... it isn't fully featured yet, and it's not really clear that it will be, but it is a good step. From my point of view however, I will only be happy when ms-documents are saved natively in ODF with no loss of information! Having seen an example of the two formats, there is no way I want an archive full of the MS openXML product ...

by Bryan Lawrence : 2006/07/10 : Categories msxml (permalink)

What does a forecast icon mean?

Well, the forecast is beginning to go pear-shaped, although not through lack of heat ... or perhaps not. Here are four MetOffice via BBC weather forecast snapshots, but I'm not really sure what they mean ...

Firstly, here is a 24 hour forecast issued late Monday, followed by a five day forecast (today is Wednesday):

Image: static/forecasts/bbc.forecast.24h.060703.jpg

Image: static/forecasts/bbc.forecast.5day.060703.jpg

What's interesting about this is that the 24 hour forecast makes it clear that the weather may be foul at some point during Tuesday ... but the 5 day forecast average symbol (actually, I think the word on the website is "predominant") ... suggests sunshine. Me I think I would want to know in a five day forecast that rain and thunderstorms might occur. Mind you, the website states:

This is calculated based on a weighting of different types of weather, so if a day is forecast to be sunny with the possibility of a brief shower, then we will see a sunny or partly cloudy symbol rather than a rain cloud.

Well ok, at least they did explain the methodology, but compare this with today's forecast:

Image: static/forecasts/bbc.forecast.24h.060705.jpg

Image: static/forecasts/bbc.forecast.5day.060705.jpg

Now, the average symbol tells me what I want to know - risk of rain, but the 24 hour forecast makes it clear that it's not expected to be a big part of the day. Now that does contradict what the website says, so there is an element of inconsistency here (unless we believe that there are thunderstorms between the sunny spells, since the hourly values appear to be instantaneous).

Leaving aside the fact that a look out the window suggests the average is more correct, one is left wondering what exactly what the algorithm is for constructing these things. (Clearly it's autogenerated, because you can get these on a postcode basis).

We've spotted this sort of thing in the past, one day I must chase this up ... because I'm getting interested in what folk expect out of climate forecasts.

by Bryan Lawrence : 2006/07/05 (permalink)

Whither Our Web Servers - Part I

BADC currently deploys a bunch of web services (in the loosest definition), most of which are cgi's, and nearly all of which are served up by apache (except some specific ZSI web services which for the moment run inside the firewall on their own ports). We also have some tomcat (for sure) and mod_perl (I think).

Nearly all our development is being done in python in the cgi environment, but now some of our development is being associated with real services, so it's time for us to look into server-side options to make them run fast.

I'm pig ignorant about this sort of thing, CGI has been enough for me thus far, and while I expect to get educated by my team, I wanted to get the vocab in my head, so I did a bit of googling. Not surprisingly there's a lot of stuff out there, so I needed to (and still need to) make some notes which I'll put here over the next couple of weeks so I can get corrected if I get the wrong end of the stick.

Firstly, I found a really good article by Mark Mayo1 and two useful responses by Pedro Melo and Ian Bicking. What did I learn from these guys? Lots, but worth noting for now include:

Mark Mayo on why not CGI?:

... you have to load the whole shebang from scratch ON EVERY REQUEST. Performance will be shit, trust me, and so for anything but development with these ?megaframeworks? CGI is completely unfeasable. You need a way of getting all that code loaded into memory and have it stay persistent across requests. That?s what you get with an Apache module, and that?s what you get with FastCGI... now you have these compelling competing frameworks, written in competing languages, needing persistance on the web server and they?re not going to get it with an Apache module.

Huh? You get with an apache module and then you don't. Hmm. We might have to come back to that. Meanwhile. Mark Mayo on alternatives:

FastCGI:What?s wrong with FastCGI in Apache? ... The UNIX Domain Sockets are unreliable, for unknown reasons. Switch to TCP runners and they sometimes hang. Unexplicably. ... Matters aren?t helped by the fact that the FastCGI C code itself is crufty ... Finally, FastCGI in Apache just isn?t as flexible as we?ve come to expect things in Apache to be. ...SCGI came about as a simpler FastCGI replacement in the Python world ... But the FastCGI technology itself is clearly quite a bit better than most people?s experience with it under Apache would lead them to believe. It?s been rock solid in Zeus for many years. Recently lighttpd has also proven that FastCGI can be quite robust and quick in an open source web server, to the point that it?s superior FastCGI implementation propelled lighttpd into the limelight from nowhere as the way to run the new wave of web frameworks.

OK, so maybe we need to look into lightpd. But Mark also pointed out that Apache may well improve their fastcgi support.

Back to the why not Apache modules? Mark again:

Apache modules are not a viable path forward. I think most experienced sysadmins already know this. Building and maintaining Apache gets exponentially more complex as you add modules, and that?s reason enough to avoid it in my books without even considering the memory consumption issue.

What memory consumption issue? Actually, that seems to be best described in a post Mark Mayo made in comments (number 24) to the one about apache and fastcgi in response to someone else recommending mod_whatever(python, perl etc):

... the problem of getting two different versions of the Ruby interpreter into memory as Apache modules at the same time. Smart coding isn?t going to magically eliminate the fact that each Apache process (4 per connection on average) carries the interpreter (via a shared object, no less!) and since requests are round-robin?ed across available processes each process will end up tickling all the memory pages required to run your code. Which means each process, even if it?s just sending out a little .css file, is going to be taking up dozens of megabytes of memory.

(but at least it will be persistent, yeah?)

Marks' post goes on some, but this is where I have to stop for now. Look out for my next thrilling installment ... covering the rest of Marks' post, Ian and Pedro's replies, and moving forward ...

1: Yes I did read the whole thing, though I can't claim to have taken it all in (ret).

by Bryan Lawrence : 2006/07/05 : Categories ndg badc python : 2 trackbacks : 0 comments (permalink)

The Great British Summer

The great British summer gets a pretty bad press from folks who come from my part of the world, but in my experience it's pretty unjustified. I think British weather gets its crap reputation (deservedly) from the other seasons, when it's not to unusual to go weeks without seeing even a glimpse of the sun. Having said that, occasionally, summer does go awol here in some years, but those years are getting fewer and further between ...

Meanwhile, what's to complain about in this met office (via bbc) forecast?:

Image: static/2006/07/02/weather.forecast.crop.jpg

by Bryan Lawrence : 2006/07/03 : 1 comment (permalink)

Useful TCDL papers

I've just discovered the bulletin of the IEEE Technical Committee on Digital Libraries. The current issue has a number of interesting papers from my perspective:

  • Egger discusses shortcomings of the OAIS (Open Archive Information System) model in terms of what it means to be "conformant". His analysis essentially identifies the mixing of technical and management concepts as being a key problem, along with a strange admixture of abstraction. One of his references is to a java API for a content repository for java technology (jsr170, 2004).

  • Golub discusses using controlled vocabularies to automate classification of textural documents (web pages) in the context of browsing. I'm interested in that because we have a wee problem in rationalising the (hopefully eventually automated) metadata that we will get from systems like the Numerical Model Metadata with the more textural material we would get if we asked a scientist to write a document covering some of the same ground, but with a very loose set of constraints (NumSim). This of course is just an example of a more generic problem. At some point or another we will want to rationally browse from human generated material to similar automatically generated material and vice versa ... that means we'll need to classify the former automatically. This is of direct scientific importance because interdisciplinary science depends on being able to make connections (find data and information) from areas where one is less familiar with the ideas, concepts and nomenclature - exactly what browsing by classification is all about. Golub has some references on this point, and I've been on about this for ages.

  • There are also a couple of papers on searching that looked interesting too. I haven't yet read them, but I'm bookmarking them here: Ramirez and Kriewel. I probably ought to read Frommholz's paper on annotation too.

(I liked the concept of this issue too: getting a group of doctoral students together for a consortium meeting and creating a special issue out of the discussion and papers presented ... these sorts of things are good for both the readers and the creators. Of course every time I read this sort of thing, I wish that atmospheric science papers were as accessible ... roll on open access!)

by Bryan Lawrence : 2006/06/28 (permalink)

Good Standards Development - Whither web services

Well, I'm back from the go-essp meeting at Livermore (where the temperatures hovered around 40C) ... and hopefully back to some blogging. At the meeting, Russ Rew drew my attention to Michi Henning's article on the rise and fall of CORBA. I didn't get time to read it then, but Tim Bray also linked to it, and jogged my memory (and provided the other links for this post). While I'm not really interested in CORBA, since I've never had to know anything about it (and it seems dead), the advice on standards seems right on, and so I'm repeating it here:

  • Standards consortia need iron-clad rules to ensure that they standardize existing best practice. There is no room for innovation in standards. Throwing in ?just that extra little feature? inevitably causes unforeseen technical problems, despite the best intentions.

  • No standard should be approved without a reference implementation. This provides a first-line sanity check of what is being standardized.

  • No standard should be approved without having been used to implement a few projects of realistic complexity.

  • To create quality software, the ability to say ?no? is usually far more important than the ability to say ?yes.?

Tim also links to Bruce Eckel, who compares corba with WS-*:

Of course, the fundamental idea of web services is a good one. We really do want to be able to extract useful information from a remote site. That's happening, and it will continue to happen. But it must be simple, because even if programmers are smart and can figure things out, the CORBA experienced proves that we don't want everything to be as complex as it can possibly be.

Then Tim also points out Steve Loughran who summarises where we are now with:

Out there in the intra-organisation land, you either have REST, relatively simple SOAP (no WSRF, WS-Eventing, etc), or something custom ...

and

One issue is whether distributed objects are the right metaphor for distributed computing. CORBA .. et al (including WSRF) ... all implement the idea that you are talking to instances of stateful things at the end, with operations to read/write state, and verbs/actions/method calls to tell the remote thing to do something. These protocols invariably come up against the object lifetime problem, the challenge of cleaning up instances when callers go away... They also have the dealing with change problem ... either end of the system may be suddenly replaced by an upgraded version, possibly with a changed interface, possibly with changed semantics ...

He goes on to say that REST handles this by freezing things with a low number of verbs .... and claims that this is the only thing that has been show to work.

The thing is that this is an incomplete analysis, because he's essentially saying that the only thing that has been shown to work is a custom solution - because the RESTful stuff doesn't actually give you enough to build anything that relies on any thing sophisticated and discoverable at the other end ... without external semantics bound up in some sort of registry. So does this mean building autonomous discoverable grid services is well-nigh impossible?

Well, I'd argue not. Within NDG we're building a custom solution built on a mixture of SOAP, XML-RPC and REST ... and a bunch of specs for "useful" services (from the OGC). The key to the interoperable-ness of our solution will be our reliance on the semantics built into those service descriptions and the quality of our registries. Roll on ebrim!

by Bryan Lawrence : 2006/06/26 : Categories ndg metadata (permalink)

Atmospheric Science will return

I'm conscious that most of my blog entries lately have been about computing/metadata type issues ... which I think reflects the huge effort I and my ndg team have been putting into trying to reach our alpha milestone in time for next week's Global Organisation for Earth System Science Portals meeting.

I hope that once that's out of the way I'll be able to get back to spending half my time on "real" science ...

by Bryan Lawrence : 2006/06/15 (permalink)

ructions in the ogc periphery

There have been some interesting issues around georss lately (via Allan Doyle). The wrap ups are here and here.

What was interesting about this was the perception that OGC doesn't want to hear from the little guy in standards development. I've blogged about this issue before, but then I was rather negative about the problem (i.e. I think it's not a big deal). Although it appears that the georss issue was a sin of carelessness by OGC, there are obviously some ongoing isuses, most significant of which in my mind are that the OGC needs to be careful about

  1. licensing (given what OGC wants to contribute how much better it would be if they did use creative commons, and

  2. how individuals can contribute and be involved in OGC activities.

With regard to the latter, I don't necessarily think the issue is about voting, it's about whether or not sensible technical contributions can come from anyone (which requires rather more openness than OGC produces by default - although at least it is better than ISO). At the end of the day, interoperability is about more than just the big companies. I do agree with this from Howard Butler:

Interoperating down in the trenches ... is still a rough, unforgiving, and frequently frustrating experience.

(where I have slightly broadened the applicability of what he actually said).

by Bryan Lawrence : 2006/06/15 : Categories metadata (permalink)

Australia drifts even closer to America

Being a kiwi, who has lived a long time in the UK, with lots of close ties with (continental) Europeans, Americans, and Australians, I believe I can say something about the "closeness" of the relative cultures.

Most folk make the assumption that Kiwi's and Australians are close to each other, and closer to poms than they are Americans. Not so. Sportingly Australia and NZ are close (and that's a big part of culture, but it's not all). In my mind, if we plotted points on a plane for each "culture", we'd probably have a line from the UK to the US, with Australia closer to the US end, and NZ closer to the UK end (and with Germany and the rest of Europe off the line, but a similar sort of distance away from NZ to the UK ... much closer to NZ than the US)!

Anyway, more evidence for my hypothesis about the Australians being closer to the US is their behaviour of climate issues. I discovered over the weekend that the Austalian www.csiro.au had binned most if not all of it's world famous climate group. Unbelievable. Howard appoints a bunch of coal and oil blokes to the csiro board, and the climate group is canned. Not hard to make that connection. What amazes me is that either I've been even more myopic than usual, or the Australian environmental community has failed to make nearly enough noise about this (or both).

So it's all evidence that Australian federal political culture is moving down that line closer to the American federal political culture (and thus the cultures themselves are converging ...). However, one has to make a strong distinction ... at least the American climate community is strong and fighting back!

by Bryan Lawrence : 2006/06/12 : Categories environment (permalink)

Back from Greece

I'm getting withdrawl symptoms from blogging - both reading feeds, and writing something of my own - but I guess that's the result of trying to keep too many balls in the air. It's also the result of having a wonderful week on holiday in Greece - Antiparos to be precise. We (wife and thirteen-month daughter) had only five full days on the island, but they were great. I can't imagine a much better place to introduce a child to the delights of swallowing salt water and eating sand ...

Anyway, I'm back, and depressed with the mountain of work ahead. The expectation of the locals (and most of the tourists) on Antiparos was that one should spend at least a fortnight on the islands, and they were right ... I don't think the work mountain could have been discernably bigger from my perspective looking up at it today, and I would have had more time getting slower and slower (on day four I said to Anne that I was just getting the hang of the pace of life ...).

I'll post some pictures when we've got the film processed (unfortunately the shutter on my digital camera jammed shut on the ferry to the island, so we're reliant on old technology).

by Bryan Lawrence : 2006/06/06 (permalink)

Suffering with unicode

Our discovery portal needs to handle discovery documents why may not include xml with the correct encoding declarations.

To see the sort or problem this introduces consider the following python code

import ElementTree as ET
a = unicode('Andr?','latin-1')
b= '<test>%s</test>'%a
c=ET.fromstring(b)

which produces

Traceback (most recent call last):
  File "<stdin>", line 1, in ?
  File "ElementTree.py", line 960, in XML
    parser.feed(text)
  File "ElementTree.py", line 1242, in feed
    self._parser.Parse(data, 0)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 10-11: ordinal not in range(128)

(amusingly I couldn't use my embed handler to format the python code for this wiki entry because of the same problem)

Now, I can fix this, sort of, using:

c=ET.fromstring(b.encode('utf-8','replace'))

I say sort of, because for arbitrary content (in this example content a), it still breaks ... because it would seem that the resulting string b can still fail to import into ElementTree ...

For example, in one document we have 1.3 <= tau < 3.6 in some encoding ... and even after using an encode('ascii','replace') option we get this error:

 self = <ElementTree.XMLTreeBuilder instance>, self._parser = <pyexpat.xmlparser object>, 
self._parser.Parse = <built-in method Parse of pyexpat.xmlparser object>, 
data = '<DIF><Entry_ID>badc.nerc.ac.uk:DIF:...</DIF>
  ExpatError: not well-formed (invalid token): line 1, column 11389 
      args = ('not well-formed (invalid token): line 1, column 11389',) 
      code = 4 
      lineno = 1 
      offset = 11389

(and the string above includes column 11389) which at least is not longer a unicode error.

Well, I'm not going to solve this now, because I'm off on holiday for a week, but if anyone has solved it by the time I get back, I'll be very happy!

Update: well I haven't even gone on holiday yet, but Dieter Maurer on the xml-sig mailing list has pointed out that the < sign needs to be escaped in XML ... and I'm pretty certain that it isn't ... so now I can go on holiday knowing that it's a relatively simple thing to fix!

So the bottom line was that I had two sets of problems with my input docs: genuine unicode problems and unescaped content ... and I thought it was just one class of problem which is why I struggled with it ...

by Bryan Lawrence : 2006/05/27 : Categories python ndg : 0 comments (permalink)

ndg security and ubuntu

The current version of NDG security has a number of python dependencies, including m2crypto and pyxmlsec ... the good news is that it's relatively easy to get the things in place under ubuntu, the bad news is that it's a pain working out what packages you need, hence this list (which is mainly for my benefit). You've probably already got libxml2 installed, but you also need:

  • libxml2-dev

  • xmlsec1

  • libxmlsec1-dev

  • m2crypto

all of which can be installed using apt-get. You will also need to get a tar file of pyxmlsec which you can install with

sudo python setup.py build install  

and we need celementtree, which is also a download, and python setup.py install task ...

We know this will be a pain in some circumstances, and we plan next year to try and significantly streamline our package dependencies ...

by Bryan Lawrence : 2006/05/22 : Categories ndg (permalink)

Service Binding

One of the things we have grappled with rather unsatisfactorily in the NDG is how to declare in discovery and browse metadata

  • that specific services are available to manipulate the described data entities

  • and, for a given service, what the binding between the service and data id is to invoke a service instance.

This is prety important, as has been pointed out numerous times before, but probably most eloquently in an FGDC Geospatial Interoperability Reference Model discussion:

For distributed computing, the service and information viewpoints are crucial and intertwined. For instance, information content isn't useful without services to transmit and use it. Conversely, invoking a service effectively requires that its underlying information be available and its meaning clear. However, the two viewpoints are also separable: one may define how to represent information regardless of what services carry it; or how to invoke a service regardless of how it packages its information.

Thus far in NDG, where in discovery we have been using the NASA GCMD DIF, we have been pretty limited in what we can do, so we extended the DIF schema to support a hack in the related URL ...

Basically what we did is add into the related URL the following:

<Related_URL>
<URL_Content_Type>NDG_A_SERVICE </URL_Content_Type>
<URL>http://dmgdev1.esc.rl.ac.uk/cgi-bin/ndgDataAccess%3FdatasetSource=dmgdev1.esc.rl.ac.uk%26datasetID= </URL>
<Description>The NDG service delivering data via NDG A metadata. </Description>
</Related_URL>

Leaving aside the fact that we've embedded a naughty character (&) in what should be XML, we then create a binding for a user in the GUI between that service and the dataset id ... it's clumsy, ugly, and of no use to anyone else who might obtain our record via OAI.

Ideally of course the metadata needs to be useful to both a human and automatic service discovery and binding tools. In the example above, we (NDG) know how to construct the binding between the service and the dataset id to make a human usable (clickable from a gui) URL, but no one else would. Likewise, there is no possibility of interoperability based on automatic tools. Such tools would be likely to use something like WSDL, or ISO19119 or both, or more ... (neither provide too much information about what the semantics are of the operations provided, one needs a data access query model -DAQM -which we've termed "Q" in our metadata taxonomy).

However, if we step back from the standards and ask ourselves what we need, I think it's something like the following:

<OnlineResource>
<Service>
<ServiceName>NDG_DataExtractor </ServiceName>
<ServiceBinding>
<ServiceLocation>http://DeployedServiceURL </ServiceLocation>
<HumanInterfaceURL>http://DeployedServiceURL/dataid </HumanInterfaceURL>
</ServiceBinding>
</Service>
<Service>
<ServiceName> NDG_DataExtractorWS </ServiceName>
<ServiceDescription>
<DescriptionURL type="WSDL"> address </DescriptionURL>
<DescriptionURL type="ISO19119"> address </DescriptionURL>
<DescriptionURL type="DAQM"> address </DescriptionURL>
<Description> Provides a web service interface to the
<a href="http://ndg.nerc.ac.uk">csml features provided in the dataset, allowing an application (for example the NDG DataExtractor GUI, but others as well), to subset and extract specific features from the dataset. </a>
</Description>
</ServiceDescription>
<ServiceBinding>
<ServiceLocation>http://DeployedServiceURL </ServiceLocation>
</ServiceBinding>
</Service>
</OnlineResource>

where I've made up the tags to get across some semantic meaning (yes, I know, I should have done it in UML).

OK, I think I know what we need, now how does this work in the standards world, what have I forgotten? What do we need to do to make it interoperable, and what are the steps along the way?

Well those are rhetoric questions, I know some of the first things I need to do: starting with a chat to my mate Jeremy Tandy at the Met Office who is wrestling with the same questions for the SIMDAT project, and then I think I'll be off reading the standards documents again. I suspect I'll have to find out more about OWL-S as I'm pretty sure there will be more tools in that area (given that ISO19139 is only just arriving for ISO19115 and there is no matching equivalent that I'm aware of for ISO19119).

by Bryan Lawrence : 2006/05/19 : Categories metadata ndg : 1 trackback : 0 comments (permalink)

Evaluating Climate Cloud

Jonathan Flowerdew at the University of Oxford has been working with me for the last two and a half years on methods of evaluating clouds in climate models. We've recently submitted a paper on this work to Climate Dynamics. If you want a copy of a preprint, contact him (or contact me to get his contact details).

The use of nudging and feature tracking to evaluate climate model cloud ? J.P. Flowerdew , B.N. Lawrence , D.G. Andrews

A feature tracking technique has been used to study the large-scale structure of North Atlantic low-pressure systems. Composite anomaly patterns from ERA-40 reanalyses and the International Satellite Cloud Climatology Project (ISCCP) match theoretical expectations, although the ISCCP low cloud ?eld requires careful interpretation. The same technique has been applied to the HadAM3 version of the UK Met Of?ce Uni?ed Model. The major observed features are qualitatively reproduced, but statistical analysis of mean feature strengths reveals some discrepancies. To study model behaviour under more controlled conditions, a simple nudging scheme has been developed where wind, temperature and humidity ?elds are relaxed towards ERA-40 reanalyses. This reduces the aliasing of interannual variability in comparisons between model and observations, or between different versions of the model. Nudging also permits a separation of errors in model circulation from those in diagnostic calculations based on that circulation.

by Bryan Lawrence : 2006/05/16 : Categories climate (permalink)

ISO 21127 aka CIDOC CRM - more metadata dejavu

Most of my colleagues in the environmental sciences wont have come across ISO 21227 (to be fair, it may not yet exist, but heck, most of my colleagues in environmental science haven't come across any ISO standard ...). I was introduced to the concepts behind it by my colleague Matthew Stiff, from the NERC Centre for Ecology and Hydrology (CEH), and I've just been nosying through a powerpoint tutorial, which introduce the CIDOC Conceptual Reference Modle (CRM) (ppt) which would appear to be the heart of it. Maybe the topic: Information and documentation -- A reference ontology for the interchange of cultural heritage information isn't going to engage too many of my colleagues, but maybe it should because the key concept is that:

Semantic interoperability in culture can be achieved by an ?extensible ontology of relationships? and explicit event modeling, that provides shared explanation rather than prescription of a common data structure.

That sounds familiar, if we change "events" to "observations", and replace "in culture" with "in environmental science" we'd all be on the same page ... although maybe some of my hard science mates wouldn't like the word relationships ...

Reading on we find that the CIDOC CRM aims to approximate a conceptualisation of real world phenomena ... sounds like the feature type model approximating the universe of discourse to me ... One key difference from the GML world though is the early acceptance of objects (features in my language) having multiple inheritance (which is hard to do in XML schema, hence a problem for GML).

I'm not the first to make the link between the ISO/OGC world and the CIDOC world of course, Martin Doerr who is one of the CIDOC authors made the connection explicit in a comparison of the CIDOC CRM with an early version of what became GML (pdf). Regrettably his conclusions are a bit dated now (five years is a long time on our business). It'd be interesting if someone did a comparison of the OGC Observations and Measures spec (or the new draft) with the CIDOC CRM ... meanwhile, when I can get my hands on the standard itself, it may make interesting reading to help inform our semantic web developments.

by Bryan Lawrence : 2006/05/14 : Categories metadata (permalink)

kubuntu dapper beta broken

I really wanted beagle, and the breezy beagle was broken (or to be precise it seemed to be progressively corrupting my kde directories), so I upgraded to the new dapper beta ...

It all seemed fine when working from home, and the new beagle (with kerry-beagle) is great, but I came into work this morning and stuck the laptop into the docking station, and two significant problems occurred:

  • cups only sees my local printer, and I can't make it browse for the network printers (despite browsing being on in my cups config file).

    • and nothing I can do seems to change anything. I can't change anything via kcontrol or the web inteface to cups.

    • These problems don't seem new! It's pretty poor if a major release like kubuntu dapper (even in beta) can't get printing right!

    • Update: OK, some of my problems were to do with some incompatibility between my (pre-existing) cupsd.conf file and the new release. Using the new release file means i can now edit the file, but browsing is still broken ...

  • Xorg has broken ... I used to be able to slap the laptop down, and get my 1920x1200 big screen working by restarting the x-server ... but now nothing I can do is making this screen anything but a clone of my laptop 1024x768. I've just spent rather more time than I should have trying to fix it ...

    • Note to self: the radeon driver has a man page ...

by Bryan Lawrence : 2006/05/02 (permalink)

times higher education pays attention

claddier in the news ...

by Bryan Lawrence : 2006/05/01 : Categories claddier (permalink)

More on Data Citation

There was a lot interest at the e-science meeting about a throw away comment I made about our claddier project.

In talking to folk, it became clear that it helps to draw a distinction between those who interpret data and those who produce data. Often these are the same people, and there is a clear one-to-one link between some data, the data creator(s) and a resulting paper which describes the data in some detail. However, in this situation, the key reason why the paper is published will normally be the interpretation that is presented, not the data description per se (although there are some disciplines in which the paper may simply describe the data, this is not the normal case).

So what CLADDIER is about is the situation where the data creator(s) has (have) only been involved in the creation of the data, not the resulting interpretation. Again, in some disciplines the data creators may end up as authors on relevant papers (and may never even see or comment on the text). However, in most disciplines this either doesn't happen, or is severely frowned upon, and in these situations the data creators are somehow second class academic citizens (despite being an essential part of the academic food chain). What we want to be able to do is have recognition for these folk via citation of the dataset itself, not a journal paper ... that way important datasets will be measurably important.

There are a lot of complications, some of which I've addressed before. Some of the additional things we discussed yesterday included

  • How to handle small contributions (such as annotations). My take on this is that small contributions are visible via authorship within the cited entity, but probably ought not be visible outside of it (although in the semantic web world one ought to be able to count the number of these things any individual does). At some point folk have to decide on the definition of small though (probably a discipline dependent decision) ...

  • The situation is rather more complicated with meta-analyses. Arguably we're in the same situation as an anthology or book of papers ... in either case we would expect the contributions to be citable as is the aggregated work.

One new concept for me was the taxonomy world might want to count (in some way) the number of times a specific taxa is mentioned, and use that as a metric of the work done categorising the species. It struck me that this might not be too clever in their world - surely the description of some rare species ought to be pretty important, but it might not get cited as frequently as some pretty routine work on, for example some agriculturally important species. (In the journal paper world this is the same reason why citation impact alone never used to be the whole story in the UK Research Assessment Exercise)

by Bryan Lawrence : 2006/04/28 : Categories claddier (permalink)

NDG Status Talk

As yesterday's last blog intimated, I'm in the middle of a two-day meeting of the NERC e-science community. Yesterday I gave a talk on the status of the NERC DataGrid project:

NERC DataGrid Status

In this presentation, I present some of the motivation for the NERC DataGrid development (the key points being that we want semantic access to distributed data with no centralised user management), link it to the ISO TC211 standards work, and take the listener through a tour of some of the NDG products as they are now. There is a slightly more detailed look at the Climate Sciences Modelling Language, and I conclude with an overview of the NDG roadmap.

by Bryan Lawrence : 2006/04/27 : Categories ndg (permalink)

Another Missed Anniversary

As of last Tuesday I have been running the BADC for six years! I can't believe it's been that long! My first tour of duty in the UK (1990-1996) was just under six years, and my return to Godzone (1996-2000) was only four years. Yet the last six years have gone by incredibly quickly, and seem to have been much faster (and in some ways less full) than those other two chunks of time. (Mind you, my wife and I were talking this over, and when we listed what we had done in the last six years it became obvious that the key word in that last sentence was seems.)

When I was interviewed for the post they asked me what I thought I would be doing in five years. I said I didn't know, but I was sure I would have moved on to new challenges. I guess I got that mostly right, I have moved on to new challenges, but I've moved on by staying physically in the same place and growing the job (and taking on a daughter) ...

... and I have no plans right now of jumping back down under, despite having done another long tour of duty :-)

by Bryan Lawrence : 2006/04/27 (permalink)

Baker Report

Treasury via the Office of Science and Innovation1 is putting a good deal of pressure on the research councils2 to contribute to UK wealth by more effective knowledge transfer. The key document behind all of this is the Baker Report. It's been hanging around since 1999, having more and more influence in the way the research councils behave, but from my perspective it's finally really beginning to bite, so I decided I'd read the damn thing, so people would stop blindsiding me with quotes.

The first, and most obvious thing to note, is that it really is about commercialisation, and it's driven by government policy objective of improving the contribution of publicly funded science to wealth creation. But right up front (section 1.9) Baker makes the point that the free dissemination of research outputs can be an effective means of knowledge transfer, with the economic benefits accruing to industry as a whole, rather than to individual players. Thus the Baker report is about knowledge transfer in all its forms.

The second obvious point is that with all the will in the world, the research councils can't push knowledge into a vacuum: along with push via knowledge transfer initiatives, there needs to be an industry and/or a market with a will to pull knowledge out! Where such industry is weak or nonexistent there is the strongest case to make research outputs freely available as a methodology for knowledge transfer.

Some of my colleagues will also be glad to know that the presumption is that the first priority of the research councils should be to deliver their science objectives:

Nothing I advocate in this report is intended to undermine the capacity of (the Research Councils) to deliver their primary outputs.

Baker actually defines what he calls knowledge transfer:

  • collaboration with industry to solve problems (often in the context of contract research for industry)

  • the free dissemination of information, normally by way of publication

  • licencing of technology to industry users

  • provision of paid consultancy advice

  • the sale of data

  • the creation and sale of software

  • the formation of spin out companies

  • joint ventures with industry

  • the interchange of staff between the public and private sector.

Given that Baker recognises the importance of free dissemination of information it's disappointing that he implies that data and software are not candidates for free dissemination. Of course, he was writing in 1999, when the world of open source software was on the horizon, but not really visible to the likes of Baker, so I would argue that the creation of open source software by the public sector research establishment would not only fit squarely within these definitions had he been writing today, but he might have explicitly included it (indeed he probably would have been required to). In terms of free dissemination of data, most folk will know I'm working towards the formal publication of data, so that fits in this definition too.

I was also pleased to see (contrary to what others have said to me), that Baker explicitly (3.17 and 3.18) makes the point that knowledge transfer is a global activity, and the benefit to the UK economy will flow whether or not knowledge is transferred directly into UK entities or via global demand. The key point seems to be that the knowledge is transferred, not where it is transferred to (although he sensibly make the point that where possible direct UK benefit should be engendered).

Where it starts to go wrong, or at least, the reader can get carried away, is the emphasis in the report on protecting and exploiting Intellectual Property. At one point he puts it like this:

the management of intellectual property is a complex task that can be broken down into three steps; identification of ideas with commercial potential; the protection and defence of these ideas and their exploitation..

There is a clear frame of thinking that protecting and defending leads to exploitation, and this way of thinking is very easy to lead one astray. It certainly doesn't fit naturally with all the methods of knowledge transfer that he lists! It can also cause no end of problem for those of us with legislative requirements to provide data at no more than cost to those who request it (i.e. particularly the environmental information regulations, e.g. see the DEFRA guidance - although note that EIR don't allow you to take data from a public body and sell it or distribute it without an appropriate license so the conflict isn't unresolvable).

Baker does realise some of this of course, he makes the point that:

There is little benefit in protecting research outputs where there is no possibility of deriving revenues from the work streams either now or in the future.

I was amused to get to the point where he recognises that modest additional funding would reap considerable reward, but of course that money hasn't transpired (as far as I can see, but I may not be aware of it). As usual with this government, base funding has had to stump up for new policy activities. (This may be no bad thing, but it's more honest to admit it - the government is spending core science money on trying to boost wealth creation. Fine, and indeed we have been doing knowledge transfer, and will continue to do so, from our core budget, but the policy is demanding more).

The final thing to remember is that the Baker report is about the public sector research establishment itself, my reading of it definitely didn't support the top-slicing of funds from the grant budgets that go to universities to support knowledge transfer, but that's what is happening. Again, perhaps no bad thing, but I don't see Baker asking for it (although there is obvious ambiguity, since it covers the research councils, but when a research council issues a grant, the grant-holding body gets to exploit the intellectual property).

So the Baker report was written in 1999, but government policy is being driven by rather more recent things too. Over the next couple of months, I'll be blogging about those as well (if I have time). One key point to make in advance is that knowledge transfer can and now does include the concept of science leading to policy (which is of course a key justification for the NERC activities)

1: that's the new name for the Office of Science and Technology - such a new name that as of today their website hasn't caught up (ret).
2: Actually it applies to the entire public sector research establishment, so it includes all the directly funded science activities of the government departments as well as the research councils (ret).

by Bryan Lawrence : 2006/04/26 : Categories strategy (permalink)

I'm still naive

I've just read the real climate post on how not to write a press release. I was staggered to read the actual press release that caused all the fuss (predictions of 11C climate sensitivity etc). The bottom line is that had I read that press release without any prior knowledge I too might have believed that an 11 degree increase in global mean temperature was what they had predicted (which is not what they said in the paper). I can't help putting some of the blame back on the ClimatePrediction.net team - the press release didn't reflect the message of their results at all properly, and they shouldn't have let that happen. I'm still naive enough to believe it's incumbent on us as scientists to at least make sure the release is accurate, even if we can't affect the resulting reporting.

Having said that, I thought Carl Cristenson made a fair point in the comments to the RC post (number 13): to a great extent, the fuss is all a bit much, we should concentrate on the big picture - the London Metro isn't the most significant organ of the press :-)

(I stopped reading the comments around number 20 ... does anyone have time to read 200 plus comments?)

Amusingly I read the RC post (and the final point: "not all publicity is good publicity") less than four hours after I listened to Nick Faull (the new project manager for cp.net) ruefully review the press coverage (in a presentation at a nerc e-science meeting). He finished up with "but at least they say no publicity is bad publicity" ... one can find arguments to support both points of view!

As an aside, we spent some time at the NERC meeting discussing the fact that current climateprediction experiment is completely branded as a BBC activity, even though NERC is still providing significant funding (and seeded the project in the first place as part of the COAPEC programme) ... this at a time when NERC needs to get its knowledge transfer activities visible.

by Bryan Lawrence : 2006/04/26 : Categories climate environment (permalink)

Data Storage Strategy

Recently I was asked to come up with a vision of where the UK research sector needed to be in terms of handling large datasets in ten years time. This is being fed into the deliberations of various committees who will come up with a bid for infrastructure in the next government comprehensive spending plan. Leaving aside that the exam question is unanswerable, this is what we1 came up with:

The e-Infrastructure for Research in next 10 years (Data Storage Issues)

The Status Quo

In many fields data production is doubling every two years or so, but there are a number of fields where in the near future, new generations of instruments are likely to introduce major step changes in data production. Such instruments (e.g. the Large Hadron Collider and the Diamond Light Source) will produce data on a scale never before managed by their communities.

Thus far disk storage capacity increases have met the storage demands, and both tape and disk storage capacities are likely to continue to increase although there are always concerns about being on the edge of technological limits. New holographic storage systems are beginning to hit the market, but thus far are not yet scalable nor fast enough to compare with older technologies.

While storage capacities are likely to meet demand, major problems are anticipated in both the ability to retrieve and manage the data stored. Although storage capacities and network bandwidth have been roughly following Moore?s Law (doubling every two years), neither the speed of I/O subsystems nor the storage capacities of major repositories have kept apace. Early tests of data movement within the academic community have found in many cases that their storage systems have been the limiting factor in moving data (not the network). While many groups are relying on commodity storage solutions (e.g. massive arrays of cheap disks), as the volumes of data stored have gone up, random bit errors are beginning to accumulate, causing reliability problems. Such problems are compounded by many communities are relying on ad hoc storage management software, and few expect their solutions to scale with the oncoming demand. As volumes go up, finding and locating specific data depends more and more on sophisticated indexing and cataloguing. Existing High Performance Computing facilities do not adequately provide for users whose main codes produced huge volumes of data. There is little exploitation of external off-site backup for large research datasets, and international links are limited to major funded international projects.

The Vision

In ten years, the UK community should expect to have available both the tools and the infrastructure to adequately exploit the deluge of data expected. The infrastructure is likely to consist of

  • One or more national storage facilities that provide reliable storage for very high-volumes of data (greater than tens of PB each).

  • A number of discipline specific data storage facilities which have optimised their storage to support the access paradigms required by their communities, and exploit off site backups, possibly at the national storage facilities.

All these storage facilities will all have links with international peers that enable international collaborations to exploit distributed storage paradigms without explicitly bidding for linkage funding. Such links will include both adequate network provision and bilateral policies on the storage of data for international collaborators. The UK high performance computing centres will have large and efficient I/O subsystems that can support data intensive high performance computing, and will be linked by high performance networks (possibly including dedicated bandwidth networks) to national and international data facilities.

The UK research community will have invested both in the development of software tools to efficiently manage large archives on commodity hardware, and on the development of methodologies to improve the bandwidth to storage subsystems. At the same time the investment in technologies to exploit both distributed searching (across archives) and server-side data selection (to minimise downloading) will have been continued as the demand for access to storage will have continued to outstrip the bandwidth.

Many communities will have developed automatic techniques to capture metadata, and both the data and metadata will be automatically mirrored to national databases, even as they are exploited by individual research groups.

As communities will have become dependent on massive openly accessible databases, stable financial support will become vital. Both the UK and the EU will have developed funding mechanisms that reflect the real costs and importance and of strategic data repositories (to at least match efforts in the U.S. which have almost twenty years of heritage). Such mechanisms will reflect the real staffing costs associated with the data repositories.

Key Drivers

In all communities, data volumes of PB/year are expected within the next decade:

  • In the environmental sciences, new satellite instruments and more use of ensembles of high resolution models will lead to a number of multi-PB archives in the next decade (within the U.K. the Met Office and European Centre for Medium Range Weather Forecasting already have archives which exceed a PB each. Such archives will need to be connected to the research community with high-speed networks.

  • In the astronomy community, there are now of the order of 100 experiments delivering 1 or more TB/year with the largest at about 20 TB/year, but in the near future the largest will be providing 100 TB/yr and by early in the next decade PB/year instruments will be deployed.

  • In the biological sciences, new microscopic and imaging techniques, along with new sensor arrays and the exploitation of new instruments (including the Diamond Light Source) are leading to, and will continue to, an explosion in data production.

  • Within the particle physics community, the need to exploit new instruments at CERN and elsewhere is leading to the development of new storage paradigms, but they are continually on the bleeding edge, both in terms of software archival tools, and the hardware which is exploited.

Legal requirements to keep research data as evidential backup will become more prevalent. Communities are recognising the benefits of meta-analyses and inter-disciplinary analyses across data repositories. Both legality issues and co-analysis issues will lead to data maintenance periods becoming mandated by funding providers.

With the plethora of instruments and simulation codes within each community, each capable of producing different forms of data, heterogeneity of data types coupled with differing indexing strategies will become a significant problem for cross-platform analyses (and the concommittant data retrieval). The problem will be exacerbated for interdisciplinary analyses.

1: I polled a few key folk who contributed significantly to this document, it's certainly not all mine (ret).

by Bryan Lawrence : 2006/04/26 : Categories strategy : 3 comments (permalink)

Government Policy on Open Source Software

All this strategy stuff is making my head hurt. But meanwhile, I think I'll have to start a series of blog entries to provide me with relevant notes. To start with, here is a verbatim quote from the UK government policy on open source software (pdf, October 2004):

The key decisions of this policy are as follows:

  • UK Government will consider OSS solutions alongside proprietary ones in IT procurements. Contracts will be awarded on a value for money basis.

  • UK Government will only use products for interoperability that support open standards and specifications in all future IT developments.

  • UK Government will seek to avoid lock-in to proprietary IT products and services.

  • UK Government will consider obtaining full rights to bespoke software code or customisations of COTS (Commercial Off The Shelf) software it procures wherever this achieves best value for money.

  • Publicly funded R&D projects which aim to produce software outputs shall specify a proposed software exploitation route at the start of the project. At the completion of the project, the software shall be exploited either commercially or within an academic community or as Open Source Software

(although the last point has some exemptions, the key one - from my perspective - being trading funds like the Met Office)

by Bryan Lawrence : 2006/04/26 : Categories strategy (permalink)

Orthogonal Achievement

Well, I may not have done much blogging this month, but I've achieved a bunch of other stuff:

  • Firstly, and most importantly, Elizabeth turned one, and we had a big family and friends party over Easter to celebrate. She contributed in her normal way by smiling and laughing and eating in roughly equal proportions. At the same time, we built a deck outside our back door so we can get out without negotiating a temporary step that I put there nearly two years ago ...

  • Secondly, I'm suffering a barrage of strategy meetings, starting with a NERC information strategy meeting a couple of weeks ago, then a knowledge strategy meeting on Monday, and a one hour telecon today on technology strategy. I'm all strategied out.

  • Thirdly, I've probably written more code this month than in any month for a long time - an analysis of who was doing what on the way to the alpha release of our ndg code and services made it clear that there were a couple of gaps and believe it or not my toying with the code for this blog made me the person best fitted to do it.

  • Fourthly, I enjoyed a bit of time reading and making some contributions to some scientific papers.

All this working four days a week. Since the beginning of March I've been looking after Elizabeth on Fridays, and been fitting my nominal 40 hours in the previous four days and nights (I say nominal, because I'd like to be only doing 40 hours). It's been an interesting exercise, because I think I'm actually being more productive doing fewer longer days. However, for all that, I've still got too many balls in the air, so if you're reading this, and wondering why I haven't done something I should have, sorry ... hopefully it's still in the queue!

by Bryan Lawrence : 2006/04/25 (permalink)

ISO19115 extensions

I keep having to look up the iso document for this list of allowable extensions to iso19115:

  1. adding a new metadata section;

  2. creating a new metadata codelist to replace the domain of an existing metadata element that has free text listed as its domain value;

  3. creating new metadata codelist elements (expanding a codelist);

  4. adding a new metadata element;

  5. adding a new metadata entity;

  6. imposing a more stringent obligation on an existing metadata element;

  7. imposing a more restrictive domain on an existing metadata element.

by Bryan Lawrence : 2006/04/18 : Categories iso19115 (permalink)

More Meteorological Time

My first post at describing the time issues for meteorological metadata led to some confusion, so I'm trying again. I think it helps to consider a diagram:

Image: static/2006/04/11/time.jpg

The diagram shows a number of different datasets that can be constructed from daily forecast runs (shown for an arbitrary month from the 12th til the 15th as forecasts 1 through 4. If we consider forecast 2, we are running it to give me a forecast from the analysis time (day 13.0) forward past day 14.5 ... but you can see that the simulation began (in this case 1.0) days earlier, at a simulation time of 12.0 (T0). In this example

  • we've allowed the initial condition to be day 12.0 from forecast 1.

  • we've imagined the analysis was produced at the end of a data assimilation period (let's call this time Ta).

  • the last time for which data was used (the datum time, td) corresponds to the analysis time.

(Here I'm using some nomenclature defined earlier as well as using some new terms).

Anyway, the point here was to introduce a couple of new concepts. These forecast datasets can be stored and queried relatively simply ... we would have a sequence of datasets, one for each forecast, and the queries would simply then be on finding the forecasts (using discovery metadata, e.g. ISO19139) and then on extracting and using the data itself (using archive aka content aka usage metadata, e.g. an application schema of GML such as CSML).

What's more interesting is how we provide, document and query the sythesized datasets (e.g. the analysis and T+24 datasets). Firstly, if we look at the analysis dataset, we could extract the Ta data points and have a new dataset, but often we need the interim times as well, and you can see that we have two choices of how to construct them - we can use the earlier time from the later forecast (b), or the later time from the earlier forecast (a). Normally we choose the latter, because the diabatic and non-observed variables are usually more consistent outside the assimilation period when they have had longer to spin up. Anyway, either way we have to document what is done. This is a job for a new package we plan to get into the WMO core profile of ISO19139 as an extension - NumSim.

From a storage point of view, as I implied above, we can extract and store the new datasets, or we can try and do this as a virtual dataset, described in CSML, and extractable by CSML-tools. We don't yet know how to do this, but there is obvious utility in saving storage in doing so.

by Bryan Lawrence : 2006/04/13 : Categories badc curation ndg (permalink)

No Entries.

Curating metadata aka XML documents

Andy Roberts (via Sam Ruby) has defined a markup language for describing changes to xml documents.

One of the things that has worried me for a long time about relational systems for storing metadata is that it is non-trivial to reliably maintain metadata provenance. I don't want relational integrity, I want historical integrity (and don't give me rollback as an answer). One of the reasons why I like XML as a way of dealing with our scientific metadata as it is much more intuitively obvious how one deals with provenance (one keeps old records), but differencing xml documents is tedious and in the long run keeping multiple copies of large documents with minor changes is expensive.

I'll be interested to see if the delta web gains any traction.

by Bryan Lawrence : 2006/04/06 : Categories metadata ndg curation (permalink)

Standard Invention

Allan Doyle's blog introduced me to the SurveyOS project. These guys purport to be

sponsoring the development of Geospatial Standards For Free Software. These standards will provide a way to share geospatial data between oepn (sic) source applications in a vendor-neutral, XML based format.

Sound like dejavu? Well these folks recognise this up front:

Many GIS users will be critical of setting up an additional set of open standards for geospatial technology, when the OGC already maintains a set of similar standards. While we recognize the benefits of a unified approach to standards design, we believe there are some serious and fundamental flaws in the approach OGC takes to standards development.

Well, I'm not a GIS user (yet), but I'm sure critical of this. What we don't need is even more nugatory effort reinventing wheels. Anyway, they their reasons why they don't like OGC, so let's see if we can make some sense of their reasons:

  1. The OGC is about geospatial standards, but it is not about free and open source software.

    • True, but as the SurveyOS guys admit: OGC promotes the development and use of consensus-derived publicly available and open specifications that enable different geospatial systems (commercial or public domain or open source) to interoperate. So this might be a good case for folk to build open source implementations of OGC standards, not a good case for creating new standards.

  2. Membership to the OGC is expensive and exclusive.

    • True. But it's not hard to get involved, even without investing money. The issue would appear to be that the surveyOS folks want to influence the standards without paying to sit at the table. I understand the motivation. We've paid the minimum we can to sit close, and we don't have a vote, but I've seen no evidence that the OGC community ignore good technical advice.

  3. The OGC focus a great deal of its efforts on GIS for the web. Some of us still use GIS on the desktop. GIS and the internet are a powerful combination, but they are not the solution to every problem. The GSFS will also focus its efforts on GIS for the desktop.

    • This might be valid, but frankly what sits under the hood of a gui can have the same engine as something that sits under the hood of a browser. But even if we believe this point, it's an argument to develop a complementary set of protocols not a competing set.

  4. OGC specifications are difficult to read and understand.

    • True. And there are a lot of them as well. But why not get involved in documenting what exists, rather than developing new things? The SurveyOS web page says: We will give explanations, go into details, provide examples, and explain technical jargon. Our specifications will be designed to read like a book or tutorial, not like a specification. So why not write a book or a tutorial for the OGC specs?

  5. The OGC Provides A Standard But No Implementation. There are a few open source projects that implement the OGC standards ... Every standard adopted as a part of GSFS will be developed in conjuction with an open source implementation that can be used as an example for other developers.

    • This is the worst of the lot. Let's rephrase this to tell it like it is: Because some folk have spent years developing some technically excellent standards and have done the initial implementations in the commercial world, and not built us an open source implementation, we will take a few months (years?) to develop our own standards, and build our own implementations which wont interoperate with the others. How does that help anyone? There is nothing to stop the surveyOS guys from building their own open source implementations of any OGC spec.

They conclude with this statement:

The SurveyOS will make an effort to learn from and adopt portions of the OGC standards, but believes the problems listed above require a separate effort at creating geospatial standards.

What a waste of effort. I agree that the OGC standards are daunting in their complexity and volume, and agree that they depend heavily on the ISO standards which are expensive to obtain, but in my experience they have covered sucha lot of ground that repeating the effort will be really nugatory.

I'm a big fan of open source developments. Indeed the NDG is open source, and building on OGC protocols. I recognise the advantages of competition in developing implementations, but if one wants to repeat the standards work, one should base the argument for doing so around where the standards are deficient, not whether or not there are open source implementations of the standards!

Is the bottom line that these people too lazy to read the specs properly?

by Bryan Lawrence : 2006/04/06 : Categories ndg : 5 comments (permalink)

Schedule Chicken

I found Jim Carson's write up on schedule chicken via Dare Obasanjo.

Food for thought in the software industry, or any industry actually!

by Bryan Lawrence : 2006/04/06 : Categories management (permalink)

Moving Office

Today I'm moving office - only about 50 yards along a corridor, but it's the usual trauma: do I need to keep this? Can I throw it away?

Ideally most of the paper would become electronic ... what I would give to be able to search my photocopied papers and grey material and books etc? But even the academic papers are beyond help .... I have a thousand plus of the things, all nicely entered in a bibtex file with a box name associated with them ... but they'll never become digital I'm sad to say. What will be interesting will be when I'll feel comfortable enough with finding digital copies (for free) on the web, to biff them. I wonder how long that will be?

by Bryan Lawrence : 2006/04/03 (permalink)

konqueror, safari, and xslt

Can it really be true that neither konqueror nor safari can render xml files with xslt as the stylesheet? (Yes it can!) Can it really be true that only IE and mozilla can do this? (Not even opera. Apparently!) Does anyone know a timescale for the kde or apple folks to sort this out? Do they care? Should I care?

by Bryan Lawrence : 2006/03/28 : Categories computing : 0 trackbacks : 0 comments (permalink)

Some practicalities with ISO19139

A number of us in the met community are rushing towards implementing a number of standards that should improve interoperability across the environmental sciences. One of those is ISO19139 which (we hope) will soon be standard xml implementation of ISO19115 (the content standard for discovery metadata).

Both ISO19139 and ISO19115 have built into them the concept that communities will want to build profiles which are targeted to their own uses. Such profiles may consists of a combination of constraints (subsets of the standards) and/or extensions (to the standards). I've introduced some of these concepts before (parameters in metadata and xml and dejavu - and the conclusions of both still stand).

It has been suggested that this ability to constrain and/or extend ISO(19115/19139) could impact on interoperability because by definition an extended document cannot conform to the ISO19139 schema, and so one could not consume a document in an unfamiliar profile (and possibly therebye loose the benefit of a standard in the first place).

I think this is a red herring. I don't believe anyone is going to have an archive of "pre-formed" xml instance documents which conform to ISO19139 (but see below). In nearly all cases folks will have their own metadata, which they will export into XML instances as necessary (from something, a database, flatfiles, whatever).

Where those xml instances are going to be shared by a community, they will be in the community profile - which can use xml schema to validate according to the community schema (which had better have all the extension material in a new package in a new namespace). But that package can import the gmd namespace (gmd is the official prefix of the iso19139 specific schema).

So, these will be valid instances of the schema against which they are declared, which means the xml schema machinery (love it or hate it) will be useable. Consumers of these instances can either parse the file for stuff they understand, or exporters can choose to ensure that they are ?transformed from the community profile into vanilla iso19139 on export from the source, or from the community profile instance.

The first option (consuming an instance of a schema which we don't fully understand but which imports and uses - significantly - a namespace that we do understand) is that discussed by Dave Orchard in an xml.com article.

Orchard has some useful rules, of which number 5 is relevant:

5. Document consumers must ingore any XML attributes or elements in a valid XML document that they do not recognise.

To second option (export a valid iso19139 document from your profile when sharing outside your community) will be much easier if those who create community profiles always produce an accompanying xslt for producing vanilla iso19139 ... this shouldn't be overly onerous for someone who can build a valid schema in the first place :-).

This last option will be absolutely necessary when considering the role of community, national and international portals. They will have archives of ISO19139-like documents, and it will be encumbent on the data providers to ensure that they can harvest the right documents - I don't think it makes sense to rely on the portals consuming non-specific documents and then processing to get the right records for their portals! (This paragraph added as an update in response to email from Simon Cox who reminded me about this application).

But all this technical machinery is the easy part. The hard part is defining a dataset so we can decide what needs an ISO19139 record! (Some if the issues for citing a dataset are discussed here and most of them come down to What is a dataset?)

by Bryan Lawrence : 2006/03/28 : Categories metadata ndg iso19115 : 1 trackback : 3 comments (permalink)

Novel Escapism

James Aaach in an off topic comment on my recent post about journals and blogging introduced me to www.LabLit.com. I suspect I'll waste a lot of time there on quiet evenings.

The site is full of interesting essays, but what caught my attention immediately, on a day when Gordon Brown is trying to up spending on science and technology education in the UK was the forum discussion on Gregory Benford's contention that the rise of fantasy fiction has

... led to a core lessening of ... the larger genre, with a lot less real thinking going on about the future. Instead, people choose to be horrified by it, or to run away from it into medieval fantasy ... all of this as a retreat from the present, or rather, from the implications of the future ... (no) accident that fantasy novels dominate a market that once was plainly that of Heinlein, Clarke, Asimov, and Phil Dick ... (led) to the detriment of the total society, because science fiction, for decades really, has been the canary in the mineshaft for the advanced nations, to tell us what to worry about up ahead.

The forum had some interesting threads in it (not least a list of science fiction that Benford considers has significant science in it ... which includes much I haven't read ... to my surprise). I was particularly struck by Benford's point:

that our culture has uplifted much of humanity with technology, but needs to think about the ever faster pace of change and bring along those who may well react against it. Genomics, climate change, biotechs, etc--all need realistic treatment in what-if? scenarios.

This got me thinking about the role of science fiction per se. Surely first and foremost it's about entertainment? I don't buy Benford's argument that people are consciously avoiding thinking about the future - I never once read a science fiction novel because I thought it was about a possible future, I read them for fun. Which brings me back to Gordon Brown ... While I think Benford's right nonetheless that science fiction can illuminate the future, I think an even more important cultural reason to bring people back to science fiction from fantasy is that for many of us, science fiction is what got us (dare I say) turned on to the fact that science is interesting. It's not much practical use to anyone if all forms of escapism are contemporary, historical or fantasy. More good quality science fiction will go a long way to enticing those young people into science that Gordon Brown and the UK need for the future. And good quality science fiction (that is readable science fiction which has digestible facts in it) might just help even the folk without science genes to appreciate what it's about.

I guess the problem is it's hard to write science fiction, and much of the potential audience hasn't the appetite for it any more. What I don't really understand is why not.

(As an aside, in response to Benford's position, further down, in the same link, Darrell Schweitzer makes some points about the dangers of the drift to fantasy amongst which he lists:

.. And of course the other real danger comes from writers like Michael Crichton who tell their readers that all scientific innovation is bad and scary, carried out by naive or corrupt people, and will likely kill you. THAT is socially harmful.

Amen!)

by Bryan Lawrence : 2006/03/22 : 1 comment (permalink)

New Word - Bliki

I've just learnt a new word. Apparently this blog (a leonardo instance) is a bliki. Having learnt it I propose to forget it. It seems like an unnecessary distinction to make and some more geek jargon ... but it does have a ring to it :-)

by Bryan Lawrence : 2006/03/22 : 2 comments (permalink)

Parameterisation of Orographic Cloud Dynamics in a GCM

Sam Dean, Jon Flowerdew, Steve Eckermann and I have just submitted a paper to Climate Dynamics:

Abstract. A new parameterisation is described that predicts the temperature perturbations due to sub-grid scale orographic gravity waves in the atmosphere of the 19 level HADAM3 version of the United Kingdom Met Of?ce Uni?ed Model. The explicit calculation of the wave phase allows the sign of the temperature perturbation to be predicted. The scheme is used to create orographic clouds, including cirrus, that were previously absent in model simulations. A novel approach to the validation of this parameterisation makes use of both satellite observations of a case study, and a simulation in which the Uni?ed Model is nudged towards ERA-40 assimilated winds, temperatures and humidities. It is demonstrated that this approach offers a feasible way of introducing large scale orographic cirrus clouds into GCMs.

Anyone interested should contact Sam for a preprint. If you haven't got his details, let me know.

by Bryan Lawrence : 2006/03/21 : Categories climate : 0 trackbacks (permalink)

The Lockin results in the Exeter Communique

A few weeks ago I went quiet for a week while I attended a workshop at the Met Office. We called our workshop "the lock-in", as the original proposal was that we would be fed pizzas through a locked door until we came out with a GML application schema for operational meteorology. Well, we were allowed out, but being the Met Office we had no effective internet connection, and we were too knackered for anything in the evenings ...

Anyway, the important thing is that we had a really excellent week, and made some real progress on a number of issues. The results are summarised in the Exeter Communique (pdf):

This document summarises the discussions and recommendations of a workshop convened by the UK Met Office to examine the application of OGC Web Services and GML Modelling to Operational Meteorology. The workshop addressed the need to define best practice for development of data standards that span multiple subject domains, jurisdictions and processing technologies. The findings of this workshop will be of use not only to organisations involved in the processing of meteorological data, but any community that requires interoperability along a data processing chain.

Further information is at the AUKEGGS1 wikipage.

1: Australia-UK collaboration on Exploitation of Grid and Geospatial Standards (ret).

by Bryan Lawrence : 2006/03/20 : Categories ndg metadata : 2 comments (permalink)

tim bray wrong for the first time

Regular readers of this blog will know that Tim Bray's blog is reliable source of inspiration for me ... however, for for the first time, I think he's got it wrong (ok, a bit of a tongue in cheek here, but anyway), he said, quoting Linda Stone:

But then she claims that email is ineffective for decision-making and crisis management. The first is not true, I have been engaged in making important complex decisions mostly by email, mostly in the context of standards efforts, for about ten years now. If she were right, we could disband the IETF.

I think the first is nearly true, and the second is absolutely true. (Of course, it's not obvious from that quote that she was actually making two different points, but we'll assume she is from his response).

I would claim that you can't make decisions unless you have control over the information flow that leads to your decisions. In particular, during crises, one needs to be able to identify the relevant information quickly. Who amongst us really can control their email? No, I mean control!

Of course Tim is a self confessed insomniac, so maybe his bandwidth is still large enough to cope. Mine isn't.

by Bryan Lawrence : 2006/03/20 : 0 trackbacks : 1 comment (permalink)

Journals and Blogging

On Thursday I gave my seminar at Oxford. Of course I wrote the abstract months before the talk, so I didn't cover half the things I said I would in any detail, but for the record, the presentation itself is on my Talks page.

Of the sixty slides in the presentation, most of the discussion afterwards concentrated on six slides at the end, and it became clear that I had confused my audience about what I was saying, so this is by way of trying to clear up some misconceptions.

For many in the audience I think this was there first significant exposure to the ideas and technologies of blogging (and importantly trackback). I introduced the concept with these factoids (from the 15th of March):

  • Google search on ?climate blogs? yields 33,900,000 hits.

  • tecnorati is following 30 million blogs

    • 269,404 have climate posts

    • 1,953 climate posts in ?environmental? blogs

    • 131 posts about potential vorticity (mainly in weather/hurricane blogs)

  • Very few ?professional? standard blogs in our field, but gazillions in others! (Notwithstanding: RealClimate and others.)

I then compared traditional scientific publishing and self publishing, and this is where I think my message got blurred. Anyway, the comparison was along these lines:

     Traditional Publishing    Self Publishing  
     Pluses    Pluses  
  Review    Peer-Review; the gold standard    What people think is visible! Trackback, Annotation 
  Quality measures    Citation    Citation, Trackback and Annotation 
  Feedback    publish then email, slow    Immediate Feedback, Hyperlinks  
  Indexing    Web of Science etc, reliable    Tagging, Google, just as reliable  
  Readability    Paper is nice to read    PDF can be printed  
  Other       You can still publish in the traditional media  
     Minuses    Minuses  
  Review    Peer review is not all it could be    No formal peer review  
  Indexing    Proprietary indexing (roll on google-scholar)    Ranking a problem: finding your way amongst garbage  
  Other    Often very slow to print     
     Libraries can't afford to buy copies (limited readership)     
        Trackback and Comment Spam  

I then concluded the big question is really how to deal with self publishing and peer review outside the domain of traditional journals, because I think for many their days are numbered (possibly apart from as formal records).

Most of the ensuring discussion was predicated on the assumption that I was recommending blogging as the alternative to "real" publishing, despite the fact that earlier I had introduced the RCUK position statement on open access and I then went straight on to introduce Institutional Repositories and the CLADDIER project.

So, let me try and be very explicit about the contents of my crystal ball.

  1. The days of traditional journals are numbered, if they continue to behave the way they do, i.e.

    1. Publishers continue to aggregate, and ignore the (declining) buying power of their academic markets.

    2. They do not embrace new technologies.

    3. They maintain outdated licensing strategies.

    4. Two outstanding exemplars of journals moving with the times (for whom this is not a problem) are:

      1. Nature.

        1. Check out Connotea (via Timo Hannay).

        2. Note also their harnessing of Supplementary Online Material is a good thing. They're only one step away from formally allowing data citation!

        3. Their licensing policy is fair: authors can self-publish into their own and institutional archives six months after publication. (Nature explicitly does not require authors to sign away copyright!)

      2. Atmospheric Chemistry and Physics.

        1. Uses the same license as this blog (Creative Commons Non-Commercial Share-Alike).

        2. Peer Review is done in public, and the entire scientific community can join in. Check out the flow chart.

  2. Early results and pre-publication discussion will occur in public using blog technologies!

    1. Obviously some communities will hold back some material so as to maintain competitive advantage (actually, I think the only community that should do so are graduate students who maybe need more time from idea to fruition, the rest of us will gain from sharing)

    2. We may need to have a registration process and use identify management to manage spam on "professional blogs", but individuals will probably continue to do their own thing.

    3. Some institutions will need to evolve their policies about communication with the public (especially government institutions).

    4. There will be more "editorialising" about what we are doing and why, and this will make us all confront each other more, and hopefully increase the signal level within our community.

  3. Data Publication will happen, and then we will see bi directional citation mechanisms (including trackback) between data and publications.

    1. By extending trackback to include bidirectional citation mechanisms and implementing this at Institutional Repositories (and journals) we will see traditional citation resources becoming less important. (There is a major unsolved problem though: there might be multiple copies of a single resource - author copy, IR copy, journal of record copy - done properly they all need to know about a citation, which means it'll have to behave more like a tag then a trackback alone ... however, I still think the days of a business model built around a traditional citation index may be numbered.)

To sum up:

  • Many journals will die if they don't change their spots

  • Trackback linking will become very important in how we do citation.

  • Post publication annotation will become more prevalent.

  • Blogging (technologies) will add another dimension to scientific discourse.

by Bryan Lawrence : 2006/03/20 : Categories curation : 1 trackback : 2 comments (permalink)

Back end or front end searching?

Searching is one of those things that keeps bumping into my frontbrain, but not getting much attention. I have a hierarchy of things grabbing at me:

  1. I want to search my own email (~3GB) and documents (~10GB).

    • I'm hanging out for beagle or kat to get at my kmail maildir folders in the next release of kubuntu (yes, I'm afraid of the size of the indexes I'll need).

  2. I'd like to be able to search my stuff, and some selected other stuff, and then let google have a go after that ... all in one command

    • I think I'd have to build my own "search engine" using the google web service which would be called after "my own" search engine had responded. I'll never get around to this, but

    • it would help a lot of I had search inside Leonardo, and if it could give different results depending on whether or not I was logged in, I could conceivably get around to that. I might choose one of the technologies below, or I might choose pylucene.

  3. The badc needs to provide a search interface to

    1. The mailing lists we manage.

      • These are mailman lists. One of my colleagues has had a look at searching options for that and he's come down in favour of http://www.mnogosearch.org/. Why doesn't mailman come with a search facility out of the box? Part of me says we should do a pylucene search interface and contribute it to mailman, and part of me says, run with the first option, it's easier and faster ... but see below ...

      • At the same time, I'd done a bit of a nosey, and a bit of a web search on the subject led me to (of course) Tim Bray's series of essays, and an interesting thread on Ian Bicking's blog. The latter points one at (in no particular order)

    2. We have other stuff as well (metadata in flat files, databases, data files, web pages etc).

For the badc, the big question in my mind is a matter of architectural nicety though. For the badc, whould we rely on front-end searching (overlay a web-search engine, e.g. mnogosearch, or even a targetted google)? Or should we do our own searching on indexes we manage at back end (i.e. avoid indexing via a/the http interface)? For some reason I hanker for the latter, but I'm sure it's stupidity ... fortunately for this last class of searching I'll not be making a decision about this (I've delegated it), but I feel like I should have an informed opinion.

by Bryan Lawrence : 2006/03/16 : Categories badc python computing : 0 trackbacks : 3 comments (permalink)

Using the python logging module

I've been upgrading the support for trackback in leonardo (the only things left to do are to get the trackback return errors working correctly and deal with spam blocking) ... and in the process I found it useful to put to use the python logging module.

I had to do this without the standard documentation (as I was working on a train). But obviously I had python introspection on my side ... I thought it was a good result for me and for python introspection to work it all out between Banbury and Leamington Spa!

The hard bit was working out how to do it inside another class (without the docs), but once worked out, it's all very easy (I know this is in the library docs, but I feel like recording my experience :-):

import logging
class X:
    def __init__(self,other stuff ): 
         ...
         f=logging.FileHandler('blah.log')
         self.logger=logging.getLogger()
	 self.logger.addHandler(f)
	 self.logger.setLevel(logging.DEBUG)
	 self.logger.debug('Debug logging enabled')
         ... 

***
highlight file error
***

and now everywhere I have an X class instance, I have access to the logger, which is much easier than the way I used to do things for debugging. In my leonardo, the instance is the config instance, so obviously the logfile and level is all configurable too.

by Bryan Lawrence : 2006/03/15 : Categories python (permalink)

Declining Earth Observation

Allan Doyle has highlighted a recent CNN article which reports the poor future that EO has in the American space programme. This is something that has been brewing for a while now, but seems to be getting more and more real.

The worry of course is that each cancellation could potentially set a programme back by a decade, even if someone restarts it next year. Tthe hard part of a space programme isn't the hardware - although that's hard - nor the testing - although that's hard too! It's keeping the team together and/or documenting how one got to where one is now - reputedly why NASA can't go back to the moon right now, they've forgotten how to build Saturn V rockets. This, as Allan says, makes this a tough issue to deal with.

One can but hope that ESA picks up some of the slack ...

by Bryan Lawrence : 2006/03/15 : Categories environment (permalink)

Functional Trackback

I reckon I've got a functional trackback working now. If anyone fancies trying a trackback to this post I'd be grateful ...

This version is pretty basic, it should reject any "non-conforming" pings with an error message, but do the right thing for correctly formed pings. It doesn't yet have any anti-spam provision (which i plan to do by simply checking that citing posts exist, and have a reference to the permalink that they are tracking back to).

The url to trackback to is the permalink of the post. I'll make that more clear in the future (and add autodiscovery probably as well).

by Bryan Lawrence : 2006/03/15 : Categories python computing : 3 comments (permalink)

Meteorological Time

One of the problems with producing standard ways of encoding time is that in meteorology we have a lot of times to choose from. This leads to a lot of confusion in the meteorological community as to which time(s) to use in existing metadata standards, and even claims that existing standards cannot cope with meteorological time.

I think this is mainly about confusing storage, description and querying.

Firstly, let's introduce some time use cases and vocabulary:

  1. I run a simulation model to make some predictions about the future (or past). In either case, I have model time which is related to real world time. In doing so, I may have used an unusual calendar (for example, a 360 day year). We have three concepts and two axes to deal with here: the Simulation Time axis (T) and the Simulation Calendar. The Simulation Period runs from T0 to Te. We also have to deal with real time, which we'll denote with a lower case t.

  2. Using a numerical weather prediction model.

    1. Normally such a model will use a "real" calendar, and the intention is that T corresponds directly to t.

    2. It will have used observational data in setting up the initial conditions, and the last time for which observations were allowed into the setup is the Datum Time (td - note that datum time is a real time, I'll stop making that point now, you can tell from whether it is T or t whether it's simulation or real).

    3. The time at which the simulation is actually created is also useful metadata, so we'll give that a name: Creation Time (tc).

    4. The time at which the forecast is actually issued is also useful, call it Issue Time (ti). So:

      Latex Embedded Image

    5. A weather prediction might be run for longer, but it might only be intended to be valid for a specific period called the ValidUsagePeriod. This period runs from ti until the VerificationTime (tv).

    6. During the ValidUsagePeriod (and particularly at the end) the forecast data (in time axis T) may be directly compared with real world observations (in time axis t), i.e., they share the same calendar. So now we have the following: T_0 < tc but t''d can be either before or after T0!

      • Note also that the VerificationTime is simply a special labelled time, this doesn't imply that verification can't be done for any time or times during the ValidUsagePeriod.

      • Note that some are confused about variables which are accumulations, averages, or maximum/minimum, all over some Interval. These do not have a special relationship with these variables. When I do my comparisons, I just need to ensure that the intervals are the same on both axes.

    7. We might have an ensemble of simulations, which share the same time properties, and only differ by an EnsembleID - we treat this as a time problem because each of these is essentially running with a different instance of T, even though each instance maps directly onto t. But for now we'll ignore these.

  3. In the specific case of four-dimensional data assimilation we have:

    Latex Embedded Image

    confused? You shouldn't be now. But the key point here is that there is only one time axis, and one time calendar both described by T! All the things which are on the t axis are metadata about the data (the prediction).

  4. If we consider observations (here defined as including objective analyses) as well, we might want some new time names, it might be helpful to talk about

    1. the ObservationTime (to,the time at which the observation was made, sometimes called the EventTime).

    2. the IssueTime is also relevant here, because the observation may be revised by better calibration or whatever, so we may have two identical observations for the same to, but different ti.

    3. a CollectionPeriod might be helpful for bounding a period over which observations were collected (which might start before the first observations and finish after the last one, not necessarily beginning with the first and ending with the last!)

  5. Finally, we have the hybrid series. In this case we might have observations interspersed with forecasts. However, again, there is one common axis time. We'd have to identify how the hybrid was composed in the metadata.

I would argue that this is all easily accommodated in the existing metadata standards, nearly all these times are properties of the data, they're not intrinsic to the data coverages (in the OGC sense of coverage). Where people mostly get confused is in working out how to store a sequence of, say, five day forecasts, which are initialised one day apart, and where you might want to, for example, extract a timesequence of data which is valid for a specific series of times, but is using the 48 hour forecasts for those times. This I would argue is a problem for your query schema, not your storage schema - for that you simply have a sequence of forecast instances to store, and I think that's straight forward.

I guess I'll have to return to the query schema issue.

by Bryan Lawrence : 2006/03/07 : Categories curation badc (permalink)

Climate Sensitivity and Politics

James Annan has a post about his recent paper with J. C. Hargreaves1 where they combine three available climate sensitivity estimates using Bayesian probability to get a better constrained estimate of the sensitivity of global mean climate to doubling CO2.

For the record, what they've done is used estimates of sensitivity based on studies which were

  1. trying to recreate 20th century warming, which they characterise as (1,3,10) - most likely 3C, but with the 95% limits lying at 1C and 10C,

  2. evaluating the cooling response to volcanic eruptions - characterised as (1.5,3,6), and

  3. recreating the temperature and CO2 conditions associated with the last glacial minimum - (-0.6,2.7,6.1).

The functional shapes are described in the paper, and they use Bayes theorem to come up with a constrained prediction of (1.7,2.9,4.9), and go on to state that they are confident that the upper limit is probably lower too. (A later post uses even more data to drop the 95% confidence interval down to have an upper limit of 3.9C).

In the comments to the first post, Steve Bloom asks the question:

Here's the key question policy-wise: Can we start to ignore the consequences of exceeding 4.5C for a doubling? What percentage should we be looking for to make such a decision? And all of this begs the question of exactly what negative effects we might get at 3C or even lower. My impression is that the science in all sorts of areas seems to be tending toward more harm with smaller temp increases. Then there's the other complicating question of how likely it is we will reach doubling and if so when.

This pretty hard to answer, because it's all bound up in risk. As I said quoting the met office when I first started reporting James predictability posts,

as a general guide one should take action when the probability of an event exceeds the ratio of protective costs to losses (C/L) ... it's a simple betting argument.

So, rather than directly answer Steve's question, for me the key issue is what is the response to a 1.7C climate sensitivity? We're pretty confident (95% sure) that we have to deal with that, and so we already know we have to do something. What to do next boils down to the evaluating protective costs against losses.

Unfortunately it seems easier to quantify costs of doing something than it is to quantify losses, so people use that as an excuse for doing nothing. The situation is exacerbated by the fact that we're going to find it hard to evaluate both without accurate regional predictions of climate change (and concomitant uncertainty estimates). Regrettably the current state of our simulation models is that we really don't have enough confidence in our regional models. Models need better physics, higher resolution, more ensembles, and more analysis. So while it sounds like more special pleading ("give us some more money and we'll tell you more"), that's where we're at ...

.. but that's not an excuse to do nothing, it just means we need to parallelise (computer geek) doing something (adaptation and mitigation) with refining our predictions and estimates of costs and potential losses.

1: I'll replace the link with the doi to the original when a) it appears, and b) I notice. (ret).

by Bryan Lawrence : 2006/03/07 : Categories climate environment (permalink)

mapreduce and pyro

I just had an interesting visit with Jon Blower - technical director at Reading's Environmental E-science centre (RESC). He introduced me to Google's mapreduce algorithm (pdf). Google's implementation of MapReduce

runs on a large cluster of commodity machines and is highly scalable: a typical MapReduce computation processes many terabytes of data on thousands of machines.

and we all know the sorts of things it's doing. Jon is interested in applying it to environmental data. I hope we can work with him on that.

Obviously, I'm interested in a python implementation, and low and behold there seem to be plenty of those too: starting with this one, which utilised python remote objects (pyro home page). I'd looked at pyro some time ago, and it seems to have got a bit more mature. However, a quick look at the mailing list led to some questions that don't yet seem to have been answered.

It's not obvious that any use by our group of mapreduce would necessarily distribute python tasks (for example, we might want to distribute tasks in C, but have a python interface to that) ... but if we do go down that route, pyro may be useful. Of course mapreduce isn't the only algorithm in town either, and these are both new technologies and we've already got enough of those ... but we can't stop moving ...

by Bryan Lawrence : 2006/03/03 : Categories ndg badc computing python (permalink)

serendipity and identity management

Two seemingly unrelated threads of activity have come together: on the one hand I want to use an identity to login to the trackback protocol wiki; on the other hand, we're talking about the difficulties of single sign on for ndg.

In the latter context, Stephen has written some notes on cross domain security. The key issue here is that we have the following use case:

  1. A user logs on to site A via a browser.

    • the only way we can maintain state is by a cookie, so site A writes a cookie to the user's browser with "login" state information.

  2. User now navigates to site B not via a link from a site A! (i.e. this is not a portal type operation).

    • because site B is not in the same domain as site A, it is unlikely that site B can get at site A's cookie (the user may have set their browser up to do this, but that's probably a bad idea for them to do, so we can't rely on this).

In the former case, we want to be able to login to a site using an identity which is verifiably mine in some way. It turns out that there are a number if identity management systems out there:

  • OpenID,

  • LID - the Light-Weight Identity, and

The Yadis Wiki asserts that these two identify management systems are being used by over fifteen million people worldwide, and so the Yadis project was established to bring them together. From their wiki:

YADIS is applicable to any URL-based identity system and by no means tied to OpenID, LID, or XRI ... it became clear very quickly that the resulting interoperability architecture was much more broadly applicable. In our view, it promises to be a good foundation for decentralized, bottom-up interoperability of a whole range of personal digital identity and related technologies, without requiring complex technology, such as SOAP or WS-*. Due to its simplicity and openness, we hope that it will be useful for many projects who need identification, authentication, authorization and related capabilities.

At this stage this is all I know (except that I had a look at the mylid.net hosted service and it doesn't have any nasty terms of service ... on the other hand the site I want to get in to now only recognises OpenID). It also looks like openid might be slightly more prevalent: google on "OpenID identity" has slightly over 2 million hits, "LID identity" has just under 1.5 million hits. No idea if that's meaningful!

In any case, getting back to my serendipity point, I got all excited about LID because Johannes Ernst introduced it on the trackback list, and pointed out that

LID can be run entirely without user input (in fact, that was the original goal for authenticating requests; additional conventions then make it "nice" for a browser-based user)

It may be that this will help us avoid our cross domain cookie problem, since this is exactly what these technologies are for. So at this point it looks like I have to go understand LID and YADIS (the implication of the last link from Johnannes was that OpenID does require more user input).

by Bryan Lawrence : 2006/03/02 : Categories ndg computing : 4 comments (permalink)

openid part one, why I might have to roll my own

(One day I'll blog about my day job again ... sometimes it's hard to remember that I'm an atmospheric scientist first and a computer geek second).

Anyway, on joining the trackback working group, the first thing one wants to do is annotate the wiki, which means you have to have an openid identity. Hmm. This is the first I've ever heard of openid. sixapart who run the wiki recommend one of MyOpenID or videntity to get an identity if you don't have one.

Ok, so off I trot. Firstly, MyOpenID. Let's have a look at their terms of service, and in particular this bit:

You agree to indemnify and hold JanRain, and its subsidiaries, affiliates, officers, agents, co-branders or other partners, and employees, harmless from any claim or demand, including reasonable attorneys' fees, made by any third party due to or arising out of your Content, your use of the Service, your connection to the Service, your violation of the TOS, or your violation of any rights of another.

This states that if someone in the U.S. get's pissed off with my content, and can't get me in the UK, they can go after me via JanRain, and then I'm liable for JanRain's attorney's fees! So, a denial of service attack on me, would be to threaten to sue JanRain (not me, which would require them to come to a UK court I think). Don't like this much. Reject.

Ok, let's have a look at videntity, and their terms of service:

You agree to hold harmless and indemnify the Provider, and its subsidiaries, affiliates, officers, agents, and employees from and against any third party claim arising from or in any way related to your use of the Service, including any liability or expense arising from all claims, losses, damages (actual and consequential), suits, judgments, litigation costs and attorneys' fees, of every kind and nature. In such a case, the Provider will provide you with written notice of such claim, suit or action.

This time it's the Costa Rican courts, but it's the same deal.

While I hope I never write anything that would piss someone off so much they'd sue, the point is that they should be suing me, and these particular indemnification clauses seems to imply that if they can't get to me in my court system, they can get to me in another court system via my identity manager ...

... thanks, but no thanks! I can see why these folk want to be indemnified, but by putting either an explicit clause covering content or loose wording that has the same affect, I think they're making things worse - if they didn't have this, there would be no way anyone would sue them because one couldn't win - there is no method that they have to control my content, so a prospective suer would be wasting his/her money bringing a case. However, by putting this in, despite the fact that they (the identity managers) can't control my content, there is a route for a prospective suer to get at least costs out of me, so suddenly it's worth it, because they can directly affect me.

Now I reckon there is nearly zero risk that this would happen, but it's a principle thing. These sorts of clauses are bad news, and it's up for us (the users) not to accept them! We shouldn't mindlessly click through!

So, I guess if I want to use openid, I need to role my own identity. Well, I'll probably not bother, but it might, just might be worth investigating for NDG ... in which case, the python openid software may be of interest ... but really, I don't think I'll ever get to a part two on this subject ... there are just too many other things I should be doing.

Update (7th March, 2006): Actually these indemnity clauses are rampant. Even blogger.com - see clause 13 - has one!

by Bryan Lawrence : 2006/03/01 : Categories computing : 1 comment (permalink)

Some Trackback code

So I've joined up to the trackback working group, and I decided I'd better have a proper trackback implementation to play with ... given this is all very important for claddier the time investment was worth it.

So here is my toy standalone implementation for playing with. I have no idea whether it is actually properly compliant, but it works for me. If folk want to point out why it's wrong I'll be grateful. But note it's only toy code, it's not for real work.

There are five files:

  • NewTrackbackStuff.py - which consists of three classes which I'll document below: a TrackbackProvider (handles the trackbacks), which uses the Handler to provide a dumb but persistent store. There is also a StandardPing class to help with various manipulations ... (it's really unnecessary, but helped me with debugging and thinking etc).

  • cgi_server.py - a dumb server which runs

  • tb_server.py on the localhost at port 8001.

  • ElementTree.py (part of the fabulous package by Fredrik Lundh and included here for completeness)

  • test_tb_server.py which tests the trackback to the persistent store.

It's all pretty simple stuff, anyway, this is how you invoke trackback, assuming you have the tb_server running locally on port 8001:

import urllib,urllib2

payload={'title':'a new citing article title',
		'url':'url of citing article',
		'excerpt':'And so we cite blah ... carefully  ...',
		'blog_name':'Name of collection which hosts citing article'}

s=urllib.urlencode(payload)

req=urllib2.Request('http://localhost:8001/cgi/tb_server.py/link1')
req.add_header('User-Agent','bnl trackback tester v0.2')
req.add_header('Content-Type','application/x-www-form-urlencoded')
fd=urllib2.urlopen(req,s)
print fd.readlines()

***
highlight file error
***

There's not much to say about that. The server looks like this:

#!/usr/bin/env python
from NewTrackbackStuff import TrackBackProvider,Handler
import cgi,os
#mport cgitb
#cgitb.enable()

handler=Handler()
relative_path=os.environ.get("PATH_INFO").strip('/')
method=os.environ.get("REQUEST_METHOD")
cgifields=cgi.FieldStorage()

print "Content-type: text/html"
print

tb=TrackBackProvider(method,relative_path,cgifields,handler)
print tb.result()

***
highlight file error
***

So all the fun stuff is in NewTrackBackStuff in the TrackBackProvider:

class TrackBackProvider:
    	''' This is a very simpler CGI handler for incoming trackback pings, all it
	does is accept the ping if appropriate, and biff it in a dumb persistence
	store '''
	def __init__(self, method, relative_path, cgifields, handler):
		''' Provides trackback services
		    (The handler provides a persistent store)
		'''
		self.method=method
		self.relative_path=relative_path
		self.fields={}
		#lose the MiniFieldStorage syntax:
		for key in cgifields: self.fields[key]=cgifields[key].value
        	self.handler = handler
		
		self.noid='Incorrect permalink ID'
        	self.nostore='Cannot store the ping information'
        	self.invalid='Invalid ping format'
		self.nourl='Invalid or nonexistent ping url'
		self.noretrieve='Unable to retreive information'
		self.xmlhdr=''
		
	def result(self):
        	target=self.relative_path
            	if target=='': return self.__Response(self.noid)
		if not self.handler.checkTargetExists(target): self.__Response(self.noid)
		if self.method=='POST':
			''' All incoming pings should be a post '''	
			try:
				ping=StandardPing(self.fields)
				if ping.noValidURL(): return self.__Response(self.nourl)
			except:
				return self.__Response(self.invalid+str(self.fields))
			try:
				r=self.handler.store(target,ping)
				return self.__Response()
			except:
				return self.__Response(self.nostore)
		elif self.method=='GET':
			# this should get the resource at the trackback target ...
			try:
				r=self.handler.retrieve(target)
				return self.xmlhdr+r
			except:
            			return self.__Response(self.noretrieve)

	def __Response(self,error=''):
        	''' Format a compliant reply to an incoming ping '''
        	e='0'
        	if error!='': e='1%s'%error
        	r=''.join([self.xmlhdr,'',e,''])
        	return r

***
highlight file error
***

which is mainly about handling the error returns. The persistent store, as I say is very dumb, and simply consists of an XML file for this toy, which uses:

class Handler:
	''' This provides a simple persistence store for incoming trackbacks.
	Makes no assumptions beyond assuming the incoming ping is
	an xml fragment. No logic for ids for the trackbacks etc ... this is 
	supposed to be dumb, and essentially useless for real applications!!'''
	def __init__(self):
		self.xmlfile='trackback-archive.xml'
		# if file doesn't exist create it on store ...
		try:
			t=ElementTree.parse(self.xmlfile)
			self.data=t.getroot()
		except:
			self.data=ElementTree.Element("tracbackArchive")
	def checkTargetExists(self,target):
		''' check whether the target link exists '''
		for t in self.data:
			if t.attrib.get('permalink')==target: return 1
		return 0
	def urlstore(self,target,ping):
		'''stores a url encoded "standard" ping '''
		node=ElementTree.fromstring(ping)
		return self.store(target,node)
	def store(self,target,ping):
		''' stores a ping and associates it with target, quite happy for the
		moment to have duplicates - assumes the ping is already an ET instance '''
		#regrettably I think we have to test each child for the attribute name,
		#I'd prefer to use an xpath like expression ... but don't know how.
		for t in self.data:
			if t.attrib.get('permalink')==target:
				#add to an existing target element
				t.append(ping.element)
				break
		else:
			#create a new target element
			t=ElementTree.Element('target',permalink=target)
			t.append(ping.element)
			self.data.append(t)
		ElementTree.ElementTree(self.data).write(self.xmlfile)
		return 0
	def retrieve(self,target):
		''' retrieves the target and any pings associate with it '''
		for t in self.data:
			if t.attrib.get('permalink')==target: return ElementTree.tostring(t)

***
highlight file error
***

Finally, for handling the ping, I found this useful:

class StandardPing:
	''' Defines a standard trackback ping payload. Use as a toy to validate
	existing standard pings and convert to XML or urlencode ...'''
	def __init__(self,argdict=None):
		''' Instantiate with payload or empty '''
		self.allowed=('title','url','excerpt','blog_name')
		self.element=ElementTree.Element('ping')
        	for key in argdict:
			self.__setitem__(key,argdict[key])
	def __setitem__(self,key,item):
        	''' set item just as if it is dictionary, but keys are limited'''
        	if key not in self.allowed: self.reject(key)
        	e=ElementTree.SubElement(self.element,key)
		e.text=item
	def reject(self,key):
        	raise 'Invalid key in TraceBack Ping:'+key
	def toXML(self):
		''' take element tree instance and create xml string '''
		s=ElementTree.tostring(self.element)
		return s
	def toURLdata(self):
		''' take element tree instance and create url encoded payload string '''
		s=[]
		for item in self.element:
			s.append((item.tag,item.text))
		return urllib.urlencode(s)
	def noValidURL(self):
		''' At some point this could check the url for validity etc '''
		e=self.element.find('url')
		if e is None: return 1
		if e.text=='': return 1
		return 0

***
highlight file error
***

Update: 15th March, 2006: Use of blogname replaced with blog_name (might as well get it right :-)

by Bryan Lawrence : 2006/03/01 : Categories ndg computing (permalink)

context tokens, attribute certificates and proxy certificates

My colleague Steven Pascoe has joined the blogosphere, and his initial post is about security tokens, following my drivel about single sign on.

Steven describes a situation where service A wants to invoke service B using some sort of credential for a user. The use of service A has been authorised by the use of an NDG attribute certificate.

This is pretty much exactly what RFC3820 proxy certificates are for, except that they are mainly about authentication, and there the assumption most certainly is that you don't invoke service A unless you trust it to only pass your proxy onto services that you have invoked via it. One of Stephen's points is that we might as well use our attribute certificate as a proxy certificate. I think he's right.

In this case I think most of the rest of the issues about sessionID go away once you accept that the attribute certificate can be used as a proxy certificate and that one has to trust service A ... and that in this case service A is acting as a portal to service B.

The point about single sign-on, and where a sessionID is necessary, is the case where a browser is talking to service A and then service B and we're trying to avoid the user signing on at A and B.

by Bryan Lawrence : 2006/02/27 : Categories ndg computing : 2 comments (permalink)

web service wars

The web-service wars have broken out again. Tim Bray has a nice summary. The bottom line is well captured by Dare Obasanjo thus:

At best WS-* means you don't have to reinvent the building blocks when building a service that has some claims around reliability and security. However the specifications and tooling aren't mature yet. In the meantime, many of us have services to build.

and that is of course why I'm interested - I have to know when this stuff is mature (if ever). People keep asking why we aren't using this or that "grid" or "web-service" technology, and the reason is always

  1. It doesn't do what we want it to do, and

  2. It's not ready for a developer who doesn't have a mate in their core team!

In the UK, OMII is trying to do something about hardening these packages, but they're not hardening the ones we want ... so, we're still trying to use WS-Security that interoperates between python and java installations in our own home-grown security stack. Well strictly speaking we're not yet using WS-security as per the spec, but our roadmap says we will later this year .... but to do that we'll almost certainly have to engineer a solution in the python web services library ...

Meanwhile, the OGC webservice stack is far more mature ... but even there, building real web feature servers depends on building real application schema, which are XML schema, and they're subject to this problem (also Dare):

After working with XSD for about three years, I came to the conclusion that XSD has held back the proliferation and advancement of XML technologies by about two or three years. The lack of adoption of web services technologies like SOAP and WSDL on the world wide web is primarily due to the complexity of XSD. The fact that XQuery has spent over 5 years in standards committees and has evolved to become a technology too complex for the average XML developer is also primarily the fault of XSD. This is because XSD is extremely complex and yet is rather inflexible with minimal functionality. This state of affairs is primarily due to its nature as a one size fits all technology with too many contradictory design objectives. In my opinion, the W3C XML Schema Definition language is a victim of premature standardization.

(Mind you, it's exactly the complexity of XSD that dragged the grid world towards WSRF and yet that's probably got even more folk throwing bricks at it that the simpler WS-stacks).

Sometime soon I'll be blogging about some of our experiences in the area of trying to build application schema of GML, and issues with the tooling and XML schema etc.

by Bryan Lawrence : 2006/02/24 : Categories ndg computing : 1 comment (permalink)

Might trackback grow up?

I've been interested in trackback since before I started blogging. I have mused about it in the past, and still think it has significant potential. Our claddier project will exploit a modified version of trackback to explicitly make citation links between repositories.

So it's great to discover via Sam Ruby that there is a proposal to take TrackBack through the internet standards track, with the explicit aim of addressing:

  • Standardization

  • Protocol Extensibility

  • Authentication

  • Better documentation

and perhaps other aims of merging ping and pingback and removing the RDF nastiness.

So, mug like I am, with no time, I've still signed up for the mailing list ...

by Bryan Lawrence : 2006/02/24 : Categories computing : 3 comments (permalink)

Rights and Disclaimers

Some of you will have noted that I replaced my copyright statement with a disclaimer, and added a creative commons license to my menu bar. I wanted to do two things:

  • Make clear that this is my personal blog, the fact that we host it at the badc does not make any part of the academic system (including my funders or employers) responsible for what I write. Of course, they may take down anything that could bring them into disrepute, so it's up to me to try and avoid that.

  • Make clear that this material is in the public domain, with the proviso that my employers would not allow anyone other than themselves make financial gain from my work (since most of this blog is work based).

(I felt obliged to do those two things because of various corporate rumblings which I may get to blog about sometime in the future). I did think about moving my blog off to a personal website, but there are two fundamental reasons for not doing that:

  • My academic output is owned by my employer, so they could still ask me to remove material they considered their IPR. If it's their IPR, it might as well go up here (under badc.rl.ac.uk). However, I maintain most of my production is not commercially useful, so it can be public, and I can protect that assertion for the cclrc by using the particular creative commons license I have chosen.

  • Most of what I blog about is relevant to my position, and it's the public who pay for my opinions, so here they are. The very small amount of other stuff just helps give you context for my opinions. Obviously most of what gets written here is subjective, so the more you know about me, the more you can choose how much salt to use :-)

I still plan to roll out blogging technology more widely for ncas staff.

The bottom line here is that blogging is something new that can add value to the relationship between the scientific community and the public who fund them. It can also improve collaboration within the scientific community. Hence, it is right and proper for us to support weblogs for members of staff.

by Bryan Lawrence : 2006/02/23 (permalink)

Two papers published

Can't remember the last time I had two academic papers published in the same month (if ever), but meanwhile:

  • Juckes, M.N. and B.N. Lawrence, 2006: Data Assimilation for Reanalyses: potential gains from full use of post-analysis-time observations. Tellus A 58 (2), 171-178. doi:10.1111/j.1600-0870.2006.00167.x

  • G.J. Fraser, S.H. Marsh, W.J. Baggaley, R.G.T. Bennett, B.N. Lawrence, A.J. McDonald and G.E. Plank, 2006: Small-scale structures in common-volume meteor wind measurements. Journal of Atmospheric and Solar-Terrestrial Physics, 68, 317-322 doi:10.1016/j.jastp.2005.03.016

(Abstracts and links to preprints are on my publications page)

by Bryan Lawrence : 2006/02/21 (permalink)

What is single sign on? Done properly - it's a grid!

In the NDG our access control system (aka security) is designed to allow us single sign in across a number of different domains. It's important I think that it really is single sign on ... see for example the torturous explanations of Trevin Chow of the MS Passport team via Dare Obasanjo.

I can think of three classes of single sign-on:

  1. Portal based: the portal holds your credentials and passes your activities to services which are displayed by the portal ... this is probably the easiest to implement.

  2. Distributed-Services: where a service is being instantiated in a number of places, but where there is one logical session which can be used to hold the credentials and verify identity etc.

  3. Multiple-Services, multiple sessions: In this case the security session needs to be logically disaggregated from sessions associated with individual service sessions. There are at least two ways of doing this too:

    1. Have one master security session and all service sessions are clients of that security session (i.e. a physical version of the logical version) - but we need to decide on the client interface.

    2. Somehow replicate credentials in all sessions (i.e. physically the security and service sessions are replicated), but now the issue is that somehow we must maintain the security sessions in sync.

Within NDG, we've more or less rejected the simple version of 1 - that is, everyone uses the same portal (while we could all use a portal copy, we would then end up in a variant of 3 because that's what would result from navigation from one portal to another).

In the first instance, we could build versions of 2 for different services using the same software library, but recognising that as users migrate from one service to another they'd end up logging in again, we decide that we have no choice but to go to a version of three.

In terms of where we are now, we'll probably have to go with a version of 3b, which is ugly, and migrate to 3a, which ought to be easier to upgrade as other people's security paradigms begin to meet our requirements (we're not aware of anything out there that does right now, hence having to roll our own).

Foster defines a grid as having three main characteristics:

  • standard open protocols,

  • no centralised control, and

  • non?trivial quality of service.

Leaving aside the protocols and quality of service for the moment, a key requirement is the lack of centralised control. So we think that in truth, it's the third version of single-sign on that is the central characteristic of a grid, provided one can ensure that the security session can be instantiated once from and at any place within the grid, and client services from a variety of locations can exploit it. (The emphasis is to make the point that multiple users of the grid may end up with instantiated security session control at multiple distributed locations, although one hopes that any one user exploits only one security-control session at any one time).

Of course what makes it interesting is when one wants to pass a pointer to the security session between various services as one carries out service chaining ...

Within NDG we can also imagine other common session requirements that all services will require (like logging and history keeping), so it makes sense to make those properties of the security-control session as well.

by Bryan Lawrence : 2006/02/20 : Categories ndg computing : 1 comment (permalink)

information overload

So I've just had a week off, following a week out of the office. Tomorrow I'm back to work, and the information overload fear is building. Of course, despite being out of the office and on leave, I've been reading my email, so I only have 169 unread email messages and approximately 600 read, but needing me to do something (even if it's only to delete them). Still, the best part of a days work should I actually ever do go through them properly. These are not emails from mailing lists, I filter those, and they aren't in that total - while there are about another 200 of those technically unread, that kind of implies that I have read the others, which I haven't ... I may have looked at the subject lines ... or I may have just block marked some lists as read. I also have 1821 RSS/atom feed articles unread (and again, that's after block marking of some feeds as read when I haven't looked at them).

Last year I ended the year with 2059 messages left un-(actioned or filed). I have no idea how many feeds and mailing list articles that I simply didn't even see.

So why I am I writing about this? Because I so totally agree with Dare Obasanjo:

traditional mail readers do a poor job of enabling people to manage the amount of information they consume today... mail readers already suck at dealing with email information overload let alone when RSS feeds are added to the mix.

and

With RSS, we've had the opportunity to experiment with different models of presenting information to users from "river of news" style aggregators to personalized portal pages instead of sticking to the traditional 2 or 3 pane readers which dominate email and news readers ... Unfortunately, the major browser vendors haven't gotten in on the act. Instead of using RSS as an opportunity to explore new ways of presenting information to users we've seen rather lame attempts at RSS integration into the browser.

So nothing out there yet ...

But why not? For a long time I've hoped that the Chandler Personal Information Manager would ride to the rescue, but it seems they've got buried in the calendaring issue (of which more later), so the current release apparently won't help. But it doesn't seem like such a big ask. If I had a month (yeah, ok, I know it'd take longer to do it properly, but even improperly would be better than my current information world)... I'd design a mailing interface that

  1. allowed me to tag all incoming email with a keyword, which could be based on the contents of any of the headers, and which would be automatically associated with all future emails with that "feature".

    1. The tags could also be associated with feed articles

    2. The tags, which would be essentially virtual folders, could be arranged in multiple heirarchies, any of which could be displayed and navigated.

  2. provided proper search. Note that articles would never disappear even when deleted from the tag heirarch, all (except those marked as junk) would be put into a lucene based bucket for retrieval via search.

    • and all attachments would obviously be parsed and available to the search

    • as would keywords that could be associated with image attachments

  3. would allow allocation of priorities to individuals and subjects, and let meorganise my information using these priorities.

  4. allowed me to optionally respond to incoming email from people I don't know (and which my spam filter has left in) with a message saying something like "your email may be read within 15 days, but is not in a priority read queue - ring me if it's urgent".

  5. Interacted with an MS-exchange server for calendaring ... faultlessly (I don't know if Chandler is doing this, but if it isn't then all that effort on calendaring wont be much use to me, there is no chance of our corporate calendaring moving away from MS-exchange in the near to medium future).

(For the record, I currently use kmail+akgregator+outlook via crossover office to mismanage my personal information).

by Bryan Lawrence : 2006/02/19 : Categories computing (permalink)

I can't ignore javascript any longer

Well, actually I can ... Anyway, some will remember that I'm trying to ignore javascript because I don't have enough time to pay attention (and in truth, nowadays, I probably have no good reason to play with it).

However, if I want a good reason, then this page of Yahoo tricks via Simon Willison is good enough. Drag and drop that works in a browser is in my humble opinion amazing (apparently cross browser, and certainly in my konqueror browser)!

by Bryan Lawrence : 2006/02/15 : Categories computing (permalink)

Another quiet time

I'm in the middle of an intensive period of activity, ranging from (a couple of weeks ago) trying to reorganise the structure of the group I'm responsible for to a barrage of meetings, and now chairing a weeks workshop in Exeter working on GML application schema for meteorological data. Despite all that, I found time to let my daughter infect me with some nasty bug, which thankfully seems to be slowy receding. Next week I'm on leave ... maybe I'll find time for blogging then ... or maybe it'll wait til I get back to work!

by Bryan Lawrence : 2006/02/07 (permalink)

schema

Sometimes you just don't want to use a word because it carries too much baggage for some audiences:

www.synonym.com reports:

Sorry, I could not find synonyms for 'schema'.
Overview of noun schema

The noun schema has 2 senses (no senses from tagged texts)
                                       
1. schema, scheme -- (an internal representation of the world; an organization of
concepts and actions that can be revised by new information about the world)
2. outline, schema, scheme -- (a schematic or preliminary plan)

I like the definition (1), I just want a synonym ... it's not much to ask for :-)

by Bryan Lawrence : 2006/01/27 : Categories curation metadata (permalink)

Merging Metadata Always Lowers Quality

I just found Stefano Mazzocchi's Blog via Dare Obasanjo. It looks like Stefano doesn't blog very frequently, but what he does write is well worth reading. His latest, entitled "On the quality of Metadata" makes the assertion that

merging two (or more) datasets with high quality metadata results in a new dataset with much lower quality metadata.

He goes on

The "measure" of this quality is just subjective and perceptual, but it's a constant thing: everytime we showed this to people that cared about the data more than the software we were writing, they could not understand why we were so excited about such a system, where clearly the data was so much poorer than what they were expecting.

Hmmm. Then he gives an example, which is probably best summarised from Dares' blog as

being able to say that two items are semantically identical (i.e. an artist field in dataset A is the same as the 'band name' field in dataset B) doesn't mean you won't have to do some syntactic mapping as well (i.e. alter artist names of the form "ArtistName, The" to "The ArtistName") if you want an accurate mapping.

and the latter is often just hard. Stefano goes on to say to do this

you need to further link your metadata references "The Beatles" or "Beatles, The" to a common, hopefully globally unique identifier.

before you can reliably do this, and to do this, you're essentially appealing to higher order metadata construct (the one that maps these artists onto a global uri). In fact, Stefano covers some of this in an earlier post:

The problem is rather simple, really: words are not unique identifiers for concepts. Everybody knows this very well: synonyms exist in every language. So, all you need to start is to create unique identifiers for your tags, but if you don't do it well enough, it doesn't scale globally.

Well, that's the same as saying that as much as possible, you want to exploit controlled vocabularies within your semantic tags ... which is why we put so much effort into it.

(Actually, as an aside, the older post has a really interesting idea about how to avoid using common denominator controlled vocabularies and still end up with agreed definitions starting from different tags. He says he's implemented something based on it ... one day I think I should chase that up).

by Bryan Lawrence : 2006/01/22 : Categories ndg metadata curation (permalink)

Bibliographic References, ISO19115 and NumSim

One of the things one can imagine our NumSim discovery metadata to support is the ability for someone to identify that a particular model component (say the atmosphere) has a bibliographic reference associated with it that describes some part of the system. Having done that, to find all other simulation datasets for which the atmosphere is associated with the same citation. What this implies is that

  1. We need to support bibliographic references associated with particular parts of the NumSim extension (as opposed to simply listing them as part of the parent DIF), and

  2. We should describe such references in a standards compliant manner, and in such a way that we can format the reference both for automatic discovery and for humans to view on the screen (or paper).

The first of these is in the current schema simply by having reference elements (of type xs:string) in a reference list itself a subelement of the model components.

Given NumSim is aimed eventually at ISO19115, I thought I'd take a look at the machinery for supporting this "properly", which I naively thought would be "obvious". Within ISO19115, it's clearly the CI_Citation datatype. Ideally then, we should have references of type CI_Citation. At the same time, within CLADDIER we know we have to be able to map between the library communities standard ways of referencing citations ... which we might assume to be based around Dublin Core and/or ISO690.

Currently in NumSim we might have:

<NS_Reference>
Pope, V. D., M. L. Gallani, P. R. Rowntree and R. A. Stratton, 2000: 
The impact of new physical parametrizations in the Hadley Centre climate model -- HadAM3. 
Climate Dynamics, 16: 123-146. 
</NS_Reference>

How would we code that in ISO191151? Well, it looks horrific. I think the clear intention is that authorship should be through the CI_citedResponsibleParty element (see CI_ResponsibleParty), so let's lay this one out in psuedo XML:

<CI_Citation>
<title>The Impact of new physical parameterizations </title>
<citedResponsibleParty> Pope ...</citedResponsibleParty>
<citedResponsibleParty>Gallani ... </citedResponsibleParty>
<series><name>Climate Dynamics</name><issueIdentification>45
</issueIdentification> <page>123-146</page>

well that makes some sense (but what about using collectiveTitle for the journal title?), but we need to dive into that citedResponsibleParty some more. Presumably we should have, for example,:

<citedResponsibleParty>
<individualName> Pope, V.D. </individualName>
<role codeList=? codeListValue-?>
</citedResponsibleParty

which is what I think is what the version I've got of ISO19139 would have, i.e., we have to jump off to someone else's code space (Dublin Core), so this would be a DC.creator tag? This seems rather unfortunate and cumbersome. If we're going to someone else's code space, then why not avoid the CI_Citation type for a bibliographic entry anyway, and go straight to Dublin Core or ISO690?

Note that if we stick with this stuff there's no sensible way to format this stuff either, because we can't put the initials in their own tag, so when we go to xhtml we can have the first author surname first, and others thereafter.

Alternatively, we could do this (ignoring the XML representation for a moment):

CI_Citation.title=The Impact of new physical parameterizations ...
CI_Citation.identifier=doistring
CI_Citation.identifierType=doi
CI_Citation.otherCitationDetails=the reference string which is our "standard reference".

In which case we could use ISO690 to get the last string "right", but relying on the DOI will only work for "modern" references in some journals.

Either way, it's not a great advertisement for ISO19115, reference lists are a must for any metadata standard. While I appreciate CI_Citation supports a number of citation mechanisms throughout ISO19115, you would have thought making bibliographic citations easy would have been a fundamental requirement! I'm tempted to use an xml schema version of bibtex like this one, which I suppose could legitimately be built in via an extension.

If anyone can give me a cleaner example of how it should be done, I'd be grateful! Even better, some public domain xml schema to go with the examples.

1: here I'm hopping between the content standard, and a draft of ISO19139 dated September 2005 (ret).

by Bryan Lawrence : 2006/01/20 : Categories curation ndg claddier metadata iso19115 : 0 comments (permalink)

Predictability and Risk

Nearly a year ago, I thought I might write something down about how predictability works in the climate sense, as opposed to, say weather forecasts. Suffice to say, I never got around to it. When I saw Roger Pielke's blog I started reading it because I agree with his basic premise about the importance of other parts of the earth system in making good climate predictions (see this post for a recent example or this older one). But over time as I read his group blog entries, the pattern of nonsense about climate predictability became worse and worse, and I stopped reading it regularly. I got really fed up when I read the guest article on predictability by Henk Tennekes as it follows Pielke's statement that predictions of regional and global climate change are not science.

I was fed up enough that I had started bookmarking some material, and was going to write something. But I don't have to. James Annan has done a fabulous job in three posts:

I liked the first post because it makes really clearly the distinction between predictive uncertainty that exists because of inherent unpredictability (he gives the example of coin tosses, I've used dice), and the uncertainty that exists because of lack of knowledge about the system (which is mostly what Pielke is on about). It's much better than my efforts.

I liked the second post, because it nails the issue about misusing the observations to validate climate predictions, and the third really puts it to bed. Why oh why does Pielke not get it?

The bottom line here is that we do the best predictions we can, and we undoubtedly need to produce the best possible indication of our uncertainty in that prediction. At which point as James and other's say, we're into a betting situation. So what should we do?

In terms of weather, the met office recommends that as a general guide one should take action when the probability of an event exceeds the ratio of protective costs to losses (C/L). It's simple betting argument. We do the thing that minimises our potential losses.

Now the reality is that something might be happening to the climate (ok, is!). There might be some costs associated with climate change. It's logical to do try and estimate the probability of various possible futures, because in terms of betting, we've already made the bet (our planet is on the line), so the only real question is what to do about it. We need to estimate that C/L ratio. Sure, we can argue that we can never know enough to calculate the odds, but using that as an argument for not doing anything is the same as saying we're sure that we will suffer no losses. Now how certain can we be of that?

Update (25/01/2006): James is still writing posts on predictability, the following one is how one should interpret statements like: tomorrow there is 70% chance of rain:

Update(02/03/2006): James is still writing posts. The first one is yet more on the comparison between Bayesian and frequentist uncertainty, with an excellent example from his back garden. The bottom line of the second one is that the verification problem for climate simulations is somewhat harder than that of weather forecasting, because really the only way of having any confidence is using a procedure called cross-validation - which relies on holding back some data from the methodology used to construct the model - and then testing the model against that ... but it's darn hard to hold back observations from the model builders ...

by Bryan Lawrence : 2006/01/18 : Categories climate environment (permalink)

Feedback on the Future of CF

On November the 9th I publicised the first public draft of the Future of CF white paper both on my blog and the cf mailing list.

The significant community responses are listed here. They are broadly approving, so the next steps are to integrate the specific suggestions into our plans, and deal with the issues raised.

by Bryan Lawrence : 2006/01/18 : Categories cf : 1 comment (permalink)

One more person taking climate seriously

Would you believe following a chain of thought about a computer science issue led me to David Ignatious in The Washington Post:

So many of the things that pass for news don't matter in any ultimate sense. But if people such as Lovejoy and Kolbert are right, we are all but ignoring the biggest story in the history of humankind. Kolbert concluded her series last year with this shattering thought: "It may seem impossible to imagine that a technologically advanced society could choose, in essence, to destroy itself, but that is what we are now in the process of doing." She's right. The failure of the United States to get serious about climate change is unforgivable, a human folly beyond imagining.

No idea who this bloke is, but hopefully he has some readership ...

by Bryan Lawrence : 2006/01/18 : Categories environment (permalink)

Trends in Data Analysis

I finally got around to looking at Scientific Data Management in the Coming Decade (Gray, et.al., 2005)1

Anyway, they make a number of good points, among them:

  • Data volumes are approximately doubling each year.

  • Data analysis tools have not kept pace with our ability to capture and store data.

  • I/O bandwidth has not kept pace with storage capacity: in the last decade while storage capacity has grown more than 100-fold, storage bandwidth has improved only about 10-fold (actually, I think they're underestimating the storage growth, see On Moore's Law).

  • Increasingly, the datasets are so large, and the application programs are so complex, that it is much more economical to move the end-user?s programs to the data and only communicate questions and answers rather than moving the source data and its applications to the user?s local system.

  • Replicating the data in science centres at different geographic locations is implied in the discussion above.

There is also an excellent discussion of the importance of metadata, and for once, there is an understanding of the breadth of the metadata required to support the both the automatic analysis systems we will need to cope with vast data archives, and the individual scientists who now must exploit other people's data (and so need their wisdom encoded as metadata).

They also discuss database systems, and follow with a discussion which makes the point that

  • (the) Fortran/C/Java/Python file-at-a-time procedural data analysis is nearing the breaking point.

and that the only way to make progress now is to

  1. to use intelligent indices and data organisations to subset the search,

  2. to use parallel processing and data access to search huge datasets within seconds, and,

  3. to have powerful analysis tools that they can apply to the subset of data being analysed.

They also see NetCDF and HDF as nascent database systems, which is fair, because that's how we're using them now ... at a recent meeting I was explaining the problems I have with existing "databases" and how we use files, and someone pointed out that our entire BADC system was in fact a DBMS ...

Their take on databases is that we should be able to exploit their internal parallelisation and indexing strategies, once they become more sophisticated (by exploiting object relational technologies and embedded programming languages).

The bottom line of this paper is that they identify three technical advances that will be crucial to scientific data analysis, one of which is:

  • extensive metadata and metadata standards that will make it easy to discover what data exits, make it easy for people and programs to understand the data, and make it easy to track data lineage.

It's nice to know that our ndg activity is slap bang in the middle of an area of key technological advance. I will also be blogging soon about our "Big Data Analysis Network" ... we take delivery of the hardware soon, and I should have some interesting stories to tell about parallelisation of data analysis and I/O bandwidth ...

1: This is published in Technology Watch, which is another of these web-only journals which doesn't seem to have hitched into the DOI system ... which is OK provided they promise to keep the same URI structure ... forever. (ret).

by Bryan Lawrence : 2006/01/17 : Categories curation badc ndg (permalink)

Real Climate Science

This is what I call real climate science: Jones, et.al. (2005). While I haven't got access to the full text, the abstract sounds fascinating, Firstly:

From 1950 to 1999 the majority of the world's highest quality wine-producing regions experienced growing season warming trends.

which is the sort of extra information that simply confounds the global warming skeptics, some of whom are fixated on the definition of "global mean temperature".

Secondly:

Currently, many European regions appear to be at or near their optimum growing season temperatures, while the relationships are less defined in the New World viticulture regions.

They imply that as the world continues to warm, some places which are now optimal for wine growing will become less so, and obviously others will become more optimal. Me I think this is good news for Hawkes Bay (which is where I am from). It already produces great wine, but a slightly longer season should make the Cabernet Sauvignon more reliable - then watch out Bordeaux! Maybe global warming is not all bad - ok that was tongue in cheek :-).

by Bryan Lawrence : 2006/01/15 : Categories climate (permalink)

New version of NumSim available

A new version of the NumSim metadata schema, along with documentation is now available. NumSim has been designed as a package module to add to standard metadata discovery records to discriminate between data sets which have resulted from simulations.

NumSim is currently available and documented on an NDG wiki here, but will move to a more permanent location when we get our new ndg website up and about. Development versions will probably move onto this website though ... but not yet.

by Bryan Lawrence : 2006/01/13 (permalink)

End-to-End System Design

I've just skimmed through Salzer et.al. (pdf) on end-to-end arguments on system design via Dave Orchard. The paper has had an interesting history, apparently being so timeless that it was published four times in different places in the decade between 1981 and 1991, and now rearing up in Dave's mind in the context of WS-messaging specs.

It's got some rather interesting discussion on where (in the software stack) and how reliability of (amongst other things) file transfer should be calculated and ensured, but the bottom line was:

End-to-end arguments are a kind of "Occam's razor" when it comes to choosing the functions to be provided in a communication subsystem. Because the communication subsystem is frequently specified before applications that use the subsystem are known, the designer may be tempted to "help" the users by taking on more function than necessary. Awareness of end-to-end arguments can help to reduce such temptations.

Which I think could easily be applied more generally (i.e. outside of communications) simply by removing the word communication.

The paper itself is worthy of more time than I'm going to give it now, but even so, given we are right now engineering various reliability checks into file transfers between HPCx (one of our national supercomputers) and our systems, across a network which is encrypted by default, the whole thing resonated rather a lot. I hope I find time to come back and digest it a bit more carefully.

by Bryan Lawrence : 2006/01/12 : Categories computing (permalink)

''Some'' Scientists are Baffled.

There is some hot air floating around (e.g. RealClimate via Stoat) about a recent Nature paper which identifies a potentially important source of the important greenhouse gas, methane. Apparently the bottom line is that they showed that a thus far unrecognised process causes living plant material to emit methane in quantities that could be very significant globally. (I say apparently, because I can't get access to the original paper tonight - I'm not sure why not, I ought to be able to) 1.

While this isn't my area of expertise (I sometimes wonder if I have one any more), I can't help thinking it's not that big a deal. It's certainly a surprise, if the results stand up to closer scrutiny, but the reality is that the relationship of the biosphere with the atmosphere is pretty poorly constrained in both observations and models. From one of the comments in RealClimate, I gleaned that it appears that it would be a positive feedback (more methane would be emitted as temperature increases), but the fact is that the observations of global methane haven't changed ... we still know how much is in the atmosphere, and we know what it's been recently, and despite what the Guardian says, we know pretty much how it moves around once it's there. It is clear that (if true) this will affect predictions, but I'd lay a bet that there are loads of other biosphere interactions 2 in the earth system that are at least as poorly quantified, and will lead to similar scale issues with our predictions, but the bottom line is that carbon dioxide is the biggest problem ... so there are unlikely to be substantive changes in climate change predictions at the global scale.

1: This is of course very frustrating, I get so little time to keep up with "real" science ... while I'll probably be able to get access tomorrow, I probably wont have time to chase it up then ... (ret).
2: The key word is interactions, one hopes that not too many budgets have unknown sources or sinks of such scale (ret).

by Bryan Lawrence : 2006/01/12 : Categories climate environment (permalink)

On DRM and curation.

Shelly Powers (via Dare Obasanjo) has an excellent blog entry on DRM. It's all good stuff, as are the comments ... but there is rather a lot of it. So this is by way of a summary of some of the key points, leading to the crunch point on curation.

Shelley (in comments on another Weblog):

What debate, though? Those of us who have pointed out serious concerns with Creative Commons (even demonstrating problems) are ignored by the creative commons people. Doc, you don?t debate. You repeat the same mantra over and over again: DRM is bad, openness is good ...

(Me, I want to chase up what those problems are with Creative Commons), but that's for another day, meanwhile, onwards ... this was all based on a quote from Lloyd Shepherd: ...

let?s face it: we?re going to have to have some DRM. At some level, there has to be an appropriate level of control over content to make it economically feasible for people to produce it at anything like an industrial level. And on the other side of things, it?s clear that the people who make the consumer technology that ordinary people actually use - the Microsofts and Apples of the world - have already accepted and embraced this. The argument has already moved on.

and in that entry there was link back to Chris Anderson:

The real question is this: how much DRM is too much? Clearly the marketplace thinks that the protections in the iPod and iTunes are acceptable, since they?re selling like mad. Likewise, the marketplace thought that the protections in Sony?s digital music players (until recently, they didn?t support MP3s natively) were excessive and they rejected them.

Well, I'm not so sure that the marketplace actually understands the limitations that DRM might actually come with 1 ... because they haven't yet met the full consequences (which is what the first comment in response to Shepherd's post states by comparing "the market knows" with saying the market accepted Thalidomide ... not that that's a good example, because the market didn't even have access to the information, but you get the picture) ...

Anyway, I digress, I wanted to summarise some of the 71(!) comments on Shelley's post which I found interesting. Charles said:

DRM, like almost every technology, is amoral. The argument, as I see it, is whether it is better to use copyrights to pursue legal remedies after infringement, or to implement a technological means to prevent it before the fact. I don?t see much difference between the two positions. DRM only destroys value if you find value in copyright infringement.

Seth said:

Regarding DRM only destroys value if you find value in copyright infringement. That?s absolutely right. Such value is called ?fair use?. Really. ?Fair use? is what we call infringement where there is value in it. And DRM can forbid fair use ...

and then there was some discussion that made the point that DRM doesn't stop you going back to the original copies (not DRM'd) for some "fair use" reason, but

  • the argument boils down to the fact that this is often so difficult (time-consuming) that it's impractical (which while not being impossible is as near as damit)

and

  • if you have to get permission from the copyright holder, it's not fair use, it's a license.

This latter point was well made by Doug Lay (op cit):

This is why there is a difference between implementing DRM to prevent infringement before it happens, and relying on copyright enforcement to deal with infringement after the fact. With copyright enforcement, a court can review a defendant?s claims of Fair Use and make a decision. Since DRM is algorithmic and Fair Use cannot be determined algorithmically, DRM looks to be incompatible with Fair Use.

Kjetil Kjernsmo said:

DRM is IMHO not amoral, in that what makes DRM DRM is who controls the keys. Removing the keys from the individual who has the player is IMHO immoral, and so, the technology that has only that intention is immoral.

The second big issue here is whether an open channel can co-exist with DRMed channels. I believe it cannot, as it will always be possible to distribute copies of works with DRM removed through those channels if they are allowed to exist.

I may of course be wrong, but the opposite, that all channels of communication is DRMed, will be so disastrous for democracy that it requires careful consideration.

Which is a pretty strong argument, and by may way of thinking, the next point makes a pretty sensible constructive suggestion:

Arj says:

What DRM would I find useful and helpful? I?m not sure. I think that watermarking PDFs is one example of DRM that I?m not opposed to. Rather than disabling functionality, it ties the original purchaser with the item, so that if someone else copies or prints it, it can be linked back to who actually paid for it. I?m not sure if that?s strictly DRM. However, I think that?s one way that we could use technology to help enforce copyright infringement without disabling the copyrighted material so the legitimate owner can?t use it fairly.

... At which point I'm at comment 31 and I'm beginning to think I'm not learning any more, but let's persevere ... aaah, most of the argument is now about how artists get paid (the thread is fundamentally about video and music, and it's a legitimate concern, but I'm interested from a curation perspective) ...

... but I did find mcubed's comment rather interesting on the general level, especially about first sale. The point being that DRM results in

the loss of first-sale privileges, which has been a key component of copyright law almost from the beginning of the modern conception of copyright ... DRM can only be good when it is implemented in a way that doesn?t interfere with whatever first-sale privileges we retain ...

but as he points out the U.S. copyright office appears to have found that the first-sale doctrine doesn't appear to apply to digital media. (Why appears? I'd like to understand that too, but can hardly claim it's of professional interest). Then he gets to the real crux of it:

... others not bothered by DRM are ignoring is that ?creators? are not ?owners.? The public is the owner. All creative work is bound for the public domain; copyright is the mechanism we have evolved to impart temporary, limited property-like rights to creators so that they can profit from their creations, but that does not make their creations their property. It is the responsibility of the owners of that property (the public) to ensure that it is preserved and accessible for generations to come, and any proprietary DRM schemes interfere with our ability to do that, one way or another.

Which is why I'm interested 2 in all these things, because if DRM takes off, it'll eventually interfere with our ability to preserve material for the future. It's going to be hard enough just dealing with formats that are transparent, but changing with time ...

1: Not that it's that bad for ITunes, as Kevin Marks says, the only value lost by the iTunes DRM is some convenience, the media and time cost of burning your own CD copy (ret).
2: Professionally. We should all be interested in a private capacity. (ret).

by Bryan Lawrence : 2006/01/11 : Categories curation : 2 comments (permalink)

Next Steps for CF

I have yet to work on the revised version of the CF whitepaper, and summarise all the feedback received, but meanwhile, the BADC is taking a few steps forward.

Alison Pamment will be taking responsibility for the CF name management. She'll be beginning one day a week from February, working up to three days a week in April.

Her first activities will be:

  • Educating herself on CF,

  • Understanding how Jonathan has managed CF standard name evolution

and in March we will move the standard name part of CF to the BADC (unless anyone has violent objections).

by Bryan Lawrence : 2006/01/10 : Categories cf (permalink)

Python Projects I should investigate some more

Many (Most?) of my python posts are about things which have caught my attention fleetingly, and I wanted to mark for later consumption ... not that I seem to do much later consumption ... but I might!

James Reid at EDINA has pointed out a few things that he's spotted that I haven't mentioned, but could/should have. His list includes:

  • Kid - a really nice XML templating enginge with XSLT like abilities - which is something I've been paying attention to, but hadn't blogged on, because I wanted to take the time to understand a bit better first ... oh for the time!

  • MAKI - not used this myself because of the extensive dependencies list but looks very interesting. Following Jame's suggestion, I looked at this for a few minutes, but noted the developer has gone over to Zope a year ago, and plans no updates. Fair enough, but that's enough to turn me off.

  • PyMeld - a poor mans Kid but nice all the same. This is something else I didn't know about. It looks like a nice idea to cleanly separate the html from the templating engine (python). There is also an interesting discussion about possible successors here.

  • CherryPy - a nice pythonic OO web framework without the hassles/pecularities of Zope. This is another one that I have been following. Indeed one of my colleagues is actively investigating it for use in the BADC.

James had another couple of useful snippets for me, including pointing out that mapserver enterprise are using the bdbxml database at their backend and the existence of two more projects I didn't know about: AHAH, and GeoRSS.

I'm not too interested in chasing up AHAH right now, as I'm trying to pretend javascript doesn't exist (that way I don't have to understand it). However, I did chase up GeoRSS and it looks very interesting and relevant to NDG (indeed, we've already decided to use one of their conventions for bounding boxes).

by Bryan Lawrence : 2006/01/09 : Categories python ndg (permalink)

Old Media - Antibiotics and GE go green

Sometime over the Christmas holidays I contemplated a world without (print) newspapers. I get newspapers delivered on Saturday and Sunday mornings, and my weekend days usually begin by digesting the sports pages followed by the other pages. However, because the Christmas break was essentially a week of weekend days, most of which didn't have newspapers with them, I found myself on the net most mornings, just reading for the hell of it. As I say, I found myself thinking: I can live without a print newspaper ...

... but yesterday and today, The Guardian and The Observer delivered papers with articles that reminded me exactly why print newspapers are so interesting. Essentially, one turns the pages, and things catch your eye that simply wouldn't show up if you were parsing a website ... (either computationally or visually). I think I'll keep my newspapers for a few years yet ...

The two (completely unrelated) articles that caught my eye were:

  1. An article on antibiotics where it was pointed out that the last fifty or sixty years may well have been the golden age of medicine while we had efficient antibiotics, but it may be as few as five years before hospitals become too dangerous to use as there may be no effective antibiotics for new "superbugs". Quite clearly the bloke who the story was about was pushing a fearful state1 because he rightly points out that the search to replace antibiotics with completely different forms of therapy has desultory funding. While there was nothing particularly new in the article to that point, I did find absolutely fascinating

    • the discussion of plasmids and how they are involved in antibiotic drug resistance, and

    • the concept of bacteriocin - a chemical produced by bacteria to kill competitors (something I'd never heard of) - and the idea that his research group might be onto a new way of killing bacteria

  2. A story about GE launching an ecomagination campaign. It turns out that GE launched this drive in May 2005, although the Observer only picked up on it today. Nontheless, as Caulkin says, the fact is that a major company is setting targets that managers will have to meet. We can but hope others follow suit. The targets include by 2010 achieving

    • real greenhouse gas emission savings

    • delivering $20 billion per annum sales on green products, and

    • doubling investment to $1.5 billion per annum on research and development for their green technologies (which range from cleaner coal and power plants to fuel cells and wind turbines).

1: In one small way, I agree with Michael Crichton - politically to get things happen, it appears one needs to create a state of fear for folk/governments to open their purse strings. Where I disagree with Crichton is that he implies that it's in the interests of national states to ensure the public are always in a state of fear ... hence my choice of words: a fearful state. (ret).

by Bryan Lawrence : 2006/01/08 : Categories environment (permalink)

bt billing

Today I got my BT bill. I guess I shouldn't be surprised that the "free" home highway termination move I was promised turned up as a ?55 charge.

A 16 minute phone call today resulted in an agreement to refund the ?55 on my next bill. I'll have to keep a watch on that, as last time BT promised to refund me something on my next bill it never happened ...

It seems that everyone I speak to at BT is really nice (which is a vast improvement on the situation a decade ago), but somehow they can't make things work properly. I'll report back on this in three months!

by Bryan Lawrence : 2006/01/07 : Categories broadband : 1 comment (permalink)

More on comment spam

Well, yesterday I modified leonardo's code so

  1. I know when I get a comment (ok, we should have had that in there before, but there's only so much time in a day), and

  2. It is more easy to delete multiple comments.

I'm glad I did that: it turns out in under 12 hours I had got 34 comments - all spam. Most of these were to specific articles, so I've turned comments off on them. I guess we'll now find out whether a human is involved or not. Meanwhile, this weekend I'll get on with the captcha idea.

Update: I've half done the captcha, and implemented it, so the net effect is you wont be able to post any comments to my blog until I've finished it :-) (Now you can!)

by Bryan Lawrence : 2006/01/06 : Categories python : 1 comment (permalink)

Enough to make you cry

And I used to like Coldplay ... sigh.

by Bryan Lawrence : 2006/01/05 (permalink)

Oxford Seminar in March

I've been invited back to Oxford AOPP to give a seminar in March. The seminar coordinator tells me it'll be over ten years since I last gave a seminar there. Last time it was on gravity waves (no, not gravitational waves!). This time it wont be directly about atmospheric science research, but about very relevant technologies:

Communicating scientific thoughts and data: from weblogs to the NERC DataGrid via data citation and institutional repositories

For centuries the primary method of scientific communication has revolved around papers submitted either as oral contributions to a meeting or as written contributions to journals. Peer reviewed papers have been the gold standard of scientific quality, and the preservation of scientific knowledge has relied on preserving paper. ?The establishment (whatever that is) has assessed productivity and impact by counting numbers of papers produced and cited. Yet clearly much of the bedrock of scientific productivity depends on

  1. Faster and more effective communication methods (witness the growth of preprint repositories and weblogs) and

  2. Vast amounts of data, of which more and more is digital.

Digital data to leads opportunities to ?introduce data publication (and citation) and to exploit metanalyses of data aggregations. Electronic journals and multimedia lead to a blurring of the distinction between "literary" outputs (i.e papers) and "digital" outputs (i.e. data). ?Web publication alone leads to a blurring of the understanding of what publication itself means. Digital information is potentially far harder to preserve for posterity than paper.

This seminar will review some of these issues, and introduce two major projects led by the NCAS British Atmospheric Data Centre, and embedded in earth system science, aimed at developing technologies (and cultures) that will change how we view and do publication and data exploitation.

by Bryan Lawrence : 2006/01/05 : Categories ndg badc curation (permalink)

Whither parameters in metadata

Last year, there was considerable chatter on the metadata mailing list about where the right place was to put which parameters were measured in an ISO19115 metadata record.

The WMO has had the same problem with working out how to produce the WMO core ISO19115 schema. First iterations assumed the best thing to do was to put such things in feature instances themselves in the metadata documents. While this is legal (although not the way it was done by the WMO tech team initially), if one follows the appropriate extension mechanisms and produces metadata instances for all feature instances, it does lead to cumbersome metadata.

I think this is easily avoided by only using ISO19115 metadata for datasets which are aggregations of feature instances, but that still leaves the problem of where the measured parameters go ...

Various responses on the metadata list implied that parameters themselves (as opposed to keywords) should end up in feature descriptions, which some thought, like the initial WMO thoughts, ought to be included in ISO19139 documents (ISO19139 is the xml encoding of ISO19115). However, for volume reasons, and because it simply wont be right in most profiles of ISO19115 (based on ISO19139 or not) the correct thing to do is to refer to feature instance documents from within the metadata instance documents.

Regrettably though the most likely thing one wants to do when searching for data (i.e. parsing dataset metadata) is to search for specific types (parameters) of data (as well as use keyword vocabularies) which implies that whether or not the features are in the metadata instances, the parameters should be in the ISO19115 documents, so let's track down where and how the features are encoded.

Well, iso19110 only tells us how to define feature types - not instances - and even then doesn't define a schema for them. ISO19136, the geography markup language (GML) provides mechanisms for building XML application schemas, so that's the one that tells us how to define feature types in terms of XML. It's those actual application schema instances which provide the formalism for encoding feature instances.

OK, so now we know that the parameters are encoded in data dictionaries within the instances defined by application schema themselves encoded using 19136 (GML). However, we also know that in reality we need some of these things inside our discovery documents.

My take on this is we have to use the ISO19115 extension mechanism to get the parameter names into the ISO19115 profiles we use ... and have an explicit overlap between the feature instance dictionaries and the discovery records (i.e. an overlap, not a replication of all of it). While overlaps are redundant, I think we expect to generate our ISO19115 docs from our feature instance docs anyway, so the redundancy can simply be thought of as input data rather than additional data.

Maybe this is all obvious anyway (after all, it's explicitly what the NDG metadata taxonomy states) but I felt like spelling it out.

by Bryan Lawrence : 2006/01/03 : Categories ndg metadata iso19115 : 2 comments (permalink)

Lunchtime Usage

Today's task: find out whether this blog actually gets any readership (given most of my comments are spam, and there are very few inbound refers). Mind you, I like writing my blog, because it's cathartic, and it's useful as a notekeeper, so readership's not entirely what it's all about, but readership is nice :-)

So, I revised my stats programme yet again to look only at the blog entries and atom entries separately (and removed a few more bots), to get the following usage last year:

  2005    blog    atom  
  Jan    411    66  
  Feb    248    49  
  Mar    265    51  
  Apr    462    47  
  May    456    46  
  Jun    427    60  
  Jul    425    89  
  Aug    756    74  
  Sep    755    75  
  Oct    665    75  
  Nov    593    77  
  Dec    1023    114  

Now this is a bit difficult to interpret, because I've deliberately removed all the bots which load the atom script, so this doesn't include folk who read via aggregator providers only, but it's healthy enough anyway.

So my next question is how many of these people come back to blog entries on more than five days?

It turns out 109 people came back on more than five days to blog entries ... maybe there are some folk who find my musings vaguely interesting. Great. Thanks!

by Bryan Lawrence : 2006/01/03 (permalink)

What to do about comment spam?

After a bit of trolling round the web, it turns out that the low down on the issues was in a Mark Pilgrim article which was on diveintomark.org (and appears to be only viewable via a google cache). It doesn't make pleasant reading.

I've also read up about captchas, and suppose that I could add such a facility to Leonardo ... but there don't appear to be any easy to install python options for doing it that are standalone.

At the moment I have what almost looks like manual spamming - there are only about a dozen episodes, mind you, each includes forty odd entries. If it is manual spamming, a captcha wouldn't work anyway. However, I'm minded to try a simple math captcha, as opposed to an image one. I hope I could work out how to code up one pretty quickly. If I understand the basic philosophy, we

  1. put up a prompt equation,

  2. encode the answer in the form, and prompt for it

  3. on reading the form, read the encoded version, and the user version and compare.

Obviously this can be relatively easily defeated by a targetted code which parses the page for the equation, but I can't believe anyone would do that for Leonardo ... yet ... so may be it'll do for the moment. Before I get going, anyone got an easy code for doing this that I can hijack?

by Bryan Lawrence : 2006/01/02 : Categories python (permalink)

Why turn down satellite?

In my "penultimate post" on broadband, I said I'd explain why I was turning down satellite broadband for terrestrial ADSL. The answers come down to:

  1. perceived contention ratio (aka real throughput), and

  2. reliability of service, and

  3. cost.

My satellite broadband provider has been micronet broadband, which in principle provided up to 2 Mb/s downlink and I'm going to BT broadband (for my first year anyway) with a 512 Kb/s service.

The way I had my service configured (for "one-way" satellite broadband) I used

  • ISDN for my uplink - which itself leads to extra costs (for the privilege of a second line one doesn't ever use as a phone line), and for the connect time (at this distance from the exchange I was lucky to exceed 14 kb/s using a conventional analogue modem ).

  • an F10 router, which created a VPN over the ISDN uplink and downlink so that all traffic from my home net used the satellite downlink.

So, taking the reasons in turn:

  1. My 2 Mb/s service had a monthly limit, after which the service dropped back to 512 kB/s anyway. However, in I found that I was limited to 512 fairly quickly most times, and I figure that had to be a contention issue, as I rarely hit my monthly limit (not that there was any way for me to know). Sometimes performance was abysmal, which again must have been contention (fortunately this wasn't often)... Of course my uplink was limited to ISDN bandwidth. In general I found the latency issues associated with the satellite not to be a problem, with one big exception: I couldn't run a VPN tunnel through the existing VPN tunnel to the satellite ... so that meant if I wanted to use work services from within the LAN, I dropped back to ISDN ... and unfortunately the F10 router was too braindead to take advantage of the second line (which meant that second ISDN line was only useful to ensure that the telephone was always working, it made no difference to my ISDN-only connection speeds).

  2. In just over a year I had three significant failures of service, once because of a reconfiguration on the satellite which micronet failed to tell me about, and twice because of failures of their billing systems. I think what this indicates is that their back office systems aren't yet up to it - but when you realise that BT can't get their act together either, I think one just has to say c'est la vie.

  3. If one was only paying for the satellite service, then 20 quid per month for 512-2024 kB/s might not be too bad, but add the ISDN costs on top, and it's very expensive.

The bottom line: Given I can get 512 kB/s ADSL, then I'm better off with terrestrial broadband, but if I could have only got 256, then I wouldn't have hesitated to stay on the satellite. Can I recommend micronet? Well, I get the impression that they're getting better, and when I got to talk to Kamran (who seems to be the technical brains), I found him very helpful and relatively efficient (unlike their website, which is one of the most poorly organised I've ever seen - they're better than their website might indicate :-).

Will I consider the micronet satellite booster product to get faster than 512 (the BT guy tells me that even with MaxDSL, the best I could get here would be a 768 kb/s service, and that with reduced reliability wrt the 512 service)? Leaving aside how I might get it to work from linux, I'm not so sure, as I can't quite see that the price reflects the likely usage (surely the only reason to use such a service would be to occasionally download big things - and the limits on "gearing" would make such usage very expensive).

by Bryan Lawrence : 2006/01/01 : Categories broadband (permalink)


DISCLAIMER: This is a personal blog. Nothing written here reflects an official opinion of my employer or any funding agency.