Bryan Lawrence : Bryan's Blog 2006/03

Bryan Lawrence

... personal wiki, blog and notes

Bryan's Blog 2006/03

konqueror, safari, and xslt

Can it really be true that neither konqueror nor safari can render xml files with xslt as the stylesheet? (Yes it can!) Can it really be true that only IE and mozilla can do this? (Not even opera. Apparently!) Does anyone know a timescale for the kde or apple folks to sort this out? Do they care? Should I care?

by Bryan Lawrence : 2006/03/28 : Categories computing : 0 trackbacks : 0 comments (permalink)

Some practicalities with ISO19139

A number of us in the met community are rushing towards implementing a number of standards that should improve interoperability across the environmental sciences. One of those is ISO19139 which (we hope) will soon be standard xml implementation of ISO19115 (the content standard for discovery metadata).

Both ISO19139 and ISO19115 have built into them the concept that communities will want to build profiles which are targeted to their own uses. Such profiles may consists of a combination of constraints (subsets of the standards) and/or extensions (to the standards). I've introduced some of these concepts before (parameters in metadata and xml and dejavu - and the conclusions of both still stand).

It has been suggested that this ability to constrain and/or extend ISO(19115/19139) could impact on interoperability because by definition an extended document cannot conform to the ISO19139 schema, and so one could not consume a document in an unfamiliar profile (and possibly therebye loose the benefit of a standard in the first place).

I think this is a red herring. I don't believe anyone is going to have an archive of "pre-formed" xml instance documents which conform to ISO19139 (but see below). In nearly all cases folks will have their own metadata, which they will export into XML instances as necessary (from something, a database, flatfiles, whatever).

Where those xml instances are going to be shared by a community, they will be in the community profile - which can use xml schema to validate according to the community schema (which had better have all the extension material in a new package in a new namespace). But that package can import the gmd namespace (gmd is the official prefix of the iso19139 specific schema).

So, these will be valid instances of the schema against which they are declared, which means the xml schema machinery (love it or hate it) will be useable. Consumers of these instances can either parse the file for stuff they understand, or exporters can choose to ensure that they are ?transformed from the community profile into vanilla iso19139 on export from the source, or from the community profile instance.

The first option (consuming an instance of a schema which we don't fully understand but which imports and uses - significantly - a namespace that we do understand) is that discussed by Dave Orchard in an xml.com article.

Orchard has some useful rules, of which number 5 is relevant:

5. Document consumers must ingore any XML attributes or elements in a valid XML document that they do not recognise.

To second option (export a valid iso19139 document from your profile when sharing outside your community) will be much easier if those who create community profiles always produce an accompanying xslt for producing vanilla iso19139 ... this shouldn't be overly onerous for someone who can build a valid schema in the first place :-).

This last option will be absolutely necessary when considering the role of community, national and international portals. They will have archives of ISO19139-like documents, and it will be encumbent on the data providers to ensure that they can harvest the right documents - I don't think it makes sense to rely on the portals consuming non-specific documents and then processing to get the right records for their portals! (This paragraph added as an update in response to email from Simon Cox who reminded me about this application).

But all this technical machinery is the easy part. The hard part is defining a dataset so we can decide what needs an ISO19139 record! (Some if the issues for citing a dataset are discussed here and most of them come down to What is a dataset?)

by Bryan Lawrence : 2006/03/28 : Categories metadata ndg iso19115 : 1 trackback : 3 comments (permalink)

Novel Escapism

James Aaach in an off topic comment on my recent post about journals and blogging introduced me to www.LabLit.com. I suspect I'll waste a lot of time there on quiet evenings.

The site is full of interesting essays, but what caught my attention immediately, on a day when Gordon Brown is trying to up spending on science and technology education in the UK was the forum discussion on Gregory Benford's contention that the rise of fantasy fiction has

... led to a core lessening of ... the larger genre, with a lot less real thinking going on about the future. Instead, people choose to be horrified by it, or to run away from it into medieval fantasy ... all of this as a retreat from the present, or rather, from the implications of the future ... (no) accident that fantasy novels dominate a market that once was plainly that of Heinlein, Clarke, Asimov, and Phil Dick ... (led) to the detriment of the total society, because science fiction, for decades really, has been the canary in the mineshaft for the advanced nations, to tell us what to worry about up ahead.

The forum had some interesting threads in it (not least a list of science fiction that Benford considers has significant science in it ... which includes much I haven't read ... to my surprise). I was particularly struck by Benford's point:

that our culture has uplifted much of humanity with technology, but needs to think about the ever faster pace of change and bring along those who may well react against it. Genomics, climate change, biotechs, etc--all need realistic treatment in what-if? scenarios.

This got me thinking about the role of science fiction per se. Surely first and foremost it's about entertainment? I don't buy Benford's argument that people are consciously avoiding thinking about the future - I never once read a science fiction novel because I thought it was about a possible future, I read them for fun. Which brings me back to Gordon Brown ... While I think Benford's right nonetheless that science fiction can illuminate the future, I think an even more important cultural reason to bring people back to science fiction from fantasy is that for many of us, science fiction is what got us (dare I say) turned on to the fact that science is interesting. It's not much practical use to anyone if all forms of escapism are contemporary, historical or fantasy. More good quality science fiction will go a long way to enticing those young people into science that Gordon Brown and the UK need for the future. And good quality science fiction (that is readable science fiction which has digestible facts in it) might just help even the folk without science genes to appreciate what it's about.

I guess the problem is it's hard to write science fiction, and much of the potential audience hasn't the appetite for it any more. What I don't really understand is why not.

(As an aside, in response to Benford's position, further down, in the same link, Darrell Schweitzer makes some points about the dangers of the drift to fantasy amongst which he lists:

.. And of course the other real danger comes from writers like Michael Crichton who tell their readers that all scientific innovation is bad and scary, carried out by naive or corrupt people, and will likely kill you. THAT is socially harmful.

Amen!)

by Bryan Lawrence : 2006/03/22 : 1 comment (permalink)

New Word - Bliki

I've just learnt a new word. Apparently this blog (a leonardo instance) is a bliki. Having learnt it I propose to forget it. It seems like an unnecessary distinction to make and some more geek jargon ... but it does have a ring to it :-)

by Bryan Lawrence : 2006/03/22 : 2 comments (permalink)

Parameterisation of Orographic Cloud Dynamics in a GCM

Sam Dean, Jon Flowerdew, Steve Eckermann and I have just submitted a paper to Climate Dynamics:

Abstract. A new parameterisation is described that predicts the temperature perturbations due to sub-grid scale orographic gravity waves in the atmosphere of the 19 level HADAM3 version of the United Kingdom Met Of?ce Uni?ed Model. The explicit calculation of the wave phase allows the sign of the temperature perturbation to be predicted. The scheme is used to create orographic clouds, including cirrus, that were previously absent in model simulations. A novel approach to the validation of this parameterisation makes use of both satellite observations of a case study, and a simulation in which the Uni?ed Model is nudged towards ERA-40 assimilated winds, temperatures and humidities. It is demonstrated that this approach offers a feasible way of introducing large scale orographic cirrus clouds into GCMs.

Anyone interested should contact Sam for a preprint. If you haven't got his details, let me know.

by Bryan Lawrence : 2006/03/21 : Categories climate : 0 trackbacks (permalink)

The Lockin results in the Exeter Communique

A few weeks ago I went quiet for a week while I attended a workshop at the Met Office. We called our workshop "the lock-in", as the original proposal was that we would be fed pizzas through a locked door until we came out with a GML application schema for operational meteorology. Well, we were allowed out, but being the Met Office we had no effective internet connection, and we were too knackered for anything in the evenings ...

Anyway, the important thing is that we had a really excellent week, and made some real progress on a number of issues. The results are summarised in the Exeter Communique (pdf):

This document summarises the discussions and recommendations of a workshop convened by the UK Met Office to examine the application of OGC Web Services and GML Modelling to Operational Meteorology. The workshop addressed the need to define best practice for development of data standards that span multiple subject domains, jurisdictions and processing technologies. The findings of this workshop will be of use not only to organisations involved in the processing of meteorological data, but any community that requires interoperability along a data processing chain.

Further information is at the AUKEGGS1 wikipage.

1: Australia-UK collaboration on Exploitation of Grid and Geospatial Standards (ret).

by Bryan Lawrence : 2006/03/20 : Categories ndg metadata : 2 comments (permalink)

tim bray wrong for the first time

Regular readers of this blog will know that Tim Bray's blog is reliable source of inspiration for me ... however, for for the first time, I think he's got it wrong (ok, a bit of a tongue in cheek here, but anyway), he said, quoting Linda Stone:

But then she claims that email is ineffective for decision-making and crisis management. The first is not true, I have been engaged in making important complex decisions mostly by email, mostly in the context of standards efforts, for about ten years now. If she were right, we could disband the IETF.

I think the first is nearly true, and the second is absolutely true. (Of course, it's not obvious from that quote that she was actually making two different points, but we'll assume she is from his response).

I would claim that you can't make decisions unless you have control over the information flow that leads to your decisions. In particular, during crises, one needs to be able to identify the relevant information quickly. Who amongst us really can control their email? No, I mean control!

Of course Tim is a self confessed insomniac, so maybe his bandwidth is still large enough to cope. Mine isn't.

by Bryan Lawrence : 2006/03/20 : 0 trackbacks : 1 comment (permalink)

Journals and Blogging

On Thursday I gave my seminar at Oxford. Of course I wrote the abstract months before the talk, so I didn't cover half the things I said I would in any detail, but for the record, the presentation itself is on my Talks page.

Of the sixty slides in the presentation, most of the discussion afterwards concentrated on six slides at the end, and it became clear that I had confused my audience about what I was saying, so this is by way of trying to clear up some misconceptions.

For many in the audience I think this was there first significant exposure to the ideas and technologies of blogging (and importantly trackback). I introduced the concept with these factoids (from the 15th of March):

  • Google search on ?climate blogs? yields 33,900,000 hits.

  • tecnorati is following 30 million blogs

    • 269,404 have climate posts

    • 1,953 climate posts in ?environmental? blogs

    • 131 posts about potential vorticity (mainly in weather/hurricane blogs)

  • Very few ?professional? standard blogs in our field, but gazillions in others! (Notwithstanding: RealClimate and others.)

I then compared traditional scientific publishing and self publishing, and this is where I think my message got blurred. Anyway, the comparison was along these lines:

     Traditional Publishing    Self Publishing  
     Pluses    Pluses  
  Review    Peer-Review; the gold standard    What people think is visible! Trackback, Annotation 
  Quality measures    Citation    Citation, Trackback and Annotation 
  Feedback    publish then email, slow    Immediate Feedback, Hyperlinks  
  Indexing    Web of Science etc, reliable    Tagging, Google, just as reliable  
  Readability    Paper is nice to read    PDF can be printed  
  Other       You can still publish in the traditional media  
     Minuses    Minuses  
  Review    Peer review is not all it could be    No formal peer review  
  Indexing    Proprietary indexing (roll on google-scholar)    Ranking a problem: finding your way amongst garbage  
  Other    Often very slow to print     
     Libraries can't afford to buy copies (limited readership)     
        Trackback and Comment Spam  

I then concluded the big question is really how to deal with self publishing and peer review outside the domain of traditional journals, because I think for many their days are numbered (possibly apart from as formal records).

Most of the ensuring discussion was predicated on the assumption that I was recommending blogging as the alternative to "real" publishing, despite the fact that earlier I had introduced the RCUK position statement on open access and I then went straight on to introduce Institutional Repositories and the CLADDIER project.

So, let me try and be very explicit about the contents of my crystal ball.

  1. The days of traditional journals are numbered, if they continue to behave the way they do, i.e.

    1. Publishers continue to aggregate, and ignore the (declining) buying power of their academic markets.

    2. They do not embrace new technologies.

    3. They maintain outdated licensing strategies.

    4. Two outstanding exemplars of journals moving with the times (for whom this is not a problem) are:

      1. Nature.

        1. Check out Connotea (via Timo Hannay).

        2. Note also their harnessing of Supplementary Online Material is a good thing. They're only one step away from formally allowing data citation!

        3. Their licensing policy is fair: authors can self-publish into their own and institutional archives six months after publication. (Nature explicitly does not require authors to sign away copyright!)

      2. Atmospheric Chemistry and Physics.

        1. Uses the same license as this blog (Creative Commons Non-Commercial Share-Alike).

        2. Peer Review is done in public, and the entire scientific community can join in. Check out the flow chart.

  2. Early results and pre-publication discussion will occur in public using blog technologies!

    1. Obviously some communities will hold back some material so as to maintain competitive advantage (actually, I think the only community that should do so are graduate students who maybe need more time from idea to fruition, the rest of us will gain from sharing)

    2. We may need to have a registration process and use identify management to manage spam on "professional blogs", but individuals will probably continue to do their own thing.

    3. Some institutions will need to evolve their policies about communication with the public (especially government institutions).

    4. There will be more "editorialising" about what we are doing and why, and this will make us all confront each other more, and hopefully increase the signal level within our community.

  3. Data Publication will happen, and then we will see bi directional citation mechanisms (including trackback) between data and publications.

    1. By extending trackback to include bidirectional citation mechanisms and implementing this at Institutional Repositories (and journals) we will see traditional citation resources becoming less important. (There is a major unsolved problem though: there might be multiple copies of a single resource - author copy, IR copy, journal of record copy - done properly they all need to know about a citation, which means it'll have to behave more like a tag then a trackback alone ... however, I still think the days of a business model built around a traditional citation index may be numbered.)

To sum up:

  • Many journals will die if they don't change their spots

  • Trackback linking will become very important in how we do citation.

  • Post publication annotation will become more prevalent.

  • Blogging (technologies) will add another dimension to scientific discourse.

by Bryan Lawrence : 2006/03/20 : Categories curation : 1 trackback : 2 comments (permalink)

Back end or front end searching?

Searching is one of those things that keeps bumping into my frontbrain, but not getting much attention. I have a hierarchy of things grabbing at me:

  1. I want to search my own email (~3GB) and documents (~10GB).

    • I'm hanging out for beagle or kat to get at my kmail maildir folders in the next release of kubuntu (yes, I'm afraid of the size of the indexes I'll need).

  2. I'd like to be able to search my stuff, and some selected other stuff, and then let google have a go after that ... all in one command

    • I think I'd have to build my own "search engine" using the google web service which would be called after "my own" search engine had responded. I'll never get around to this, but

    • it would help a lot of I had search inside Leonardo, and if it could give different results depending on whether or not I was logged in, I could conceivably get around to that. I might choose one of the technologies below, or I might choose pylucene.

  3. The badc needs to provide a search interface to

    1. The mailing lists we manage.

      • These are mailman lists. One of my colleagues has had a look at searching options for that and he's come down in favour of http://www.mnogosearch.org/. Why doesn't mailman come with a search facility out of the box? Part of me says we should do a pylucene search interface and contribute it to mailman, and part of me says, run with the first option, it's easier and faster ... but see below ...

      • At the same time, I'd done a bit of a nosey, and a bit of a web search on the subject led me to (of course) Tim Bray's series of essays, and an interesting thread on Ian Bicking's blog. The latter points one at (in no particular order)

    2. We have other stuff as well (metadata in flat files, databases, data files, web pages etc).

For the badc, the big question in my mind is a matter of architectural nicety though. For the badc, whould we rely on front-end searching (overlay a web-search engine, e.g. mnogosearch, or even a targetted google)? Or should we do our own searching on indexes we manage at back end (i.e. avoid indexing via a/the http interface)? For some reason I hanker for the latter, but I'm sure it's stupidity ... fortunately for this last class of searching I'll not be making a decision about this (I've delegated it), but I feel like I should have an informed opinion.

by Bryan Lawrence : 2006/03/16 : Categories badc python computing : 0 trackbacks : 3 comments (permalink)

Using the python logging module

I've been upgrading the support for trackback in leonardo (the only things left to do are to get the trackback return errors working correctly and deal with spam blocking) ... and in the process I found it useful to put to use the python logging module.

I had to do this without the standard documentation (as I was working on a train). But obviously I had python introspection on my side ... I thought it was a good result for me and for python introspection to work it all out between Banbury and Leamington Spa!

The hard bit was working out how to do it inside another class (without the docs), but once worked out, it's all very easy (I know this is in the library docs, but I feel like recording my experience :-):

import logging
class X:
    def __init__(self,other stuff ): 
         ...
         f=logging.FileHandler('blah.log')
         self.logger=logging.getLogger()
	 self.logger.addHandler(f)
	 self.logger.setLevel(logging.DEBUG)
	 self.logger.debug('Debug logging enabled')
         ... 

***
highlight file error
***

and now everywhere I have an X class instance, I have access to the logger, which is much easier than the way I used to do things for debugging. In my leonardo, the instance is the config instance, so obviously the logfile and level is all configurable too.

by Bryan Lawrence : 2006/03/15 : Categories python (permalink)

Declining Earth Observation

Allan Doyle has highlighted a recent CNN article which reports the poor future that EO has in the American space programme. This is something that has been brewing for a while now, but seems to be getting more and more real.

The worry of course is that each cancellation could potentially set a programme back by a decade, even if someone restarts it next year. Tthe hard part of a space programme isn't the hardware - although that's hard - nor the testing - although that's hard too! It's keeping the team together and/or documenting how one got to where one is now - reputedly why NASA can't go back to the moon right now, they've forgotten how to build Saturn V rockets. This, as Allan says, makes this a tough issue to deal with.

One can but hope that ESA picks up some of the slack ...

by Bryan Lawrence : 2006/03/15 : Categories environment (permalink)

Functional Trackback

I reckon I've got a functional trackback working now. If anyone fancies trying a trackback to this post I'd be grateful ...

This version is pretty basic, it should reject any "non-conforming" pings with an error message, but do the right thing for correctly formed pings. It doesn't yet have any anti-spam provision (which i plan to do by simply checking that citing posts exist, and have a reference to the permalink that they are tracking back to).

The url to trackback to is the permalink of the post. I'll make that more clear in the future (and add autodiscovery probably as well).

by Bryan Lawrence : 2006/03/15 : Categories python computing : 3 comments (permalink)

Meteorological Time

One of the problems with producing standard ways of encoding time is that in meteorology we have a lot of times to choose from. This leads to a lot of confusion in the meteorological community as to which time(s) to use in existing metadata standards, and even claims that existing standards cannot cope with meteorological time.

I think this is mainly about confusing storage, description and querying.

Firstly, let's introduce some time use cases and vocabulary:

  1. I run a simulation model to make some predictions about the future (or past). In either case, I have model time which is related to real world time. In doing so, I may have used an unusual calendar (for example, a 360 day year). We have three concepts and two axes to deal with here: the Simulation Time axis (T) and the Simulation Calendar. The Simulation Period runs from T0 to Te. We also have to deal with real time, which we'll denote with a lower case t.

  2. Using a numerical weather prediction model.

    1. Normally such a model will use a "real" calendar, and the intention is that T corresponds directly to t.

    2. It will have used observational data in setting up the initial conditions, and the last time for which observations were allowed into the setup is the Datum Time (td - note that datum time is a real time, I'll stop making that point now, you can tell from whether it is T or t whether it's simulation or real).

    3. The time at which the simulation is actually created is also useful metadata, so we'll give that a name: Creation Time (tc).

    4. The time at which the forecast is actually issued is also useful, call it Issue Time (ti). So:

      Latex Embedded Image

    5. A weather prediction might be run for longer, but it might only be intended to be valid for a specific period called the ValidUsagePeriod. This period runs from ti until the VerificationTime (tv).

    6. During the ValidUsagePeriod (and particularly at the end) the forecast data (in time axis T) may be directly compared with real world observations (in time axis t), i.e., they share the same calendar. So now we have the following: T_0 < tc but t''d can be either before or after T0!

      • Note also that the VerificationTime is simply a special labelled time, this doesn't imply that verification can't be done for any time or times during the ValidUsagePeriod.

      • Note that some are confused about variables which are accumulations, averages, or maximum/minimum, all over some Interval. These do not have a special relationship with these variables. When I do my comparisons, I just need to ensure that the intervals are the same on both axes.

    7. We might have an ensemble of simulations, which share the same time properties, and only differ by an EnsembleID - we treat this as a time problem because each of these is essentially running with a different instance of T, even though each instance maps directly onto t. But for now we'll ignore these.

  3. In the specific case of four-dimensional data assimilation we have:

    Latex Embedded Image

    confused? You shouldn't be now. But the key point here is that there is only one time axis, and one time calendar both described by T! All the things which are on the t axis are metadata about the data (the prediction).

  4. If we consider observations (here defined as including objective analyses) as well, we might want some new time names, it might be helpful to talk about

    1. the ObservationTime (to,the time at which the observation was made, sometimes called the EventTime).

    2. the IssueTime is also relevant here, because the observation may be revised by better calibration or whatever, so we may have two identical observations for the same to, but different ti.

    3. a CollectionPeriod might be helpful for bounding a period over which observations were collected (which might start before the first observations and finish after the last one, not necessarily beginning with the first and ending with the last!)

  5. Finally, we have the hybrid series. In this case we might have observations interspersed with forecasts. However, again, there is one common axis time. We'd have to identify how the hybrid was composed in the metadata.

I would argue that this is all easily accommodated in the existing metadata standards, nearly all these times are properties of the data, they're not intrinsic to the data coverages (in the OGC sense of coverage). Where people mostly get confused is in working out how to store a sequence of, say, five day forecasts, which are initialised one day apart, and where you might want to, for example, extract a timesequence of data which is valid for a specific series of times, but is using the 48 hour forecasts for those times. This I would argue is a problem for your query schema, not your storage schema - for that you simply have a sequence of forecast instances to store, and I think that's straight forward.

I guess I'll have to return to the query schema issue.

by Bryan Lawrence : 2006/03/07 : Categories curation badc (permalink)

Climate Sensitivity and Politics

James Annan has a post about his recent paper with J. C. Hargreaves1 where they combine three available climate sensitivity estimates using Bayesian probability to get a better constrained estimate of the sensitivity of global mean climate to doubling CO2.

For the record, what they've done is used estimates of sensitivity based on studies which were

  1. trying to recreate 20th century warming, which they characterise as (1,3,10) - most likely 3C, but with the 95% limits lying at 1C and 10C,

  2. evaluating the cooling response to volcanic eruptions - characterised as (1.5,3,6), and

  3. recreating the temperature and CO2 conditions associated with the last glacial minimum - (-0.6,2.7,6.1).

The functional shapes are described in the paper, and they use Bayes theorem to come up with a constrained prediction of (1.7,2.9,4.9), and go on to state that they are confident that the upper limit is probably lower too. (A later post uses even more data to drop the 95% confidence interval down to have an upper limit of 3.9C).

In the comments to the first post, Steve Bloom asks the question:

Here's the key question policy-wise: Can we start to ignore the consequences of exceeding 4.5C for a doubling? What percentage should we be looking for to make such a decision? And all of this begs the question of exactly what negative effects we might get at 3C or even lower. My impression is that the science in all sorts of areas seems to be tending toward more harm with smaller temp increases. Then there's the other complicating question of how likely it is we will reach doubling and if so when.

This pretty hard to answer, because it's all bound up in risk. As I said quoting the met office when I first started reporting James predictability posts,

as a general guide one should take action when the probability of an event exceeds the ratio of protective costs to losses (C/L) ... it's a simple betting argument.

So, rather than directly answer Steve's question, for me the key issue is what is the response to a 1.7C climate sensitivity? We're pretty confident (95% sure) that we have to deal with that, and so we already know we have to do something. What to do next boils down to the evaluating protective costs against losses.

Unfortunately it seems easier to quantify costs of doing something than it is to quantify losses, so people use that as an excuse for doing nothing. The situation is exacerbated by the fact that we're going to find it hard to evaluate both without accurate regional predictions of climate change (and concomitant uncertainty estimates). Regrettably the current state of our simulation models is that we really don't have enough confidence in our regional models. Models need better physics, higher resolution, more ensembles, and more analysis. So while it sounds like more special pleading ("give us some more money and we'll tell you more"), that's where we're at ...

.. but that's not an excuse to do nothing, it just means we need to parallelise (computer geek) doing something (adaptation and mitigation) with refining our predictions and estimates of costs and potential losses.

1: I'll replace the link with the doi to the original when a) it appears, and b) I notice. (ret).

by Bryan Lawrence : 2006/03/07 : Categories climate environment (permalink)

mapreduce and pyro

I just had an interesting visit with Jon Blower - technical director at Reading's Environmental E-science centre (RESC). He introduced me to Google's mapreduce algorithm (pdf). Google's implementation of MapReduce

runs on a large cluster of commodity machines and is highly scalable: a typical MapReduce computation processes many terabytes of data on thousands of machines.

and we all know the sorts of things it's doing. Jon is interested in applying it to environmental data. I hope we can work with him on that.

Obviously, I'm interested in a python implementation, and low and behold there seem to be plenty of those too: starting with this one, which utilised python remote objects (pyro home page). I'd looked at pyro some time ago, and it seems to have got a bit more mature. However, a quick look at the mailing list led to some questions that don't yet seem to have been answered.

It's not obvious that any use by our group of mapreduce would necessarily distribute python tasks (for example, we might want to distribute tasks in C, but have a python interface to that) ... but if we do go down that route, pyro may be useful. Of course mapreduce isn't the only algorithm in town either, and these are both new technologies and we've already got enough of those ... but we can't stop moving ...

by Bryan Lawrence : 2006/03/03 : Categories ndg badc computing python (permalink)

serendipity and identity management

Two seemingly unrelated threads of activity have come together: on the one hand I want to use an identity to login to the trackback protocol wiki; on the other hand, we're talking about the difficulties of single sign on for ndg.

In the latter context, Stephen has written some notes on cross domain security. The key issue here is that we have the following use case:

  1. A user logs on to site A via a browser.

    • the only way we can maintain state is by a cookie, so site A writes a cookie to the user's browser with "login" state information.

  2. User now navigates to site B not via a link from a site A! (i.e. this is not a portal type operation).

    • because site B is not in the same domain as site A, it is unlikely that site B can get at site A's cookie (the user may have set their browser up to do this, but that's probably a bad idea for them to do, so we can't rely on this).

In the former case, we want to be able to login to a site using an identity which is verifiably mine in some way. It turns out that there are a number if identity management systems out there:

  • OpenID,

  • LID - the Light-Weight Identity, and

The Yadis Wiki asserts that these two identify management systems are being used by over fifteen million people worldwide, and so the Yadis project was established to bring them together. From their wiki:

YADIS is applicable to any URL-based identity system and by no means tied to OpenID, LID, or XRI ... it became clear very quickly that the resulting interoperability architecture was much more broadly applicable. In our view, it promises to be a good foundation for decentralized, bottom-up interoperability of a whole range of personal digital identity and related technologies, without requiring complex technology, such as SOAP or WS-*. Due to its simplicity and openness, we hope that it will be useful for many projects who need identification, authentication, authorization and related capabilities.

At this stage this is all I know (except that I had a look at the mylid.net hosted service and it doesn't have any nasty terms of service ... on the other hand the site I want to get in to now only recognises OpenID). It also looks like openid might be slightly more prevalent: google on "OpenID identity" has slightly over 2 million hits, "LID identity" has just under 1.5 million hits. No idea if that's meaningful!

In any case, getting back to my serendipity point, I got all excited about LID because Johannes Ernst introduced it on the trackback list, and pointed out that

LID can be run entirely without user input (in fact, that was the original goal for authenticating requests; additional conventions then make it "nice" for a browser-based user)

It may be that this will help us avoid our cross domain cookie problem, since this is exactly what these technologies are for. So at this point it looks like I have to go understand LID and YADIS (the implication of the last link from Johnannes was that OpenID does require more user input).

by Bryan Lawrence : 2006/03/02 : Categories ndg computing : 4 comments (permalink)

openid part one, why I might have to roll my own

(One day I'll blog about my day job again ... sometimes it's hard to remember that I'm an atmospheric scientist first and a computer geek second).

Anyway, on joining the trackback working group, the first thing one wants to do is annotate the wiki, which means you have to have an openid identity. Hmm. This is the first I've ever heard of openid. sixapart who run the wiki recommend one of MyOpenID or videntity to get an identity if you don't have one.

Ok, so off I trot. Firstly, MyOpenID. Let's have a look at their terms of service, and in particular this bit:

You agree to indemnify and hold JanRain, and its subsidiaries, affiliates, officers, agents, co-branders or other partners, and employees, harmless from any claim or demand, including reasonable attorneys' fees, made by any third party due to or arising out of your Content, your use of the Service, your connection to the Service, your violation of the TOS, or your violation of any rights of another.

This states that if someone in the U.S. get's pissed off with my content, and can't get me in the UK, they can go after me via JanRain, and then I'm liable for JanRain's attorney's fees! So, a denial of service attack on me, would be to threaten to sue JanRain (not me, which would require them to come to a UK court I think). Don't like this much. Reject.

Ok, let's have a look at videntity, and their terms of service:

You agree to hold harmless and indemnify the Provider, and its subsidiaries, affiliates, officers, agents, and employees from and against any third party claim arising from or in any way related to your use of the Service, including any liability or expense arising from all claims, losses, damages (actual and consequential), suits, judgments, litigation costs and attorneys' fees, of every kind and nature. In such a case, the Provider will provide you with written notice of such claim, suit or action.

This time it's the Costa Rican courts, but it's the same deal.

While I hope I never write anything that would piss someone off so much they'd sue, the point is that they should be suing me, and these particular indemnification clauses seems to imply that if they can't get to me in my court system, they can get to me in another court system via my identity manager ...

... thanks, but no thanks! I can see why these folk want to be indemnified, but by putting either an explicit clause covering content or loose wording that has the same affect, I think they're making things worse - if they didn't have this, there would be no way anyone would sue them because one couldn't win - there is no method that they have to control my content, so a prospective suer would be wasting his/her money bringing a case. However, by putting this in, despite the fact that they (the identity managers) can't control my content, there is a route for a prospective suer to get at least costs out of me, so suddenly it's worth it, because they can directly affect me.

Now I reckon there is nearly zero risk that this would happen, but it's a principle thing. These sorts of clauses are bad news, and it's up for us (the users) not to accept them! We shouldn't mindlessly click through!

So, I guess if I want to use openid, I need to role my own identity. Well, I'll probably not bother, but it might, just might be worth investigating for NDG ... in which case, the python openid software may be of interest ... but really, I don't think I'll ever get to a part two on this subject ... there are just too many other things I should be doing.

Update (7th March, 2006): Actually these indemnity clauses are rampant. Even blogger.com - see clause 13 - has one!

by Bryan Lawrence : 2006/03/01 : Categories computing : 1 comment (permalink)

Some Trackback code

So I've joined up to the trackback working group, and I decided I'd better have a proper trackback implementation to play with ... given this is all very important for claddier the time investment was worth it.

So here is my toy standalone implementation for playing with. I have no idea whether it is actually properly compliant, but it works for me. If folk want to point out why it's wrong I'll be grateful. But note it's only toy code, it's not for real work.

There are five files:

  • NewTrackbackStuff.py - which consists of three classes which I'll document below: a TrackbackProvider (handles the trackbacks), which uses the Handler to provide a dumb but persistent store. There is also a StandardPing class to help with various manipulations ... (it's really unnecessary, but helped me with debugging and thinking etc).

  • cgi_server.py - a dumb server which runs

  • tb_server.py on the localhost at port 8001.

  • ElementTree.py (part of the fabulous package by Fredrik Lundh and included here for completeness)

  • test_tb_server.py which tests the trackback to the persistent store.

It's all pretty simple stuff, anyway, this is how you invoke trackback, assuming you have the tb_server running locally on port 8001:

import urllib,urllib2

payload={'title':'a new citing article title',
		'url':'url of citing article',
		'excerpt':'And so we cite blah ... carefully  ...',
		'blog_name':'Name of collection which hosts citing article'}

s=urllib.urlencode(payload)

req=urllib2.Request('http://localhost:8001/cgi/tb_server.py/link1')
req.add_header('User-Agent','bnl trackback tester v0.2')
req.add_header('Content-Type','application/x-www-form-urlencoded')
fd=urllib2.urlopen(req,s)
print fd.readlines()

***
highlight file error
***

There's not much to say about that. The server looks like this:

#!/usr/bin/env python
from NewTrackbackStuff import TrackBackProvider,Handler
import cgi,os
#mport cgitb
#cgitb.enable()

handler=Handler()
relative_path=os.environ.get("PATH_INFO").strip('/')
method=os.environ.get("REQUEST_METHOD")
cgifields=cgi.FieldStorage()

print "Content-type: text/html"
print

tb=TrackBackProvider(method,relative_path,cgifields,handler)
print tb.result()

***
highlight file error
***

So all the fun stuff is in NewTrackBackStuff in the TrackBackProvider:

class TrackBackProvider:
    	''' This is a very simpler CGI handler for incoming trackback pings, all it
	does is accept the ping if appropriate, and biff it in a dumb persistence
	store '''
	def __init__(self, method, relative_path, cgifields, handler):
		''' Provides trackback services
		    (The handler provides a persistent store)
		'''
		self.method=method
		self.relative_path=relative_path
		self.fields={}
		#lose the MiniFieldStorage syntax:
		for key in cgifields: self.fields[key]=cgifields[key].value
        	self.handler = handler
		
		self.noid='Incorrect permalink ID'
        	self.nostore='Cannot store the ping information'
        	self.invalid='Invalid ping format'
		self.nourl='Invalid or nonexistent ping url'
		self.noretrieve='Unable to retreive information'
		self.xmlhdr=''
		
	def result(self):
        	target=self.relative_path
            	if target=='': return self.__Response(self.noid)
		if not self.handler.checkTargetExists(target): self.__Response(self.noid)
		if self.method=='POST':
			''' All incoming pings should be a post '''	
			try:
				ping=StandardPing(self.fields)
				if ping.noValidURL(): return self.__Response(self.nourl)
			except:
				return self.__Response(self.invalid+str(self.fields))
			try:
				r=self.handler.store(target,ping)
				return self.__Response()
			except:
				return self.__Response(self.nostore)
		elif self.method=='GET':
			# this should get the resource at the trackback target ...
			try:
				r=self.handler.retrieve(target)
				return self.xmlhdr+r
			except:
            			return self.__Response(self.noretrieve)

	def __Response(self,error=''):
        	''' Format a compliant reply to an incoming ping '''
        	e='0'
        	if error!='': e='1%s'%error
        	r=''.join([self.xmlhdr,'',e,''])
        	return r

***
highlight file error
***

which is mainly about handling the error returns. The persistent store, as I say is very dumb, and simply consists of an XML file for this toy, which uses:

class Handler:
	''' This provides a simple persistence store for incoming trackbacks.
	Makes no assumptions beyond assuming the incoming ping is
	an xml fragment. No logic for ids for the trackbacks etc ... this is 
	supposed to be dumb, and essentially useless for real applications!!'''
	def __init__(self):
		self.xmlfile='trackback-archive.xml'
		# if file doesn't exist create it on store ...
		try:
			t=ElementTree.parse(self.xmlfile)
			self.data=t.getroot()
		except:
			self.data=ElementTree.Element("tracbackArchive")
	def checkTargetExists(self,target):
		''' check whether the target link exists '''
		for t in self.data:
			if t.attrib.get('permalink')==target: return 1
		return 0
	def urlstore(self,target,ping):
		'''stores a url encoded "standard" ping '''
		node=ElementTree.fromstring(ping)
		return self.store(target,node)
	def store(self,target,ping):
		''' stores a ping and associates it with target, quite happy for the
		moment to have duplicates - assumes the ping is already an ET instance '''
		#regrettably I think we have to test each child for the attribute name,
		#I'd prefer to use an xpath like expression ... but don't know how.
		for t in self.data:
			if t.attrib.get('permalink')==target:
				#add to an existing target element
				t.append(ping.element)
				break
		else:
			#create a new target element
			t=ElementTree.Element('target',permalink=target)
			t.append(ping.element)
			self.data.append(t)
		ElementTree.ElementTree(self.data).write(self.xmlfile)
		return 0
	def retrieve(self,target):
		''' retrieves the target and any pings associate with it '''
		for t in self.data:
			if t.attrib.get('permalink')==target: return ElementTree.tostring(t)

***
highlight file error
***

Finally, for handling the ping, I found this useful:

class StandardPing:
	''' Defines a standard trackback ping payload. Use as a toy to validate
	existing standard pings and convert to XML or urlencode ...'''
	def __init__(self,argdict=None):
		''' Instantiate with payload or empty '''
		self.allowed=('title','url','excerpt','blog_name')
		self.element=ElementTree.Element('ping')
        	for key in argdict:
			self.__setitem__(key,argdict[key])
	def __setitem__(self,key,item):
        	''' set item just as if it is dictionary, but keys are limited'''
        	if key not in self.allowed: self.reject(key)
        	e=ElementTree.SubElement(self.element,key)
		e.text=item
	def reject(self,key):
        	raise 'Invalid key in TraceBack Ping:'+key
	def toXML(self):
		''' take element tree instance and create xml string '''
		s=ElementTree.tostring(self.element)
		return s
	def toURLdata(self):
		''' take element tree instance and create url encoded payload string '''
		s=[]
		for item in self.element:
			s.append((item.tag,item.text))
		return urllib.urlencode(s)
	def noValidURL(self):
		''' At some point this could check the url for validity etc '''
		e=self.element.find('url')
		if e is None: return 1
		if e.text=='': return 1
		return 0

***
highlight file error
***

Update: 15th March, 2006: Use of blogname replaced with blog_name (might as well get it right :-)

by Bryan Lawrence : 2006/03/01 : Categories ndg computing (permalink)


DISCLAIMER: This is a personal blog. Nothing written here reflects an official opinion of my employer or any funding agency.