Bryan Lawrence : Bryan's Blog

Bryan Lawrence

... personal wiki, blog and notes

Bryan's Blog

(Only the last ten entries are here, earlier entries can be found by using the Summary, Archive or Categories pages, or by using the calendar to go to previous months and/or years).

Updated version of the model documentation plots

A couple of weeks ago, I outlined three figures to help describe the model documentation workflow and asked for a bit of feedback.

I've had some feedback, and some more thinking, so here are three updated versions of those plots.

In each figure, the boxes and links are not meant to correspond directly to the UML classes and associations of the (es-doc) Common Information Model (although some do) - the intention is to describe the concepts and intent of the various pieces of documentation.

Figure One

Image: static/2014/07/29/MIPprocess_esdocV2.jpg

The MIPS process involves designing experiments which are provided to modelling centres.

Modelling centres configure model code to produce a configured model, which needs InputData.

They then run one (or many) Simulations to support one or more experiments. The Simulation will conformTo (NumericalRequirements) of the Experiment, and will produce OutputData, and was run on a Platform.

The output data is uploaded to the ESGF archive where is supports the MIP.

Each of the coloured boxes represents an es-doc "document-type". The yellow coloured relationships will be included in a given Simulation document.

Figure Two

Image: static/2014/07/29/simple_esdocV2.jpg

Neglecting the process, we can look at the various document types and what they are for in more detail.

A simulation (which can also be an aggregation of simulations, also known as an ensemble) will have conformed to the requirements of the model and the experiment via conformances. Many of which constrain the input data, so as to meet the input requirements of the model, and which may also have been constrained by one of the numerical requirements of the experiment. Others may affect the code (maybe a choice of a particular code modification, or via a specific parameter choice or choices).

Configured models are described by their Science Properties.

A Simulation document will include all the information coloured in yellow, so it will define which configured model was used via the uses relationship, which will point to configured model document, which itself describes the model.

Similarly, an experiment document will define all the various numerical requirements, and may point to some specific input data requirements.

Ideally output data objects will point back at the simulation which produced them.

Figure Three

Image: static/2014/07/29/Simulations_esdocV2.jpg

In even more detail we can see that numerical requirements come in a number of flavours, including:

  • SpatioTemporalConstraints - which might define required experimental start dates, durations, and/or the coverage (global, regional etc).

  • Forcings - which generally define how various aspects of the system might be represented, for example, providing Ozone files for a consistent static ozone representation.

  • OutputRequirements - which define what output is required for intercomparison.

Simulation conformances are generally linked to specific numerical requirements and consist of data and code-mod conformances.

Currently es-doc does not clearly distinguish between data requirements and data objects, this will be fixed in a future version.

ConfiguredModels are configured from BaseModel code (this is not currently implemented in es-doc). In principle they can be described both by their software properties and the science properties (and so can their sub-components).

The run-time performance of the simulation should also be captured by resource and performance characteristics (but this too is not yet supported by es-doc).

A model description document should include all the material in green, including the links.

by Bryan Lawrence : 2014/07/29 : 0 comments (permalink)

NCAS Science Conference, Bristol

In the middle of two days in Bristol for the NCAS science conference. Good to see what so many of my colleagues are up to (distributed as they are, across the UK).

My talk on the influence of Moore's Law on Atmospheric Science is linked from my talks page.

I wish the rest of the talks were publicly available, there is a lot of good stuff (much of it yet to come, today). The problem with (very slow) peer review as a gold standard is that a lot of stuff only sees the public light of day (even into the relevant science community) long after the work was done, whereas much of it is fit for exposure (and discussion) well before then - but you have to go to the right meeting/workshop/conference. However, some of what is discussed is provocative, and work in progress ... and of course our community is (for good reasons, mostly related to an uneducated public spotlight) somewhat shy of premature publication. It's a conundrum.

by Bryan Lawrence : 2014/07/18 : 0 comments (permalink)

The vocabulary of documenting models

Some time ago our European project Metafor migrated to a global project called es-doc. During the metafor project we put a lot of effort into trying to develop materials which describe what it is trying to achieve, but I think we never really got that right (despite a number of papers on the subject - e.g. Lawrence et al 2012 , Guilyardi et al 2013 etc).

I'm currently in the process of trying to produce a figure on the subject for another publication, and in the process have produced these, which I think might be quite useful for understanding what es-doc is trying to document, and what the main kinds of documentation are that it produces. I'd be interested in feedback as to whether these figures are helpful or not.

In each figure, the boxes and links are not meant to correspond directly to the UML classes and associations of the (es-doc) Common Information Model (although some do) - the intention is to describe the concepts and intent of the various pieces of documentation.

Figure One

Image: static/2014/07/09/MIPprocess_esdoc.jpg

The MIPS process involves designing experiments which are provided to modelling centres.

Modelling centres configure model code to produce a configured model.

They then run one (or many) Simulations to support one or more experiments. The Simulation will conformTo (NumericalRequirements) of the Experiment, and will produce OutputData, and was run on a Platform.

The output data is uploaded to the ESGF archive where is supports the MIP.

Each of the coloured boxes represents an es-doc "document-type".

Figure Two

Image: static/2014/07/09/simple_esdoc.jpg

Neglecting the process, we can look at the various document types and what they are for in more detail.

A simulation (which can also be an aggregation of simulations, also known as an ensemble) will have used some input data, some of which may have been defined and/or constrained by one of the numerical requirements of the experiment.

Constraints on the simulation by the numerical requirements might require conformances - which define how those constraints affect either the data or the code (maybe a choice of a particular code modification, or via a specific parameter choice or choices).

Configured models are described by their Science Properties.

A Simulation document will include all the information coloured in yellow, so it will define which configured model was used via the uses relationship, which will point to configured model document, which itself describes the model.

Similarly, an experiment document will define all the various numerical requirements, and may point to some specific input data requirements.

Ideally output data objects will point back at the simulation which produced them.

Figure Three

Image: static/2014/07/09/Simulations_esdoc.jpg

In even more detail we can see that numerical requirements come in a number of flavours, including:

  • SpatioTemporalConstraints - which might define required experimental start dates, durations, and/or the coverage (global, regional etc).

  • Forcings - which generally define how various aspects of the system might be represented, for example, providing Ozone files for a consistent static ozone representation.

  • OutputRequirements - which define what output is required for intercomparison.

Simulation conformances are generally linked to specific numerical requirements and consist of data and code-mod conformances.

ConfiguredModels are configured from BaseModel code (this is not currently implemented in es-doc). In principle they can be described both by their software properties and the science properties (and so can their sub-components).

The run-time performance of the simulation should also be captured by resource and performance characteristics (but this too is not yet supported by es-doc).

A model description document should include all the material in green, including the links.

by Bryan Lawrence : 2014/07/09 : 1 trackback : 2 comments (permalink)

Accessing JASMIN from Android

I have a confession to make. I sometimes illicitly work using my android phone and legitimately with my android tablet. Sometimes that work involves wanting to ssh into real computers, and edit files on real computers. One real computer I want to access is JASMIN, and access to JASMIN is controlled by passphrase protected public/private key pairs.

This is how I do it. None of it's rocket science, but I'm documenting it here, since there a zillions of possible tools, and I spent ages working out which ones I could actually use ... so this is to save you that wasted time.

Before we start a couple of words of warning though: To do this you're going to have to put a copy of your private key on your Android device. It's important to realise that means someone with access to your android device is only one step away from access to your computing accounts. In my case, to protect that key I have taken three steps: I have encrypted my android device, I have a very short time (30s) before lockdown, and I have remote wipe enabled. If the bad guys still get access, my key still has a pass phrase! I hope those steps are enough that if I lose my phone/tablet I will realise and block access before anything bad could happen.

You might read elsewhere about how to use dropbear and your public key on your device. Right now, dropbear doesn't support keys with passphrases, so I am not using dropbear (since to do so would require me to remove the passphrase on my private key). This also means that some Android apps for terminal access which use dropbear under the hood (pretty much any that use busybox) can't exploit a properly protected public key. You can use them for lots of things, but don't use them for JASMIN (or similar) access.

Ok, onwards.

Step 1: Get your Key

Get a copy of your ssh private key onto your device. You can do this by any means you like, but ideally you'd do it in a secure way. This is the way I did it:

  1. Make sure you have your private key somewhere where you can get at it by password protected ssh (not key protected).

  2. Load an app onto your device which gives you an scp/ssh enabled terminal to your device. I used AirTerm (which provides a cool floating terminal, very useful on tablets especially). Note that unfortunately AirTerm is not suitable for JASMIN access because it appears to use dropbear under the hood, and so it can't handle our private/public key pairs.

    • (If you are using AirTerm, you will want to go to preferences and install kbox to follow what I've done.)

  3. Fire up AirTerm, and change working directory:

    cd /storage/sdcard0
    

  1. Make a directory for your key, and change into it, something like

    mkdir mystuff
    cd mystuff
    

  1. Now scp your private key into that directory, something like:

    scp you@yourhost.wherever:path_to_key/your_private_key .
    

  1. You may or may not need to move the copy of your your_private_key on yourhost.wherever, depending on whether it's secure there.

(You're done with AirTerm for now, but I'm sure you'll find lots of other uses for it.)

Step 2: Get and Configure JuiceSSH

I use JuiceSSH to work with remote sites which use the key infrastructure. It has lots of nice properties (especially the pop up terminal keyboard) and it can manage connections and identities, and handle multi-hop ssh connections (e.g. for JASMIN, as needed to get via the login nodes to the science nodes).

JuiceSSH is pretty straightforward. Here's what you need for JASMIN.

  1. Fire up JuiceSSH. You will need a password for juice itself. Select something memorable and safe ... but different from your private key passphrase! (If you're like me, and forget these things, you might want to exploit something like lastpass to put this password in your vault).

  2. Add your JASMIN identity:

    • Select a nickname for this identity, give it your jasmin username, and then choose the private key option. Inside there let smart search find your public key (and then tick it). Update and save.

  3. Now add a connection to jasmin-login1, the options are pretty straightforward. You can test it straight away. If it doesn't work, ask yourself if you need to use a VPN client on your phone/tablet first to put yourself in the right place for JASMIN to take an inbound connection.

  4. You can add a direct connection to jasmin-sci1 (or your science machine) by using the via option in the setup. Here's an example of how to configure that.

    Image: static/2013/11/13/juice.png

    In that example, "jasmin login" is the nickname I gave to the connection to jasmin-login1, and "bnl jasmin pk" is the nickname for my jasmin identity.

Now you can access JASMIN, but what about editing files?

Step 3: Get and Configure DroidEdit

I'm using Droidedit as my tool of choice for editing on Android, although, to be fair, I'm not doing that much editing on Android yet so I'd be interested in hearing if there are other better tools. I primarily chose it because it has support for both pki and editing remote files via sftp.

Once you have DroidEdit, you need to go into settings, and to into the SFTp/FT actions "Add Remote Server". Configure with path to your private key, and save using the server address jasmin-login1.ceda.ac.uk. Ignore the fact that the test fails (note that it prompts you for a password, not a passphrase).

Then, go back out, and try and open a file on jasmin, this time you'll get prompted for a passphrase, and voila, it should just work (as before, make sure your android device has used a VPN or whatever to be "in the right place").

by Bryan Lawrence : 2013/11/13 : 0 comments (permalink)

vertical resolution

Last week I pointed out that I wasn't at all sure the analysis by LFR89 really applied at modern horizontal grid resolutions, since the vertical scales implied for quasi-geostrophic motion didn't make sense.

I've done a wee bit more delving, and now I'm sure it's not appropriate. The analysis LFR89 did was based on the solutions of the "quasi-geostrophic psueudo-vorticity equation". This is a venerable equation, first derived by a couple of folks, but formalised by Jules Charney. It's derived by carrying out a scale analysis of the primitive equations of motion, suitable for ''large scale motions where departures from hydrostatic and geostrophic equilibrium are small". I still haven't done the rederivation myself (it's a lot of bookwork, and desultory attempt to setup sympy to do it ran out of time), but Charney himself (Charney,1971) in an interesting paper (of some relevance here) put some bounds on the various scales of validity (see his equation 9). As a consequence , Charney points out that these equations define a band of specific horizontal and vertical scales! The fastest way to get to that band is to go back to the fuller derivation in Charney and Stern, 1961 where we get the constraints laid out more easily in the scale analysis. In particular, the quasi-Boussinesq and hydrostatic approximations give us:

L2/D < g/f2 (~109) and D/L < 0.1

Putting L=100km into those equations suggest that D should lie between 1 and 10km, which isn't quite the same as we get from the assertion in LFR89, that all vertical scales may appear and they are related to the expression:

L = D * N/f

(which gives us D=1km, which I think is more to do with the scales of the baroclinic wave solutions to the QG equation). Of course this scale analysis needs to be evaluated carefully in practice, and in particular, ever smaller values of L and D may have those kinds of wave solutions in the QG equations, but the equations themselves are no longer valid.

I'd be happy if someone could be bothered to do the derivation with constraints properly, and evaluate them completely for all scales, but until they do, I'm not going to invest too much more time in using LFR89 to give me "large" scale constraints on the vertical resolution.

Charney's paper is interesting and relevant since it also points out that the larger scales do not feed energy to the smaller scales, but it says nothing significant about other scales, such as those involved in fronts and blocking and gravity waves. As I've already argued, I think these tell us more of what we need to know. That's where we'll go next.

Update 19/09/13: I have slightly edited this post since when I didn't like the tone of a particular sentence when the dust had settled; on rereading it a day later it carried connotations that were not intended. I fixed that and added a clarifying sentence or two.

by Bryan Lawrence : 2013/09/18 (permalink)

zotero, zandy, greader, evernote, and me

Quite a while ago (i.e. years), I decided that managing my bibliographic information in a bibtex file wasn't working any longer. Back then I had a look at Mendeley and Zotero. I can't really remember why, but I chose Zotero (I think it was a combination of how it worked for me, I played with both, and I didn't like having to use their PDF viewr. I also had some worries about Mendeley and the software and information IPR ... when Elsevier bought out Mendeley I felt vindicated on the latter.)

Anyway, now Zotero is a pretty integral part of my working environment. I use zotero standalone on my linux laptop (which is also my desktop when it's in a docking station). I make heavy use of zotfile to migrate papers to and from my Android tablet for reading (I no longer print out anything). I like being able to annotate my PDFs on the tablet, and in particular, having anything I highlighted being pulled out automagically when zotfile pulls the papers back off the tablet.

However, there are two issues with that workflow that bug me. I'd like my PDF library to be completely synchronised for offline reading on my tablet, and I'd like a fully featured native zotero client on the tablet (and my Galaxy Note phone). Zandy is the only Android app for zotero, and while it has some useful functionality (it synchronises the metadata so at least one can check on the phone/tablet if something is in my library), it doesn't synchronise the attachments completely.

(I do use box.net to synchronise my attachments out from zotero standalone via webdav, which works, but one can only use it to effectively download attachments one by one to Zandy - there is no bulk download facility, and no way to annotate and upload back - it's one way sync! But you can view stuff without going to the journal which can be useful for memory jogging.)

The other thing I can't do on my Android devices, and in particular my phone, is effectively create zotero information. There are ways, I could:

  • Manually enter the information in Zandy (no thanks, the whole point of zotero is to avoid manual bibliographic entry where possible). (There is a scanner option, but I'm mostly dealing with papers found on journal websites.)

  • Use the zotero bookmarklet on chrome, well yes, that's possible, but it fails miserably on the AMS journal websites, and requires an inordinate amount of clicking and typing. (The way you use the bookmarklet is to start typing it's bookmark name into the address-bar of the page you are looking at, and if it can find a translator, it loads it into zotero.)

What I really want to do, is from a feed reader, share a journal entry straight into zotero. I can nearly do this. However, from greader, if I

  • Share to Zandy, I basically just get the paper loaded as web page, and I have to manually fix it all later. This isn't necessarily a bad option, at least I get something, but it's often not enough, unless I do that manual step. You can guess how often I do it ...

  • Share to evernote, I can at least get the abstract and most of the body out of the RSS/atom straight into evernote (again the AMS journal feeds are hopeless). But now I have my bibliographic information in two places: abstracts in evernote, and full papers with proper references in Zotero. Searching is cumbersome.

Anyone got a better solution for (zotero based) bibliographic handling from Android (or a way of encouraging Avram Lyon, the Zandy author, to get back into active development)?

I need it to work from Android, because it's in the nature of my job that I spend a lot of time before and after meetings, travelling etc, when being able to interact with the scientific literature on my phone and tablet would make me more productive. Indeed, I do most of my journal paper triage on my phone! (No, I am not going to consider becoming an apple fan-boy!)

(Of course most of this is pointless if the paper is invisible behind a paywall. Invective removed by the editor/author.)

by Bryan Lawrence : 2013/09/11 : 1 comment (permalink)

Vertical and Horizontal Resolution

I've been delving in the literature a bit this week ... considering model resolution and various issues around it. This post is by way of notes from my reading.

One of the things to consider at any time is do we have enough resolution. Most climate scientists will tell you they need more horizontal resolution, but fewer will concede they need more vertical resolution.

It should be (but appears to be not) well known that just as one has to consider changing the time-step as horizontal resolution is increased, one needs to consider whether there is enough vertical resolution. This issue was dealt with quite a time ago in Lindzen and Fox-Rabinovitz (1989) (hereafter LFR89). There have been some recent follow-ups on the importance for chemistry (e.g Kent et al 2012) and on models performance in general (e.g.Marques et al 2011). (It's probably worth pointing out that the latter, and references therin, point out that model convergence to reality depends as much on how the physics deals with resolution as on the dynamics, but that's a point for another day ... but if you want to go there, you could look at Pope and Stratton 2002 and Pope et al 2001, although I have to say both do a bit of special pleading to rule out extra vertical resolution.)

Anyway, I thought it might be interesting to tabulate what sorts of resolution are actually needed for various tasks. It's important to note that LFR89's analysis comes up with different resolutions for different tasks and at different latitudes. So, if we're to take LFR89 at face-value and we're interested in quasi-geostrophic scales, then we can extend their table to modern model resolutions:

  dx (deg)    dz equator    dz 60    dz 45    dz 22.5  
  0.25    1m (!!)    84m    97m    69m
  0.5    3m (!!)    170m    190m    140m  
  1    14m (!)    340m    390m    270m 
  2    54m (!)    670m    780m    550m  
  5    340m    1700m    1900m    1400m  

Clearly there is a problem with this analysis in the tropics at all scales, and everywhere at 25km. Common sense suggests one can't have atmospheric phenomena with horizontal scales of over 50km with vertical scales of 1m. Pretty obviously the scaling assumptions that underly the LFR89 use of quasi-geostrophy are broken. Which brings us to a moot point in interpreting LFR89. If one starts with a QG equation, we've already rejected a bunch of small scales which LFR89 have coming out of the analysis at modern high resolution scales. We probably need to rethink the analysis! (Which is to say, here and now, I'm not going to do that rethinking :-).1

Fortunately for me (in terms of analysis), right now, I'm less interested in the large-scale horizontal flows, but in gravity waves. There the analysis of LFR89 is a bit more timeless. However, the analysis pretty much says, if you're interested in breaking gravity waves you need infinite resolution. However, they then back off and do a bit of a fudge around effective damping to suggest to resolve gravity wave processes one need resolutions of roughly 0.006*the grid resolution in degrees. For the horizontal resolutions above, that gives us something like 1.5,3,6,12 and 30m vertical resolutions.

That doesn't look likely any time soon!

Another approach is to look at what people have thought they need (and why). One of the reasons I started all this thinking was because I was wondering how easy it would be to repeat Watanabe et al (2008)'s work with the UM. Watanabe et al used a T213L256 model with a model top at 85km, having done a lot of previous work evaluating L250 type models. This is roughly a 0.5 degree model using the table above, and has an average vertical resolution of about 300m, which is not too far from LFR89 in the table above (at least using the value of N discussed in the footnote). Most other models seem to fall well short of that. For the UM, even studies which look at resolved gravity waves in the stratosphere have relatively coarse resolutions, e.g. Shutts and Vosper (2011) use 70 levels to a model top at 80km (again with a model resolution around 0.5 degrees). However, in that model the standard configuration had a time-stepping regime which filtered out resolved gravity waves, so the vertical resolution was constrained by being the same as the standard model, when used in a model which didn't filter gravity waves. Similarly, Bushel et al (2010) in a study looking at tropical waves and their interaction with ozone use, used a relatively low horizontal and vertical resolution (between 1 and 4 degrees horizontally) and L60 to 84km - but again, resolved gravity waves were filtered out, and parameterisations were used.

As an aside, one of the arguments in Pope et al 2001 as to why vertical resolution is less important in the tropics is a reference to Nigam et al 1986 who they assert show that non linear processes smooth fields naturally so as to diminish vertical resolution requirements. This is one of the cases where I have some of my own opinions, see, for example Rosier and Lawrence, 1999, discussing, amongst other things pancake structures with small vertical scales in the tropical stratosphere. Given it seems that there is now a body of evidence suggesting that the troposphere does react dynamically to the middle-atmosphere in climatically important ways, that brings me nicely back to wanting more vertical resolution ... even if we buy that it's not needed in the troposphere, and I'm a long way from buying that ... yet (particularly given recent results looking at blocking and resolution in CMIP5 models: Anstey et al, 2013.)

However, for the UM, before I worry too much about the vertical resolution, I've got to get to the bottom of the time-step filtering I alluded to above.

1: That said, to repeat their table, I had to replace a 3 in their approximation for N with a 2, and in fact, I'd rather use a tropospheric average of N~0.012, in which case we get nearly a factor of 2 larger required resolution. However, the fundamental issue is still that I would prefer to work through the assumptions of the QG approximation, I think there is a problem in there ... but I don't have time now. (ret).

by Bryan Lawrence : 2013/09/09 : 1 comment (permalink)

Storing and manipulating environmental big data with JASMIN

We're pleased that our paper on the first year of JASMIN has been accepted by the IEEE BigData 2013 conference.

I'll put a copy of the paper online soon, but for now, here is the abstract:

Storing and manipulating environmental big data with JASMIN

B.N. Lawrence, V.L. Bennett, J. Churchill, M. Juckes, P. Kershaw, S. Pascoe, M. Pritchard, S. Pepler and A. Stephens

JASMIN is a super-data-cluster designed to provide a high-performance high-volume data analysis environment for the UK environmental science community. Thus far JASMIN has been used primarily by the atmospheric science and earth observation communities, both to support their direct scientific workflow, and the curation of data products in the STFC Centre for Environmental Data Archival (CEDA). Initial JASMIN configuration and first experiences are reported here. Useful improvements in scientific workflow are presented. It is clear from the explosive growth in stored data and use that there was a pent up demand for a suitable big-data analysis environment. This demand is not yet satisfied, in part because JASMIN does not yet have enough compute, the storage is fully allocated, and not all software needs are met. Plans to address these constraints are introduced.

by Bryan Lawrence : 2013/08/28 : Categories abstracts jasmin : 0 comments (permalink)

Gavin's Proposal

Gavin made some proposals on my blog for how he thought an easy automatic DOI system could be set up for CMIP data in ESGF.

I promised a reply in a day or so. Oh well, small unit error, perhaps week or so (or two) would have been a better promise. I had hoped to agree my reply with Stephen, but this is the season for not overlapping. So, here we are.

I've taken Gavin's proposal, and broken it down. We'll go through them and analyse them against what ESGF can do now, my description of what a DOI should do, and my prejudice ... (In this first section, Gavin's proposal leads each enumerated point, my responses are in the bullets). After that, I'll make a modified proposal.

  1. Each simulation (not ensemble!) (a unique model, experiment, rip number) is associated with a DOI.

    • For now, let's just ask, can this be done with another sort of identifier? The answer is yes, we can do that with a URL now, and indeed, with an unambigous identifier (the DRS URL stripped of the host hame). For example, for this dataset:

      Image: static/2013/08/23/categ.png

      (from url at our esgf node we can follow the catlogue link to the data metadata page and find the following ids for this dataset:

      cmip5.output1.MOHC.HadGEM2-A.amip.6hr.atmos.6hrLev.r1i1p1 or cmip5.output1.MOHC.HadGEM2-A.amip.6hr.atmos.6hrLev.r1i1p1.v20110803 or cmip5.output1.MOHC.HadGEM2-A.amip.6hr.atmos.6hrLev.r1i1p1.v20110803|cmip-dn1.badc.rl.ac.uk

    • Then, the next question is what value would a DOI add to that URL?

      • Well, it should add persistence. I've already argued that I'm not happy about that unless we're sure this data won't be superseded. More on this anon.

      • It would add publisher credit when used. Hmm, that introduces multiple sub-questions. Is this the right granularity? Do the publishers get too many citations?

      • It could be used by consumers? Again, is this the right granularity, how do we see this appearing in a reference list? How many of them would appear in your typical paper?

      • I'm going to save these granularity questions for later, but they're key to part of my answer (informed by prejudice, experience, and wider goals).

  2. This DOI is technically a collection of unique tracking IDs (one per file) AND any related DOIs (perhaps a small number of classes). The classes of related DOIs could be a previous version (with either trivial or substantive changes to one or more of the individual files) or to DOIs in the same ensemble. Think 'sibling' or 'parent' DOIs perhaps.

    • No, it's technically a link to landing page (if it's a DOI), which then links to a real data object. That aside, yes the digital object could be a list of tracking-ids, and a list of typed-links to other objects. It's a new kind of digital object, but there is much to like about it, not least that a list of identifiers is unlikely to be supersede-able.

    • It's not obvious to me that the first part of this (the tracking-id list) wouldn't be better to be a list of DRS URLs (including versioning) ...

  3. This gets assigned for every simulation on publication and is under the control of each ESGF node.

    • This is already done in the Thredds catalogue, if this is a DRS URL.

  4. If some data gets corrected, you get a new DOI, but it may be that most of the tracking IDs are the same and so people who didn't use the corrected file will be able to see that the data they used was the same.

    • This is already done by the ESGF software using tools that Stephen Pascoe built, provided this is a DRS URL, and most importantly, the publisher node takes the trouble to do it. Sadly, most have not! So, a priori, we already know that all the publishers aren't going to do this ... because the automagic behaviour requires configuration (and file management) that thus far, folks haven't wanted to do. Caveat: what I think is required here is for the identifiers of deleted data to be kept, and I don't think our current tools do that, so there is some work to be done.

  5. The original DOI is part of the collection, and whether people use the original or subsequent ones, you could build an easy tool to see whether your DOI was cited as either a parent or sibling DOI in other collections.

    • If you replace DOI with identifier in this sentence, I'm comfortable with it, but this is not a DOI (for granularity reasons I'll get to).

  6. So, the DOIs (which contain only to tracking IDs and other DOIs) are persistent. They allow a unique mapping to the data as used, but also to any updates/corrections. Groups can use them to see who used their data, they can be assigned automatically. The only metadata required are trivially available via the filename structure.

    • All good, all likeable, but it then requires the development of an entirely new ecosystem of citation tools (for counting, comparison etc), and it's not equal in any sense to a traditional academic citation.

So, I don't think there are any insurmountable technical hurdles in what is proposed, but I have three substantive objections, two of which have been hinted about above.

  • Respect: One of the major reasons for wanting to use DOIs (as opposed to any other identifiers), is that if we get the granularity and procedures right, all the existing citation tools, and shared academic understanding of what a citation means, can be used. Do something different, and we have loads of new (counting/comparator) tools to build, and more importantly, we have to convince every administration you can think of, that these new "citations" have merit (in comparison to traditional citations. Administrations include your organisation, whoever you are, will a data publication rank the same as a paper publication? I hope so, especially if it's in a data journal like one of these: Scientific Data, Geoscience Data Journal, Earth System Science Data etc. Administrations also include journals. Will they be happy for this citation to appear in a reference list?

    • The bottom line: Does the above proposal "get the granularity and procedure right"? Do we anticipate that an ESGF DOI should rank the same as one of those? If not, why not? How will a consumer differentiate? (My answers: no; not as proposed; wrong granularity, no formal procedures; they'll not use them, so it won't be a problem.)

  • Granularity: When we think about traditional publications, we have clear notions of what a publication actually is, and we have good concepts about how to refer to parts of publications. (We know about papers, books, chapters, pages etc.). Data is different. Why choose to cite simulations (as opposed to variables, bytes, ensembles etc)? The answer has to be that the right granularity is an optimisation of the needs of all parties (producers, consumers, publishers). A key a priori assumption is that we cannot expect it to be right for all of them, so whatever we choose, we need a method of referring into a publication unit, and we need a method of sensibly aggregating published units.

    • For reasons I'll get to, I think this proposal has the wrong granularity.

  • Scientific Usefulness/Appropriateness: and now we get to the one I care about most, and the one that is most subjective. In this instance the thing I like least is giving a DOI to an ensemble member. Why not give it to an ensemble?

    • PRO: There is an obvious mechanism for referring to an ensemble member within an ensemble.

    • PRO: Most times one would expect a scientific analysis to exploit an ensemble (although occasionally one can only afford to have one member).

    • PRO: It is easy to provide more complete metadata (once) for an ensemble.

      • Disclosure: I think I think that this is much more important than Gavin, but that's because I also deal with many user communities who are not modelling experts. We both agree we have yet to get this right, but that's not a reason to give it up.

    • CON: (Anticipated) Why choose the ensemble axis to be the ensemble of one model's integrations, not a multi-model ensemble. (Answer: because many use cases are not multi-model, and we can deal with multi-model ensembles as aggregations of single-model ensembles).

    • CON: Ensembles are always being added to. When is it finished? What do I do if I want to use the data before the n'th member is available?

    • CON: The metadata for describing the difference between the ensemble members may be incomplete. (Yeah, so, ...)

OK, so what are the minimal changes I'd make to Gavin's proposal to make it something I do like? (Because it's close). What issues would arise?

  1. Granularity: use ensembles, but make sure there is a well understood notation for citing into ensembles to specific ensemble members, and that there is a preprint notation for citations into ensembles which are known to be incomplete, and an edition notation for ensembles which have grown after Publication.

    • We would need to think this through for publication and Publication inside ESGF, but this ought not be hard.

    • Aggregations could be done in new publications, in the outside world, using for example, the data journals listed above.

  2. Version Control: I like this proposal a lot, in that it takes out the need to keep the data (perhaps). I'd need Stephen to work out what the consequence are to Thredds and the DRS versioning library to keep tracking ids of deleted files etc, but it's doable, albeit work.

  3. Respect. Since this is now (nearly) automagic within ESGF, you'd want some method of ensuring that only reputable modelling groups can put their data in. I think ensembles are about the right granularity to rank with traditional publications.

  4. Content Standard: We'd need to come up with an appropriate manifest format for the tracking ids (or maybe DRS URLS instead, but we'd still be holding somewhere lists of tracking-ids mapping onto URLS). We'd need to control the vocabulary for relationships. All doable, but starting to look like significant work.

  5. Persistence. ESGF data nodes come and go. We'd need the usual suspects to commit to maintain this long-term, which means the funders to commit to the work, and the maintenance. So we'd need to work this up into a business case. (Not a grant: this would have to be underpinned by long-term money, not soft-money.)

    • The existing CMIP DOI system, is underpinned primarily by the IPCC-DDC commitments, either we get the IPCC-DDC group to agree to do this, and/or we get the relevant funding agencies in Germany, the UK and US (at least) to buy into it.

One last caveat. I happen to believe the existing DOI plan for CMIP5 is not broken, it's just under-resourced, and the community has not bought into it enough. I wouldn't want to do this new thing, and stop doing the existing activity. That has implications that need thinking through too!

by Bryan Lawrence : 2013/08/23 : 4 comments (permalink)

github repo for badc text file code

Some years ago, my colleagues Graham Parton and Sam Pepler developed a metadata content standard to allow the ingestion of comma separated value (csv) files of data into the BADC archive. What they wanted to do was allow folks whose primary data manipulation tool was a spreadsheet to have a viable and easy mechanism of getting their data into our managed environment.

Data in a managed environment needs to be in well known formats that can be persisted for a long time, and it has to have adequate metadata. Normal (e.g. excel) spreadsheets fail both criteria: over the years the format has changed, and has not always been well enough documented that one would have confidence that all the content could be recovered in the future (when the original software is no longer available). CSV files exported from spreadsheets can be persisted, but like their spreadsheets, don't necessarily have the right metadata.

The BADC text file "format" (it's not really a format) was designed to address those metadata requirements with some modest mandatory information and a syntax for optional information. BADC has run a checker (so that folks can check their files comply) for some years now, but never made any code for using/manipulating these files available.

A few days ago I discovered that (since I wanted to have some files hang around on my desktop, and I've long since discovered that I need metadata, never mind the BADC). I could have used NetCDF, but I have a policy of trying out BADC stuff whenever I can, so I tried the text files, and wasn't happy that I couldn't find any public code.

Well there is public code now: https://github.com/cedadev/badctext!

by Bryan Lawrence : 2013/08/22 : 2 comments (permalink)


DISCLAIMER: This is a personal blog. Nothing written here reflects an official opinion of my employer or any funding agency.