Bryan Lawrence : Bryan's Blog

Bryan Lawrence

... personal wiki, blog and notes

Bryan's Blog

(Only the last ten entries are here, earlier entries can be found by using the Summary, Archive or Categories pages, or by using the calendar to go to previous months and/or years).

NCAS Science Conference, Bristol

In the middle of two days in Bristol for the NCAS science conference. Good to see what so many of my colleagues are up to (distributed as they are, across the UK).

My talk on the influence of Moore's Law on Atmospheric Science is linked from my talks page.

I wish the rest of the talks were publicly available, there is a lot of good stuff (much of it yet to come, today). The problem with (very slow) peer review as a gold standard is that a lot of stuff only sees the public light of day (even into the relevant science community) long after the work was done, whereas much of it is fit for exposure (and discussion) well before then - but you have to go to the right meeting/workshop/conference. However, some of what is discussed is provocative, and work in progress ... and of course our community is (for good reasons, mostly related to an uneducated public spotlight) somewhat shy of premature publication. It's a conundrum.

by Bryan Lawrence : 2014/07/18 : 0 comments (permalink)

The vocabulary of documenting models

Some time ago our European project Metafor migrated to a global project called es-doc. During the metafor project we put a lot of effort into trying to develop materials which describe what it is trying to achieve, but I think we never really got that right (despite a number of papers on the subject - e.g. Lawrence et al 2012 , Guilyardi et al 2013 etc).

I'm currently in the process of trying to produce a figure on the subject for another publication, and in the process have produced these, which I think might be quite useful for understanding what es-doc is trying to document, and what the main kinds of documentation are that it produces. I'd be interested in feedback as to whether these figures are helpful or not.

In each figure, the boxes and links are not meant to correspond directly to the UML classes and associations of the (es-doc) Common Information Model (although some do) - the intention is to describe the concepts and intent of the various pieces of documentation.

Figure One

Image: static/2014/07/09/MIPprocess_esdoc.jpg

The MIPS process involves designing experiments which are provided to modelling centres.

Modelling centres configure model code to produce a configured model.

They then run one (or many) Simulations to support one or more experiments. The Simulation will conformTo (NumericalRequirements) of the Experiment, and will produce OutputData, and was run on a Platform.

The output data is uploaded to the ESGF archive where is supports the MIP.

Each of the coloured boxes represents an es-doc "document-type".

Figure Two

Image: static/2014/07/09/simple_esdoc.jpg

Neglecting the process, we can look at the various document types and what they are for in more detail.

A simulation (which can also be an aggregation of simulations, also known as an ensemble) will have used some input data, some of which may have been defined and/or constrained by one of the numerical requirements of the experiment.

Constraints on the simulation by the numerical requirements might require conformances - which define how those constraints affect either the data or the code (maybe a choice of a particular code modification, or via a specific parameter choice or choices).

Configured models are described by their Science Properties.

A Simulation document will include all the information coloured in yellow, so it will define which configured model was used via the uses relationship, which will point to configured model document, which itself describes the model.

Similarly, an experiment document will define all the various numerical requirements, and may point to some specific input data requirements.

Ideally output data objects will point back at the simulation which produced them.

Figure Three

Image: static/2014/07/09/Simulations_esdoc.jpg

In even more detail we can see that numerical requirements come in a number of flavours, including:

  • SpatioTemporalConstraints - which might define required experimental start dates, durations, and/or the coverage (global, regional etc).

  • Forcings - which generally define how various aspects of the system might be represented, for example, providing Ozone files for a consistent static ozone representation.

  • OutputRequirements - which define what output is required for intercomparison.

Simulation conformances are generally linked to specific numerical requirements and consist of data and code-mod conformances.

ConfiguredModels are configured from BaseModel code (this is not currently implemented in es-doc). In principle they can be described both by their software properties and the science properties (and so can their sub-components).

The run-time performance of the simulation should also be captured by resource and performance characteristics (but this too is not yet supported by es-doc).

A model description document should include all the material in green, including the links.

by Bryan Lawrence : 2014/07/09 : 2 comments (permalink)

Accessing JASMIN from Android

I have a confession to make. I sometimes illicitly work using my android phone and legitimately with my android tablet. Sometimes that work involves wanting to ssh into real computers, and edit files on real computers. One real computer I want to access is JASMIN, and access to JASMIN is controlled by passphrase protected public/private key pairs.

This is how I do it. None of it's rocket science, but I'm documenting it here, since there a zillions of possible tools, and I spent ages working out which ones I could actually use ... so this is to save you that wasted time.

Before we start a couple of words of warning though: To do this you're going to have to put a copy of your private key on your Android device. It's important to realise that means someone with access to your android device is only one step away from access to your computing accounts. In my case, to protect that key I have taken three steps: I have encrypted my android device, I have a very short time (30s) before lockdown, and I have remote wipe enabled. If the bad guys still get access, my key still has a pass phrase! I hope those steps are enough that if I lose my phone/tablet I will realise and block access before anything bad could happen.

You might read elsewhere about how to use dropbear and your public key on your device. Right now, dropbear doesn't support keys with passphrases, so I am not using dropbear (since to do so would require me to remove the passphrase on my private key). This also means that some Android apps for terminal access which use dropbear under the hood (pretty much any that use busybox) can't exploit a properly protected public key. You can use them for lots of things, but don't use them for JASMIN (or similar) access.

Ok, onwards.

Step 1: Get your Key

Get a copy of your ssh private key onto your device. You can do this by any means you like, but ideally you'd do it in a secure way. This is the way I did it:

  1. Make sure you have your private key somewhere where you can get at it by password protected ssh (not key protected).

  2. Load an app onto your device which gives you an scp/ssh enabled terminal to your device. I used AirTerm (which provides a cool floating terminal, very useful on tablets especially). Note that unfortunately AirTerm is not suitable for JASMIN access because it appears to use dropbear under the hood, and so it can't handle our private/public key pairs.

    • (If you are using AirTerm, you will want to go to preferences and install kbox to follow what I've done.)

  3. Fire up AirTerm, and change working directory:

    cd /storage/sdcard0

  1. Make a directory for your key, and change into it, something like

    mkdir mystuff
    cd mystuff

  1. Now scp your private key into that directory, something like:

    scp you@yourhost.wherever:path_to_key/your_private_key .

  1. You may or may not need to move the copy of your your_private_key on yourhost.wherever, depending on whether it's secure there.

(You're done with AirTerm for now, but I'm sure you'll find lots of other uses for it.)

Step 2: Get and Configure JuiceSSH

I use JuiceSSH to work with remote sites which use the key infrastructure. It has lots of nice properties (especially the pop up terminal keyboard) and it can manage connections and identities, and handle multi-hop ssh connections (e.g. for JASMIN, as needed to get via the login nodes to the science nodes).

JuiceSSH is pretty straightforward. Here's what you need for JASMIN.

  1. Fire up JuiceSSH. You will need a password for juice itself. Select something memorable and safe ... but different from your private key passphrase! (If you're like me, and forget these things, you might want to exploit something like lastpass to put this password in your vault).

  2. Add your JASMIN identity:

    • Select a nickname for this identity, give it your jasmin username, and then choose the private key option. Inside there let smart search find your public key (and then tick it). Update and save.

  3. Now add a connection to jasmin-login1, the options are pretty straightforward. You can test it straight away. If it doesn't work, ask yourself if you need to use a VPN client on your phone/tablet first to put yourself in the right place for JASMIN to take an inbound connection.

  4. You can add a direct connection to jasmin-sci1 (or your science machine) by using the via option in the setup. Here's an example of how to configure that.

    Image: static/2013/11/13/juice.png

    In that example, "jasmin login" is the nickname I gave to the connection to jasmin-login1, and "bnl jasmin pk" is the nickname for my jasmin identity.

Now you can access JASMIN, but what about editing files?

Step 3: Get and Configure DroidEdit

I'm using Droidedit as my tool of choice for editing on Android, although, to be fair, I'm not doing that much editing on Android yet so I'd be interested in hearing if there are other better tools. I primarily chose it because it has support for both pki and editing remote files via sftp.

Once you have DroidEdit, you need to go into settings, and to into the SFTp/FT actions "Add Remote Server". Configure with path to your private key, and save using the server address Ignore the fact that the test fails (note that it prompts you for a password, not a passphrase).

Then, go back out, and try and open a file on jasmin, this time you'll get prompted for a passphrase, and voila, it should just work (as before, make sure your android device has used a VPN or whatever to be "in the right place").

by Bryan Lawrence : 2013/11/13 : 0 comments (permalink)

vertical resolution

Last week I pointed out that I wasn't at all sure the analysis by LFR89 really applied at modern horizontal grid resolutions, since the vertical scales implied for quasi-geostrophic motion didn't make sense.

I've done a wee bit more delving, and now I'm sure it's not appropriate. The analysis LFR89 did was based on the solutions of the "quasi-geostrophic psueudo-vorticity equation". This is a venerable equation, first derived by a couple of folks, but formalised by Jules Charney. It's derived by carrying out a scale analysis of the primitive equations of motion, suitable for ''large scale motions where departures from hydrostatic and geostrophic equilibrium are small". I still haven't done the rederivation myself (it's a lot of bookwork, and desultory attempt to setup sympy to do it ran out of time), but Charney himself (Charney,1971) in an interesting paper (of some relevance here) put some bounds on the various scales of validity (see his equation 9). As a consequence , Charney points out that these equations define a band of specific horizontal and vertical scales! The fastest way to get to that band is to go back to the fuller derivation in Charney and Stern, 1961 where we get the constraints laid out more easily in the scale analysis. In particular, the quasi-Boussinesq and hydrostatic approximations give us:

L2/D < g/f2 (~109) and D/L < 0.1

Putting L=100km into those equations suggest that D should lie between 1 and 10km, which isn't quite the same as we get from the assertion in LFR89, that all vertical scales may appear and they are related to the expression:

L = D * N/f

(which gives us D=1km, which I think is more to do with the scales of the baroclinic wave solutions to the QG equation). Of course this scale analysis needs to be evaluated carefully in practice, and in particular, ever smaller values of L and D may have those kinds of wave solutions in the QG equations, but the equations themselves are no longer valid.

I'd be happy if someone could be bothered to do the derivation with constraints properly, and evaluate them completely for all scales, but until they do, I'm not going to invest too much more time in using LFR89 to give me "large" scale constraints on the vertical resolution.

Charney's paper is interesting and relevant since it also points out that the larger scales do not feed energy to the smaller scales, but it says nothing significant about other scales, such as those involved in fronts and blocking and gravity waves. As I've already argued, I think these tell us more of what we need to know. That's where we'll go next.

Update 19/09/13: I have slightly edited this post since when I didn't like the tone of a particular sentence when the dust had settled; on rereading it a day later it carried connotations that were not intended. I fixed that and added a clarifying sentence or two.

by Bryan Lawrence : 2013/09/18 (permalink)

zotero, zandy, greader, evernote, and me

Quite a while ago (i.e. years), I decided that managing my bibliographic information in a bibtex file wasn't working any longer. Back then I had a look at Mendeley and Zotero. I can't really remember why, but I chose Zotero (I think it was a combination of how it worked for me, I played with both, and I didn't like having to use their PDF viewr. I also had some worries about Mendeley and the software and information IPR ... when Elsevier bought out Mendeley I felt vindicated on the latter.)

Anyway, now Zotero is a pretty integral part of my working environment. I use zotero standalone on my linux laptop (which is also my desktop when it's in a docking station). I make heavy use of zotfile to migrate papers to and from my Android tablet for reading (I no longer print out anything). I like being able to annotate my PDFs on the tablet, and in particular, having anything I highlighted being pulled out automagically when zotfile pulls the papers back off the tablet.

However, there are two issues with that workflow that bug me. I'd like my PDF library to be completely synchronised for offline reading on my tablet, and I'd like a fully featured native zotero client on the tablet (and my Galaxy Note phone). Zandy is the only Android app for zotero, and while it has some useful functionality (it synchronises the metadata so at least one can check on the phone/tablet if something is in my library), it doesn't synchronise the attachments completely.

(I do use to synchronise my attachments out from zotero standalone via webdav, which works, but one can only use it to effectively download attachments one by one to Zandy - there is no bulk download facility, and no way to annotate and upload back - it's one way sync! But you can view stuff without going to the journal which can be useful for memory jogging.)

The other thing I can't do on my Android devices, and in particular my phone, is effectively create zotero information. There are ways, I could:

  • Manually enter the information in Zandy (no thanks, the whole point of zotero is to avoid manual bibliographic entry where possible). (There is a scanner option, but I'm mostly dealing with papers found on journal websites.)

  • Use the zotero bookmarklet on chrome, well yes, that's possible, but it fails miserably on the AMS journal websites, and requires an inordinate amount of clicking and typing. (The way you use the bookmarklet is to start typing it's bookmark name into the address-bar of the page you are looking at, and if it can find a translator, it loads it into zotero.)

What I really want to do, is from a feed reader, share a journal entry straight into zotero. I can nearly do this. However, from greader, if I

  • Share to Zandy, I basically just get the paper loaded as web page, and I have to manually fix it all later. This isn't necessarily a bad option, at least I get something, but it's often not enough, unless I do that manual step. You can guess how often I do it ...

  • Share to evernote, I can at least get the abstract and most of the body out of the RSS/atom straight into evernote (again the AMS journal feeds are hopeless). But now I have my bibliographic information in two places: abstracts in evernote, and full papers with proper references in Zotero. Searching is cumbersome.

Anyone got a better solution for (zotero based) bibliographic handling from Android (or a way of encouraging Avram Lyon, the Zandy author, to get back into active development)?

I need it to work from Android, because it's in the nature of my job that I spend a lot of time before and after meetings, travelling etc, when being able to interact with the scientific literature on my phone and tablet would make me more productive. Indeed, I do most of my journal paper triage on my phone! (No, I am not going to consider becoming an apple fan-boy!)

(Of course most of this is pointless if the paper is invisible behind a paywall. Invective removed by the editor/author.)

by Bryan Lawrence : 2013/09/11 : 1 comment (permalink)

Vertical and Horizontal Resolution

I've been delving in the literature a bit this week ... considering model resolution and various issues around it. This post is by way of notes from my reading.

One of the things to consider at any time is do we have enough resolution. Most climate scientists will tell you they need more horizontal resolution, but fewer will concede they need more vertical resolution.

It should be (but appears to be not) well known that just as one has to consider changing the time-step as horizontal resolution is increased, one needs to consider whether there is enough vertical resolution. This issue was dealt with quite a time ago in Lindzen and Fox-Rabinovitz (1989) (hereafter LFR89). There have been some recent follow-ups on the importance for chemistry (e.g Kent et al 2012) and on models performance in general (e.g.Marques et al 2011). (It's probably worth pointing out that the latter, and references therin, point out that model convergence to reality depends as much on how the physics deals with resolution as on the dynamics, but that's a point for another day ... but if you want to go there, you could look at Pope and Stratton 2002 and Pope et al 2001, although I have to say both do a bit of special pleading to rule out extra vertical resolution.)

Anyway, I thought it might be interesting to tabulate what sorts of resolution are actually needed for various tasks. It's important to note that LFR89's analysis comes up with different resolutions for different tasks and at different latitudes. So, if we're to take LFR89 at face-value and we're interested in quasi-geostrophic scales, then we can extend their table to modern model resolutions:

  dx (deg)    dz equator    dz 60    dz 45    dz 22.5  
  0.25    1m (!!)    84m    97m    69m
  0.5    3m (!!)    170m    190m    140m  
  1    14m (!)    340m    390m    270m 
  2    54m (!)    670m    780m    550m  
  5    340m    1700m    1900m    1400m  

Clearly there is a problem with this analysis in the tropics at all scales, and everywhere at 25km. Common sense suggests one can't have atmospheric phenomena with horizontal scales of over 50km with vertical scales of 1m. Pretty obviously the scaling assumptions that underly the LFR89 use of quasi-geostrophy are broken. Which brings us to a moot point in interpreting LFR89. If one starts with a QG equation, we've already rejected a bunch of small scales which LFR89 have coming out of the analysis at modern high resolution scales. We probably need to rethink the analysis! (Which is to say, here and now, I'm not going to do that rethinking :-).1

Fortunately for me (in terms of analysis), right now, I'm less interested in the large-scale horizontal flows, but in gravity waves. There the analysis of LFR89 is a bit more timeless. However, the analysis pretty much says, if you're interested in breaking gravity waves you need infinite resolution. However, they then back off and do a bit of a fudge around effective damping to suggest to resolve gravity wave processes one need resolutions of roughly 0.006*the grid resolution in degrees. For the horizontal resolutions above, that gives us something like 1.5,3,6,12 and 30m vertical resolutions.

That doesn't look likely any time soon!

Another approach is to look at what people have thought they need (and why). One of the reasons I started all this thinking was because I was wondering how easy it would be to repeat Watanabe et al (2008)'s work with the UM. Watanabe et al used a T213L256 model with a model top at 85km, having done a lot of previous work evaluating L250 type models. This is roughly a 0.5 degree model using the table above, and has an average vertical resolution of about 300m, which is not too far from LFR89 in the table above (at least using the value of N discussed in the footnote). Most other models seem to fall well short of that. For the UM, even studies which look at resolved gravity waves in the stratosphere have relatively coarse resolutions, e.g. Shutts and Vosper (2011) use 70 levels to a model top at 80km (again with a model resolution around 0.5 degrees). However, in that model the standard configuration had a time-stepping regime which filtered out resolved gravity waves, so the vertical resolution was constrained by being the same as the standard model, when used in a model which didn't filter gravity waves. Similarly, Bushel et al (2010) in a study looking at tropical waves and their interaction with ozone use, used a relatively low horizontal and vertical resolution (between 1 and 4 degrees horizontally) and L60 to 84km - but again, resolved gravity waves were filtered out, and parameterisations were used.

As an aside, one of the arguments in Pope et al 2001 as to why vertical resolution is less important in the tropics is a reference to Nigam et al 1986 who they assert show that non linear processes smooth fields naturally so as to diminish vertical resolution requirements. This is one of the cases where I have some of my own opinions, see, for example Rosier and Lawrence, 1999, discussing, amongst other things pancake structures with small vertical scales in the tropical stratosphere. Given it seems that there is now a body of evidence suggesting that the troposphere does react dynamically to the middle-atmosphere in climatically important ways, that brings me nicely back to wanting more vertical resolution ... even if we buy that it's not needed in the troposphere, and I'm a long way from buying that ... yet (particularly given recent results looking at blocking and resolution in CMIP5 models: Anstey et al, 2013.)

However, for the UM, before I worry too much about the vertical resolution, I've got to get to the bottom of the time-step filtering I alluded to above.

1: That said, to repeat their table, I had to replace a 3 in their approximation for N with a 2, and in fact, I'd rather use a tropospheric average of N~0.012, in which case we get nearly a factor of 2 larger required resolution. However, the fundamental issue is still that I would prefer to work through the assumptions of the QG approximation, I think there is a problem in there ... but I don't have time now. (ret).

by Bryan Lawrence : 2013/09/09 : 1 comment (permalink)

Storing and manipulating environmental big data with JASMIN

We're pleased that our paper on the first year of JASMIN has been accepted by the IEEE BigData 2013 conference.

I'll put a copy of the paper online soon, but for now, here is the abstract:

Storing and manipulating environmental big data with JASMIN

B.N. Lawrence, V.L. Bennett, J. Churchill, M. Juckes, P. Kershaw, S. Pascoe, M. Pritchard, S. Pepler and A. Stephens

JASMIN is a super-data-cluster designed to provide a high-performance high-volume data analysis environment for the UK environmental science community. Thus far JASMIN has been used primarily by the atmospheric science and earth observation communities, both to support their direct scientific workflow, and the curation of data products in the STFC Centre for Environmental Data Archival (CEDA). Initial JASMIN configuration and first experiences are reported here. Useful improvements in scientific workflow are presented. It is clear from the explosive growth in stored data and use that there was a pent up demand for a suitable big-data analysis environment. This demand is not yet satisfied, in part because JASMIN does not yet have enough compute, the storage is fully allocated, and not all software needs are met. Plans to address these constraints are introduced.

by Bryan Lawrence : 2013/08/28 : Categories abstracts jasmin : 0 comments (permalink)

Gavin's Proposal

Gavin made some proposals on my blog for how he thought an easy automatic DOI system could be set up for CMIP data in ESGF.

I promised a reply in a day or so. Oh well, small unit error, perhaps week or so (or two) would have been a better promise. I had hoped to agree my reply with Stephen, but this is the season for not overlapping. So, here we are.

I've taken Gavin's proposal, and broken it down. We'll go through them and analyse them against what ESGF can do now, my description of what a DOI should do, and my prejudice ... (In this first section, Gavin's proposal leads each enumerated point, my responses are in the bullets). After that, I'll make a modified proposal.

  1. Each simulation (not ensemble!) (a unique model, experiment, rip number) is associated with a DOI.

    • For now, let's just ask, can this be done with another sort of identifier? The answer is yes, we can do that with a URL now, and indeed, with an unambigous identifier (the DRS URL stripped of the host hame). For example, for this dataset:

      Image: static/2013/08/23/categ.png

      (from url at our esgf node we can follow the catlogue link to the data metadata page and find the following ids for this dataset:

      cmip5.output1.MOHC.HadGEM2-A.amip.6hr.atmos.6hrLev.r1i1p1 or cmip5.output1.MOHC.HadGEM2-A.amip.6hr.atmos.6hrLev.r1i1p1.v20110803 or cmip5.output1.MOHC.HadGEM2-A.amip.6hr.atmos.6hrLev.r1i1p1.v20110803|

    • Then, the next question is what value would a DOI add to that URL?

      • Well, it should add persistence. I've already argued that I'm not happy about that unless we're sure this data won't be superseded. More on this anon.

      • It would add publisher credit when used. Hmm, that introduces multiple sub-questions. Is this the right granularity? Do the publishers get too many citations?

      • It could be used by consumers? Again, is this the right granularity, how do we see this appearing in a reference list? How many of them would appear in your typical paper?

      • I'm going to save these granularity questions for later, but they're key to part of my answer (informed by prejudice, experience, and wider goals).

  2. This DOI is technically a collection of unique tracking IDs (one per file) AND any related DOIs (perhaps a small number of classes). The classes of related DOIs could be a previous version (with either trivial or substantive changes to one or more of the individual files) or to DOIs in the same ensemble. Think 'sibling' or 'parent' DOIs perhaps.

    • No, it's technically a link to landing page (if it's a DOI), which then links to a real data object. That aside, yes the digital object could be a list of tracking-ids, and a list of typed-links to other objects. It's a new kind of digital object, but there is much to like about it, not least that a list of identifiers is unlikely to be supersede-able.

    • It's not obvious to me that the first part of this (the tracking-id list) wouldn't be better to be a list of DRS URLs (including versioning) ...

  3. This gets assigned for every simulation on publication and is under the control of each ESGF node.

    • This is already done in the Thredds catalogue, if this is a DRS URL.

  4. If some data gets corrected, you get a new DOI, but it may be that most of the tracking IDs are the same and so people who didn't use the corrected file will be able to see that the data they used was the same.

    • This is already done by the ESGF software using tools that Stephen Pascoe built, provided this is a DRS URL, and most importantly, the publisher node takes the trouble to do it. Sadly, most have not! So, a priori, we already know that all the publishers aren't going to do this ... because the automagic behaviour requires configuration (and file management) that thus far, folks haven't wanted to do. Caveat: what I think is required here is for the identifiers of deleted data to be kept, and I don't think our current tools do that, so there is some work to be done.

  5. The original DOI is part of the collection, and whether people use the original or subsequent ones, you could build an easy tool to see whether your DOI was cited as either a parent or sibling DOI in other collections.

    • If you replace DOI with identifier in this sentence, I'm comfortable with it, but this is not a DOI (for granularity reasons I'll get to).

  6. So, the DOIs (which contain only to tracking IDs and other DOIs) are persistent. They allow a unique mapping to the data as used, but also to any updates/corrections. Groups can use them to see who used their data, they can be assigned automatically. The only metadata required are trivially available via the filename structure.

    • All good, all likeable, but it then requires the development of an entirely new ecosystem of citation tools (for counting, comparison etc), and it's not equal in any sense to a traditional academic citation.

So, I don't think there are any insurmountable technical hurdles in what is proposed, but I have three substantive objections, two of which have been hinted about above.

  • Respect: One of the major reasons for wanting to use DOIs (as opposed to any other identifiers), is that if we get the granularity and procedures right, all the existing citation tools, and shared academic understanding of what a citation means, can be used. Do something different, and we have loads of new (counting/comparator) tools to build, and more importantly, we have to convince every administration you can think of, that these new "citations" have merit (in comparison to traditional citations. Administrations include your organisation, whoever you are, will a data publication rank the same as a paper publication? I hope so, especially if it's in a data journal like one of these: Scientific Data, Geoscience Data Journal, Earth System Science Data etc. Administrations also include journals. Will they be happy for this citation to appear in a reference list?

    • The bottom line: Does the above proposal "get the granularity and procedure right"? Do we anticipate that an ESGF DOI should rank the same as one of those? If not, why not? How will a consumer differentiate? (My answers: no; not as proposed; wrong granularity, no formal procedures; they'll not use them, so it won't be a problem.)

  • Granularity: When we think about traditional publications, we have clear notions of what a publication actually is, and we have good concepts about how to refer to parts of publications. (We know about papers, books, chapters, pages etc.). Data is different. Why choose to cite simulations (as opposed to variables, bytes, ensembles etc)? The answer has to be that the right granularity is an optimisation of the needs of all parties (producers, consumers, publishers). A key a priori assumption is that we cannot expect it to be right for all of them, so whatever we choose, we need a method of referring into a publication unit, and we need a method of sensibly aggregating published units.

    • For reasons I'll get to, I think this proposal has the wrong granularity.

  • Scientific Usefulness/Appropriateness: and now we get to the one I care about most, and the one that is most subjective. In this instance the thing I like least is giving a DOI to an ensemble member. Why not give it to an ensemble?

    • PRO: There is an obvious mechanism for referring to an ensemble member within an ensemble.

    • PRO: Most times one would expect a scientific analysis to exploit an ensemble (although occasionally one can only afford to have one member).

    • PRO: It is easy to provide more complete metadata (once) for an ensemble.

      • Disclosure: I think I think that this is much more important than Gavin, but that's because I also deal with many user communities who are not modelling experts. We both agree we have yet to get this right, but that's not a reason to give it up.

    • CON: (Anticipated) Why choose the ensemble axis to be the ensemble of one model's integrations, not a multi-model ensemble. (Answer: because many use cases are not multi-model, and we can deal with multi-model ensembles as aggregations of single-model ensembles).

    • CON: Ensembles are always being added to. When is it finished? What do I do if I want to use the data before the n'th member is available?

    • CON: The metadata for describing the difference between the ensemble members may be incomplete. (Yeah, so, ...)

OK, so what are the minimal changes I'd make to Gavin's proposal to make it something I do like? (Because it's close). What issues would arise?

  1. Granularity: use ensembles, but make sure there is a well understood notation for citing into ensembles to specific ensemble members, and that there is a preprint notation for citations into ensembles which are known to be incomplete, and an edition notation for ensembles which have grown after Publication.

    • We would need to think this through for publication and Publication inside ESGF, but this ought not be hard.

    • Aggregations could be done in new publications, in the outside world, using for example, the data journals listed above.

  2. Version Control: I like this proposal a lot, in that it takes out the need to keep the data (perhaps). I'd need Stephen to work out what the consequence are to Thredds and the DRS versioning library to keep tracking ids of deleted files etc, but it's doable, albeit work.

  3. Respect. Since this is now (nearly) automagic within ESGF, you'd want some method of ensuring that only reputable modelling groups can put their data in. I think ensembles are about the right granularity to rank with traditional publications.

  4. Content Standard: We'd need to come up with an appropriate manifest format for the tracking ids (or maybe DRS URLS instead, but we'd still be holding somewhere lists of tracking-ids mapping onto URLS). We'd need to control the vocabulary for relationships. All doable, but starting to look like significant work.

  5. Persistence. ESGF data nodes come and go. We'd need the usual suspects to commit to maintain this long-term, which means the funders to commit to the work, and the maintenance. So we'd need to work this up into a business case. (Not a grant: this would have to be underpinned by long-term money, not soft-money.)

    • The existing CMIP DOI system, is underpinned primarily by the IPCC-DDC commitments, either we get the IPCC-DDC group to agree to do this, and/or we get the relevant funding agencies in Germany, the UK and US (at least) to buy into it.

One last caveat. I happen to believe the existing DOI plan for CMIP5 is not broken, it's just under-resourced, and the community has not bought into it enough. I wouldn't want to do this new thing, and stop doing the existing activity. That has implications that need thinking through too!

by Bryan Lawrence : 2013/08/23 : 4 comments (permalink)

github repo for badc text file code

Some years ago, my colleagues Graham Parton and Sam Pepler developed a metadata content standard to allow the ingestion of comma separated value (csv) files of data into the BADC archive. What they wanted to do was allow folks whose primary data manipulation tool was a spreadsheet to have a viable and easy mechanism of getting their data into our managed environment.

Data in a managed environment needs to be in well known formats that can be persisted for a long time, and it has to have adequate metadata. Normal (e.g. excel) spreadsheets fail both criteria: over the years the format has changed, and has not always been well enough documented that one would have confidence that all the content could be recovered in the future (when the original software is no longer available). CSV files exported from spreadsheets can be persisted, but like their spreadsheets, don't necessarily have the right metadata.

The BADC text file "format" (it's not really a format) was designed to address those metadata requirements with some modest mandatory information and a syntax for optional information. BADC has run a checker (so that folks can check their files comply) for some years now, but never made any code for using/manipulating these files available.

A few days ago I discovered that (since I wanted to have some files hang around on my desktop, and I've long since discovered that I need metadata, never mind the BADC). I could have used NetCDF, but I have a policy of trying out BADC stuff whenever I can, so I tried the text files, and wasn't happy that I couldn't find any public code.

Well there is public code now:!

by Bryan Lawrence : 2013/08/22 : 2 comments (permalink)

confusion on dois

The discussion on twitter and my blog about DOIs and CMIP5 and what did and didn't work reflects a bunch of different perspectives.

These include not only:

  • the modeller who wants a metric of their data usage (the producers)

  • the scientist who wants to accurately describe the data they used in some study (the consumers),

but also (at least):

  • the organisation/group/individual responsible for data publication (as opposed to Publication),

  • those of us who distinguish between publication and Publication (e.g. see Lawrence et al 2011),

  • journals in which one might want to use DOIs,

  • service providers for DOI machinery, and

  • those of us trying to persist the AR5 and CMIP5 archives.

As usual, all these folk are using the same words to describe different objectives (this behaviour has been constant since the dawn of time, and a thread on this blog since I started blogging, as has citation). It would help a lot if folk defined what they meant by "we want DOIs".

When done, which I'll try to do a bit below, we find that from some of those perspectives, the problem seems tracticable and easy (e.g. Gavin's "it can't be that hard"), but if we take them all on board, it gets hard (e.g. Stephen's "These problems are being worked on but not on the scale we'd need for CMIP5").

Apologies in advance. I didn't have time to write anything shorter.

Taking these in turn. I think

  • The modeller ("producer") wants identifiers for their data, which others cite, and wants those citations to be countable (and perhaps, where those citations occur, to be identifiable). They want this to work over time, and they know that a URL wont work, since a) their data appears at more than one URL, and b) URLs aren't persistent, so where their data appears in a URL changes with time. Everyone wants the granularity of these objects to be "appropriate" for credit.

  • The scientist ("consumer") wants an identifier-like tool which unambiguously defines what data they used, and some scientists would like this to show when similar data was used. (That is, I used some data, and you used some data, and most of our data was the same, so it'd be nice that the fact that most of our data was the same was somehow recognisable easily via the identifiers). Many consumers also respect the necessity for giving producers credit. If we unpick this requirement even further, we find sub-requirements:

    • The urge to have an appropriate citation to the data used in a particular work, and

    • The urge to have a record of what data was used in a particular work, and maybe even

    • The urge to have a manifest that could take part in some future workflow.

  • The publisher wants to be able to handle getting the data "out there" efficiently, and with the minimum of human intervention. Sadly, dealing with version control is low down on their requirements ("what, you changed one minor thing, and I have to republish tens of thousands of files, and you want the old ones to remain?").

  • We already have a defacto concept of digital publication (here defined as "being available on the internet") in citation: journals allow you to use URLs, and everyone knows those are "weak" citations, and no one counts them (properly). We also have Publication (here defined as "available after going through a review process and persisted"). Published items appear in bibliographies and are counted by citation services.

  • Journals are picky about what they allow in reference lists, for good reasons which are about trying to make sure that "references" are "obtainable". If we want data p/Publications in reference lists, then we need to ensure that they meet appropriate standards - otherwise they can (IMHO should) languish in footnotes and parentheses like URLS.

  • Digital Object Identifiers are expected to be persistent, so that if the location of the object changes, you can still dereference the object. That's pretty much the entire purpose of them ... so if we are going to stand up a DOI, we have to have some machinery behind them that take the DOI reference to the actual digital objects. That machinery has to be reliable and predictable, which in software terms means it has to have well defined characteristics and maintainable software.

  • In the context of CMIP5 we already have the Data Reference Syntax(DRS, pdf)? that provides a consistent URL naming scheme, which if the hostname is left off, unambigously defines the data files, and some pre-defined aggregations. In the context of the IPCC AR5 archive (the CMIP5 data available on the cut-off date), we want to have DOIs for the simulations. We think simulations (which include ensembles) are about the right granularity. To make this manageable, in both cases we have some versioning rules, and for DOIs, we expect the data to have passed enough quality control that it is unlikely to need multiple versions; that really matters if we want to persist at petascale.

Ok, now to revist the recent discussion, this from Gavin:

(1) The problem with doi as a standard for data was recognised years ago - it is good for static objects that don't have relationships with other objects. Data is not like that - it gets updated, corrected, rescaled, redated etc.

and earlier, when prompted to distinguish between a URL and a DOI:

(2) doi can be static through time, URLs not; neither deal with versioning; URL is at file level, not dataset

and these:

(3) The 'defined quality' issue for DOIs is a red herring. There is no way to judge this objectively, and so no system to do it will be satisfactory. As we have seen with the DKRZ project, it takes too long and makes the process useless.

(4) Access to the data is a definite goal, tracking what was done is a definite goal, and automatic tracking of where data was used is a definite goal.

We also have these from Aslak Grinsted:

(5) I would like to see the DOI being used purely to point at what data files we are talking about. I don't think it should contain any other info at all. Nothing about quality or relationships to other data series.

(6) i would sacrifice persistence if it fails QC and storage is an issue. I know this makes science less reproducible, but there is nothing preventing people from using the data at the moment. They just struggle with citing it. When data is removed because of a QC issue then we would still have a records of its relationship to new versions. Thus the system can inform the user "sorry this file is no longer available because of some issue. A corrected version of the file can be found here."

I think you can immediately see that Aslak and Gavin both have consumer perspectives, and legitimate ones at that, but they need to be discussed in the larger context ...

So let's take those points above:

1) The problem with doi as a standard for data was recognised years ago - it is good for static objects that dont have relationships with other objects. Data is not like that - it gets updated, corrected, rescaled, redated etc.

Sorry Gavin, you're plain wrong. Data doesn't change. But yes, it might get superseded, processed, etc, leading to new data objects. What you all want to happen for a digital object identifier is that when you have used data, we know what data you have used. You often want someone discovering that data to find it's been superseded, but if science is to be reproducible, then you want to (always be able to) find the original data. It is an additional requirement to find relationships between data objects, something to layer on top of a reliable identifier system. What you've asked for here is precisely what the linked data paradigm was invented to address, but linked data is all about making typed links between digital objects, and yes, they need identifiers. So, DOIs are not the solution to the problem you outline, but they (can) be a key part of the solution (as can other types of identifiers).

2) doi can be static through time, URLs not; neither deal with versioning; URL is at file level, not dataset

Gavin wrote this in answer to a question of mine that was prompting him to say something like this. It was intended to be a bit of a rhetorical trap, and given the 140 character limit, it worked in that Gavin's answer is a wee bit wrong. DOIs are meant to be static in time, but they should point to landing page which itself points to one or more URLs to the object in question. One expects to change the landing page if the URL where the objects can be found change ... (and indeed, one can, by convention add other stuff to the landing page, like links to related objects, such as a new version). It's important to note that URLs can point to anything you can dereference on the web, and that could include a file system directory (as opposed to a file), so neither URLs nor DOIs know the difference between a file and a collection of files. Those who allocate the DOIs get to choose the granularity (or level of aggregation) by convention.

The point being that all a DOI is, is a mechanism for making a link between some notion of a resource (the object with the DOI identifier as described by the landing page), and it's current location (URL). Thus it turns out that if you don't want persistence and a resource description, there is NO DIFFERENCE BETWEEN A URL AND A DOI. (However, there is one extra thing usually involved with a DOI, and that's an individual or organisation that allocates the DOIs and manages the landing pages, we might call them the publisher! A publisher can add value to a DOI, but the notion of publication can be maintained with URLs as well, so it's not that big of a discriminator ... except that the publishing world generally recognises that there is a publisher for a DOI, whereas, there might be one for a URL, and so not anyone can allocate a DOI).

3) The "defined quality" issue for DOIs is a red herring. There is no way to judge this objectively, and so no system to do it will be satisfactory. As we have seen with the DKRZ project, it takes too long and makes the process useless.

Room for disagreement here. My definition of quality control for the AR5 archive is that "the data is not likely to be superceded because of easily identifiable errors in the format and/or science". Experience tells us that MIP data is often superceded, and simple QC catches most of those errors. Most is good enough here. The QC is not about science quality per se, and we can define the sort of QC we want. However, Gavin is right, the process for CMIP5 has been too slow, and that's partly because no one is paying for this (so no one has a day job to do the QC), and partly because the modelling groups themselves have never bought into it properly (so necessary metadata isn't always finished).

  • I'm sure Gavin could respond and say that the metadata we asked for was not what was needed, and to some extent he'd be right, but the bottom line here is we need consistent metadata, and right now this is what we asked for ... and it's not possible to keep chopping and changing requirements and deliver a service ...

4) Access to the data is a definite goal, tracking what was done is a definite goal, and automatic tracking of where data was used is a definite goal.

We agree on those goals. The question is how to achieve them. I hope it's clear that we can't keep every version of the data (and for some data, there have been lots of versions). So, ideally we don't have citations to superceded data in the literature. The problem is how to do it automatically. True automatic tracking would require a truly persistent data architecture, and a truly persistent workflow toolset. We don't have either ... and frankly, I don't think complete automation is the answer ... Stephen's comment addressed some of the reasons why (if an arbitrary bag of objects gets a DOI every time, then the DOI has no use except in workflow, since you can't compare them, it's not easy to count them (for provider credit in a way that doesn't reward providers in strange ways, and it's a huge overhead on the machinery of the service provider). The bottom line is that if you can completely automate the assignation of an identifier pointing to a bag of files, this is obviously a good thing, but it's not a DOI! (Where is the role of Publishing?). So, yes, let's consider what we need to have machinery to do this, but let's not confuse that with a DOI ...

5) I would like to see the DOI being used purely to point at what data files we are talking about. I dont think it should contain any other info at all. Nothing about quality or relationships to other data series.

Yes, the DOI job is just to point at the data. The argument about QC is just to avoid carrying too many of them, and yes, the relationships between data should lie above the DOIs per se ... however, when we think of a DOI as represented a Published dataset, we can at least include the concept of allowing later versions to be signalled on the landing page.

6) I would sacrifice persistence if it fails QC and storage is an issue. I know this makes science less reproducible, but there is nothing preventing people from using the data at the moment. They just struggle with citing it. When data is removed because of a QC issue then we would still have a records of its relationship to new versions. Thus the system can inform the user "sorry this file is no longer available because of some issue. A corrected version of the file can be found here."

I don't see why you would bother with a DOI if you take away persistence, what value does it offer over a URL?

So the bottom line in my responses to both: if you are willing to throw away persistence (Aslak) and you think the identifier has to be automatic hide versioning (Gavin), then neither of you want a DOI per se, you want something else ...

I think both of you might be happier with some sort of automatic manifest wrapping up DRS URLS, itself available at a URL. I think that would satisfy your provenance citation, but I don't think it would hit many of the other requirements. I think we'd want DOIs as well!

In summary, the identifiers you need for workflow provenance are probably not the same ones you need to support the scientific neeed to cite. I don't think there is any value for citation in completely opaque identifiers to arbitrary grab bags of files (as I've argued above). I can see the value of them in workflow, but not citation.

Finally, one last point, for Aslak, and everyone else. It's not silly to ask questions, or promulgate solutions, I'm really glad you are, please keep doing it. The only silly questions are the ones we don't ask (since it's silly not to ask a question if you don't know the answer and want to know). It's also not silly to propose answers to questions, again, that might be the only way to find there is a well known better answer etc ... (or to bring enlightenment to those of us stuck in the trenches, we sometimes need to lift up our eyes).

by Bryan Lawrence : 2013/08/13 : 10 comments (permalink)

DISCLAIMER: This is a personal blog. Nothing written here reflects an official opinion of my employer or any funding agency.