Bryan Lawrence : Bryan's Blog

Bryan Lawrence

... personal wiki, blog and notes

Bryan's Blog

(Only the last ten entries are here, earlier entries can be found by using the Summary, Archive or Categories pages, or by using the calendar to go to previous months and/or years).

A citation and provenance system for climate modelling

What would a modelling citation and provenance system need to do?

I've thought about this before, more than once, but this is a first principles use case description.

We start from the assumption that I will be accessing files from a local "copy" of some files of data, and that I have a subset of files that I've used for a particular problem.

So, I have to describe that compendium of data, which means I need a tool which identifies which data I used ... It needs to be able to do something notionally like:

makecite "list of files" > provenance.list

What's actually in provenance.list should be a list of permanent identifiers to data actually used, not the data itself.

I expect I will want to cite this provenance.list in my publication, so the provenance list itself should be a (published) dataset, with an identifier. So, there needs to be a way of describing and publishing my provenance.list.

Now, you reading my paper, need to be able to obtain and use that provenance list. Assuming my provenance.list has a DOI, let's assume getting it is straightforward (it should be small).

Now you need a tool which allows you to use the provenance list to get the relevant data or check that you already have it, something like:

citeget provenance.list

which should result in a set of files , or

citecheck provenance.list

might confirm that you have those files. Alternatively (or additionally)

citeupdate provenance.list

might give you (or me) an updated set of versions for the same datasets ...

That user story is very file-centric. We could probably make it more "data-centric" by, for example, including opendap urls to bounding boxes, but as it stands it's very simple, and hopefully doable (none of these tools actually exist!)

This story doesn't address credit, but it does address scientific repeatability!

So what to do about credit? We could of course pull out of the list of permanent identifiers a list of contributing simulations.

What to do with it? Do we believe it will be possible to go from those simulation identifiers to appropriate "traditional papers"? In principle yes, in practice no. We can expect to do this exercise before the appropriate formal scientific model and simulation description papers have even been written!

So, can one use "data" DOIs? It rather depends on whether we believe an appropriate data publication system is in place and on an appropriate granularity. However it too may not be in place when the citation is necessary.

However, that's a very traditional way of thinking, that we have to show the modelling group credit by putting a traditional citation to them in my paper. If one has a more altmetrics focus, we should be happy that the metrics can be calculated, we don't have to have the right way of doing it a priori!

by Bryan Lawrence : 2015/03/02 : 1 comment (permalink)

Pagico Experience at week one.

Ok, I promised to report my experience with pagico at week one.

I really like the dashboard view, and the must do and might do lists ... I found it a really good way of thinking about what I need to do next. However, the bottom line with Pagico is that it's harder than I'd like to get information into Pagico, and to move between Pagico's view and other views of information (in particular, Evernote).

Email integration: I did go ahead and put my gmail through apple to investigate. I can drag and drop an email from apple mail into Pagico, but it ends up as being an item in a collection of items for the Project. So, if I understand it correctly, if I want to postpone answering an email because it'll take more than a couple of minutes, the expected workflow is: drag and drop it into a project collection, then create a task, then give it a due date. Not really frictionless, I really wanted emails to become tasks with minimal intervention.

However, it looks like there is a better option for (apple/outlook) mail integration if you have mailtags installed. I don't, but I suspect I would if I lived in Apple mail land. (Actually, if you live in Apple email land, and email is the main issue for you in terms of GTD, mailtags might be of interest in it's own right).

Evernote integration is simply via the ability to drag and drop links, which are to the browser version, and always open it in a tab which requires logging in. Compared to competitors, this isn't really integration.

Drag and drop feels incomplete. In some views, I can easily drag files onto tasks, but in other places, come what may, I can't help dragging them onto the parent project. I found that frustrating ...

So, my final feeling with Pagico - at the moment - is that it has a really good interface for task management, but until they fix email and evernote integration properly, and deal with the drag and drop issues, the friction of getting information into Pagico is just too high. I'm going to look elsewhere (and hope I can get all my tasks out of my trial version of Pagico). I could easily be persuaded to come back if they sorted the evernote/email integration.

So, I'm still inspired enough to keep on with GTD tools, so I expect there will be more to report anon.

by Bryan Lawrence : 2015/02/01 : 0 comments (permalink)

Pagico and getting things done

Getting Things Done. GTD.

My colleague Eric Guilyardi showed me Pagico on Friday. I was sufficiently impressed that I spent a few hours on Saturday morning playing with it.

Why? Well, I'm continually feeling hassled by the number of things that I'm trying to keep track of, the size of my inbox etc, so a good GTD tool has always been something I've been looking for.

Years ago I used Remember the Milk (successfully, for some months, but in the end it it couldn't deal with the complexity of information I wanted to store in it). I've used a range of notes tools, and I'm currently using Evernote. I use it for everything of course, but for GTD, I just use a weekly to-do list with check boxes, but it doesn't organise things ... and I find I ignore the reminders ... so is there something better out there?

Well, Pagico on Eric's screen certainly looked like it. I particularly liked the someday tasks, that show up for tomorrow. I liked the "dashboard" (pseudo-Gantt) ... and when I read about it I liked the idea of Evernote and email integration. So, as i said, I spent some hours with it. Of course, in doing so, I did a bit of googling... and started wondering whether Pagico was really what I want.

However, meanwhile, in just trying to work out how to use Pagico, I did some really useful thinking about how to organise my workflow into task lists, tasks, projects and collections. I (manually) moved a bunch of emails into Pagico (and archived them in Gmail). I archived everything else. I achieved inbox zero for the first time in, well, it seems like forever (certainly at least a year). The lesson I take from that particular exercise is that the "organise" part of GTD is incredibly important, and probably independent of the tool (provided it has at least three levels of hierarchy).

Experience with Pagico itself? Well, I had some little glitches I didn't like, so on Saturday I wrote to the developer. On Sunday I had a reply, I replied, he replied. Blimey, that's responsive (and I made it clear I was only using the trial version and might not buy). Sounds like he'll fix some of the things I didn't like/wanted. Blimey again.

At this point I have a lot of actions in a few projects. We'll see how the week goes, but I'm already a bit disappointed in that the evernote integration is weak - one only gets to drag a web-link in, so it doesn't work with native (mac1) application I use. Also, my email is in gmail. I haven't found a way of marking an email as a task, or how to drag an email into Pagico. For me that's absolutely crucial. I suppose I could load gmail into the Mac email app (from where d&d apparently works), but I am rather partial to google's filtering into priority inbox etc ... (but maybe I don't need that with a good GTD tool).

So, just this few hours of playing with Pagico has made me realise that: I do need a really good GTD tool, even the thinking that this GTD tool made me do was useful, but it needs evernote, email and calendar integration. Oh, and it has to have a good android interface on phone and tablet. Will it be Pagico, or will it be something else?

As I said above, I did some googling, and in doing so I discovered that the whole GTD world has moved on a lot since I last paid attention. purplezengoat, for example, has a very interesting list of tools to think about. I intend to think about them, possibly concentrating on iqtell and zendone. But I'm going to give Pagico a week. Stay tuned.

1: Yes, I sold my soul and moved to a macbook about six months ago, but never fear, i only do management gubbins on it, my real work is done on a linux virtual machine. But I have to confess, for management gubbins, I do like the Mac. Damn it. Worse, nowadays, management gubbins is most of my day job, which is why the blogging has dried up. Damn it again. Oh, and I have to say, the macbook hardware is really good too! (ret).

by Bryan Lawrence : 2015/01/25 : 1 trackback : 2 comments (permalink)

Leptoukh Lecture

I was honoured by the informatics section of the American Geophysical Union this year by being awarded the Leptoukh Lecture. The abstract and talk itself are on my talks page.

by Bryan Lawrence : 2014/12/18 (permalink)

Building your own JASMIN Virtual Machine

I make a good deal of use of the JASMIN science virtual machines, but sometimes I want to just do something locally for testing. Fortunately you can build your own virtual machine using the " JASMIN Analysis Platform" (JAP) to get the same base files.

Here's my experience building a JAP instance in a VMware Fusion virtual machine (I have a Macbook, but I have thus far done all the heavy lifting inside a linux mint virtual machine ... but the JAP needs a centos or redhat base machine, hence this).

Step One: Base Virtual Machine

We want a base linux virtual machine on which we build the JAP.

  1. Start by downloading a suitable base linux installation (Centos or RedHat). Here is one I got some time ago: CentOS-6.5-x86_64-bin-DVD1.iso

  2. From VMware fusion choose the File>New Option and double click on the "Install from Disc or Image" option and find your .iso from the previous step.

  3. Inside the linux easy install configure your startup account

  4. You might want to configure the settings. I chose to give mine 2 cores and 4 GB of memory and access to some shared folders with the host.

  5. Start your virtual Machine.

  6. (Ignore the message about unsupported hardware by clicking OK)

  7. Wait ... do something else ...

  8. Login.

  9. (This is a good place to take a snapshot of the bare machine if you have the available disk space. Snapshots take up as much disk as you asked for memory.)

Step Two: Install the JAP

Following instructions from here. There are effectively three steps plus two wrinkles. The three steps are: get the Extra Packages for Enterprise Linux into your config path; get the CEDA JAP linux into your config path; and build. Then the wrinkles: the build currently fails! However, the fixes to make it build are pretty trivial.

  1. Open up a terminal window and su to root.

  2. Follow the three steps on the installation page, then you'll see something like this:

    --> Finished Dependency Resolution
    Error: Package: gdal-ruby-1.9.2-1.ceda.el6.x86_64 (ceda)
               Requires: libarmadillo.so.3()(64bit)
    ... 
    Error: Package: grib_api-1.12.1-1.el6.x86_64 (epel)
               Requires: libnetcdf.so.6()(64bit)
    ...
                   Not found
     You could try using --skip-broken to work around the problem
     You could try running: rpm -Va --nofiles --nodigest
    

    But never fear, two easy fixes are documented here. You need to

  3. Force the install to use the CEDA grib_api, not the EPEL version, You do that by putting

    exclude=grib_api*
    

    at the end of the first (EPEL) section in the /etc/yum.repos.d/epel.repo file, and

  4. Add the missing (older version of the) armadillo library by downloading the binary rpm on the ticket and installing it locally, then you can redo the final step:

  5. yum install jasmin-sci-vm

And stand back and wait. You'll soon have a jasmin-sci-vm.

by Bryan Lawrence : 2014/08/04 : Categories jasmin : 0 comments (permalink)

simulation documents

In my last post I was discussing the relationship between the various elements of documentation necessary to describe the simulation workflow.

It turns out the key linking information is held in the simulation documents - or should be, but for CMIP5 we didn't do a good job of making clear to the community the importance of the simulation as the linchpin of the process, so many were not completed, or not completed well. It's importance should be clear from the previous figure one, but we never really promulgated that sort of information, and it certainly wasn't clear from the questionnaire interface - where the balance of effort was massively tilted to describing the configured model.

Looking forward, it would seem sensible to separate the collection of information about the simulation from the collection of information about the configured model (and everything else). If we did that, the folks running the simulations could get on and document those in parallel with those documenting the models etc. It would also make the entire thing a bit less daunting.

To that end, I've tried to summarise the key simulation information in one diagram:

Image: static/2014/08/01/simulations.jpg

(Like last time, this is not meant to be formal UML, but something more pleasing on the scientist eye.)

The key information for a simulation appears in the top left box, and most if it could be completed using a text editor if we gave folks the right tools. At the very least it could be done in a far easier manner to allow cut and paste. The hard part was, and still would be, the conformances.

(For now we don't have a specification for how to describe performance and resources, but these are not expected to be hard, and the ground work has already done by Balaji at GFDL.)

One tool we need to provide folks to make the best use of this information would be a way of parsing the code mods and the configured models to create a sort of key,value list which described any given configured model in a way that could be compared in terms of mathematical distance from another model. Such a tool would enable the creation of model genealogies (a la Masson and Knutti, 201) in a completely objective way.

One thing to note is that the simualtioncollection documents allow one to collect together older simulations into new simulation collections, which means that we ought to be able to develop simulation and data collections which exploit old data and old models within the ensembles for new experiments.

(I should say that these blog posts have been a result of conversations with a number of folk, this last one based on a scribble on the back of an envelope from a chat with Eric Guilyardi.)

by Bryan Lawrence : 2014/08/01 : Categories esdoc metafor cmip6 : 0 comments (permalink)

Updated version of the model documentation plots

A couple of weeks ago, I outlined three figures to help describe the model documentation workflow and asked for a bit of feedback.

I've had some feedback, and some more thinking, so here are three updated versions of those plots.

In each figure, the boxes and links are not meant to correspond directly to the UML classes and associations of the (es-doc) Common Information Model (although some do) - the intention is to describe the concepts and intent of the various pieces of documentation.

Figure One

Image: static/2014/07/29/MIPprocess_esdocV2.jpg

The MIPS process involves designing experiments which are provided to modelling centres.

Modelling centres configure model code to produce a configured model, which needs InputData.

They then run one (or many) Simulations to support one or more experiments. The Simulation will conformTo (NumericalRequirements) of the Experiment, and will produce OutputData, and was run on a Platform.

The output data is uploaded to the ESGF archive where is supports the MIP.

Each of the coloured boxes represents an es-doc "document-type". The yellow coloured relationships will be included in a given Simulation document.

Figure Two

Image: static/2014/07/29/simple_esdocV2.jpg

Neglecting the process, we can look at the various document types and what they are for in more detail.

A simulation (which can also be an aggregation of simulations, also known as an ensemble) will have conformed to the requirements of the model and the experiment via conformances. Many of which constrain the input data, so as to meet the input requirements of the model, and which may also have been constrained by one of the numerical requirements of the experiment. Others may affect the code (maybe a choice of a particular code modification, or via a specific parameter choice or choices).

Configured models are described by their Science Properties.

A Simulation document will include all the information coloured in yellow, so it will define which configured model was used via the uses relationship, which will point to configured model document, which itself describes the model.

Similarly, an experiment document will define all the various numerical requirements, and may point to some specific input data requirements.

Ideally output data objects will point back at the simulation which produced them.

Figure Three

Image: static/2014/07/29/Simulations_esdocV2.jpg

In even more detail we can see that numerical requirements come in a number of flavours, including:

  • SpatioTemporalConstraints - which might define required experimental start dates, durations, and/or the coverage (global, regional etc).

  • Forcings - which generally define how various aspects of the system might be represented, for example, providing Ozone files for a consistent static ozone representation.

  • OutputRequirements - which define what output is required for intercomparison.

Simulation conformances are generally linked to specific numerical requirements and consist of data and code-mod conformances.

Currently es-doc does not clearly distinguish between data requirements and data objects, this will be fixed in a future version.

ConfiguredModels are configured from BaseModel code (this is not currently implemented in es-doc). In principle they can be described both by their software properties and the science properties (and so can their sub-components).

The run-time performance of the simulation should also be captured by resource and performance characteristics (but this too is not yet supported by es-doc).

A model description document should include all the material in green, including the links.

by Bryan Lawrence : 2014/07/29 : Categories esdoc : 1 trackback : 0 comments (permalink)

NCAS Science Conference, Bristol

In the middle of two days in Bristol for the NCAS science conference. Good to see what so many of my colleagues are up to (distributed as they are, across the UK).

My talk on the influence of Moore's Law on Atmospheric Science is linked from my talks page.

I wish the rest of the talks were publicly available, there is a lot of good stuff (much of it yet to come, today). The problem with (very slow) peer review as a gold standard is that a lot of stuff only sees the public light of day (even into the relevant science community) long after the work was done, whereas much of it is fit for exposure (and discussion) well before then - but you have to go to the right meeting/workshop/conference. However, some of what is discussed is provocative, and work in progress ... and of course our community is (for good reasons, mostly related to an uneducated public spotlight) somewhat shy of premature publication. It's a conundrum.

by Bryan Lawrence : 2014/07/18 : 0 comments (permalink)

The Influence of Moores Law

NCAS Science Meeting, Bristol, July 2014

I gave a talk on how Moore's Law and friends are influencing atmospheric science, the infrastructure we need, and how we trying to deliver services to the community.

Presentation: pdf (19 MB!)

by Bryan Lawrence : 2014/07/17 : Categories talks (permalink)

The vocabulary of documenting models

Some time ago our European project Metafor migrated to a global project called es-doc. During the metafor project we put a lot of effort into trying to develop materials which describe what it is trying to achieve, but I think we never really got that right (despite a number of papers on the subject - e.g. Lawrence et al 2012 , Guilyardi et al 2013 etc).

I'm currently in the process of trying to produce a figure on the subject for another publication, and in the process have produced these, which I think might be quite useful for understanding what es-doc is trying to document, and what the main kinds of documentation are that it produces. I'd be interested in feedback as to whether these figures are helpful or not.

In each figure, the boxes and links are not meant to correspond directly to the UML classes and associations of the (es-doc) Common Information Model (although some do) - the intention is to describe the concepts and intent of the various pieces of documentation.

Figure One

Image: static/2014/07/09/MIPprocess_esdoc.jpg

The MIPS process involves designing experiments which are provided to modelling centres.

Modelling centres configure model code to produce a configured model.

They then run one (or many) Simulations to support one or more experiments. The Simulation will conformTo (NumericalRequirements) of the Experiment, and will produce OutputData, and was run on a Platform.

The output data is uploaded to the ESGF archive where is supports the MIP.

Each of the coloured boxes represents an es-doc "document-type".

Figure Two

Image: static/2014/07/09/simple_esdoc.jpg

Neglecting the process, we can look at the various document types and what they are for in more detail.

A simulation (which can also be an aggregation of simulations, also known as an ensemble) will have used some input data, some of which may have been defined and/or constrained by one of the numerical requirements of the experiment.

Constraints on the simulation by the numerical requirements might require conformances - which define how those constraints affect either the data or the code (maybe a choice of a particular code modification, or via a specific parameter choice or choices).

Configured models are described by their Science Properties.

A Simulation document will include all the information coloured in yellow, so it will define which configured model was used via the uses relationship, which will point to configured model document, which itself describes the model.

Similarly, an experiment document will define all the various numerical requirements, and may point to some specific input data requirements.

Ideally output data objects will point back at the simulation which produced them.

Figure Three

Image: static/2014/07/09/Simulations_esdoc.jpg

In even more detail we can see that numerical requirements come in a number of flavours, including:

  • SpatioTemporalConstraints - which might define required experimental start dates, durations, and/or the coverage (global, regional etc).

  • Forcings - which generally define how various aspects of the system might be represented, for example, providing Ozone files for a consistent static ozone representation.

  • OutputRequirements - which define what output is required for intercomparison.

Simulation conformances are generally linked to specific numerical requirements and consist of data and code-mod conformances.

ConfiguredModels are configured from BaseModel code (this is not currently implemented in es-doc). In principle they can be described both by their software properties and the science properties (and so can their sub-components).

The run-time performance of the simulation should also be captured by resource and performance characteristics (but this too is not yet supported by es-doc).

A model description document should include all the material in green, including the links.

by Bryan Lawrence : 2014/07/09 : Categories esdoc metafor : 1 trackback : 2 comments (permalink)


DISCLAIMER: This is a personal blog. Nothing written here reflects an official opinion of my employer or any funding agency.