... personal wiki, blog and notes
Weeks six and seven
Two weeks to report this time, primarily because I had a day off on sick leave, and two days on real leave, so there's only seven weekday workdays to talk about (although I have just spent bit chunks of both yesterday and today, that is, sat and sun, on work as well).
Early on the main thing I was doing was trying to catch up on the ever increasing email mountain, but in week six, by the time I took out a day in London for a NERC Information Strategy Group meeting (mostly about the future of NERC data centres), a half day on technical futures for CEDA, another half day on my final CEDA monthly meeting and a follow up meeting on CEDA support for SPARC, and a day off on sick leave, that only left a few hours here and there to get much done. The net effect of course was that the email mountain grew.
To be fair, it wasn't so much that the email mountain grew, but the Nozbe task list grew. I did manage to process a lot of email, but quite a lot of things got thrown onto the pile for "later".
Then this last week (week seven), we had a few days on holiday down Dorset/Devon way. Just the couple of nights, but real recharge territory. However, back to work on Wednesday, and back to London - this time for an NCAS Science Strategy Board meeting, so only two real work days available.
So, in the work time, I did a wee bit of work on the ENES infrastructure foresight that I've talked about before, and quite a lot of work on the first deliverable we've got for ESIWACE: which is on requirements and business modelling for (weather and climate) data centres. Given the constraints on my time during the week, it's been a big effort on that this weekend too. I suspect quite a few things from that will turn up on my blog ...
... but anyway, that's the weeks that were.
by Bryan Lawrence : 2017/02/19 : 0 comments (permalink)
Space and Open Plan Offices
I was paying a bit more attention to twitter this morning than usual (I'm hoping I'll get some feedback on my analysis of citations that I posted yesterday). One thing that blew by was this headline the Washington Post:
"Google got it wrong. The open-office trend is destroying the workplace."
which took me back to something that I wrote in 2005, reporting on work done long before.
It seems that Google was ignoring history as well. Unlike them.
There's been lots more work done since the work I cited in that blog post, for example:
Workers in open-plan offices are more distracted, unfriendly and uncollaborative than those in traditional workplaces, according to the latest industry survey.
Employees who have to share their office with more than two people experience high levels of colleague distrust and form fewer co-worker friendships than those working in single-occupancy offices ...
"...the open-plan proponents' argument that open-plan improves morale and productivity appears to have no basis in the research literature."
Why so much data? Part I: The rise of direct numerical simulation
Over the last decade or so, many in the scientific community (especially in the environmental sciences) have been surprised by the increased cost of data handling - not only in absolute terms, but in terms of the percentage of the cost of doing "normal science".
This increase has been problematic on a number of fronts: not only do individuals not always plan appropriately for managing their data storage and handling, but even where they have, institutions and funding agencies have themselves been surprised and not always too keen to pay up. After all, in a world where there is very little new money in real terms (or even none), an increase in one part of the budget needs to be offset elsewhere. So, the real consequence of these increases in storage cost have to be decreases in the amount of science done (fewer staff/instruments/computers), and that's unpalatable even if it's unavoidable. At the moment many choose to think it is avoidable, which is quite a feasible position if you're an ostrich, but not so good if you're responsible for delivering science!
In this post, and maybe a couple to follow, I want to address why data handling is becoming a bigger deal in environmental science, and why we can't avoid spending more of money on it (but also how we can avoid spending more than we need to).
So, what are the factors in play? Well as well as background economics, there are four:
The direct influence of Moore's Law on instrumentation and simulation (finer resolution in space and time means more numbers),
The indirect influence of Moore's Law on what can be simulated (more compute means more things are computable),
The growth of interdisciplinarity (more things need to be compared and contrasted) and more people are doing it, and
The relationship between Moore's Law and Kryder's Law (is the cost of storage falling as rapidly as the cost of creating numbers to be storing is falling?).
Before delving into the technical details, I want to look at one of the underlying scientific trends that arises, partially as a consequence of the "indirect influence of Moore's Law": the rise and rise of direct numerical simulation, especially in the environmental sciences.
As noted above, more computability means more things are computable, and when we couple that to increased mathematical sophistication, more and more of the real word is amenable to direct numerical simulation: that is, it can be numerically simulated from fundamental equations rather than approximated by heuristics. The importance of this from a scientific point of view is that if one believes one is simulating the underlying processes properly, one can use the "simulation system" to predict how the system will behave under different circumstances than have been observed (either by coupling it into more complex systems or by using the system to predict past or future behaviour).
To some extent this is the holy grail of science: when one can simulate a system so well that one can't tell 1 whether one is observing a simulation or the real word, one can believe we understand that part of the real world.
For some time I have been asserting that an ever greater part of environmental science is engaging in direct numerical simulation (DNS), year on year. At the same time, I've been asserting that larger and wider communities were interacting around data, and especially around the data from model intercomparison projects, again, year on year. Of course, these assertions were not unrelated!
These assertions were based primarily on my interactions with the scientific community (one of the things about running an environmental data centre for much more than a decade is that my day job has involved interacting with individuals from across the scientific spectrum), and so one might dispute them. However, a year or so ago, I realised I might be able to get some quantitive information to support them by a bit of careful text mining. Unfortunately, I also realised I was never going to get the time to do it properly, so what follows is very amateur, but I hope still interesting. (If you have the skills and access to the data to do this properly, get in touch!)
The following figure is generated by spending a lot of time doing searches on Google Scholar (it would have been rather less time if Google didn't actively stop one doing this sort of work programmatically - I did try, and it worked fine until their "no robot code" stopped me in my tracks). Each point reflects the number of hits from a search on a specific set of terms from material restricted to two years. The terms were chosen to try and reflect five specific categories of interaction:
direct numerical simulation (across any discipline) - "dns" in the figure,
direct numerical simulation in the environmental sciences - "dns+env" in the figure,
model intercomparison projects - "mips",
use of satellite data in environmental science - "sats", and
regular observations - "sondes".
(The table at the end of this post gives the details of the exact searches carried out.)
The numbers in the legend are firstly the ratio of the last couple of points over the first couple of points - a measure of proportional growth, and secondly, the gradient from a fit to the number of hits per annum - a measure of absolute growth.
The figures back my observations rather nicely:
The number of papers on any facet of environmental science is growing. (No news there).
The number of papers using direct numerical simulation is growing rapidly, but the use of DNS in environmental sciences is growing even more rapidly, at least using the proportional measure.
The increase in papers which use MIP data is explosive, and one can see the direct influence of CMIP5 in the numbers.
Growth in observational science is slower than in numerical science, although the effect of increased availability of satellite data is apparent.
Obviously these conclusions could be heavily affected by the search terms I used, so if your mileage varies, let me know!
For the record, these are the exact (full text) searches used:
dns: search for the exact phrase "direct numerical simulation"
dns+env": as above, but require one of the following as well: cloud,rain,weather,climate,ocean,atmosphere,land,river,biogeochemistry,aerosol.
"sat": search for an exact match on "nadir sounder" (as a proxy for atmospheric satellite data only, and using that as a proxy for environmental science use of satellites in general).
"mips": exact match for model intercomparison project AND the use of the word simulation AND either atmosphere OR ocean.
"sondes": at least one of "radiosonde" or "dropsonde" appears (being a proxy for any sort of "traditional" observation - words like Lidar and Radar being too difficult to limit to environmental science, at least in this first cut at the problem).
by Bryan Lawrence : 2017/02/05 : 0 comments (permalink)
Nothing exciting to report. I spent nearly the entire week processing email, interacting with my team (both directly and on slack), and producing short things (e.g. a management level one pager on why NCAS should continue to support the CF conventions, and how). However, I did spend a wee bit of time trying to reinforce some of my observations around why storage is becoming a much bigger deal in environmental science. More on that next ...
by Bryan Lawrence : 2017/02/05 (permalink)
Not much to report for week four, since the work I did this week was mostly "processing email" - although we did submit the outline EPSRC bid, so one thing actually done (for now). The lack of anything else substantial was down to still being in the States in the early part of the week, coupled with a virulent bout of (probably) food poisoning which knocked me out for a couple of days.
by Bryan Lawrence : 2017/01/29 (permalink)
Quite a different week, for different reasons.
Firstly, the work. Up before the larks on Monday to take my daughter to school and then on by Eurostar to Paris. Monday through Wednesday this week was the final general assembly for the IS-ENES2 project: the second "Infrastructure (to support) the European Network for Earth System Modelling".
IS-ENES2 is a pretty important project to European climate science, although for a lot of folks it's invisible, but it has two important facets:
data-wise, for us, it's under-written our support for the CF NetCDF conventions, the support for constructing the data request for CIMIP6, and the entire es-doc initiative to document CMIP climate models and their simulations. IS-ENES2 has been the major supporter of the ESGF in Europe.
hpc-modelling-wise, it's underwritten work on devising plans for future model infrastructures, from workflow, couplers and model codes themselves. Looking forward much of that part is morphing into ESIWACE, but significant elements of support for current "production" climate science have been supported by IS-ENES2, and are not included within ESIWACE.
I had a couple of key roles in this meeting, which was held in a "room with a view" at the top of a tower in UPMC in Paris:
I've been helping coordinate a mid-term update to an ENES infrastructure strategy, and I've also been working on coding issues for future models, so I gave talks on both aspects.
Unfortunately I had to leave before the end of the Tuesday, and couldn't be there for the Wednesday. I had to head back to Blighty to work on the EPSRC bid, before flying to the states on Thursday. So Wed was mostly about the EPSRC bid, although I fitted in a bunch of other small things around that.
The US trip is a family thing, so I wont say much about that, but because i had a long day flight, and because although I need to be in the States, I don't need to be "off work" all the time, I've got a lot of other work done ...
... catching up on a lot of reading around data analytic futures suitable for JASMIN, as well as learning about some new software to use for information management. Sometime I'll blog about that too, but not now. I've also caught up on a lot of other bits and bobs.
I also managed a couple of long telcos on Friday, one on HPC futures around storage for the NERC community (tied up in the JASMIN funding I've been talking about), and one on making some measurements to help plan a migration away from parallel file systems to object store disk. That's a big story for another day too!
by Bryan Lawrence : 2017/01/23 : 0 comments (permalink)
Another week, another load of paperwork written ... another week that didn't feel much like science except for the Intel bit which at least was interesting ...
Pretty much the same topics as last week on my mind:
(Foresight) Last week I didn't spend any time on "Foresight", this week I did! This is the mid-term update to the European Network for Earth System Simulation (ENES) infrastructure strategy from 2012 (pdf). We held a meeting last year in October, and I'm coordinating the update, but it's been on the back burner because of other commitments. However, we're talking about it on Tuesday (i.e. in a couple of days), so I had to push on with it this week. I got a skeleton structure for a document done and discussed it with some colleagues.
(C3S Magic Lot2) I didn't spend any time on this last week either, but this week spent a couple of hours on it in the context of our (CEDA) CP4CDS contract with ECMWF to supply ESGF data to the Copernicus Climate Services project (Lot 1). This activity deserves it's own blog post and will get it when I come up for air ... but meanwhile just to say that this week was about the interaction between Lot2 (where I have a UoR involvement) which is about providing code to run in the climate services system which will be delivered by Lot1, and Lot1 itself.
(Chasm) spent a couple of hours today preparing a summary presentation of the outputs from our chasm workshop also held in October last year, for tomorrow. This is about the future of how we programme climate models and their infrastructure. It's not going to be easy!
(EPSRC Data Science Bid) A lot more time on that this week, impacts, objectives, updates to the outline, and some iterations around effort and finances.
(JASMIN Funding) Updated the brief for NERC with more details about the science programme consequences associated with the various financial and technical scenarios, and dealt with some of the consequential questions.
One new thing this week. Spent a day in Hamburg getting a restricted secret (!) briefing from Intel about their future plans. Very interesting stuff, none of which I can talk about, suffice to say I worry about programmability of next generation architectures (that's no secret, I worry about how we programme current architectures such as KNL, and our entire Chasm activity is about this issue ...). I think this is an oncoming train which much of the environmental science community is treating in the best traditions of ostrich escapism (collective heads in the sand).
As always on the Hamburg metro I notice how everyone seems just a bit more relaxed than the equivalent journey would be the UK. It might be just that there are less people in the carriage, but then I visit DKRZ and again people just seem less hassled, so it's more than that. I get the impression that they still actually fund new things in Germany, rather than just ask people to carry on doing the old things and do new things with no new money; and as a consequence people have sensible workloads unlike here (he says at the end of a 63 hour working week). Things seem to get done, and maintained ...
Of the other things I talked about last week, I progressed most of them in some small way as part of my normal email flow - I spent hours more on email this week ...
by Bryan Lawrence : 2017/01/15 : 0 comments (permalink)
So this is the end of week one of 2017. What did I spend the week doing?
Well, to answer that, I'm going to start by going back a couple of years: back then, in Jan and Feb 2015 I was looking at Pagico as a "getting things done" (GTD) tool. I ended up not choosing Pagico, but I did within a few weeks settle on using nozbe ...
... and I'm still using it, nearly every day! The picture above is a screenshot of my active projects this afternoon, most of which arose from deadlines from last year.
Without (today) getting into the details of how I use Nozbe, I'll just say that this was the list of projects which I expected at the beginning of the week that I might want, or need, to be working on during the week. (There are loads of other projects I'm working on, or need to work on, that are managed in my Nozbe, but I didn't expect to have to put a lot of work on them this week, although the odd task did come up and get done). Of course, in the best traditions of what Harold McMillan might have said, and Helmuth von Moltke did say, my plans don't always survive the inbound email and the demands therein, but you have to start somewhere ...
As it happened, I didn't need to work on all of those things this week (LTS), or didn't find the time (C3S Magic Lot2, chasm and foresight, so I'll explain those a bit more when I do have something to say) - which in the case of chasm and foresight will of necessity be after next week, because things have to be done on those next week.
So this week, I did spend time on
(Reorganising Bryan) Early Monday morning I put the finishing touches on a proposal for how my job description should be rewritten (I'll have a lot more to say about this another day, but this is part of the "consequences" I alluded to in my last post).
(NC Commissioning of CEDA) We (CEDA) had been asked to produce a five page vision statement on the strategic need for NERC's national capability spending on data management and it was due on Friday. My colleagues had produced some bullet points I might want to consider in this document, and on Monday (yes, that public holiday Monday) and Tuesday I produced the document in time to circulate it to colleagues late Tuesday for feedback on Wednesday. I finished it and submitted it on Thursday. This is part of a large body of work which has been carried out over the last couple of years to allow NERC to change the way it commissions its data management support. There will be more to do on this before the new commissioning is complete ... but for the moment I have no active tasks in this project. Bliss.
(HPC Replacement) I am responsible for a grab bag of activities associated with HPC replacement for NERC. This week the major task was producing a draft summary for NERC of options to share a new HPC platform with the Met Office to replace the existing MONSOON machine which is due to be turned off at the end of March. I sent the new draft off to Met Office colleagues on for fact checking ... (MONSOON is a shared HPC platform to allow model development under the auspices of the Joint Weather and Climate Research Programme.)
(CF) NCAS invests considerably in supporting the Climate and Forecast conventions for NetCDF. As part of preparing for commissioning that activity within Long Term Science (LTS, so now you know what that acronym means), key NCAS staff had a three hour meeting on Wednesday morning. It was a wide ranging discussion, covering many things, and leaving us all with some actions to help support CMIP6, as well as deal with the commissioning activity.
(EPSRC Data Science Bid) I am leading a bid from the University of Reading, STFC, and NCAS, for EPSRC funding for data science. At this stage we need an outline bid. I finished that off and circulated it to colleagues as well as had several physical and virtual conversations about it.
(PhD supervision) I try to meet at least weekly with my PhD student. Sometimes I have to do things, sometimes I have to chase up on things. This week was really easy, I just had to sit and listen to the cool things he'd done since I saw him before Christmas.
(CF Data Model Paper) We have a paper trying to clarify a common understanding of some aspects of CF nearing completion. I didn't have anything I needed to do at the beginning of the week, and the outstanding tasks might not ever need doing, but I still spent some time talking about it with my co-authors and contributing to some new diagrams - it had to be a priority because as one gets really near submission it's important that we can discuss things together while we all have the details at our fingertips (i.e at the top of our mental stacks).
(JASMIN Funding) In many ways the most significant thing I have done in recent years is instigate and lead on the delivery of the JASMIN supercomputer. The big issue at the moment is hardware replacement and the necessary upgrades to storage which follow from everyone's pesky habit of creating more data. This week I've been producing technical and financial scenarios associated with a range of possible funding futures. This is tedious but necessary stuff, and took most of Thursday, Friday, and a few hours today (Sunday).
I also processed a lot of incoming email. "Processed" in the sense that I did one of the following things with all of it (thanks to Nozbe I have nearly email inbox zero):
I deleted it.
I filed it (in evernote, or in gmail archive)
I responded to it (which may or may not have required some work)
I put it in Nozbe to do at some future time.
Not a very exciting week, but a lot of important (I think) stuff done. I hope that some of the thing I have to say in future blog posts will be more interesting ...
by Bryan Lawrence : 2017/01/08 : 0 comments (permalink)
Back to the Future - I think therefore I WILL blog
(with apologies to Descartes :-)
It looks like I haven't written a blog posting for more than eighteen months. That's really sad on a number of levels, I think it reflects a combination of my workload in terms of volume and content.
I want to get back to blogging. I am a scientist (still, just), I should be communicating about what I'm doing because that'll help me do it better (and help society get value from its investment in me - that might be a bit of a pompous statement, but it's true I think).
Much of what I have done in the last year has been writing various management documents about data, HPC, and finances. I have done some other bits, but they've either been as part of things that have actually turned up as papers or talks (if you look at my publication list and talks page you'll see that there has been activity, even though my blog has been a bit empty), or as things that I haven't felt I could publicise because of the interests of other people (my student, my colleagues in a couple of projects etc).
On top of those content issues, there have been volume issues. Much of what I have done has been against deadlines, often with little warning, and coming from multiple directions at once. I kept thinking I would get a breather, but it hasn't turned out that way, and so the entire year I was under pressure, feeling knackered, and lots of things had to give, and the blog was one of them. If nothing else I could and should have blogged about the papers and talks, but even that seemed too much.
Clearly the workload thing is problematic (and there have been consequences which I hope to discuss here anon) but not blogging is problematic in its own right. I think the lack of blogging has been detrimental to the delivery of my job itself. I believe that when I was blogging it was good for communication, it helped me learn, and it helped me organise my thinking about things that have acutally ended up in papers and production services (at CEDA etc). I have missed out on all those things by not blogging, although some of the management work has been surprisingly good for helping me organise thinking, just as some of it has been completely nugatory. (I should blog about the good and bad of my recent management experiences sometime, there were some lessons worth sharing!)
It used to be that I also used this place for notes, but it's clear that that role has been supplanted by evernote, project wikis (closed and public), and ipython notebooks (I can't get used to saying "jupyter", is that an age thing?). In the future I'd like to host notebooks alongside my blog, and include some blog articles as notebooks, but that'll require a technology change ...
So, in the best traditions of new years resolutions, I have committed to myself to try and get at least one blog article out a week (except when on holiday) - if nothing else I'll try and give a bland summary of the week that was. Look out for week one coming here soon!
by Bryan Lawrence : 2017/01/07 : 1 comment (permalink)
playing with docker
From time to time I get a very short opportunity to try and do some science, and I find the context switching harder and harder. To that end, I want to make more use of ipython-notebook.
Nowadays my compute environment is a macbook pro running mavericks, and I have two VMS built: a JASMIN analysis platform (JAP) image (based on Centos, for science) and a Linux Mint image (primarily to give me a route to reliable LibreOffice - unlike the version running on the Mac). Both can run ipython notebook, but I couldn't work out how to get that visible to browsers running in my Mac environment (which is what I wanted this time).
I could probably have worked that out, but I thought, it's about time I got some hands on experience with docker, since everyone is raving about it, and folks in my team are also starting to use it ... so why not try that route?
Herewith a couple of hours on a Saturday afternoon and another hour or so on a Sunday morning:
I got boot2docker working, and then I thought I'd try out the continuumio anaconda image ... but I immediately discovered that it didn't have netcdf4 and basemap by default (and that matplotlib was broken), so herewith my first dockerfile:
# aims to run basemap, and eventually, cf-python # the anaconder base image is itself based on debian FROM continuumio/anaconda MAINTAINER Bryan Lawrence <firstname.lastname@example.org> # as of May 10, 2015 the base image needs this to work with matplotlib: RUN apt-get -y install libglib2.0-0 # now the stuff we want RUN conda install netcdf4 RUN conda install basemap
I was able to build that using
docker build -t bnlawrence/cfconda .
and run it using:
docker run -it -v $(pwd):/usr/data -w /usr/data -p 8888:8888 bnlawrence/cfconda
(from within a directory on the mac where I wanted my notebooks to reside.) I then have to run
ipython notebook --ip=0.0.0.0 --no-browser
inside the container, whereupon, as if by magic, I can access my notebooks on the mac at
(I would have liked to have run the notebook directly on the end of the docker run statement, but when i do that, the notebook kernel seems to be really unstable and repeatedly crash. I don't know why.)
Now, hopefully I can start using ipython notebook during my working week ...
by Bryan Lawrence : 2015/05/10 : 0 comments (permalink)