Bryan Lawrence : Bryan's Blog 2017/02

Bryan Lawrence

... personal wiki, blog and notes

Bryan's Blog 2017/02

Week Eight

Spent nearly the entire week (and certainly more than a "European maximum 48-hours") on things to do with ESIWACE (on the deliverable I mentioned last time, reporting for the EC, etc). I did have a fair chunk of one day on other things as also chaired a meeting of the advisory panel for our climate predictions for the copernicus climate data store project (CP4CDS, a project about deploying and maintaining sofware for a special ESGF data node to support climate services) - that took up the best part of day. However, that's pretty much it for the week. Hard to believe we're two months into the year - and I still haven't come up for air and managed to create any significant blogging time.

by Bryan Lawrence : 2017/02/27 (permalink)

Weeks six and seven

Two weeks to report this time, primarily because I had a day off on sick leave, and two days on real leave, so there's only seven weekday workdays to talk about (although I have just spent bit chunks of both yesterday and today, that is, sat and sun, on work as well).

Early on the main thing I was doing was trying to catch up on the ever increasing email mountain, but in week six, by the time I took out a day in London for a NERC Information Strategy Group meeting (mostly about the future of NERC data centres), a half day on technical futures for CEDA, another half day on my final CEDA monthly meeting and a follow up meeting on CEDA support for SPARC, and a day off on sick leave, that only left a few hours here and there to get much done. The net effect of course was that the email mountain grew.

To be fair, it wasn't so much that the email mountain grew, but the Nozbe task list grew. I did manage to process a lot of email, but quite a lot of things got thrown onto the pile for "later".

Then this last week (week seven), we had a few days on holiday down Dorset/Devon way. Just the couple of nights, but real recharge territory. However, back to work on Wednesday, and back to London - this time for an NCAS Science Strategy Board meeting, so only two real work days available.

So, in the work time, I did a wee bit of work on the ENES infrastructure foresight that I've talked about before, and quite a lot of work on the first deliverable we've got for ESIWACE: which is on requirements and business modelling for (weather and climate) data centres. Given the constraints on my time during the week, it's been a big effort on that this weekend too. I suspect quite a few things from that will turn up on my blog ...

... but anyway, that's the weeks that were.

by Bryan Lawrence : 2017/02/19 : 0 comments (permalink)

Space and Open Plan Offices

I was paying a bit more attention to twitter this morning than usual (I'm hoping I'll get some feedback on my analysis of citations that I posted yesterday). One thing that blew by was this headline the Washington Post:

"Google got it wrong. The open-office trend is destroying the workplace."

which took me back to something that I wrote in 2005, reporting on work done long before.

It seems that Google was ignoring history as well. Unlike them.

There's been lots more work done since the work I cited in that blog post, for example:

Workers in open-plan offices are more distracted, unfriendly and uncollaborative than those in traditional workplaces, according to the latest industry survey.

Employees who have to share their office with more than two people experience high levels of colleague distrust and form fewer co-worker friendships than those working in single-occupancy offices ...

or this reporting published research:

"...the open-plan proponents' argument that open-plan improves morale and productivity appears to have no basis in the research literature."

by Bryan Lawrence : 2017/02/06 : Categories management : 0 comments (permalink)

Why so much data? Part I: The rise of direct numerical simulation

Over the last decade or so, many in the scientific community (especially in the environmental sciences) have been surprised by the increased cost of data handling - not only in absolute terms, but in terms of the percentage of the cost of doing "normal science".

This increase has been problematic on a number of fronts: not only do individuals not always plan appropriately for managing their data storage and handling, but even where they have, institutions and funding agencies have themselves been surprised and not always too keen to pay up. After all, in a world where there is very little new money in real terms (or even none), an increase in one part of the budget needs to be offset elsewhere. So, the real consequence of these increases in storage cost have to be decreases in the amount of science done (fewer staff/instruments/computers), and that's unpalatable even if it's unavoidable. At the moment many choose to think it is avoidable, which is quite a feasible position if you're an ostrich, but not so good if you're responsible for delivering science!

In this post, and maybe a couple to follow, I want to address why data handling is becoming a bigger deal in environmental science, and why we can't avoid spending more of money on it (but also how we can avoid spending more than we need to).

So, what are the factors in play? Well as well as background economics, there are four:

  1. The direct influence of Moore's Law on instrumentation and simulation (finer resolution in space and time means more numbers),

  2. The indirect influence of Moore's Law on what can be simulated (more compute means more things are computable),

  3. The growth of interdisciplinarity (more things need to be compared and contrasted) and more people are doing it, and

  4. The relationship between Moore's Law and Kryder's Law (is the cost of storage falling as rapidly as the cost of creating numbers to be storing is falling?).

Before delving into the technical details, I want to look at one of the underlying scientific trends that arises, partially as a consequence of the "indirect influence of Moore's Law": the rise and rise of direct numerical simulation, especially in the environmental sciences.

As noted above, more computability means more things are computable, and when we couple that to increased mathematical sophistication, more and more of the real word is amenable to direct numerical simulation: that is, it can be numerically simulated from fundamental equations rather than approximated by heuristics. The importance of this from a scientific point of view is that if one believes one is simulating the underlying processes properly, one can use the "simulation system" to predict how the system will behave under different circumstances than have been observed (either by coupling it into more complex systems or by using the system to predict past or future behaviour).

To some extent this is the holy grail of science: when one can simulate a system so well that one can't tell 1 whether one is observing a simulation or the real word, one can believe we understand that part of the real world.

For some time I have been asserting that an ever greater part of environmental science is engaging in direct numerical simulation (DNS), year on year. At the same time, I've been asserting that larger and wider communities were interacting around data, and especially around the data from model intercomparison projects, again, year on year. Of course, these assertions were not unrelated!

These assertions were based primarily on my interactions with the scientific community (one of the things about running an environmental data centre for much more than a decade is that my day job has involved interacting with individuals from across the scientific spectrum), and so one might dispute them. However, a year or so ago, I realised I might be able to get some quantitive information to support them by a bit of careful text mining. Unfortunately, I also realised I was never going to get the time to do it properly, so what follows is very amateur, but I hope still interesting. (If you have the skills and access to the data to do this properly, get in touch!)

The following figure is generated by spending a lot of time doing searches on Google Scholar (it would have been rather less time if Google didn't actively stop one doing this sort of work programmatically - I did try, and it worked fine until their "no robot code" stopped me in my tracks). Each point reflects the number of hits from a search on a specific set of terms from material restricted to two years. The terms were chosen to try and reflect five specific categories of interaction:

  1. direct numerical simulation (across any discipline) - "dns" in the figure,

  2. direct numerical simulation in the environmental sciences - "dns+env" in the figure,

  3. model intercomparison projects - "mips",

  4. use of satellite data in environmental science - "sats", and

  5. regular observations - "sondes".

(The table at the end of this post gives the details of the exact searches carried out.)

Image: static/2017/02/05/google_searches.png

The numbers in the legend are firstly the ratio of the last couple of points over the first couple of points - a measure of proportional growth, and secondly, the gradient from a fit to the number of hits per annum - a measure of absolute growth.

The figures back my observations rather nicely:

  • The number of papers on any facet of environmental science is growing. (No news there).

  • The number of papers using direct numerical simulation is growing rapidly, but the use of DNS in environmental sciences is growing even more rapidly, at least using the proportional measure.

  • The increase in papers which use MIP data is explosive, and one can see the direct influence of CMIP5 in the numbers.

  • Growth in observational science is slower than in numerical science, although the effect of increased availability of satellite data is apparent.

Obviously these conclusions could be heavily affected by the search terms I used, so if your mileage varies, let me know!

For the record, these are the exact (full text) searches used:

  • dns: search for the exact phrase "direct numerical simulation"

  • dns+env": as above, but require one of the following as well: cloud,rain,weather,climate,ocean,atmosphere,land,river,biogeochemistry,aerosol.

  • "sat": search for an exact match on "nadir sounder" (as a proxy for atmospheric satellite data only, and using that as a proxy for environmental science use of satellites in general).

  • "mips": exact match for model intercomparison project AND the use of the word simulation AND either atmosphere OR ocean.

  • "sondes": at least one of "radiosonde" or "dropsonde" appears (being a proxy for any sort of "traditional" observation - words like Lidar and Radar being too difficult to limit to environmental science, at least in this first cut at the problem).

1: We can have a fun discussion in another post, what "one can't tell" might mean (ret).

by Bryan Lawrence : 2017/02/05 : 0 comments (permalink)

week five

Nothing exciting to report. I spent nearly the entire week processing email, interacting with my team (both directly and on slack), and producing short things (e.g. a management level one pager on why NCAS should continue to support the CF conventions, and how). However, I did spend a wee bit of time trying to reinforce some of my observations around why storage is becoming a much bigger deal in environmental science. More on that next ...

by Bryan Lawrence : 2017/02/05 (permalink)

DISCLAIMER: This is a personal blog. Nothing written here reflects an official opinion of my employer or any funding agency.