... personal wiki, blog and notes
AGU Fall Meeting 2012
So clearly I'm in San Francisco, at the biggest conference in the geophysical year. I have a talk, you can download it from the talks page linked here, but that's not what this post is about.
This is a big conference, stupidly big. While I'm enjoying it more than I thought I would, there is more than a tinge of guilt in being here. Because this conference is so big, you miss most of the talks you are interested in - because inevitably the sessions clash. So, in comparison to not coming, in which case I miss all the talks, coming means I miss most of the talks. Of course others get to hear me, but is that what it's all about?
(The AGU Fall meeting talk attendance has a lot in common with winning the lottery - buy one ticket, or even many tickets, and you've got a very small non-zero possibility of winning, but mostly you lose. If you come to AGU, you have a small, non-zero possibility of seeing everything you want, but mostly you lose.)
So apart from the ego boost of an invitation, I didn't really come to hear the talks, I came for the conversations in and around the meeting. But even they clash horrendously when all the tribes are in one place.
For all my problems with clashes, I did hear some outstanding talks, and I have (and will yet) learned from listening. So there is real value ... but ...
Then there are the posters ... which are a horrendously inefficient way of communicating information, but a great way of starting conversations. I can't help thinking we can do that better using social media now ... do we really have to fly thousands of people thousands of miles for a few short conversations in front of something that could be better presented in a different way exploiting a combination of social media and remote attendance.
I really wonder whether this sort of conference should become a dinosaur of the past. I totally respect the opportunities which physical co-location brings, both for serendipitous science, and for interdisciplinarity, but surely we can come up with a way of doing this sort of thing that requires fewer carbon miles? Perhaps we should have the big interdisciplinary overview meetings a bit less frequently, and keep the domain specific stuff where it belongs - in smaller more focused meetings? I appreciate that approach might cause problems for folks nearer the start of their careers, so I don't have all the answers, but it can't be beyond the wit of our community to resolve these sorts of issues ...
I guess I wont get another invitation ...
(p.s. EGU the same applies to you :-)
by Bryan Lawrence : 2012/12/06 : 0 comments (permalink)
A tale of two ethics
A couple of years ago I took part in a lengthy discussion on scientific ethics. The context of that discussion was "the ethical risks" of any given piece of research, but I well remember the bottom line: in matters of ethics, there is no real black and white, "it all depends". It's possible to find a standpoint from which most "thinking people" (whatever that means) will believe you behaved ethically, and a standpoint from which most will not; often for the same action, albeit generally in different contexts!
So I find the sense of outrage about Peter Gleick's actions over the Heartland documents a bit much. That said:
I find much to commend in Dan Moutal's analysis, and
I'm pretty relaxed about James Garvey's article. (Unlike Richard Betts I didn't read it as a call to misbehave - indeed as Garvey says "It depends on how this plays out". Perhaps Richard would have been less offended if the subtitle "...perhaps more climate scientists should play dirty" had an explicit rather than an implicit question mark?)
While I rather think Bob Ward has got it right (as reported by Scott Mandia here, I can't find the original tweet):
"Peter was wrong but no comparison to Heartland's tactics. Not even close. Perspective in order."
... I also have a lot of sympathy for Steve Easterbrook's position:
" ...If AGW kills 300,000 p/y, exposing Heartland is ethical!"
However, I don't think it's about whether he was right or wrong, or even ethical (particularly the latter). What it's really about is what he might have done to trust. Some of my colleagues think we (scientists) have to somehow have the moral high ground to keep trust, and are worried that we're off down a slippery slope.
(We'll come back to the moral high ground.)
Misrepresenting yourself is not in and of itself unethical, I do it every Christmas, pretending to be Santa. Do you trust what I say scientifically any less? (If you do, then frankly you're beyond help.) So now we've established that misrepresentation is "allowed" it comes down to the context.
If I were to pretend some knowledge which I don't have, then you should absolutely stop trusting me. If I pretend to be someone I'm not (perhaps I dress up in women's clothing ... or maybe I don't) it shouldn't necessarily affect your level of trust in my science. Context matters!
OK, now if I pretend to be a conservative voter long enough to get my MP to pay attention to my moans about the NHS? What now? If I get him to inadvertently send me some document admitting the conservative party has never believed in being the greenest party ever? What lines will you let me cross? Why those particular lines? Context?
And so we come to Peter Gleick - who I've never met. He's not happy about what he did. Very few people are! But as far as I can see he's done a bit of mild misrepresentation. Not good, but not terrible either. It's who he is, and who he did it to that provides the context - and eventually, the results!
What are the consequences of his actions? The context?
We know a lot more about Heartland!
Those in an echo chamber (any chamber), wont change their minds on anything.
Those who have minds that listen, will continue to weigh the evidence.
Scientists will continue to respect his refereed work on science - as they did before, nothing has changed.
And thus we come to the nub of it. Even in science, especially in science, we trust no one - without evidence, and peer review. So we're happy to keep trusting his work in his domain of expertise (because we don't have to trust him). However we don't trust others (whoever they might be) to exercise trust in evidence in quite the same disinterested (?) and objective (?) way! Unlike "us" (being superhuman of course1 :-) ).
Of course, it isn't easy to weigh up evidence. Particularly now. In a completely different context, we have this: Weinberger's equivalent of Newton's first law: "For every fact on the internet, there is an equal and opposite fact."
So, to make decisions, we need a bit of expertise, but most of all we need provenance for our "facts" ... but is our moral high ground the only important part of our provenance?! Hell no, it's the method! Just because someone has never lied before, you shouldn't trust their science, you trust their science because of the method, and peer review (warts and all).
So I don't think what he's done will matter a jot in terms of influencing opinion in general, given my enumeration above, but what I'm really scared of is that it gives an excuse, to a certain kind of (cowardly) politician, for sticking his or her head back in the sand. Now that really is a pity - but it has nothing to do with ethics, or right or wrong, it has to do with a certain spinelessness of the political class to make tough decisions! And that really is something to get wound up about. Not Peter Gleick.
by Bryan Lawrence : 2012/02/27 : 0 comments (permalink)
Massive Spam Problem
I've always had problems with spam on this blog, not quite sure why, culminating in over thirty thousand spam comments over the recent Christmas period - all on old articles!
As a consequence, I've gone back through all my posts (I hope), removed all the spurious comments (and hopefully only the spurious ones) ... and made comments readonly throughout. If I've accidentally deleted your (real) comments, then I'm sorry, but for obvious reasons I had to do this in a semi-automated way ...
Until I move over to a new system, comments will only remain open on new posts for a few days ...
by Bryan Lawrence : 2012/01/17 : 0 trackbacks : 0 comments (permalink)
Reviewing Parson and Fox on the data publication metaphor
Mark Parsons have a draft paper out for public review and are inviting comments there ...
... but I found myself hamstrung by the "comment" paradigm, so I'm responding here.
Firstly, the paper itself: Like Chris Rusbridge, commenting on the blog, I found myself in equal parts agreeing and disagreeing ... but the bottom line is that it is a good question to ask, so I hope a version of this paper appears! The more this gets discussed the better, either I'm right (and data Publication - as I define it - should be a dominant paradigm), or I'm wrong - but if right, it'll never happen until folks discuss and buy into it, and if I'm wrong, well best I find out :-)
Ok, some specific comments:
There is an underlying assumption in this paper, which bubbles up to plain view in spots, and on the blog, that Publication and sharing are in conflict. I flat out disagree with this ... and don't know of any evidence that supports such an assertion.
The Publication metaphor already explicitly supports both via preprints (sharing or publication) and the formal paper of record (Publication).
Metaphors are obviously useful, but as the authors agree, flat out dangerous as well. I've blogged on the importance of "verifiable statements" before (and folks should look at the Tauber link from that!). So do I agree with these particular metaphorical paradigms?
Well, no not really, we at CEDA would claim we were playing in all four paradigms as they are defined, and I don't think we're that unusual. It is true that the examples they have chosen can be roughly characterised by these examples, but I think there's a bit of selection bias, sadly. Partly, I think that's engendered by using these activities to try and stratify data management as opposed to data activity. For example, we are a big archive (Big Iron1) who have some datasets which we wish to publish, and we are involved in map making and linked data - but I would argue that the second two are activities that depend on data management, they are not data management per se! Other groups will have very different data management standards and methods, but still be involved in map making and linked data. I think this is recognised to some extent in the paper (via the table where the focus line makes these key points, the qualification that they're not mutually exclusive, and the first paragraph of section 4).
So, because I think these paradigms do mix management and activity they don't contribute easily to the question about publication, so I don't find them helpful in this context.
All that said, the discussion of how misunderstood definitions limit understanding is totally on the money (up to the point where there is an implicit assertion that "traditional peer review" is homogeneous enough to be differed from what one could do with data Publication. I think PLoS is an interesting analogy here ...
Why do the authors think concepts like "registration, persistence" etc are not relevant (see the reference to Penev et al)? The answer appears to come from the next sentence where there is a conflation between Publication and some (closed) implementations thereof.
I think the dynamic, annotation infested ( :-) ), world of Publications that I foresee is not inconsistent with open access, clear provenance, yet the authors assert that worrying about provenance and definition is undue? (I defy you to find a URL to a data object that is useful without you having some implicit or explicit knowledge about the medium that will be returned if you dereference a URL - obviously definition matters in the data world, and just as obviously some level of containerisation is necessary - even if it's only to say "stream of type X starts here, terminated by this binary string").
Of course annotation, federation, transformation are important, but if you want to use objects from that complex world, in a way that is scientifically useful, you need to know whether you can repeat or replicate your workflow!
Particularly, often one doesn't need to see all that "ecosystem" in the workflow. The annotations are logically distinct from the data (and may themselves be different publications). If they do need to be together, then the anthology is a useful metaphor - publish poems on their own, and in collections!
What is linked data, but simply linked data? ... the objects that are linked, may, or may not be Published ... fair enough ... no conflict there?
So I simply don't buy that Publication is in conflict with unlocking the deep web! It's orthogonal in some sense!
All those picky disagreements aside, I find myself agreeing, the various paradigms available all lack in some way - but everything is a compromise. The analysis of their paradigms is fair enough ... in general ... but the conclusion lacks some rigour that they might be able to take from our paper.
I really liked the discussion of infrastructure and ecosystems ... particularly when the authors, particularly when it got to recognising the different roles that exist in Publication and how they can be decoupled. (Again, our paper helps in this regard, although I think it could be usefully extended using the arguments that Parsons and Fox have espoused).
When we get to releases, we start to see the concept of versions (editions anyone?) of data ... right on!
I think there is real scope to consider how the infrastructure and ecosystem around data Publication will be (should be) very different from those around literary publication. I'd really like that section to be expanded ...
... it is important not to be hidebound by ... any one metaphor. Yes, Yes, Yes.
All of which is me saying my definition of Data Publication is clearly different from Parsons and Fox. Which is utterly fine ... neither of us is right or wrong, we just need to define what we are talking about!
Ok, now to consider the blog traffic :-) (which at the time of writing was up to comment 10).
The one point that I think that is explicitly different from any of the points above, is the importance of social science and philosphy in establishing both our expectations and our practice ... absolutely true!!!
The other point (to reiterate) is that data Publication is not in conflict with sharing, but there is a time for sharing and a time for not sharing - and whatever system(s) we come up with have to leave the decision as to how much sharing is wanted or desired to individuals (or their funders) - the "infrastructure" cannot make that decision (positively or negatively). It needs to facilitate it!
(In particular, it is just fine for scientists to hold some data close to their virtual chests ... consider, if nothing else, health records, or the geographical location of the last members of particular species etc).
Finally, I too am a big advocate of citation, but it is not the answer, but it's certainly an answer! Any metaphor is both useful and limiting, and even the best solutions are only useful 80% of the time. That means this data "publication" paradigm will involve multiple solutions. Bring on the separation of concerns :-)
by Bryan Lawrence : 2011/12/16 : 0 trackbacks : 2 comments (permalink)
Communicating Science (to the media)
Interesting seminar here this morning on the subject.
Some memes that came out of this (and a previous staff meeting):
Most academic staff don't see public/media communication as their job, and that's a combination of being over-worked in the first place, not thinking we're particularly good at it (we tend to use over long, over-qualified sentences :-), and not wanting to get into a fight with some nutter in honour of some spurious notion of balance.
We (academics) need to remember the reality of the new cycle: most stories have a four-hour life expectancy, and most of them are sprung from event, and prepared in a matter of hours, if not minutes. For television and radio particularly, waiting to do a properly researched story would, in many cases, mean the story would never appear - bad quick information wins over good slow information!
It's important to remember that "show business" matters; an accurate story which is boring won't be listened to (or read)! So, like it or not, if you want to communicate, you have to play by the "make it brief, make it interesting, and do it with panache" rules. (Oh no, see the second caveat in 1. above :-).
(However, in the 24-hour news cycle, you might be able to negotiate a "serialisation" of your message into byte (pun!) size chunks - giving a reporter several stories - but you'd have to be lucky :-)
Some survey of stories in the "quality US newspapers" (editor: no comment :-), suggested that 85% of the stories were founded on only one source, and often that source was a press release! (Gosh!)
Prepare for interviews, but be available. Saying "Not now, I'm available in an hour" might mean you're not "used" ... but if you're going to be interviewed, spend at least ten minutes trying to work out how to get your message across. If it's a soundbite interview, your message had better be in 45 words (or less if you have to deflect a spurious question "on-message"). If it's a three-minute "radio-4-type" interview, it'd better be 2-300 words max (and be prepared to deflect several spurious questions "on-message").
Often a story/interview will go better if you take five minutes to send someone specific facts/figures in a few bullets in an email - it might well get them "on-message" and asking about the right things!
And the bottom line, for us climate scientists: As Edmund Burke said
All that's necessary for the forces of evil to win in the world is for enough good men to do nothing.
And we are, most of us, funded by the public (one way or another), to do the right thing for the public ... a bit like open data really I suppose!
(Update: I should have noted the seminar speaker was Jon Barton of Clarify Communications.)
data scientists are anally retentive too!
Well, that title ought to make this blog article inaccessible to those of us with anally retentive firewalls, but because it's only a minor misquote of an article from the Times Higher Education, it is legitimate!
The original quote is: Professor Nunberg said
"like most high-tech companies, Google puts a much higher premium on innovation than maintenance. They aren't good at the punctilious, anal-retentive sort of work librarians are used to."
Well yes: but I think Professor Nunberg said the right thing for the wrong reasons. Firstly, the context: Google Books has a lot of metadata, and a lot of it's wrong, and Nunberg was effectively saying that Librarians do metadata better.
Well no: they clearly have procedures in place for checking metadata, but they don't necessarily have expertise, but if they do, that might not be enough. IMHO, in the long run, crowd sourced metadata (where one has potentially multiple experts on tap) will vastly exceed (individual) expert sourced metadata. Because when metadata is more than just (title, author, year), metadata really is just another sort of content, and when content matters for descriptive matter, Wikipedia v Britannica (e.g. here and here) shows us the direction of travel. Crowd sourcing wins, in the long run.
(Caveat: generating information is still an individual or at best a team game, you can't crowd source information ab initio, even if you can crowd source the evaluation and description of it. Caveat-squared: you can of course crowd source data - crowds can generate data. Anyway, enough of this digression.)
However, where librarians do matter, and data scientists too - for whom the same argument applies - is in setting the initial conditions! Since, after-all, we're never "in the long-run", we're always getting there, and we get there much faster if we know where we started from. Attention to detail is a great quality for librarians and data scientists alike (actually all scientists), such anal retention is an integral requirement for many of us ...
Taking us back to the THE article, clearly if the index doesn't point you to the vicinity of the thing you want, it will take a long time to get there. Similarly, if you don't have good scientific metadata, it takes a long time to choose the tree from the wood, and having done so, you might not have an initial evaluation for how good that tree was ...
What Nunberg might have said (who knows, maybe he did say, I only had the THE article to go on) was: why on earth didn't Google start with an algorithm to utilise the professionally curated metadata rather than their own? And why not put a proper crowd source front end, so it can be improved?
All of which is rather ho hum, but what really got me thinking was the parallel between Google, and the attitude of a great many of the scientific community to managing their data, in which case I'd say:
Like Google, most scientists put a much higher premium on innovation than maintenance. They aren't good at the punctilious, anal-retentive sort of work that is necessary for society to get the best out of the data it has paid to collect!
Which is to say, it's not just being anally retentive that matters, it's what you're anally retentive about.
OK, well that's the last time this subject (a-r) comes up on this blog ... even if it was enough to provoke me out of my workload-induced-blog silence (although in truth, what was really needed was the sudden cancellation of a scheduled meeting).
European Exascale Software Initiative
Last week I spent two days at the lovely Domaine de Tremblay (an hour from central Paris):
This was the venue of a workshop organised by the European Exascale Software Initiative (EESI). The project has an interesting blend of activities; four domain working groups
industrial and engineering applications,
weather, climatology and earth sciences,
life sciences and health,
cut across four technical working groups
hardware roadmaps (and vendor links),
numerical libraries, solvers and algorithms, and
scientific software engineering.
to try and work out what Europe will need to do to exploit future exascale computing systems (which is of course a software problem).
I didn't keep proper notes, but I did tweet a bit, here's my tweet stream from the meeting:
Sitting in a long narrow room ready for two days on the European Exascale Software Initiative. Not expecting a hashtag here #eesi
Some real ambition in this room. Complete simulation of aircraft and airflow, would require zetaflop computing! #eesi
I so do NOT believe in codesign: if the s/w life cycle is much longer than the h/w life cycle, it's a non starter! #eesi
Other domains (genomics, seismology etc) as well as climate stress importance of exaSCALE cf exaFLOP! Need more balanced HPC! #eesi
Hardware review: interesting to note that the only thing all vendors are actively researching is hw fault tolerance! Frightened? #eesi
Worries about parallel file system futures: scalability and reliability, file number, metadata performance. Few vendors in the game! #eesi
Programming models needed which better allow incremental parallelisation, more decoupling between h/w and s/w, and easier portability #eesi
Not all faults expected to be masked by h/w, or tolerated easily by application (h/w correction will generate noise, global restart) #eesi
Quadrant diagram: divide tasks into ((production, testing) versus (slower, faster)). Helps decide between capability versus capacity #eesi
Not acceptable to try and control 3 levels of parallelisation by hand: system, nodes, cores. Improve abstraction in programming model #eesi
Future algorithms might have to start with single precision, because SP uses less power, and then only use double precision later #eesi
Crickey. Someone asked about libraries for symplectic integration. I can remember the noun Hamiltonian. Can't remember what it means! #eesi
Definition of co-design: h/w perspective, we'll work with s/w community so you can change your apps to use our (slightly changed) h/w. #eesi
Definition of co-design: s/w perspective, we'll work with h/w community so you can change your systems. Will they ever meet? #eesi.
exascale hybrid core future: we might need to spend half the money we used to spend on computers on s/w engineers so we can use them! #eesi
Bloke in audience: why does data management matter for exascale? Everyone: because the flops are useless if you can't feed & use them! #eesi
Speaker guess: 3 h/w generations between us & exascale? Me sotto voce: If they get more heterogeneous each time we'll never get there! #eesi
One breakout grp arguing for s/w isolation layer to hide h/w. Next asking for co-design between h/w and s/w. Draw your own conclusion #eesi
Programmability: Petascale systems are not easy to programme. Do we really believe it will get easier at exascale? Yet it must! #eesi
Programmability: An assertion: Parallelising serial codes will almost certainly fail. Decompose algorithm from 1st principles instead. #eesi
"If you want an exascale initiative of relevance: you will need to invest big time in REdeveloping applications (new algorithms)" #eesi
... "you will also need to invest heavily in training, and a proper career path for scientific computing/computation ..." #eesi
I also chaired and reported on a breakout group on cross-cutting data issues. The breakout report is available as a pdf. Although the turnout to this breakout session was relatively low, a couple of folks came up to be me afterwards, and bemoaned having to be in two places at once, and having to choose programmability as a (slightly) higher priority. This meeting confirmed to me, for once and for all, that nearly every scientific domain is struggling with big data handling, particularly adjacent to HPC!
There was also considerable discussion in our breakout group about software sustainability, a theme that was picked up the next day in the EPSRC strategic advisory team meeting on research infrastructure.
(Yes, I did get home from a lovely venue in Isle de France and go straight to Polaris House in Swindon the next day, you can imagine how the comparison felt.)
Nearly every meeting I go to makes the following point: Given that software is such a big and crucial part of research infrastructure nowadays the system is going to have to change:
More professional software development and maintenance in academia, coupled with
Better careers for those in software support (and software leadership) roles, and
Sustained funding from research funders for codes which underpin the science community.
Thought provoking talk on climate policy
Today at the ICTP workshop on e-infrastructure for climate science, George Philander gave a really thought provoking talk on global warming and asepcts of policy consequences - particularly in an African context.
His basic meme followed the following line of argument:
We know that in pliocene times the earth appeared to be in a permanent state of El Nino (no evidence for cold water intrusions in equatorial regions).
No current climate model can replicate that state, so
We can only trust models for small perturbations from the current state
He alluded to a paper by Carl Wunsch pointing out that models can't be tested in paleo times ... and to a paper by Paul Valdes. I'm going to chase up both.
So, he worries about global warming, not because we know what is going to happen, but because we don't, which implies that
this is a time for circumspection. He likens us (the planet) to a ship in fog. It's time to slow down and sound our way!
He also had a lot to say about the policy paralysis which amongst most serious analysists comes down to an argument about the discount rate based on our current predictions for the future. Should we act now or later? But this is a rich persons argument! From those in poverty there is no argument, the only choice is later! However, the argument over discount rate is predicated on (relatively) small perturbations from the current state, but if our models don't do serious perturbations, that's a matter of concern. As he puts it, we will know these things, but we don't yet, so, let's put the argument in terms of the planet being a special place (habitable) at a special time (warm interglacial). Let's treat that planet with some circumspection until the future is clearer!
What exactly to do, if your perspective is that of trying to lift millions from poverty, even as you want to do it without burning coal, is a moot question. I don't think any of us know the answer - Philander certainly didn't claim an answer - but it's a question that I hadn't really given any thought to before today.
Most of these ideas appear in a paper called "Where are you from? Why are you here? An African perspective on global warming." which appears in the Annual Review of Earth and Planetary Sciences, pdf. It's well worth a read, since amongst other reasons, you wont often see a reproduction of a Titian in the same article as a figures showing milllenial temperature changes etc. There is also a call to arms for an African centre to study earth sciences. I hope it succeeds!
by Bryan Lawrence : 2011/05/18 : 0 trackbacks : 0 comments (permalink)
hpc futures - part two
What is this exascale wall that I've been tweeting about?
The last decade of advances in climate computing have been underpinned by the same advances in computing that everyone of us saw on our laptop and desktop until a couple of years ago: chips got faster and smaller. The next decade started a couple of years ago: processors are not getting faster, but they are still getting smaller, so we get to have more processors - there are even dual-core smartphones out there now! What's not so obvious though is that the memory per core is generally falling, and while the power consumption is falling, it's not falling fast enough ... pack a lot of these (cheap) cores together, and they use a ludicrous amount of power. The upshot of all this is great for smartphones, but not so good for climate computing: the era of easy incremental performance (in terms of faster time to solution, or more complexit/resolution at the same time to solution) in our climate codes is over. Future performance increase is going to have to come from exploiting massive parallelisation (tens of million to a billion) threads with very little memory per thread - and it'll come with energy cost as a big deal. I first starting wrote about exascale on my blog back in August last year. (I promised to talk about data then, and I will here ...)
What all this means is that our current generation of climate model codes probably have at best a few years of evolution in their current incarnation, before some sort of radical reshaping of the algorithms and infrastructures becomes necessary. At the US meeting I'm at now, two different breakout groups on this subject came to the same conclusion: if we want to have competitive models in five years and ten years time, we need to a) continue to evolve our current codes, but b) right now, start work on a completely new generation of codes to deploy for use in the next but one generation of supercomputers. That's a big ask for a community starved of folks with real hard-core computing skills. Clearly there are a lot of clever climate modellers, but the break out group on workforce summarise the reality of the issue to be that the climate modelling commmunity consists of three kinds of folks: diagnosticians, perturbers (who tinker with codes), and developers. Universities mainly turn out the diagnosticians, some perturbers, but very very few developers with the skillset and interest to do climate code work.
That's a big problem, but the data side of things is pretty big problem too. Yes, the exascale future with machines with tens to hundreds of millions of cores is a big problem, but even now we can come up with some scientifically sensible, and computationally feasible methods of filling such a machine. Colin Jones from the SMHI has proposed a sensible grand ensemble based on tracticable extension of how EC-Earth is being used now (running, in an extreme experimental mode at an effective 1.25 degrees resolution). An extrapolation of that model to 0.1 degrees resolution (roughly 10km) would probably effectively use 5000 cores or so. If one ran an ensemble of 50 members for a given start date, at the same time, it could use 250,000 cores. Ideally one would have a few different variants of this or similar models available, capturing some element of model uncertainty, let's say 4. Now we can use 1 million cores. To really understand this modelling "system", we might want to run 25 year simulations, but sample a range of initial states, let's say 40. Now we can use 40 million cores. This is an utterly realistic application. If a 40 million core machine was available, and we could use it all, this would be an excellent use of it (there are other uses too, and for those we need the new codes discussed above). But let's just consider a little further.
Colin tells me that the 1.25 degree (actually T159L62) model produces roughly 9 GB of data per simulation month writing out onto an N80 reduced Gaussian grid (which means you can double the following numbers if one wanted "full resolution"). Scaling up to the 0.1 degree version would result in 1.4 TB/month, and the grand ensemble described above would result in a total output of around 3 exabytes! For a reasonable time to solution (150 hours or a week all told, that is 6 hrs/model_instance_year), it would require a sustained I/O from the machine to storage of around 50 Tbit/s.
Archive and analysis may be a problem! Remember this 3 exabytes of data could be produced in one week!
At this point it's probably worth considering an exascale computer not as a "computer" but as a "data source" ... it's a bit of a paradigm shift isn't it? Even without heaps of work on exascale software, our exascale computer can produce an outrageous data problem. We need to start thinking about our computer in terms of its analysis and archive capability first, then we can think about it's computational ability, and indeed, how to get our codes to address other important problems (such as faster time to solution so we can have better high-res paleo runs etc). This ought to be affecting our purchasing decisions.
Hang on a moment though. The obvious immediate rejoinder is "we don't need to write all this data out". So what can we do to reduce the data output? We can calculate ensemble statistics in the machine, and we can choose only to write out some representative ensemble members. That might gain us a factor of 10 or so. We could simply say, we'll only write out certain statistics of interest, and not all the output, and that's certainly feasible for a large class of experiments where one is pretty sure the data will get no re-use, because the models are being deliberately put in some mode which is not suitable for projetion analysis or extensive comparison with obs - but many of these ensemble experiments are very likely to produce re-usable data. Should it be re-used?
Well, consider that our 40 million core machine will probably cost around 50 million USD when we get hold of one. If we depreciate that over five years say, then it's about 10 million per year (in capital cost alone) or about 20,000 USD per week to use. Double that figure for power costs and round up to 50,000 USD for that grand ensemble run. I have no idea what storage will cost when we can do this run, but my guess is that the storage costs be of the order of 300,000 USD. To be generous, let's imagine the storage costs will exceed the run time costs by a factor of 10.
(Where did that number come from? Well, today's tier-1 machine might have 50,000 cores, and it costs o(1) million USD per PB. To go to 50 million cores, we scale CPU by 1000, let's imagine we scale data costs accordingly. So when I have a 50 million core machine, I'll be able to get an EB of storage for the same price as today's PB.)
So the big question then is, how many application groups are likely to be able to exploit my run, and can I anticipate their specific data output needs? Well,my guess for an ensemble experiment as discussed above, there would be value in that data for considerably more than ten different groups - be they parameterisation developers, observatoinal analysists, or impacts advisors!
So, we probably should store as much of it as we feasibly can! We can talk about the analysis effort and the network issues at a later date!
(Actually, when I run the numbers for other possible permutations of output, and model configuration, the 9 GB/month we started with seems small, it's entirely feasible to suggest a realistic IPCC style output requirement when scaled out, would result in around 50 times more output, but for an IPCC experiment, hundreds of applications are feasible).
Pushing Water Analogies too far
Recently I heard a talk by Kevin Trenbeth, who showed a slide with the following cartoon:
It was perfect in context, but I also found it amusing in a serendipitous sort of a way: just before he started speaking , I had started (in a coffee break I hope) googling for images of reservoirs, hoses and sprinklers ... because I had a similar idea in mind, and I wanted an easy way of communicating it.
However, my idea was and is rather different from Kevin's. So why is it different? Firstly, mine was borne out of Ian Jackson's repeated mantra (on the NERC information strategy group) that a water-pipe system is no good without water, and my repeated response, and without a pipe system, the water just gets wasted. We were both right of course: a water delivery system is useless without water, and water gets wasted without a delivery system.
To some extent Kevin's image is redolent of wasted water, even though there are clearly components of a system ... but I'd put it slightly differently:
I don't think the sensor is analogous to the hydrant, I think it should be analogous to the rain pouring water into a reservoir via runoff, rivers, whatever ...
There are plenty of folk who will tell you it's raining data.
My point being that there are a plethora of sensors (water delivery mechanisms), each delivering data (water) into a bunch of archives (lakes, reservoirs, whatever).
The hydrant is simply a defined interface to which I can couple a hose! It depends on being connected to a local reservoir, which itself may be coupled by a canal or a pipe to another reservoir, or lake or whatever. So the lessons I draw are that:
In the case of data, we need local caches, to which we can connect delivery systems which understand a standard interface.
But they depend on the existence of connected (and large) managed bodies of water. In my world view, that might be a European archive (distributed or otherwise), and a national archive (which may be independent of, or part of, the European system).
I think the idea of the scientist cowering under a deluge of data is correct, but
more and more we see folks building complicated data delivery systems so that the data can be more easily consumed.
I see that as analogous to the design and construction of a range of sprinklers on the end of the hose, targeted at particular problems.
So here is my graphic:
And the context?
We are being encouraged to put more and more effort into the front end - the portal, the visualisation system - to the detriment of the backend (the managed archive, well connected, with standard interfaces).
It clearly wont end well
if we have a lovely sprinkler system, but the reservoir looks like this:
Nor will it end well
if we ignore the importance of metadata,and the fact that the pipes can carry more than one type of fluid:
(I'm well aware I need some picture credits on this post, since I made some serious use of other peoples pictures ... please let me know if your picture is here and you would like a credit - or for me to remove it.)