... personal wiki, blog and notes
Talks and Seminars
From 2006, I'll try and put the content of any significant talks I give on this page.
Data Centre Technology to Support Environmental Science
This was a talk given at a NERC Town Hall meeting on the future of data centres in London, on the 13th of October 2016. My brief was to talk about underlying infrastructure, which I did here by discussing the relationship between scientific data workflows and the sort of things we do with JASMIN. (pdf)
Computer Science Issues in Environmental Infrastructure
This was a talk at an internal University of Reading symposium held with Tony Hey, Geoffrey Fox and Jeff Dozier as guest speakers. The main aim was to get across the wide range of interesting generic science and engineering challenges we face in delivering the infrastructure needed for environmental science (pdf).
The science case for JASMIN, RAL, June 2016
Keynote scene setter for the inaugural JASMIN user conference: how the rise of simulation leads to a data deluge and the necessity for JASMIN, and a programme to improve our data analysis techniques and infrastructure: pdf.
Internal Vocabulary Meeting at CEDA
A brief introduction to some of the basic tools being use to define ES-DOC CIM2 and the CMIP6 extensions: pdf
IS-ENES2 2nd General Assembly, Hamburg, February 2016
ESDOC for CMIP6
I presented an introduction to how ES-DOC is planning on supporting CMIP6: pdf
International Computing in Atmospheric Science, Annecy, September 2015
UK academic infrastructure to support (big) environmental science
Abstract: Modern environmental science requires the fusion of ever growing volumes of data from multiple simulation and observational platforms. In the UK we are investing in the infrastructure necessary to provide the generation, management, and analysis of the relevant datasets. This talk discusses the existing and planned hardware and software infrastructure required to support the (primarily) UK academic community in this endeavour, and relates it to key international endeavours at the European and global scale ? including earth observation programmes such as the Copernicus Sentinel missions, the European Network for Earth Simulation, and the Earth System Grid Federation.
Cloud Workshop, Warwick, June 2015
Why Cloud? Earth Systems Science
or Data Driven Science: Bringing Computation to the Data (pdf)
EGU, April 2015
Beating the tyranny of scale with a private cloud configured for Big Data
BDEC, Barcelona, January 2015
There were two back to back meetings organised as part of the 2015 Big Data and Extreme Computing meeting (website). In the first, organised as part of the European Exascale Software Initiative (EESI), I gave a full talk, in the second, I provided a four page position paper with a four page exposition.
It starts and Ends with Data: Towards exascale from an earth system science perspective
Six sections: the big picture, background trends, hardware issues, software issues, workflow, and a summary: pdf.
Bringing Compute to the Data
This was my main BDEC contribution:
Leptoukh Lecture, AGU Fall Meeting, San Francisco, December 2014
I was honoured to be the third recipient of the AGU Leptoukh Lecture awarded for significant contributions to informatics, computational, or data sciences.
Trends in Computing for Climate Research
The grand challenges of climate science will stress our informatics infrastructure severely in the next decade. Our drive for ever greater simulation resolution/complexity/length/repetition, coupled with new remote and in-situ sensing platforms present us with problems in computation, data handling, and information management, to name but three. These problems are compounded by the background trends: Moore's Law is no longer doing us any favours: computing is getting harder to exploit as we have to bite the parallelism bullet, and Kryder's Law (if it ever existed) isn't going to help us store the data volumes we can see ahead. The variety of data, the rate it arrives, and the complexity of the tools we need and use, all strain our ability to cope. The solutions, as ever, will revolve around more and better software, but "more" and "better" will require some attention.
In this talk we discuss how these issues have played out in the context of CMIP5, and might be expected to play out in CMIP6 and successors. Although the CMIPs will provide the thread, we will digress into modelling per se, regional climate modelling (CORDEX), observations from space (Obs4MIPs and friends), climate services (as they might play out in Europe), and the dependency of progress on how we manage people in our institutions. It will be seen that most of the issues we discuss apply to the wider environmental sciences, if not science in general. They all have implications for the need for both sustained infrastructure and ongoing research into environmental informatics.
Presentation: pdf (30 MB).
Symposium on HPC and Data-Intensive Apps, Trieste, November 2014
Or to give it it's full name: Symposium on HPC and Data-Intensive Applications in Earth Sciences: Challenges and Opportunities@ICTP, Trieste, Italy.
I gave two talks at this meeting, the first in the HPC regular session, on behalf of my colleague Pier Luigi Vidale, on UPSCALE, the second a data keynote on day two.
Weather and Climate modelling at the Petascale: achievements and perspectives. The roadmap to PRIMAVERA
Abstract: Recent results and plans from the Joint Met Office/NERC High Resolution Climate Modelling programme are presented, along with a summary of recent and planned model developments. We show the influence of high resolution on a number of important atmospheric phenomena, highlighting both the roles of multiple groups in the work and the need for further resolution and complexity improvements in multiple models. We introduce plans for a project to do just that. A final point is that this work is highly demanding of both the supercomputing and subsequent analysis environments.
Presentation: (pdf, 37MB!)
Infrastructure for Environmental Supercomputing: beyond the HPC!
Abstract: We begin by motivating the problems facing us in environmental simulations across scales: complex community interactions, and complex infrastructure. Looking forward we see the drive to increased resolution and complexity leading not only to compute issues, but even more severe data storage and handling issues. We worry about the software consequences before moving to the only possible solution, more and better collaboration, with shared infrastructure. To make progress requires moving past consideration of software interfaces alone to consider also the "collaboration" interfaces. We spend considerable time describing the JASMIN HPC data collaboration environment in the UK, before reaching the final conclusion: Getting our models to run on (new) supercomputers is hard. Getting them to run perfomantly is hard. Analysing, exploiting and archiving the data is (probably) now even harder!
Presentation: (pdf, 22MB)
NERC ICT Current Awareness, Warwick, October 2014
I gave a talk on the JASMIN super data analysis environment to a gathering of the NERC IT community.
Presentation pdf (18 MB)
NCAS Science Meeting, Bristol, July 2014
I gave a talk on how Moore's Law and friends are influencing atmospheric science, the infrastructure we need, and how we trying to deliver services to the community.
Presentation: pdf (19 MB!)
e-research NZ 2014, Hamilton, June/July 2014
I gave three talks at this meeting:
Environmental Modelling at both large and small scales: How simulating complexity leads to a range of computing challenges
On Monday, in the HPC workshop, despite using the same title I had for the Auckland seminar, I primarily talked about the importance of software supporting collaboration, using coupling as the exemplar (reprising some of the material I presented in Boulder in early 2013):
The road to exascale for climate science: crossing borders or crossing disciplines, can one do both at the same time?
On Tuesday I gave the keynote address:
Abstract: The grand challenges of climate science have significant infrastructural implications, which lead to requirements for integrated e-infrastructure - integrated at national and international scales, but serving users from a variety of disciplines. We begin by introducing the challenges, then discuss the implications for computing, data, networks, software, and people, beginning from existing activities, and looking out as far as we can see (spoiler alert: not far!)
JASMIN: the Joint Analysis System for big data
On wednesday I gave a short talk on JASMIN:
Abstract: JASMIN is designed to deliver a shared data infrastructure for the UK environmental science community. We describe the hybrid batch/cloud environment and some of the compromises we have made to provide a curated archive inside and alongside various levels of managed and unmanaged cloud ... touching on the difference between backup and archive at scale. Some examples of JASMIN usage are provided, and the speed up on workflows we have achieved. JASMIN has just recently been upgraded, having originally been designed for atmospheric and earth observation science, but now being required to support a wider community. We discuss what was upgraded, and why.
Seminar, University of Auckland, NZ, June 2014
Abstract: Global earth system models simulate a range of processes from atmospheres and oceans, to clouds and carbon cycling, but while such models are (arguably) well suited for projecting long-term global futures, they aren't yet much use for making long-term predictions at regional and local scales. Understanding our back yard depends on higher resolution models and more locally integrated models, but we can't yet have all three in one, so we have to make progress in all three directions. However, to make progress with all three we have interesting computing and scientific challenges, some of which are discussed in this talk.
IS-ENES2 Workshop on ESM Evaluation and Infrastructure, KNMI De Bilt Netherlands, May 2014
TItle: esdoc - why, where we've been and where we are going
ECMWF Scalability Workshop, Reading, April 2014
Title: Data handling on the path to exascale
AGU Data Stewardship in Theory and Practice, San Francisco, December 2013
Title: Managing Data and Facilitating Science: A spectrum of activities in the Centre for Environmental Data Archival.(IN51D-01, Invited) (pdf)
The UK Centre for Environmental Data Archival (CEDA) hosts a number of formal data centres, including the British Atmospheric Data Centre (BADC), and is a partner in a range of national and international data federations, including the InfraStructure for the European Network for Earth system Simulation, the Earth System Grid Federation, and the distributed IPCC Data Distribution Centres. The mission of CEDA is to formally curate data from, and facilitate the doing of, environmental science.
The twin aims are symbiotic: data curation helps facilitate science, and facilitating science helps with data curation. Here we cover how CEDA delivers this strategy by established internal processes supplemented by short-term projects, supported by staff with a range of roles. We show how CEDA adds value to data in the curated archive, and how it supports science, and show examples of the aforementioned symbiosis.
We begin by discussing curation: CEDA has the formal responsibility for curating the data products of atmospheric science and earth observation research funded by the UK Natural Environment Research Council (NERC). However, curation is not just about the provider community, the consumer communities matter too, and the consumers of these data cross the boundaries of science, including engineers, medics, as well as the gamut of the environmental sciences. There is a small, and growing cohort of non-science users. For both producers and consumers of data, information about data is crucial, and a range of CEDA staff have long worked on tools and techniques for creating, managing, and delivering metadata (as well as data). CEDA "science support" staff work with scientists to help them prepare and document data for curation.
As one of a spectrum of activities, CEDA has worked on data Publication as a method of both adding value to some data, and rewarding the effort put into the production of quality datasets. As such, we see this activity as both a curation and a facilitation activity.
A range of more focused facilitation activities are carried out, from providing a computing platform suitable for big-data analytics (the Joint Analysis System, JASMIN), to working on distributed data analysis (EXARCH), and the acquisition of third party data to support science and impact (e.g. in the context of the facility for Climate and Environmental Monitoring from Space, CEMS).
We conclude by confronting the view of Parsons and Fox (2013) that metaphors such as Data Publication, Big Iron, Science Support etc are limiting, and suggest the CEDA experience is that these sorts of activities can and do co-exist, much as they conclude they should. However, we also believe that within co-existing metaphors, production systems need to be limited in their scope, even if they are on a road to a more joined up infrastructure. We shouldn't confuse what we can do now with what we might want to do in the future.
AGU Big Data in the Earth and Space Sciences, San Francisco, December 2013
Title From petascale to exascale, the future of simulated climate data. (U41A-04, Invited) (pdf)
Coleridge ought to have said:
data, data, everywhere, and all the data centres groan, data data everywhere, nor any I should clone.
Except of course, he didn't say it, and we do clone data!
While we've been dealing with terabytes of simulated datasets, downloading ("cloning") and analysing, has been a plausible way forward. In doing so, we have set up systems that support four broad classes of activities: personal and institutional data analysis, federated data systems, and data portals. We use metadata to manage the migration of data between these (and their communities) and we have built software systems. However, our metadata and software solutions are fragile, often based on soft money, and loose governance arrangements. We often download data with minimal provenance, and often many of us download the same data. In the not too distant future we can imagine exabytes of data being produced, and all these problems will get worse. Arguably we have no plausible methods of effectively exploiting such data - particularly if the analysis requires intercomparison. Yet of course, we know full well that intercomparison is at the heart of climate science.
In this talk, we review the current status of simulation data management, with special emphasis on accessibility and usability. We talk about file formats, bundles of files, real and virtual, and simulation metadata. We introduce the InfraStructure for the European Network for Earth Simulation (IS-ENES) and its relationship with the Earth System Grid Federation (ESGF) as well as JASMIN, the UK Joint Analysis System. There will be a small digression on parallel data analysis - locally and distributed. we then progress to the near term problems (and solutions) for climate data before scoping out the problems of the future, both for data handling, and the models that produce the data. The way we think about data, computing, models, even ensemble design, may need to change.
IS-ENES2 Kickoff Meeting, Paris, France, May 2013
Title: The Future of ESGF in the context of ENES and IS-ENES2.(pdf)
(I probably tried to do too much in this talk. There were three subtexts:
We as a community have too much data to handle, and I mentioned the apocryphal estimate that only 2/3 of data written is read ... but I confused folks ... that figure applies to institutional data, not data in ESGF ...
That the migration of data and information between domains (see the talk) requires a lot of effort, and that (nearly) no one recognises or funds that effort (kudos to KNMI :-),
That portals are easy to build, but hard to build right, and maybe we need fewer, or maybe we need more, but either way, they need to both meet requirements in technical functionality, and information (as opposed to data) content. )
Coupling Workshop 2013 (CW2013), Boulder, Colorado 2013
Title: Bridging Communities: Technical Concerns for Building Integrated Environmental Models
2nd IS-ENES Workshop on ''"High-performance computing for Climate Models"`` Toulouse, France, January 2013
Title: Data, the elephant in the room. JASMIN one step along the road to dealing with the elephant.
AGU San Francisco, December 2012Title: Issues to address before we can have an open climate modelling ecosystem
Authors: Lawrence, Balaji, DeLuca, Guilyardi, Taylor
Abstract Earth system and climate models are complex assemblages of code which are an optimisation of what is known about the real world, and what we can afford to simulate of that knowledge. Modellers are generally experts in one part of the earth system, or in modelling itself, but very few are experts across the piste. As a consequence, developing and using models (and their output) requires expert teams which in most cases are the holders of the "institutional wisdom" about their model, what it does well,and what it doesn't. Many of us have an aspiration for an open modelling ecosystem, not only to provide transparency and provenance for results, but also to expedite the modelling itself. However an open modelling ecosystem will depend on opening access to code, to inputs, to outputs, and most of all, on opening the access to that institutional wisdom (in such a way that the holders of such wisdom are protected from providing undue support for third parties). Here we present some of the lessons learned from how the metafor and curator projects (continuing forward as the es-doc consortium) have attempted to encode such wisdom as documentation. We will concentrate on both technical and social issues that we have uncovered, including a discussion of the place of peer review and citation in this ecosystem.
(This is a modified version of the abstract submitted to AGU, to more fairly reflect the content given the necessity to cut material to fit into the 15 minute slot available.)
3rd ESA CCI colocation meeting, Frascati, September 2012
Exploiting high volume simulations and observations of the climate - pdf. (An introduction to ENES and ESGF with some scientific motivation.)
HPC and Big Data, ICT Competitiveness Week, Brussels, September 2012Weather and Climate Computing Futures in the context of European Competitiveness - pdf
e-Infrastructures for Climate Science, Trieste, May 2011
Talk Title: Information Infrastructure to Support Climate Science (Contribution of Metafor)
This talk began by motivating some of the reasons why we need metadata about both the data produced by climate simulations and of the models which produce that data. I then introduced the Metafor project, and it's deployment in support of CMIP5. Some thoughts on the future of Meafor preceded a quick description of the IPCC data distribution centre, and it's relationship with ESGF and PCMDI.
Kick-off Plenary, Data To Knowledge, Alderly, March 2011
This presentation was given as the kick-off plenary at a meeting of the Royal Society of Chemistry's Molecular Spectroscopy Group - who wanted a speaker from outside their discipline. I guess I fitted that bill :-)
(I must say AstraZeneca's Alderly Park is an impressive complex.)
Talk Title:Managing complex datasets and accompanying information for reuse and repurpose
Abstract (Modified to describe what I said, not what I thought I'd say): For centuries the main method of scientific communication has been the academic paper, itself developed as a reaction to the non-scalability of the "personal communication" (then known as a "letter"). In the 21st century, we now find that the academic paper is not always sufficient to communicate all that needs to be known about some scientific event, whether it's the development of a theory, an observation, or a simulation, or some combination thereof. As a consequence, nearly all scientific communities are producing methods of defining and documenting datasets of importance, and building systems to augment (annotate) their data resources or to amalgamate, reprocess, and reuse their data ? often in ways unforeseen by the originator of the data. Such systems range from heavily bespoke, tightly architected systems, such as that developed in the climate community to support global climate model inter-comparison; via systems of intermediate complexity developed, for example, using the "linked data" principles; to loose assemblies of web pages, using vanilla web technologies. Concepts of publication are becoming blurred, with publication meaning anything from "I put it on twitter" to "I published in Nature".
In this talk, we'll present a taxonomy of the information that (nearly) all these systems try to address and discuss the nature of publication in the 21st Century. We?ll describe how information is built up during the life-cycle of datasets, and the importance of data provenance in the production of knowledge. We'll present our concept of the "value proposition" for maintaining digital data. The material will be mainly illustrated with examples from the environmental sciences, but we believe the concepts discussed, and the conclusions drawn, are generic.
GEOSS Support for IPCC assessment, Geneva, February 2011
This presentation was given as part of the "Managing Data to Support the Assessment Process" session, where I was requested to provide a talk on quality control and documentation. pdf
This talk, like most I give, represents the work of a lot of people, and in particular the work of the WDCC group at DKRZ (Michael Lautenschlager et al), and the folks in the U.S. Earth System Grid team (a DoE funded activity led by Dean Williams, Don Middleton and others). CMIP5 is of course a WCRP activity, in practice led by Karl Taylor.
Other meeting talks are here.
Short Presentation, NERC SISB, Birmingham, January 2011
Cut down material from the seminar (below), concentrating on data exascale futures: pdf
Seminar, Reading University, January 2011
e-Research in support of climate science
The fifth coupled model intercomparison project (CMIP5) will get underway for real during 2011, after a phony war of some years duration as modelling and archiving centres have been gearing up for the challenge. CMIP5 will involve the production of millions of datasets amounting to petabytes of data, available in a globally distributed database. The importance of that, and other data, in shaping our future decisions about reacting to climate change is obvious. Less obvious, but just as important, will be the role of the provenance of the data: who produced it/them, how, using what technique? Will the difficulty of the interpretation be in anyway consistent with the skills of the interpreter? Are there any quality metrics? In this talk I'll introduce some key parts of the metadata spectrum underlying our efforts to document climate data, for use now and into the future, concentrating on the information modelling and metadata pipeline being constructed to support CMIP5. In doing so we'll touch on the metadata activities of the European Metafor project, the software developments being sponsored by the US Earth System Grid and European IS-ENES projects, and how all these activities are being integrated into a global federated e- infrastructure. I'll conclude with the frightening prospects for the future in terms of the production and management of the next generation of climate simulations. (pdf)
An Aussie Triumvirate
Keynote: Information Network Workshop, Canberra, November, 2010: British experience with building standards based networks for climate and environmental research
A talk covering organisational and technological drivers to network interworking, with some experience from the UK and European context, and some comments for the future. pdf
All the talks from the meeting are available on the osdm website.
Metadata Workshop, Gold Coast, November 2010: Rethinking metadata to realise the full potential of linked scientific data
This talk begins with an introduction to our metafor taxonomy, and why metadata, and metadata tooling, are important. There is an extensive discussion of the importance of model driven architectures, and plans for extending our existing formalism to support both RDF and XML serialisations. We consider our the observations and measurements paradigm needs extension to support climate science, and discuss quality control annotation.
All the talks from this meeting are available on an ANDS website.
Keynote: Australasian e-Research 2010, November 2010: Provenance, metadata and e-infrastructure to support climate scienceThe importance of data in shaping our day to day decisions is understood by the person on the street. Less obviously, metadata is important to our decision making: how up to date is my account balance? How does the cost of my broadband supply compare with the offer I just read in the newspaper? We just don't think of those things as metadata (one persons data is another persons metadata). Similarly, the importance of data in shaping our future decisions about reacting to climate change is obvious. Less obvious, but just as important, is the provenance of the data:who produced it/them, how, using what technique, is the difficulty of the interpretation in anyway consistent with the skills of the interpreter? In this talk I'll introduce some key parts of the metadata spectrum underlying our efforts to document climate data, for use now and into the future. In particular, we'll discuss the information modelling and metadata pipeline being constructed to support the currently active global climate model inter-comparison project known as CMIP5. In doing so we'll touch on the metadata activities of the European Metafor project, the software developments being sponsored by the US Earth System Grid and European IS-ENES projects, and how all these activities are being integrated into a global federated e-infrastructure. (pdf, all conference talks)
Cyberinfrastructure Data Challenges, Redmond WA, September 2010
I gave a very short presentation at this NSF sponsored workshop. See presentation and commentary.
Data Standards for the ESA Climate Change Initiative, Frascati, September 2010
I gave this talk (pdf) as an invited expert on what they should be considering doing.
Introduction to the DDC, Boulder, July 2010
IS-ENES Strategy Scoping Meeting, near Paris, March, 2010
This meeting was targeted as being the first step in a foresight process for establishing a European earth system modelling strategy. This is the talk on software and data infrastructure prepared for the meeting (authored with Eric Guilyardi and Sophie Valcke): pdf
Big Data Meeting, London, February, 2010
(A workshop to hear from the users of large data to see what lessons can be learnt from the different disciplines; both academic and users of informatics in industry, and what technologies are envisaged for the future. See the agenda.)
The Earth System Simulation Challenge: from climate models to satellite observations and back again (Presentation pdf).
Data-Intensive Research meeting, Leeds, January, 2010
Turning Petabytes of Globally Distributed Climate Model Data into Policy: In late 2013 the Intergovernmental Panel for Climate Change (IPCC) will prepare a new assessment of what is known about the physical processes of anthropogenic climate change along with adaptation strategies and impacts. That new assessment will exploit (amongst other things) petabytes of simulation data produced under the auspices of the "Fifth Coupled Model Intercomparison Project, CMIP5". In this presentation, we will briefly discuss the IPCC process, describe what a coupled model actually is, and then introduce the plans for assembling the key simulation outputs into globally replicated petascale databases, and providing appropriate metadata and analysis tools. The main thrust of the presentation will be on the data federation issues, and the major international projects which exist to make this entire activity possible. (Presentation: odp)
NERC Executive Board, Henley, December, 2009
NCAS Annual Staff Meeting, Oxford, November 2009
Scientific opportunities for NCAS in the biggest show on earth: Coming soon, with more than 80 experiments and 90,000 years of simulation. (odp)
GO-ESSP 2009, Hamburg, October 2009
My talks appear on the site, including:
NetCDF climate forecast conventions status, and
Model Metadata for CMIP5
Metafor Year One Review, Brussels, May 2009
I presented the status and progress of the metafor workpackages on service and tools:
EGU General Assembly, Vienna, April 2009
Metrics for Success in the Preservation of Scientific Data at the Centre for Environmental Data Archival (CEDA)
Metadata Objects for Linking the Environmental Sciences (MOLES)
Models, Metadata and Metafor
The European contribution to a global solution for access to climate model simulations
Geophysical Fluid Dynamics Laboratory, March 2009
The British Atmospheric Data Centre (BADC) is the designated data centre for atmospheric science for the UK Natural Environment Research Council - which means it has two major roles 1) facilitating the doing of atmospheric science, and 2) for acquiring and curating all appropriate products of NERC funded atmospheric science research. In these roles it finds itself in the middle of the scientific communication ecosystem, actively helping the delivery of science ? as in for example, taking part in the next Coupled Model Intercomparison Project (CMIP5) as one of the core archive providers ? and in the vanguard of initiatives to establish data journals.
In this presentation, we'll cover a brief introduction to the BADC, and its relationship to the planned CMIP5 model archive, and some of the issues it raises. Handling the storage should be relatively trivial (in entirety a piffling few PB globally distributed). Providing useful services and information will be less trivial, we can store PB, but can users use them? The next IPCC process (the so called ?AR5?) is going to require much better communication between components of the community ? are we ready for that? Finally, can we, should we consider ?publishing? the data or is it enough to ?make it/them available for now?? (Answering the last question will require a definition of publishing.) In all: a tour through some technical problems associated with the production and use of data from CMIP5 in AR5, finishing up with some social issues at the heart of modern science!
4th International Digital Curation Conference, Edinburgh, December 2008
I gave a talk on costing metadata, covering the key aspects of the CEDA cost model. (Presentation)
Workshop on the use of GIS/OGC standards in meteorology, ECMWF, November 2008
Deploying secure OGC services in front of a heterogeneous data archive.Bryan Lawrence, Dominic Lowe, Phil Kershaw and Stephen Pascoe.
As part of the NERC DataGrid project, the British Atmospheric Data Centre has been developing a software stack which includes bespoke catalogue and information services, and both a Web Map Service and a Web Coverage Service. The server-side software stack is currently being extended to include a Web Feature Service and a more standards compliant interface to the catalogue services. The services which access data are being modified to sit behind a security filter which respects authentication and authorisation on a per-URI basis. In this presentation we will present key features of the services, and their relationship to the underlying archive - mediated via the Climate Sciences Modelling Language (CSML, an application schema of GML). We will dwell on some issues of scalability and metadata ingestion and maintenance.
Metafor Workshop, September 2008
Seminar, Oxford University, 2008
This presentation covered much of the same ground of the two EGU talks and my Royal Society presentation.
EGU General Assembly, Vienna, April 2008
I should have been involved in two talks one solicited, and as a co-author on one by Dominic Lowe:
Lowe, D.; Woolf, A.; Lawrence, B.N.; Pascoe, S Integrating the Climate Science Modelling Language with geospatial software and services (ppt), and
Lawrence, B.N.; Woolf, A.; Lowe, D.; Pascoe, S. Beyond simple features: Do complex feature types need complex service descriptions? (ppt)
(but note the latter didn't exactly cover the material I had planned to give, for which we were obviously sorry, but my inability to travel was out of my control).
Royal Society e-Science Discussion Meeting, April 2008
I gave one talk (ppt, doesn't include the three movies!) at this meeting which was based on these two papers:
Information in environmental data gridsProviding homogeneous access (?services?) to heterogeneous environmental data distributed across heterogeneous computing systems on a wide area network, re- quires a robust information paradigm which can mediate between di?ering storage and information formats. While there are a number of ISO standards which pro- vide some guidance on how to do this, the information landscape within domains is not well described. In this paper we present an information taxonomy and two information components which have been built for a speci?c application. These two components, one to aid data understanding, and one to aid data manipulation, are both deployed in the UK NERC DataGrid as described elsewhere. (pdf)
The NERC DataGrid Services
This short paper outlines the key components of the NERC DataGrid (NDG): a discovery service, a vocabulary service, and a software stack deployed both centrally to provide a data discovery portal, and at data providers, to provide local portals and data and metadata services. (pdf)
Seminar, Aston University, 2007
The NERC DataGrid: Interoperability, web services and the role of XML
The talk will start with an introduction to the NERC data centres and a description of the issues we face with interoperability - both between ourselves, and between ourselves and the wider community. The problems resolve to issues of both data and information interoperability. The main body of the seminar will concentrate on our solutions (such that they are), and our experiences with XML-based tooling, web services and standards. There will be a fearsome number of acronyms, most of which will be explained, along with an inappropriately high level of respect for standards coupled with a hearty disrespect for their implementations. If the audience can stand it, we can even delve into the arena of distributed access control.
e-Science AHM, Nottingham, 2007
Practical Access Control using NDG-security
Access control in the NERC DataGrid (NDG) is accomplished using a combination of WS-Security to ensure message level integrity, X509 proxy certificates to assert identity, and bespoke XML tokens to handle authorization. Access control decisions are handled by Gatekeepers and mediated by Attribute Authorities. The design of the NDG-security reflects the reality of building a deployable access control system which respects pre-existing user databases of thousands of individuals who could not be asked to reregister using a new system, and pre-existing services that need to be modified to take advantage of the new security tooling. NDG-security has been built in such a way that it should be able to evolve towards the use of community standards (such as SAML and Shibboleth) as they become more prevalent and best practice becomes clearer. This paper describes NDG-security in some detail, and provides details of experiences deploying NDG- security both in the e-Science funded NDG and the DTI funded Delivering Environmental Web Services (DEWS) projects. Issues to do with securing large data transfers are discussed. Plans for the future of NDG-security are outlined; both in terms of application modification and the evolution of NDG-security itself.
(Full paper: pdf)
GO-ESSP June 2007
Technical and social requirements for a putative AR5 distributed database of simulation and other data
It is highly unlikely that future large multi-model intercomparison projects involving multiple initial-condition and/or parameter ensembles from multiple institutions will be solved by centralised database solutions. Such centralised databases would need to have very high bandwidth to all possible data consumers, and require significant resources which would not be easy to obtain within existing national budgets. Fortunately solutions to the problem of intercomparison which involve distributed data holdings with common metadata structures and interfaces are possible. There are already a number of possible components of such a solution deployed in a variety of institutions. However, there are a number of issues that would need to be addressed before such solutions could be joined together to provide seamless access to data on an international scale. The issues range from agreeing on which technical solutions (the plural is important) should be used, to establishing trust relationships which could be supported not only by the scientists in the institutes involved, but by their network and computer security administrators. Given that technical developments will most likely continue to be driven by individual funding programs, not by an overall project with internationally generated and agreed requirements, it will be important to understand that success will most likely depend on agreeing common interfaces and information models, not on deploying the same technology throughout.
NERC e-science Meeting, Cambridge, May 2007
I'm giving a talk entitled: Components of a DataGrid: from bits to secure features.
Claddier Workshop, May 2007
SEEGRID III, Canberra, November 2006
Presentation (6 MB ppt)
TECO-WIS, Seoul, November 2006
The NERC Metadata Gateway
The Natural Environment Research Council's NERC DataGrid (NDG) brings together a range of data archives in institutions responsible for research in atmospheric and oceanographic science. This activity, part of the UK national e-science programme, has both delivered an operational data discovery service, and built and deployed a number of new metadata components based on the ISO TC211 standards aimed at allowing the construction of standards compliant data services. In addition, because each of the partners has existing large user databases which cannot be shared because of privacy law, the NDG has developed a completely decentralized access control structure based on simple web services and standard security tools .
One of the applications has been the redeployment of an earlier data discovery portal, the NERC Metadata Gateway, using the Open Archives Initiative Protocol for Metadata Harvesting (OAI/PMH). In this presentation we concentrate on the practicalities of building that discovery portal, report on early experiences involved with interacting with, and harvesting, ISO19139 documents, and discuss issues associated with deploying services from within that portal.
Presentation (4MB ppt)
UKOLN, Bath, July 2006
Presentation (9 MB ppt)
GO-ESSP, Livermore, June 2006
Presentation (3 MB ppt)
RDF, Ontologies and Metadata Workshop, Edinburgh, June 2006
Presentation (6 MB ppt)
NERC e-science AHM, April, 2006
NERC DataGrid Status
In this presentation, I present some of the motivation for the NERC DataGrid development (the key points being that we want semantic access to distributed data with no centralised user management), link it to the ISO TC211 standards work, and take the listener through a tour of some of the NDG products as they are now. There is a slightly more detailed look at the Climate Sciences Modelling Language, and I conclude with an overview of the NDG roadmap.
Presentation 5 MB ppt
Oxford, March, 2006
Communicating scientific thoughts and data: from weblogs to the NERC DataGrid via data citation and institutional repositories
For centuries the primary method of scientific communication has revolved around papers submitted either as oral contributions to a meeting or as written contributions to journals. Peer reviewed papers have been the gold standard of scientific quality, and the preservation of scientific knowledge has relied on preserving paper. ?The establishment (whatever that is) has assessed productivity and impact by counting numbers of papers produced and cited. Yet clearly much of the bedrock of scientific productivity depends on
Faster and more effective communication methods (witness the growth of preprint repositories and weblogs) and
Vast amounts of data, of which more and more is digital.
Digital data to leads opportunities to ?introduce data publication (and citation) and to exploit metanalyses of data aggregations. Electronic journals and multimedia lead to a blurring of the distinction between "literary" outputs (i.e papers) and "digital" outputs (i.e. data). ?Web publication alone leads to a blurring of the understanding of what publication itself means. Digital information is potentially far harder to preserve for posterity than paper.
This seminar will review some of these issues, and introduce two major projects led by the NCAS British Atmospheric Data Centre, and embedded in earth system science, aimed at developing technologies (and cultures) that will change how we view and do publication and data exploitation.