... personal wiki, blog and notes
Bryan's Blog 2004
I have just had my attention drawn to hacknot where I found an excellent article on "Basic Critical Thinking for Software Developers". In the article "Mr Ed" makes clear the importance of making verifyable statements if you want to contribute any information to discussion about software engineering practices. I think however, the article has much wider applicability, and applies to just about any discourse. If one wants to minimise wasting time, being very clear about exactly what you are talking about is absolutely crucial. James Tauber made much the same point in his blog on Successful Technical Discussions.
In real estate it's "Location, Location, Location", in technical discussions, it should be "Definition, Definition, Definition" ...
Meteorology versus Money
It's a known problem that "big-money" doesn't always care about the environment. We all understand about physical pollution (oil in the sea, PM10 in the atmosphere etc), but this one is a little different: The BBC has reported the problems that occur with "frequency pollution".
Essentially the issue is that the spectrum is finite, and commercial applications are getting priority over the use of the spectrum for remote sensing of the environment. For example, the 23.6-24 GHz band, which has the unique property of being sensitive to water vapour but not to liquid water is now under threat from "car radars"! The article also describes other bands already lost ...
Service Orientated Architecture
The NERC DataGrid concept is built around scalability and a service orientated architecture. We've gone very much for the minimum central control of anything, and very much a bunch of services that are not all integral to getting things done. As you might expect, the more of the NDG services one deploys, the less you have to do yourself, but we hope that we are imposing the minimum on the user of the system (and data). In fact, I think at the minimum, we are imposing only:
the use of a browser to find data (if they don't already know where it is).
the user to login somewhere which is "ndg-enabled". In practice this requires the user to have their own digital certificate and/or their ndg "login-supplier" to generate a proxy on their behalf.
From then on, the user will be able to deploy as much or as little of the infrastructure we are going to build as they want.
I'm prompted to write these things, because I found this via Savas. In it, Steve Maine, discusses four tenets of service orientated architecture design (from Benjamin Mitchell I think), my summary is:
Build in explicit boundaries between services (with, by definition, explicit interfaces)
Autonomous Services avoid hidden implicit assumptions about interactions.
Policy based Negotiation between services and clients
Share schema and contract, not type (avoid the baggage of explit type, allow the things on either end to implement them how they like).
I think NDG has been designed in a way that is conformant with these ideas, which are not dependant on any sort of specific web service technology, which is a good thing, because web service technologies are shifting sands from an implementation point of view.
Having been conformant with these ideas, we do put minimal constraints on the users of the system, and that should make it more useful as a data provider toolkit, which was our aim.
Elements and Attributes
Reminder: <element has-attribute="something"> stuff </element>
I say: Use attributes unless you truly need elements. You need elements for a thing if the thing can be repeated, or is itself structured, or has semantics based on its order among its peers.
This was possibly more of a summary than was warranted, but the essential point was fair. I think the bit that got lost in the summary was that in designing schema, by putting something in as an attribute you are constraining things in a way that is obvious (each attribute-type can have at most one value). You can use a schema to constrain something to have only one child of a specific element (I presume), but that's not obvious! I think being obvious is a good thing, after all, many (most?) schema will be used by people who haven't thought about the design decisions, and are reacting to them.
Nice review of RSS and scientific publishing
NDG is currently being developed around eXist, but we've had problems with scalability and reliability. It's not yet obvious why ... but as a consequence we've had to have some backup plans for what would happen if eXist failed to deliver for us. One of our backup plans was to keep an eye on the Berkely DB XML project. We didn't use it initially because we wanted to exploit xquery. Well, now, with version 2.07 announced (see previous link), it supports xquery, so no doubt we'll be testing it in the coming months. It's very pleasing that there are already suse rpms.
I have spent most of today writing a response to the various reviewers about our extension to the NDG activities.
One of the reviewers asked: "Why python? Outside the computer science community, how well known is the language?". These are actually common questions for me, as python isn't really well established in the environmental sciences.
My answers are simple:
Python is ridiculously easy to learn,
the code is easily maintainable
the code can be written very quickly ...
we can self-train staff (rather than use expensive external courses, and/or hire expensive programmers without environmental science backgrounds)
has an enormous range of additional libraries,
is easy to extend using Pyfort/swig etc, to use our favourite high level codes.
has great cross-platform support
there are already excellent toolkits out there like
Numarray and Numeric provide excellent support for doing real calculations
Not to mention the "good programming aspects" that plenty of others can comment on better than me.
Is it used outside of Computer Science?
Astronomy (e.g. the space telescope science institute and PYRAF)
High Energy Physics (e.g. SLAC)
Quantum Chemistry (e.g. pyquante)
Computational Biochemistry (e.g. biopython)
For all of these reasons, within NCAS, we at BADC have decided to try and support the wider community with free python tools, so that they can make the best use of data products from our scientific research.
Not sure one should confess to reading slashdot, but they pointed me to this in the Houston Chronicle, basically, the bottom line is that a company ("Green Mountain Energy") is providing Texan electricity consumers with electric power from a mix of hydroelectric and wind generation at prices competitive with fossil fuel suppliers!! In Texas of all places. There is hope yet.
More on Trackbacks
Yesterday I spent some time implementing the trackback spec in python. I know there are other implementations in python, but I wanted to really understand this stuff. I'll put my implementation up here soon (it's intended for use in Leonardo and other places), but for now I want to record some of the things that it got me thinking about. The key thing I discovered were some inconsistencies in the spec
why does the trackback have (uri,excerpt) but the rss response have (link,description)?
how extensible is it? I want to put some semantics in my trackback ping ... to tell the target what sort of trackback is coming in (it might not be a blog, it might be an analysis programme, or a formal citation). Thinking this must be a bit of old hat for the semantic web folks, I went off on a bit of a google on trackback and semantics.
At a recent tech discussion Mark Nottingham pointed out that the real difference between RSS and RDF (the cornerstone of the semantic web initiative) was that RSS was about lists. On the one hand this is true, however, the term list understates a crucial point about weblogging. Weblogging is designed to deal with nuggets of information that an author creates instead of a page that a publisher publishes. A permalink refers to a unique item, and in terms of the semantic web, indicates a component from which meaning can be extracted.
Ok, this is fine, but I'm not so interested in web logging per se, I'm thinking about links between nuggets that carry information about what sort of link they are (sounds like a candidate for rdf to me already) ...
(As an aside, I found a two year old post about pingback which compared it with trackback. It seems there aren't many active implementations of pingback, but if the concept of trackback without semantics is what you want it would seem simpler to me. Did it catch on? Doesn't seem so!)
I found a useful disscussions of trackback in 2003 here and here. But better, I found this describing exactly what I mean, and a comment stated "How in particular does RDF not work for you as a linking technology"? Which is what I started thinking as I read the trackback spec anyway - it uses RDF for the autodiscovery, why not in the ping itself?
That's a very good question, and is a good place to stop for today. My simple answer is, where is the (extendable) controlled vocabulary which defines the types of triples that would be allowed? I can easily imagine remotelink cites permalink remotelink incorporates permalink remotelink headlines permalink for textural things, but for data, I might want the adjective to be something which is more akin to a pointer to workflow ... hmmm ... but these would be meaningless unless a finite group of folk understood and implemented the words cites, incorporates, headlines
Clearly, at this point there are some more things I need to read up on, including, but not limited to:
Naomi Oreskes reports an analysis of all 982 papers indexed by ISI with keywords climate change between 1993 and 2003 (indexed by ISI means they had to be peer reviewed!).
The 928 papers were divided into six categories: explicit endorsement of the consensus position, evaluation of impacts, mitigation proposals, methods, paleoclimate analysis, and rejection of the consensus position. Of all the papers, 75% fell into the first three categories, either explicitly or implicitly accepting the consensus view; 25% dealt with methods or paleoclimate, taking no position on current anthropogenic climate change. Remarkably, none of the papers disagreed with the consensus position.
The final statement was pretty unambiguous:
Many details about climate interactions are not well understood, and there are ample grounds for continued research to provide a better basis for understanding climate dynamics. The question of what to do about climate change is also still open. But there is a scientific consensus on the reality of anthropogenic climate change. Climate scientists have repeatedly tried to make this clear. It is time for the rest of us to listen.
As an aside, I've had to link to this paper via the doi redirect (dx.doi.org), rather than via the doi:blah syntax because browser support for doi is still limited ...
I recently had to make a statement on the impact of publishing in a particular journal. I did it, but didn't like doing it. Most scientists probably have a subliminal hierarchy of journals, something like (for me):
Impressive (Tellus, J.Atmos.Sci., Q. J.Roy.Met.Soc, J. Climate. Climate Dynamics)
Very Good (J. Geophys. Res., Geophys. Res. Letters., Annales Geophysicae)
Good (J. Atmos. Terr. Phys.)
Worth Doing (Weather, Bull. Am. Met. Soc)
Not worth it (censored :-) ).
The order has no real meaning to anyone but me, and reflects my interests at the moment (When I concentrate on mesospheric work, I would rate JASTP higher and climate journals lower for example). But these sorts of lists seem to have rather a lot of resonance with bean counters, one such list is the in cites journal impact factors ( my local copy). These impacts are critically important to the evaluation of University research (e.g. the UK RAE), but I wonder how sensible they really are. Of course it's easy to produce such a list, but given my own priorities change, surely so too do the communities - are we really a normal distribution? What about people working in areas where there are simply less people working? Should academic research rankings depend on the herd? I'm sure the RAE normalises these things somehow, but it's still rather discomforting.
And what about Open Access publishing? Those in the scholarly publications game know that the world is moving towards Open Publishing. There is evidence that open access publishing increases impact, but most of us don't quite match that with our subliminal ranking of journals with our actions. It is certainly true now that i am far more likely to read a journal article if I can get to it online, and that is starting to sway my decision about where to publish (actually, I don't get time to think about publishing science at the moment, so this is a bit theoretical).
Of course, wearing my data hat, the really big changes we can expect in ranking scientific output will come when we have effective methods of citing datasets! One very interesting project doing work in that area is introduced here. Of course, there is formal citation, and there is google ... and now http://scholar.google.com. Hopefully we can move towards the same "easy" form of data citation too.
Every day I learn something more about blogging, and every day I can see how utterly useful it will be in a real atmospheric science application, not just as a means of communication amongst computer geeks (myself included I suppose).
In a nutshell, TrackBack was designed to provide a method of notification between websites: it is a method of person A saying to person B, "This is something you may be interested in." To do that, person A sends a TrackBack ping to person B.
and goes on to define:
TrackBack ping: a ping in this context means a small message sent from one webserver to another.
The explanation for why is all in the context of blogging:
Person A has written a post on his own weblog that comments on a post in Person B's weblog. This is a form of remote comments--rather than posting the comment directly on Person B's weblog, Person A posts it on his own weblog, then sends a TrackBack ping to notify Person B.
Person A has written a post on a topic that a group of people are interested in. This is a form of content aggregation--by sending a TrackBack ping to a central server, visitors can read all posts about that topic. For example, imagine a site which collects weblog posts about Justin Timberlake. Anyone interested in reading about JT could look at this site to keep updated on what other webloggers were saying about his new album, a photo shoot in a magazine, etc.
Why am I suddenly so interested? Well, the NDG metadata taxonomy includes a class C for Character annotation. We envisaged wanting to be able to comment on a dataset which would allow the data provider to harvest the comments and display them to the data users. This is exactly what trackback is all about.
What I love about Trackback is that it turns links from being one-way creatures into being truly bidirectional ... What this means is that you can only really read conversations on the Web the same way you read blogs: by going backwards in time. You find something interesting, and you chase links backwards from it ... (and only backwards)... Trackback completely cuts this Gordian knot. There are ordinary links in blog entries: these ones take you to the thing being commented on, backwards in time. But if you Trackback-ping when you make these links, you've created Trackback links in the stories you're commenting on. These Trackback links take readers from the commented-on to the comments, forwards in time. You can run down an entire conversational tree, from start to finish, just by reading Trackback links after you read blog entries.
From a data point of view, this means we can publish a dataset, and then users of the data can publish calibration data and experience, and someone finding the data after discovery can go forward from there to user experience ...
The U.S. National Weather service provide forecast data (apparently just for the continental US) via a SOAP service. It looks very simple and very easy to use, with a simple syntax. I only found out about it today, even though from the look of the site it has been around a while (the docs date from 2003, and the "trial" was supposed to end on the 1st of August).
Parameters supported are:
Maximum Temperature, Minimum Temperature, 3 hourly Temperature, Dewpoint Temperature, 12 hour Probability of Precipitation, Liquid Precipitation Amounts (Quantitative Precipitation Forecast), Snowfall Amounts, Cloud Cover Amounts, Wind Direction, Wind Speed, Sensible Weather, and Wave Heights.
The NOAA XML page is an amusing showing of how difficult it is to maintain a web site unless one is careful. There are a number of broken links, the text at the bottom claims it was last updated on the 4th of November, but the text in the body has clearly been updated a number of times since then.
Python and Java
There have been a number of posts in the blogs that I follow about Python is Not Java by Phillip Eby.
It was an interesting read for me too, although for different reasons which I wont go into now (basically about difficulties for code maintenance by scientists). Anyway Ned Batchelder picked up one nice point about how java trained programmers writing python often don't exploit the power of python. Simon Willison picks up on the point of people writting getters and setters instead of relying on python properties (frankly I'd always wondered why people did that). Me, the quote I like is
Some people, when confronted with a problem, think "I know, I'll use XML". Now they have two problems. This is a different situation than in Java, because compared to Java code, XML is agile and flexible. Compared to Python code, XML is a boat anchor, a ball and chain. In Python, XML is something you use for interoperability, not your core functionality, because you simply don't need it for that. In Java, XML can be your savior ... But in Python, more often than not, code is easier to write than XML.
As an aside, following the link in Phillip Eby's page led me to Charles Miller on Regular Expressions. Given I'm thinking about wiki parsers at the moment (link), made me realise yet another benefit of blogging ... it's just like when one goes to the library to find a scientific paper that is referenced in another ... often you find a few others in the same issue/volume which you didn't know about which are really useful ...
Digital Signatures and Digital Preservation
At some point we will be moving to a situation where a user will digitally sign a request for access to licensed data for the situation where we currently require a physical signature. Currently we store those pieces of paper (for perpetuity) so that we can prove that the user involved had the rights to access the data should we ever be subject to a legal challenge.
When we move to digital signatures we will have the situation where a digitally signed document will look like (again in pseudo-xml):
<xml-root> <payload - the thing being signed> <signature> <the actual signature> <x509 cert of the signer> </signature> </xml-root>
We need the x509 cert of the signer to be sure that the document is signed by the person we think signed it. We can need to use the public key of the root authority which signed their x509 certificate to validate it.
This means the preservation problem is now that we need to save both this (digitally signed) document and the public key of the root authority (as this may change with time).
The NDG b-schema is the heart of how we plan to support data browsing (as opposed to searching).
The key concept is that we have a simple relationship between observation stations (ObsStn), Data Production Tools (DPT), and Activities, which are linked together by Deployments as depicted below:
Examples of how it could work would be
describe a bunch of tools with SensorML.
describe a bunch of places where it could be deployed with the ObsStn schema (we'll have to role our own)
describe the activities with free text
link them together in a deployment, which includes:
start date, end date
activityID, ObsStnID, DPT ID
any deploymen specific settings (e.g. calibration coefficients as actually used) ... we'd have to make sure that the DPT schema supported this because the attributes of the deployment should only be selectable from schemas defined at the higher level.
describe the model capabilities with the EarleySuite
describe a bunch of computational environments with ObsStn (ok, we need to change the name of this thing)
describe the activities
link them together as above except the deployment specific settings would be from the schema of possible model settings implicit in the Earley Suite.
Yesterday and today I got to spend some substantial time on trains (a trip to Liverpool actually) ... and spent the time learning about pyxmlsec (and by implication xmlsec). I was most interested in the application of digitally signing xml documents (and the subsequent verification). We need to do this for NDG authorisation.
Because xmlsec implements the W3C xml-signature standard, the whole thing is trivial. I had expected that I would have to do work parsing the signature element to find out what algorithm to use ... and I was worried about how to find the public key of the signer.
As I say, it all turned out to be relatively trivial, especially in python. In pseudo xml, we go from something like: <Document> <children> ... </children> </Document> to <Document> <children> ... </children> <signature> ... </signature> </Document>
But the beauty of it is that everything in the signature element is standardised, and one can even load the X509 public certificate of the signer into the signature. Having done that, the public key is travelling with the document. Of course to reliably verify one needs the public key of the signer of that person's (server?'s) certificate, but often that's going to be the root certificate, so we're likely to have that in a repository anyway.
So, it all become transparent and trivial in python ... I'm in the process of building a light weight sign and verify class that uses pyxmlsec, and then we wont have to worry about this any more ...
Is Access Grid worth it?
At a meeting I attended yesterday, there was some discussion as to whether expenditure on supporting access grid was cost effective.
My personal experience of video conferencing consists of many AG sessions and many H323 based videocons. For video conferences with multiple site involved, H322 is just plain hopeless (at least the way we do it). AG beats it hands down. So, if we need more than three sites, we have to have a physical meeting or use AG. So there is a case for AG over H323 (maybe in addition).
Let's do a worst case analysis - each site provides only one attendee to a meeting, then lets say each time, there is a saving for that site of (about hundred quid each travel, plus 100 quid each in productive time spent not travelling). Let's guess at about 200 meetings a year (a vast underestimate at the CCLRC), which implies a saving of about 20,000 a year, which is about what the kit was in the first year.
With 40 sites in the country, we're saving about 800K per annum (this must be a very lower bound, the real number must be much more than that, because most meetings average several attendees at each site).
Such an analysis doesn't even address the carbon savings in not travelling, an d to my mind these are even more important - nor does it address the international aspect.
However, the AG experience isn't as good as it could be, and I haven't included the (significant) cost of having an operator in these figures. For more sites to get involved and use AG, and to drop the operator cost, we need to improve it. On those figures alone, there is a prima facie case for us to spend very significant sums and effort on AG support.
Maths and Blogging
Inevitably I'm going to want to put maths in this blog, and if not here, on scientific wikis to support the CF (and other) projects.
I've just done a bit of a wander around the internet looking for wiki solutions, figuring that I'm going to want to extend the Leonardo wiki parser to do this. I started at http://c2.com/cgi/wiki?MathWiki, which as of today is sequence of thoughts from a number of folk, which as the comment at the top says "needs refactoring". Nonetheless, there are a lot of good links off to math wiki projects.
The general consensus is that one needs somehow to support inline latex commands, at least until MathML support is widespread amongst common browsers (hopefully in my lifetime?).
I had a look at two solutions in particular, firstly, the Ian Hutchinson Tex-to-HTML translator (TtH), and secondly, Bob McElrath's LatexWiki. After spending a few minutes looking at each of them, it seemed that the former probably required the wiki viewer to do too much hard work (my KDE 3.3 konqueror didn't render things right first time ...), whereas the latter just worked (via presenting in-line images). The latter has the advantage of being python based too ... (and released under the GPL, the former has a "commercial version with additional functionality", which is a bit of a turn-off). Both have problems with IE, and what you see if you choose to print the page will be very disappointing.
It seems clear to me that the state of play in this area is far more immature than one might have expected, and might repay some sort of investment in time (although regrettably not by me).
Knowledge Management Horizon
Today I attended a workshop "on "Towards Integrated Knowledge Management", organised as part of a DEFRA horizons scanning project. These horizon projects are meant to give DEFRA a chance (!) of being ready for the next environmental crisis, instead of always being in reactive mode. A totally laudable aim.
I was taken with both the number of people there, and the evident belief of some of them that it was possible to guess what questions could be answered (let alone asked of) by a knowledge management system in twenty years time.
While some things should be taken for read: the systems will be more complex, incorporate more provenance, and provide far more context for data, other are not nearly so predictable. In fact, I would argue that the only way to get from now to then (ten+ years), is to improve our existing systems so they are capable of effectively answering todays questions. Arguably, the questions of tomorrow will be the same as today (which have been the same for a thousand years): how can we make our homes safer, improve our own lives without exploiting others, and give our children a better future? However, no one can predict what technology we'll have in ten years time, so the very best we can do is take incremental steps from where we are now and we'll get there ... so we need to concentrate on maximising the amount of contextural data we store with our data and on interoperability. If we do that, avoid Intellectual Property Rights (IPR) and ownership issues, then we'll be on the right path.
I hope this leads to modular service orientated architectures, with published standards compliant data structures underpinning them, and populated with as much data and metadata as possible. Such systems should be built around the communities of existing users, and not central monoliths. However, I fear an over engineered centralised approach ... If only governments would spend as much on acquiring and managing data as they do on poorly defined software projects.
I've been interested in blogs, and blogging software for a long while. I run the KDE akregator on my laptop, and find that i have learnt a lot from other folks ruminations (as well as having been apalled and amused to various degrees).
For a long time I've also tried various ways of keeping notes, many of which I want to make public to my friends and colleagues. I've finally found a piece of software that meets most of my requirements, and because it's written in python, I can add the functionality that I need as I require it.