... personal wiki, blog and notes
Bryan's Blog 2004/12
I have just had my attention drawn to hacknot where I found an excellent article on "Basic Critical Thinking for Software Developers". In the article "Mr Ed" makes clear the importance of making verifyable statements if you want to contribute any information to discussion about software engineering practices. I think however, the article has much wider applicability, and applies to just about any discourse. If one wants to minimise wasting time, being very clear about exactly what you are talking about is absolutely crucial. James Tauber made much the same point in his blog on Successful Technical Discussions.
In real estate it's "Location, Location, Location", in technical discussions, it should be "Definition, Definition, Definition" ...
Meteorology versus Money
It's a known problem that "big-money" doesn't always care about the environment. We all understand about physical pollution (oil in the sea, PM10 in the atmosphere etc), but this one is a little different: The BBC has reported the problems that occur with "frequency pollution".
Essentially the issue is that the spectrum is finite, and commercial applications are getting priority over the use of the spectrum for remote sensing of the environment. For example, the 23.6-24 GHz band, which has the unique property of being sensitive to water vapour but not to liquid water is now under threat from "car radars"! The article also describes other bands already lost ...
Service Orientated Architecture
The NERC DataGrid concept is built around scalability and a service orientated architecture. We've gone very much for the minimum central control of anything, and very much a bunch of services that are not all integral to getting things done. As you might expect, the more of the NDG services one deploys, the less you have to do yourself, but we hope that we are imposing the minimum on the user of the system (and data). In fact, I think at the minimum, we are imposing only:
the use of a browser to find data (if they don't already know where it is).
the user to login somewhere which is "ndg-enabled". In practice this requires the user to have their own digital certificate and/or their ndg "login-supplier" to generate a proxy on their behalf.
From then on, the user will be able to deploy as much or as little of the infrastructure we are going to build as they want.
I'm prompted to write these things, because I found this via Savas. In it, Steve Maine, discusses four tenets of service orientated architecture design (from Benjamin Mitchell I think), my summary is:
Build in explicit boundaries between services (with, by definition, explicit interfaces)
Autonomous Services avoid hidden implicit assumptions about interactions.
Policy based Negotiation between services and clients
Share schema and contract, not type (avoid the baggage of explit type, allow the things on either end to implement them how they like).
I think NDG has been designed in a way that is conformant with these ideas, which are not dependant on any sort of specific web service technology, which is a good thing, because web service technologies are shifting sands from an implementation point of view.
Having been conformant with these ideas, we do put minimal constraints on the users of the system, and that should make it more useful as a data provider toolkit, which was our aim.
Elements and Attributes
Reminder: <element has-attribute="something"> stuff </element>
I say: Use attributes unless you truly need elements. You need elements for a thing if the thing can be repeated, or is itself structured, or has semantics based on its order among its peers.
This was possibly more of a summary than was warranted, but the essential point was fair. I think the bit that got lost in the summary was that in designing schema, by putting something in as an attribute you are constraining things in a way that is obvious (each attribute-type can have at most one value). You can use a schema to constrain something to have only one child of a specific element (I presume), but that's not obvious! I think being obvious is a good thing, after all, many (most?) schema will be used by people who haven't thought about the design decisions, and are reacting to them.
Nice review of RSS and scientific publishing
NDG is currently being developed around eXist, but we've had problems with scalability and reliability. It's not yet obvious why ... but as a consequence we've had to have some backup plans for what would happen if eXist failed to deliver for us. One of our backup plans was to keep an eye on the Berkely DB XML project. We didn't use it initially because we wanted to exploit xquery. Well, now, with version 2.07 announced (see previous link), it supports xquery, so no doubt we'll be testing it in the coming months. It's very pleasing that there are already suse rpms.
I have spent most of today writing a response to the various reviewers about our extension to the NDG activities.
One of the reviewers asked: "Why python? Outside the computer science community, how well known is the language?". These are actually common questions for me, as python isn't really well established in the environmental sciences.
My answers are simple:
Python is ridiculously easy to learn,
the code is easily maintainable
the code can be written very quickly ...
we can self-train staff (rather than use expensive external courses, and/or hire expensive programmers without environmental science backgrounds)
has an enormous range of additional libraries,
is easy to extend using Pyfort/swig etc, to use our favourite high level codes.
has great cross-platform support
there are already excellent toolkits out there like
Numarray and Numeric provide excellent support for doing real calculations
Not to mention the "good programming aspects" that plenty of others can comment on better than me.
Is it used outside of Computer Science?
Astronomy (e.g. the space telescope science institute and PYRAF)
High Energy Physics (e.g. SLAC)
Quantum Chemistry (e.g. pyquante)
Computational Biochemistry (e.g. biopython)
For all of these reasons, within NCAS, we at BADC have decided to try and support the wider community with free python tools, so that they can make the best use of data products from our scientific research.
Not sure one should confess to reading slashdot, but they pointed me to this in the Houston Chronicle, basically, the bottom line is that a company ("Green Mountain Energy") is providing Texan electricity consumers with electric power from a mix of hydroelectric and wind generation at prices competitive with fossil fuel suppliers!! In Texas of all places. There is hope yet.
More on Trackbacks
Yesterday I spent some time implementing the trackback spec in python. I know there are other implementations in python, but I wanted to really understand this stuff. I'll put my implementation up here soon (it's intended for use in Leonardo and other places), but for now I want to record some of the things that it got me thinking about. The key thing I discovered were some inconsistencies in the spec
why does the trackback have (uri,excerpt) but the rss response have (link,description)?
how extensible is it? I want to put some semantics in my trackback ping ... to tell the target what sort of trackback is coming in (it might not be a blog, it might be an analysis programme, or a formal citation). Thinking this must be a bit of old hat for the semantic web folks, I went off on a bit of a google on trackback and semantics.
At a recent tech discussion Mark Nottingham pointed out that the real difference between RSS and RDF (the cornerstone of the semantic web initiative) was that RSS was about lists. On the one hand this is true, however, the term list understates a crucial point about weblogging. Weblogging is designed to deal with nuggets of information that an author creates instead of a page that a publisher publishes. A permalink refers to a unique item, and in terms of the semantic web, indicates a component from which meaning can be extracted.
Ok, this is fine, but I'm not so interested in web logging per se, I'm thinking about links between nuggets that carry information about what sort of link they are (sounds like a candidate for rdf to me already) ...
(As an aside, I found a two year old post about pingback which compared it with trackback. It seems there aren't many active implementations of pingback, but if the concept of trackback without semantics is what you want it would seem simpler to me. Did it catch on? Doesn't seem so!)
I found a useful disscussions of trackback in 2003 here and here. But better, I found this describing exactly what I mean, and a comment stated "How in particular does RDF not work for you as a linking technology"? Which is what I started thinking as I read the trackback spec anyway - it uses RDF for the autodiscovery, why not in the ping itself?
That's a very good question, and is a good place to stop for today. My simple answer is, where is the (extendable) controlled vocabulary which defines the types of triples that would be allowed? I can easily imagine remotelink cites permalink remotelink incorporates permalink remotelink headlines permalink for textural things, but for data, I might want the adjective to be something which is more akin to a pointer to workflow ... hmmm ... but these would be meaningless unless a finite group of folk understood and implemented the words cites, incorporates, headlines
Clearly, at this point there are some more things I need to read up on, including, but not limited to:
Naomi Oreskes reports an analysis of all 982 papers indexed by ISI with keywords climate change between 1993 and 2003 (indexed by ISI means they had to be peer reviewed!).
The 928 papers were divided into six categories: explicit endorsement of the consensus position, evaluation of impacts, mitigation proposals, methods, paleoclimate analysis, and rejection of the consensus position. Of all the papers, 75% fell into the first three categories, either explicitly or implicitly accepting the consensus view; 25% dealt with methods or paleoclimate, taking no position on current anthropogenic climate change. Remarkably, none of the papers disagreed with the consensus position.
The final statement was pretty unambiguous:
Many details about climate interactions are not well understood, and there are ample grounds for continued research to provide a better basis for understanding climate dynamics. The question of what to do about climate change is also still open. But there is a scientific consensus on the reality of anthropogenic climate change. Climate scientists have repeatedly tried to make this clear. It is time for the rest of us to listen.
As an aside, I've had to link to this paper via the doi redirect (dx.doi.org), rather than via the doi:blah syntax because browser support for doi is still limited ...
I recently had to make a statement on the impact of publishing in a particular journal. I did it, but didn't like doing it. Most scientists probably have a subliminal hierarchy of journals, something like (for me):
Impressive (Tellus, J.Atmos.Sci., Q. J.Roy.Met.Soc, J. Climate. Climate Dynamics)
Very Good (J. Geophys. Res., Geophys. Res. Letters., Annales Geophysicae)
Good (J. Atmos. Terr. Phys.)
Worth Doing (Weather, Bull. Am. Met. Soc)
Not worth it (censored :-) ).
The order has no real meaning to anyone but me, and reflects my interests at the moment (When I concentrate on mesospheric work, I would rate JASTP higher and climate journals lower for example). But these sorts of lists seem to have rather a lot of resonance with bean counters, one such list is the in cites journal impact factors ( my local copy). These impacts are critically important to the evaluation of University research (e.g. the UK RAE), but I wonder how sensible they really are. Of course it's easy to produce such a list, but given my own priorities change, surely so too do the communities - are we really a normal distribution? What about people working in areas where there are simply less people working? Should academic research rankings depend on the herd? I'm sure the RAE normalises these things somehow, but it's still rather discomforting.
And what about Open Access publishing? Those in the scholarly publications game know that the world is moving towards Open Publishing. There is evidence that open access publishing increases impact, but most of us don't quite match that with our subliminal ranking of journals with our actions. It is certainly true now that i am far more likely to read a journal article if I can get to it online, and that is starting to sway my decision about where to publish (actually, I don't get time to think about publishing science at the moment, so this is a bit theoretical).
Of course, wearing my data hat, the really big changes we can expect in ranking scientific output will come when we have effective methods of citing datasets! One very interesting project doing work in that area is introduced here. Of course, there is formal citation, and there is google ... and now http://scholar.google.com. Hopefully we can move towards the same "easy" form of data citation too.
The U.S. National Weather service provide forecast data (apparently just for the continental US) via a SOAP service. It looks very simple and very easy to use, with a simple syntax. I only found out about it today, even though from the look of the site it has been around a while (the docs date from 2003, and the "trial" was supposed to end on the 1st of August).
Parameters supported are:
Maximum Temperature, Minimum Temperature, 3 hourly Temperature, Dewpoint Temperature, 12 hour Probability of Precipitation, Liquid Precipitation Amounts (Quantitative Precipitation Forecast), Snowfall Amounts, Cloud Cover Amounts, Wind Direction, Wind Speed, Sensible Weather, and Wave Heights.
The NOAA XML page is an amusing showing of how difficult it is to maintain a web site unless one is careful. There are a number of broken links, the text at the bottom claims it was last updated on the 4th of November, but the text in the body has clearly been updated a number of times since then.
Every day I learn something more about blogging, and every day I can see how utterly useful it will be in a real atmospheric science application, not just as a means of communication amongst computer geeks (myself included I suppose).
In a nutshell, TrackBack was designed to provide a method of notification between websites: it is a method of person A saying to person B, "This is something you may be interested in." To do that, person A sends a TrackBack ping to person B.
and goes on to define:
TrackBack ping: a ping in this context means a small message sent from one webserver to another.
The explanation for why is all in the context of blogging:
Person A has written a post on his own weblog that comments on a post in Person B's weblog. This is a form of remote comments--rather than posting the comment directly on Person B's weblog, Person A posts it on his own weblog, then sends a TrackBack ping to notify Person B.
Person A has written a post on a topic that a group of people are interested in. This is a form of content aggregation--by sending a TrackBack ping to a central server, visitors can read all posts about that topic. For example, imagine a site which collects weblog posts about Justin Timberlake. Anyone interested in reading about JT could look at this site to keep updated on what other webloggers were saying about his new album, a photo shoot in a magazine, etc.
Why am I suddenly so interested? Well, the NDG metadata taxonomy includes a class C for Character annotation. We envisaged wanting to be able to comment on a dataset which would allow the data provider to harvest the comments and display them to the data users. This is exactly what trackback is all about.
What I love about Trackback is that it turns links from being one-way creatures into being truly bidirectional ... What this means is that you can only really read conversations on the Web the same way you read blogs: by going backwards in time. You find something interesting, and you chase links backwards from it ... (and only backwards)... Trackback completely cuts this Gordian knot. There are ordinary links in blog entries: these ones take you to the thing being commented on, backwards in time. But if you Trackback-ping when you make these links, you've created Trackback links in the stories you're commenting on. These Trackback links take readers from the commented-on to the comments, forwards in time. You can run down an entire conversational tree, from start to finish, just by reading Trackback links after you read blog entries.
From a data point of view, this means we can publish a dataset, and then users of the data can publish calibration data and experience, and someone finding the data after discovery can go forward from there to user experience ...
Python and Java
There have been a number of posts in the blogs that I follow about Python is Not Java by Phillip Eby.
It was an interesting read for me too, although for different reasons which I wont go into now (basically about difficulties for code maintenance by scientists). Anyway Ned Batchelder picked up one nice point about how java trained programmers writing python often don't exploit the power of python. Simon Willison picks up on the point of people writting getters and setters instead of relying on python properties (frankly I'd always wondered why people did that). Me, the quote I like is
Some people, when confronted with a problem, think "I know, I'll use XML". Now they have two problems. This is a different situation than in Java, because compared to Java code, XML is agile and flexible. Compared to Python code, XML is a boat anchor, a ball and chain. In Python, XML is something you use for interoperability, not your core functionality, because you simply don't need it for that. In Java, XML can be your savior ... But in Python, more often than not, code is easier to write than XML.
As an aside, following the link in Phillip Eby's page led me to Charles Miller on Regular Expressions. Given I'm thinking about wiki parsers at the moment (link), made me realise yet another benefit of blogging ... it's just like when one goes to the library to find a scientific paper that is referenced in another ... often you find a few others in the same issue/volume which you didn't know about which are really useful ...
Digital Signatures and Digital Preservation
At some point we will be moving to a situation where a user will digitally sign a request for access to licensed data for the situation where we currently require a physical signature. Currently we store those pieces of paper (for perpetuity) so that we can prove that the user involved had the rights to access the data should we ever be subject to a legal challenge.
When we move to digital signatures we will have the situation where a digitally signed document will look like (again in pseudo-xml):
<xml-root> <payload - the thing being signed> <signature> <the actual signature> <x509 cert of the signer> </signature> </xml-root>
We need the x509 cert of the signer to be sure that the document is signed by the person we think signed it. We can need to use the public key of the root authority which signed their x509 certificate to validate it.
This means the preservation problem is now that we need to save both this (digitally signed) document and the public key of the root authority (as this may change with time).