Bryan's Blog 2005
The Penultimate Broadband Post
Finally, just over three years after I started trying to get ASDL broadband home, it has happened ... this last saga has taken six weeks from inception, including two actual engineer visits and two that didn't happen. I can see why they want a one year contract!
Anyway, it turns out that despite living about 7km from the exchange we have a nice clean piece of wire and an SNR of 12 dB for the 512 kB/s service. I can live with this (I'll summarise the pros and cons wrt to satellite and why I made this change in a later posting).
While the systems that interface BT broadband to BT wholesale are clearly appalling and let the people who front them down, the people themselves have been great. The bloke who came yesterday was very helpful and professional (and made sure that what he left in terms of cabling was what we wanted, not what our legacy was).
by Bryan Lawrence : 2005/12/31 : Categories broadband : 0 trackbacks : 0 comments (permalink)
From Software Patents to Silly Trademarks
Many will know that I'm against software patents - I still like the claim (can't remember the source) that every few lines of java probably infringes someone's patent. The thing that most annoys me is that patent armouries would appear to be acquired for use by the big players to out bluff the little players, who simply can't afford to fight (why else would Microsoft be patenting things which are patently not patentable?)
It appears that this is a silliness that has wider application:Stuff is reporting the strange case of a kiwi winemaker not being able to use the name kiwi on their label in Europe, because, wait for it - a French winemaker has rights to the trademark! Similarly, some British company has rights to the name Aotearoa.
In the former case it turns out that the NZ winemaker has used the word kiwi on a label in Europe before the French one, so would probably win in court but can't afford to fight. Sound eerily familiar to the software patent world? In the latter case, the Maori word for New Zealand (that's right, the indigenous people's name for their country) has been trademarked by a company in the old world. That doesn't seem fair either ...
by Bryan Lawrence : 2005/12/31 : 0 comments (permalink)
Data Citation
One of the aims of CLADDIER is to establish methodologies for data citation. There are a lot of issues, some of which have been addressed before, by projects, like for example a German pilot experiment. The way I see it, things we have to address include:
How do we deal with dataset transience? My first reaction to this question was to say that one has to make dataset publication final - if one wants to change the dataset, one should publish a new version. However, it doesn't take long to realise that while that might make sense for datasets that are in some way canonical, it isn't a general solution. Many scientific datasets are subject to more or less continual revision as more data is collected (for example, covering a wider spatial or temporal domain), or data is revised as quality control is improved. In the case of most of our data, the latter situation should result in revised datasets - which was the situation I had envisaged - but in a gene database for example, that doesn't make nearly as much sense if one plans to cite the database, which one expects to change as new sequences are added and old ones revised. This leads us to
How do we cite something within a database/dataset? Here I think the answer is that database contents should be thought of consisting of features in the OGC/ISO sense; that is, of objects which are instances of some named class which is well defined (with the class definition being well described and lodged in some registry). In that case, all such instance objects should be described by a unique identifier and thus citeable in some way. However, it is important that as such instances are revised/improved/replaced, not only do the new instances carry new identifiers, but the old instances must carry information that they are obsolete.
What metadata is needed to publish a dataset (as opposed to, say, a book)? In the ndg project, we have been investigating metadata, and we think there are at least four categories of metadata, one needs to deal with, which we have briefly summarised as:
A (for Archive): what you need to understand the format and direct content (i.e. what quantities are actually stored).
B (for Browse): what you need to understand the context of the data, and to allow you to choose between otherwise similar datasets (hence Browse).
C (for Character or Citation): what you need to support annotation of the data.
D (For Discovery): what you need to find the datasets - broadly similar to the catalogue records in a library, but enhanced because we're dealing with data.
If substantial amounts of metadata are needed, how do we deal with the definition of authorship for datasets? In the four categories above, one might expect the data originator to supply A, and some of the B metadata, but in reality it is rare for a dataset collector to provide enough B metadata ... they have too much institutional wisdom which they fail to encode. Hence, to my mind, dataset publication will involve significant efforts from third parties (not the collector, nor necessarily, the publisher), and that effort will be in the form of material which were it in a standalone document, would be reflected with coauthorship ... I think however, for data, we need a recognised categories of authorship which clearly delineate between the collector/compiler, and the metadata creators, but which reflect the academic aspirations of both.
If we want reputable data citation systems, how do we deal with refereeing? Clearly we (the BADC) could publish datasets ourselves, and indeed, we intend to do so, however, what guarantee of quality will our users get? Obviously we can do internal quality control, but that can only deal with compliance of the data to some schema, e.g. it has the right metadata, the numbers are within bounds, etc. What we can't do, and what we need refereeing for is to comment authoritatively on, for examples:
the suitability of using a particular instrument to make a specific measurement, or
the suitability of a particular algorithm for combining data.
How do we deal with persistence of the digital objects that we cite? This is a whole topic in itself, but all objects which we expect to be citable, have to have identifiers which are live as long as the objects, and in particular, are independent of the storage mechanism, or the interfaces.
by Bryan Lawrence : 2005/12/22 : Categories badc curation claddier metadata : 4 comments (permalink)
Life in blog world
The ups: William Connelly has found my blog - it's always nice when someone notices what you've been writing and likes it.
The downs: some of my blog entries have suffered comment spam ...
Dammit. This means I have to spend some time upgrading leonardo to try and avoid comment spam. I was planning on doing some useful things, like more thinking about a decent search provider in the backend, or version control. Oh well. That's life.
by Bryan Lawrence : 2005/12/22 : 0 comments (permalink)
BT comedy continues
Well you have to laugh. Just over a month ago, I ordered bt broadband. Finally, last week, they rang up and booked the home highway conversion for today.
Just towards the end of the engineer's target time, he rang up and said he couldn't deliver a 1Mb service, so he wasn't coming. "What?" I said. We told BT we're far enough from the exchange we'd only be able to get 512 kb at best. But no, apparently the order was for 1 Mb, so he couldn't deliver a 512 kb service ... so I'd need a new order for 512, then he'd be able to come.
You have to laugh. Customer service? Common sense! Apparently neither.
Update 21/12/05: I spoke to someone on Monday night (the 19th) who said she would get back to me. She didn't. Today I managed to make another engineers appointment (apparently nothing had been done to make that happen meanwhile). So, we'll see what I report after the 30th (which is when they're due).
by Bryan Lawrence : 2005/12/19 : Categories broadband : 0 comments (permalink)
carbon tax
stuff is reporting a Dominion-Post article which reports an economist as in part debunking a self-serving analysis a bunch of kiwi companies have done on the impacts of a carbon tax on their bottom lines.
The article was depressing for a number of reasons, even though ostensibly it shouldn't have been: the headline point was that an economist was arguing that the carbon tax might not have negative implications. But somehow, the reporter involved managed to finish with this as a concluding paragraph:
The case studies should be a warning that the carbon tax, like any unnecessary tax, was unhelpful to New Zealand companies.
As a conclusion, it didn't reflect the message the headline suggested, nor reality. Who decided (the reporter, or the Business New Zealand chief executive, Phil O'Reilly who was quoted in the previous paragraph) that the tax is unnecessary?
Actually, O'Reilly, like his predecessors, seems like a bit of a plonker, he was quoted as saying:
a carbon tax would make business less competitive and less able to afford energy efficient technology.
Surely the way to avoid the carbon tax is to use energy efficient technology?
Someone (who knows whether it was O'Reilly or the reporter?) made the point also that carbon taxes in Kyoto-respecting-economies could damage company effectiveness against Australian, US and Chinese goods. Well, that's certainly true if one is paying for emitting carbon, but when one changes to more energy efficient tooling, there ought to be a competitive advantage against them ...
Like I say, a depressing article, both because of the content, and the quality of the reporting itself ...
by Bryan Lawrence : 2005/12/17 : Categories environment : 0 comments (permalink)
Not enough time for blogging either ...
The problem is that the creative flow is really uneven. First half of the week, I was on the road and didn?t post much, now for some reason I have five big pieces squirming around in the back of my head wanting to be written; and the nasty bit is that if I don?t write them, they don?t save up, they evaporate. I suspect there isn?t a solution.
Oh how true! Most things don't even make it to draft. I have some big (as in I want to write a lot) lurking in my drafts folder ... but it seems that if I don't work them up within a fortnight ... they just stay there ...
by Bryan Lawrence : 2005/12/17 : 0 comments (permalink)
Too many meetings. Not enough time.
One of my colleagues (thanks Ag) has drawn my attention to this meeting (the 5th ECSN Data Management Workshop). It looks like there were some excellent and interesting presentations.
I can't for the life of my remember whether I knew about it and couldn't go, or didn't know about it (but still couldn't have gone). Ever since the advent of the e-science programme I feel like one has to run to stand still just to keep up with what every one else is doing, let alone find time to do anything oneself.
by Bryan Lawrence : 2005/12/15 : Categories badc ndg : 0 comments (permalink)
elementtree is in python2.5
See Fredrik Lundh's effbot here and here for very good news for us in terms of being able to more easily deploy xml technologies ...
by Bryan Lawrence : 2005/12/14 : Categories xml python ndg : 0 comments (permalink)
Understanding NERC Funding
The group I am responsible for includes both the British Atmospheric Data Centre and the NERC Earth Observation Data Centre. These are both NERC Designated Data Centres, which means they are responsible for archiving the data products of all NERC atmospheric and EO research. One of our problems is that we don't really know about all the projects which NERC funds - we (and NERC central office) are working on ways to improve our information flow. Meanwhile, one of my colleagues (thanks Victoria!) has done an analysis of NERC funding by programme mode, using NERC grants on the web. As of now:
NERC has 1017 awards worth a total of ?195 million pounds
From a list of science areas ( Atmospheric, Earth, Freshwater, Marine, Terrestrial), there are 340 awards which are marked as atmospheric, totalling ?79 million pounds.
Of the grand total (1017), 203 awards, totalling ?35 million pounds, are classified as Earth Observation.
There is of course a deal of overlap in those classifications - 120 of the 203 are also classified as atmospheric.
(These are the totals of grants, NERC funds activities in a variety of other ways too, particularly by contributing strategic funds to underpin support for their research and collaborative centres - the latter include us!) ?
by Bryan Lawrence : 2005/12/14 : Categories badc strategy : 0 comments (permalink)
leonardo 0.7 beta and categories
This afternoon I upgraded to leonardo-0.7beta, and for fun, went back and categorised most of my posts back to the beginning of May (see categories). Next time I feel too slow to do any real work, I'll go back and categorise the first six months of my blogging history.
by Bryan Lawrence : 2005/12/14 (permalink)
Hemel Hempstead Fire
Here is a low resolution copy of an AATSR image of southern England for sunday morning ...
![]() |
Of course, I'm just interested in the two cloud decks that I've circled as well :-)
by Bryan Lawrence : 2005/12/13 : Categories environment (permalink)
Stratosphere to Troposphere Coupling and the Solar Cycle
I've just seen Hameed and Lee's paper on a mechanism for sun climate connections. Their results
show that the circulation anomalies caused by stratospheric warmings propagate down to the surface much more frequently under solar maximum conditions than under solar minimum conditions. This suggests that solar perturbation of the stratosphere by ultraviolet radiation variations followed by downward propagation of resulting circulation anomalies to the surface is the principal sun-climate mechanism.
This work depends on an argument that a colder stronger vortex is less easily disturbed, and so signals are less likely to propagate downwards. While I have to believe this argument (I was involved in developing the train of logic in it's incipient stages), I can't bring myself to believe it's the dominant mode of solar-climate connection. I think that's far more likely to involve direct modification of the major tropospheric circulation systems (although I can see how they might be influenced by changes in Rossby wave propagation in the stratosphere coupled with this signal).
I also find the statistics of these arguments difficult to deal with. They start with apparently quite robust statistics, but when one looks into the details a bit of disquiet pops out. For example, it was noted that these results are consistent with earlier work by Labitzke et al
who noted as early as 1982 that major midwinter stratospheric warmings do not occur during the QBO westerly phase except near solar maxima.
Sure, but there aren't a hell of a lot of case examples to build that on. Hammeed and Lee, and Baldwin, and Gray and Labitzke and the whole sequence of folk who have worked on this problem haven't done anything wrong, and I'm sure the stats are officially ok, but at the end of the day, it's just statistics, without the mechanism ... and like I say, while I can understand the mechanism, I just can't believe it's the principal mode of solar climate coupling. Given my background, I'd like it to be, but ...
2005/12/13 : Categories climate : 0 comments (permalink)
UN economic development and child mortality
As I scoffed my sandwich this lunch time, I was following up some clues about GIS, python and OGC protocols, and happened on Allan Doyles' blog. It's full of interesting snippets, but this one led me to an absolutely fascinating site at the UN where there is a flash animation which shows the evolution of GDP globally, regionally and country by country and the implications for health and child mortality. Fascinating and sobering. Well worth the time you'll take to watch it!
2005/12/12 : 2 comments (permalink)
Metaclasses in Python
This blog entry is about ignorance. My profound ignorance of some computing fundamentals. I keep reading about metaprogramming and metaclasses, and I keep thinking it has relevance for our GML parser (perhaps not the current version, but perhaps a future one). What we want to be able to do in a module is:
class A x=B class B x=A ... *** highlight file error ***
Which I can't work out how to do in a meaningful way. But Ian Bicking says
With a metaclass you can trigger some code to be run everytime a class with that metaclass is created (and subclasses inherit the superclass's metaclass).
So how does that work? Here is a definition of a metaclass from Mertz and Simioniato in an IBM developerworks article:
In Python (and other languages), classes are themselves objects that can be passed around and introspected. Since objects, as stated, are produced using classes as templates, what acts as a template for producing classes? The answer, of course, is metaclasses.
Mertz also quotes Tim Peters wrt metaclasses:
If you wonder whether you need them, you don't (the people who actually need them know with certainty that they need them, and don't need an explanation about why).
Well, I'm pretty sure we do ... in which case how and why? What use are these things? Firstly, you can create class factories, a concept I hadn't really cottoned on to, but again, from the Mertz article, here is an example:
def class_with_method(func):
class klass: pass
setattr(klass, func.__name__, func)
return klass
***
highlight file error
***
which you might use, like:
>>> def say_foo(self): print 'foo' >>> Foo = class_with_method(say_foo) >>> foo = Foo() >>> foo.say_foo() foo
This is getting close to what we want to do.
Of course, I'm not yet sure what A and B need to do ... and that'll depend on something which happens at runtime as we parse an XML file conforming to a GML application schema. But reacting to things at runtime appears to be exactly what metaclasses are about. Must be some value here somewhere, particularly if we combine it with the getattr method applied to classes.
Well, that's enough for my Sunday night python tutorial ... I think I'm slightly less ignorant although I haven't applied my new learning ... one day soon I'll have to get past reading about the doing of things, and get back into the doing of things ...
2005/12/11 : Categories python : 2 comments (permalink)
Solar Irradiance and Sunspot Cycles
For once I'm reading some real scientific literature. I've been pretty appalling lately, with my RSS aggregator telling me there are about 1400 items from Science and Nature alone that I should have a quick look at (As well as telling me that I'm not spending enough time keeping abreast of developments, I think this also tells me that I need to use some different software - something that does keyword parsing).
Anyway, today, I felt obliged to catch up on some key details of the solar climate link, and in particular, the solar side of things. I'm reading: Solar variability and climate change: is there a link?, by Sami Solanki. This is all stuff I should have known, and indeed have wittered on about before, albeit not in detail.
Key points from this paper are that
There are two major cause of solar variability: solar evolution (related to the solar core, and with timescales of billions of years) and magnetic field variability near the solar surface (driven by a dynamo at the bottom of the convection zone in the solar interior). It's the latter that's probably of interest for solar climate discussion on the scale of decades and centuries.
The major ingredients determining the strength and structure of the magnetic field are the (differential) rotation of the Sun and the (turbulent) convection at and below the solar surface. The interaction between these factors leads to bundling of field lines into features which we can see on earth: sunspots.
The solar magnetic field, and hence the sunspot activity, is strongly time-dependent, and the most obvious examples of this is the 11-year solar cycle, visible in the number of sunspots. This cycle itself varies though, with stronger and weaker amplitudes (so weak in the early 1600's that sunspots effectively disappeared during the Maunder Minimum), and some variability in cycle length.
The solar irradiance is a maximum when the number of sunspots is at a maximum, which is initially counter-intuitive (high resolution timeseries of irradiance and sunspots show that the solar output darkens as sunspots transit across the visible solar disk). However, current theory suggests that smaller bunchs of magnetic field lines (faculae) which are not visible as sunspots are associated with solar brightening, that the number of these is correlated with the number of sunspots (and the underlying magnetic activity), and that the solar brightening associated with a maximum of faculae dominates over the simultaneous darkening associated with visible sunspot maximum.
Support for the important role of the magnetic field at the solar surface is provided by the fact that the irradiance variability can be reproduced quantitatively by a simple three-component model, with the individual components representing the quiet Sun, faculae and sunspots. This reproduces the solar cycle changes quite well.
However, it is a step further to go to longer (century) timescales. BUT, if one does assume that the relationship between solar activity and irradiance found over the solar cycle also acts over longer times, it is then possible to work out that the Sun was between 2 and 4 W m?2 less bright during the Maunder minimum than today.
Some evidence to support this assumption (and the conclusion about the irradiance during the Maunder minimum) exists by noting that the underlying magnetic flux from one solar cycle persists into the next cycle, even though the manifestation in terms of sunspots do not correlate between cycles. Solanki has built a model which includes this factor, which reproduces both observed production rates (as measured in ice cores from large glaciers) of an isotope (10 Be) believed to roughly follow the open flux of magnetic activity from the Sun, and a reconstruction of geomagnetic activity, itself believed to follow the open flux from the Sun.
This is the first physical mechanism that could contribute to understanding the well known correlation between solar cycle length and northern hemisphere mean temperature published by Friis-Christense and Lassen (1991, Science, 254, pp698-700):

(figure from public website).
The previous figure is sometimes argued as justifying the assumption that observed climate changes are entirely solar induced. As well as pointing out that the previous figure is only a correlation, that could be explained by a number of factors (or indeed be a statistical fluke), Solanki concludes with a reconstruction of the last 150 years of irradiance, which he then compares with climate records, to show that the most recent two or three decades of warming are not correlated with the solar irradiance ... and thus could be explained by anthropogenic warming:

The above figure is a version of the last figure from Solanki's paper, taken from his institutes public website. The key lines are the blue ones, which are the irradiance reconsructions, and the red lines which are the climate records.
Having discovered their public website, I also had a look at Solar total and spectral irradiance: Modelling and a possible impact on climate (N.A. Krivova & S.K. Solanki 2003, ESA SP-535, 275-284). An interesting figure, and accompanying discussion from that paper was the one which showed that the recent flurry of interest in cosmic rays and clouds cannot explain the last thirty years of temperature increase either:
![]() |
(the y-axis information has been removed to comply with my fairuse criteria, but the solid curve is the cosmic ray flux, and the other curves are the climate record).
by Bryan Lawrence : 2005/12/08 : Categories climate (permalink)
Self Contradiction
Here are two five day forecasts for Oxford issued by the Met Office - both were downloaded this afternoon. The first is via the BBC, the second from the Met Office site itself:
![]() |
![]() |
Is it going to rain on Friday or not? Ostensibly these are forecasts issued around the same time (although one can't see obviously when the forecast was issued for the BBC one) ... and as far as I know, both are issued automagically from the same input data. No wonder weather forecasts sometimes have a bad rep.
2005/12/07 : 0 comments (permalink)
Fighting off sleep
A friend of mine bought me An autobiography of a one year old by Rohan Candappa (ISBN0 0918 8069 6). It's not great literature, but for those of us with circa one-year1 olds, it's very amusing to read another take on the day's events:
I sucked contemplatively on my dummy and reflected. By now sleep had put in an appearance and was edging its way over to me. I threw Teddy Bear at it, but it kept on coming. I picked up my blanket and tipped that over the side of the cot hoping to trip sleep up and delay it for a while. All to no avail. Now sleep had slipped into the cot itself. I backed away and squashed myself against the bars. I even managed to squidge half my body into the narrow crack between the edge of the mattress and the side of the cot. But sleep kept coming. In desperation I shouted at sleep as loud as I could. ...
... seems like a scene repeated in my daughter's bedroom several times a day!
2005/12/05 : 0 comments (permalink)
emacs and kubuntu
Emacs. For some reason my apt-get install couldn't find emacs, even though it turned out it was in main. It seems that doing an
apt-get clean
apt-get update
apt-get install emacs
couldn't find it, but when following some forum advice I then did
apt-cache policy emacs21
I got
emacs21:
Installed: (none)
Candidate: 21.4a-1ubuntu1
Version table:
21.4a-1ubuntu1 0
500 http://gb.archive.ubuntu.com breezy/main Packages
which makes the point that we need the version (why?), so now
apt-get install emacs21
worked.
I then went on following this forum entry to get the bits and bobs for latex support from emacs.
apt-get install emacs-extra emacs-goodies-el auctex preview-latex xfonts-jmk
at which point I got some errors about auctex not being available while setting up emacs-extra. Hopefully they got fixed.
apt-get install tetex-base tetex-extra
sudo apt-get install kdvi
xrdb -merge .Xresources
Which would seem to have been a problem, because now:
emacs blah.py
gives
No fonts match `-jmk-neep alt-medium-r-*-*-11-*-*-*-c-*-iso8859-1'
Aaargh. Well, it turns out that was in the .Xresources file. But I thought I'd loaded that ... of course ... need to reload the xserver so it can find the file.
Yes. It works.
by Bryan Lawrence : 2005/12/02 : Categories kubuntu (permalink)
Home Wireless Interference
Some frustration at home. Having moved our ISDN access point, we now have the 802.11g access point further from our home computer. Annoyingly, the network is remarkably unstable, working fine sometimes, and not others. The obvious candidate for interference is cordless phones, but our cordless phone is an NTL DECT phone. According to a DECT website the existing DECT spectrum is 1.88-1.9 GHz (the same article talks about opening up the spectrum, so it would be nice to confirm that my cordlesss phones actually uses 1.9ish GHz). Given the phone is at 1.9 GHz, and our 802.11g is in the vicinity of 2.4 GHz it would seem to be an unlikely candidate for the problem.
Annoyingly, it looks like the BT voyager wireless modem router that we have - but can't use until BT deliver us broadband - doesn't have the problem. Mind you, we used a different channel for that, so perhaps there is just some problem on the channel our DI624 is using (for some reason, after a firmware upgrade to 1.31, it has been stuck on channel 6).
I guess my options are to try and unstick the DI-624 by reloading the firmware (it is the latest version available on the UK site, although the US site appears to have later firmware), or try a repeater, or hassle BT again about our broadband delivery time. Meanwhile:
This http://www.wi-fiplanet.com/tutorials/article.php/1571601article on wireless repeaters implies that most access points can operate as repeaters for specific wireless frequencies.
This apparently obsolete repeater seems to have been the only cheap repeater option.
Update: 31/12/05: Well, apathy ruled. I waited until BT installed broadband and the new router seems not to have the same problems!
by Bryan Lawrence : 2005/11/29 : Categories broadband : 1 comment (permalink)
Starting to consider python eggs for real.
Back in March I discovered the python eggs project. Back then, it was a vision. Today, I went back to see the status of the project, and see if eggs are now actually practically usable.
It looks like they are, but it also looks like I'm well behind on tools for installing python applications, and understanding those is a prerequisite for understanding eggs. (So, it's not yet obvious to me whether understanding them is a prerequisite for using them ... as a package user ... it seems it certainly is for an egg layer).
Firstly: setuptools is:
a collection of enhancements to the python distutils that allow you to more easily build and distribute python packages, especially ones that have dependencies on other packages.
It would appear I have to understand setuptools before I can proceed, and if I want to play with them, I might as well install the setuptools using a python script called ez_setup.py. Hmm, ok, well I don't have enough time to play with this now, but I'll report when I do.
2005/11/29 : Categories python : 0 comments (permalink)
First matplotlib, now the GNU Data Language
A colleague (thanks Kevin) has drawn my attention to the GNU Data Language. This is a GPL application that essentially provides complete compatibility with code written for RSI's Interactive Data Language (IDL). What's especially exciting about this is that GDL has an interface to python (and it can be built and used as a python module as well). Both ways mean that my data analysis world is coming together ...
2005/11/22 : Categories python : 0 comments (permalink)
BT still have dreadful customer service
Well, it was time for my regular attempt to get my ISDN line converted to broadband (I try every 18 months or so as BT always seem to think they can supply the service).
So, I waited at home this morning for an engineer visit. Did one come? No. Did they ring to tell me why not? No.
So I rang them, and discovered the visit had been cancelled because "my exchange has no broadband capacity". What marvellous customer service! They cancelled it, but didn't bother to tell me. I'm still considering what compensation to ask for!
Meanwhile they can't tell me where I am in the queue (in case someone drops broadband service). No they can't tell me if the capacity is due to be upgraded. I'm just supposed to wait. The guy I spoke to claimed they didn't know if it would be days, weeks or months before they'd have capacity.
I see that this has happened in a number of places, one of which was Guidford, a year or so ago. Back then apparently BT claimed
The situation was "not ideal" ... but it was "not typical of what is going on ... Guildford was hit by a "sudden surge in demand."
Hmm, once might be sudden and unusual, but it's not hard to find other examples on the web - perhaps typical is exactly what it is!
Update (21/11/05): Today BT informed me they don't know when they'll have capacity, but they have offered to move my home highway termination from one room to another free by way of compensation. As it happens this is good compensation for us, as the current digital box (and all the routers) are in my daughters room .. and she's about to learn to crawl. Given we have satellite broadband, this means we can live without landline broadband for quite a while! So, assuming the engineer arrives when predicted, I'll be happy to report that while BT's customer handling systems are still appalling, the people behind them can still do the right thing!
Update (25/11/05): Today they moved the ISDN termination exactly when promised. The cheerful bloke who turned up also said he's spent yesterday adding capacity at the Wallingford exchange ... I wonder what happens next :-)
by Bryan Lawrence : 2005/11/17 : Categories broadband (permalink)
Accuracy, Curation and the Newspaper
The Guardian has an article on curation today: "Digital curators wage war on terabytes". They interviewed Chris Rusbridge (of the national digital curation centre, the DCC), Kevin Schurer (of the UK Data Archive) and myself. I can't speak for the others, but my interview consisted of a phone call ...
Most of the article is a fair statement of the issues we face, but one paragraph puzzled me. Apparently I said (amongst other things they repeated accurately):
What's needed ... is a management tool that acts as a framework, listing every aspect of data analysis that the researcher should be considering and suggesting possible processes that ensure the data collected is accurate.
Well, I don't remember saying that. I did talk about needing management tools that help us ensure that we are doing the process of curation properly (and that I hope the DCC might contribute by providing some certification of our processes). I probably also said some things about data accuracy, but I'm pretty sure I didn't conflate the concepts in such a way that I implied a management tool would help a researcher do better at accurate data collection. (No, I'm more than pretty sure, I know I didn't say that, if I had of, I would expect my scientific colleagues to tell me where to go ... there is no substitute for scientific excellence in the collection of quality data, and there is no place for a "management tool" in that process!)
Actually, what I should have said is that giving a metadata tool to scientists might help create better information about the data, which might make it more useful. One might argue that such a tool is a "management tool", but, there is no way I would call it a "management tool" when I gave it to a scientist if I expected them to use it :-) - there are some phrases which are just repugnant :-)
Notwithstanding my little moan about one paragraph, I have to say I thought it was a fair article, and it's good to see the national press recognising what will become a problem we all will recognise in the years to come - how to ensure our digital heritage is as available to ourselves in the decades to come as our paper heritage is!
by Bryan Lawrence : 2005/11/15 (permalink)
t-i-db
I spent today at (one day of the) 10th ECMWF workshop on operational meteorological systems. I gave a talk on our CSML work ... but the thing I want to blog about is a presentation by J. Sim?es from Portugal on his work developing the t-i-db temporal extensions to the relational database model.
Unfortunately I was having one of those multi-tasking days where I was trying to handle a bunch of email events at the same time as taking in the talks, so I didn't get the details as much as I should have, but it appears that his group have developed an objectacious extension to RDBs (Mysql and Oracle) that supports the ability to do queries based on the object schema. The work they've done so far includes support for grib objects, stored as blobs and metadata.
Of course, lots of folk have done that, but there also appears to be a kde based browse tool that can parse the contents of the database, and allow the user to see data objects and metadata. It looked well worth further investigation.
(As an aside, chasing this up reminded me of the existence of various versions of scientific linux (e.g. scientific linux itself, and Pai-pix). I need to remember these in in the context of providing a "release" of NDG - it may be far easier for us, and for data providers, to release an OS with everything already packed in, than to expect them to install everything themselves. Of course we'd have to piggy back on someones distribution, not necessarily either of the above, but it's a thought).
by Bryan Lawrence : 2005/11/15 : Categories computing : 0 comments (permalink)
On Moores Law
Last week I was involved in a conversation where I was explaining how the BADC deals with storage costs. It's not a complicated scheme, we basically work on the assumption that as long as we can store the next four years data, we can probably store the entire previous archive as a marginal activity.
Given a doubling time of thirteen months in storage capacity (associated with a halving in cost, we figure that in four years time we'll have about 10-13 times the storage for the same cost as our current storage (why thirteen months? See below). That also means our current storage will be about 5-10% of the total volume. So we front load the cost of putting data into the archive (by a tiny bit), and then our business model doesn't need to deal with the cost of long term storage.
A digression: At some point or another one might start wondering when Moore's Law might break. Yesterday, I found this from Greg Matter - a nice reminder that technically Moore's Law has nothing to say about storage efficiency advances. It's a nice blog entry, because it makes the point that the continual improvement in price performance is not just about smaller and smaller transistors, but also about improving architectures etc.
As he said:
Continuing to throw transistors at making single processors run faster is a bad idea. It's kinda like building bigger and bigger SUVs in order to solve our transportation problems.
Then he went on to say:
Just as the '80's discrete processors were killed by microprocessors, today's discrete systems motherboards full of supporting chip sets and PCI slots with sockets for microprocessors will be killed by microsystems: my word for the just-starting revolution of server-on-a-chip ... Where does end up? Well, we are now dying to get to 65nm (Niagara is 90nm) so we can get even more transistors on a chip in order to integrate more and bigger systems. Just as the microprocessor, harvested the pipeline inventions of 60's and 70's, microsystems are going to integrate the system innovations of the 80's and 90's.
and
Moore's Law is VERY much alive
OK, back to my storage doubling time. If it's not Moore's law, what is it? wikipedia calls it Kryder's law. However, we also remember it's not just about storage capacity. This figure (grabbed from oracle) makes a good point about another reality: I may be able to store more data, but getting it back is starting to be a big problem:
![]() |
And I have to worry about the software I'm using: another amusing variant of Moore's law is Wirth's law:
Software gets slower faster than hardware gets faster.
All of this reminds us that our complete business model has much more than just the storage cost in it, so as well as dealing with speed of access, and software, we have one more major cost: there is no doubling in the speed of human thought, and we still have to manage all this data!
by Bryan Lawrence : 2005/11/14 : Categories computing : 1 trackback (permalink)
Trajectory Support for campaigns
The BADC provides support for field campaigns in a number of ways. One of those is to provide information about the history of air parcels, and information about their future trajectories. We do this using a code we developed ourselves based on code from John Methven at the University of Reading. We are in the process of considering updating how we provide support, so are developing a use case.
![]() |
Any feedback we can get is welcome, either to me directly, or via this web page. (You can find a pdf of the above jpg here).
Update (24/11/05): The figures have been updated following feedback. (The originals are available by replacing the 24 in the links with 10). Two changes:
Added the trigger concept rather than the run script being created
Added the requirement to checks availability of NWP data (likely only to work with the BADC version, we probably just check the dates for the ECWMF version).
by Bryan Lawrence : 2005/11/10 : Categories badc : 0 comments (permalink)
Archival of data from numerical simulations.
The BADC (and other NERC designated data centres) need to archive simulated data, but there is a real question as to what data should be archived (and what archival should mean for simulation data).
Accordingly, we've come up with a document (pdf) which we believe describes a reasonable policy for choosing whether to handle simulated (numerical model) data.
We'd be interested in any feedback folk might want to give us, via email to me, or via comments to this page.
by Bryan Lawrence : 2005/11/10 : Categories curation badc : 0 comments (permalink)
The Future of CF
The Climate Forecast conventions for netCDF have significant impact on a number of activities at the BADC, and in the wider academic community. Earlier this year we held a meeting where we got as many of the original CF authors together as was practical. The main issue was how we move forward from where we are now.
The outcome of this meeting was a white paper (pdf) outlining some possible futures.
If you are interested in this issue, please read it and do two things:
Note the timescales that we wish to adhere to in moving forward, which start by getting feedback on the plans outlined in the paper. We need that feedback by the beginning of December.
Note also the implied request for folk to start lobbying any organisations (including international organisations and/or your home institutions, if appropriate) to find out whether they would be able to contribute financially to the maintenance of CF, and which mechanisms are preferable. Feedback on this too is sought!
I'm looking forward to responses. Feel free to send them to the mailing list, or to me personally, or as comments here. I'll try and synthesize the responses into something coherent in early December.
by Bryan Lawrence : 2005/11/09 : Categories curation cf : 0 comments (permalink)
bdbxml and xquery
Currently NDG uses the eXist database for our native XML database. There are three main reasons for this:
It's relatively easy to install and get going
It supports xquery
It has an effective webservice interface.
We are however worried about how it will perform with hundreds of thousands of documents. At various times we've considered the bdbxml native xml database. Also at various times, my team have rejected it, for reasons which we need to readdress.
For example, the xquery issue; this is verbatim from an October 2005 email on the mailing list:
I am wondering whether there is any efficient way to query large XML document with Berkeley DB XML (Java API) using XQuery. I am trying to query an large XML document (about 120M) with Berkeley DB XML using XQuery. I created appropriate indexes on the document and tried a very simple query ...
Response:
In BDB XML 2.1.8, indexes do not help for queries within documents. They are only used to narrow the search to a smaller set of candidate documents. In a few weeks, we will release BDB XML 2.2. It supports node indexes within large, single XML documents ...
From the documentation it looks like this is xquery 1.0 support.
That leaves us wondering about ease of installation and web service interfaces. From what I can see we'd have to role our own for the latter, and the former involves a number of third party libraries ...
by Bryan Lawrence : 2005/11/07 : Categories computing xml ndg : 0 comments (permalink)
The Big Picture on Climate
I've been wittering on about scientific consensus on the big picture on climate change for a long time. There is major consensus. We have a problem. William Connelly has it exactly right: : From the point of view of climate change, the top level is "The world is getting warmer, we're causing it, and it will continue to get warmer in the future". This is pretty well universally agreed on now.
by Bryan Lawrence : 2005/11/04 : Categories climate : 3 comments (permalink)
The Status of Jython
I recently attended a meeting where a colleague stated that his group were building a lot around jython. He also stated that jython development had taken off since ibm had started to use it internally. That got me wondering about what I could find out about jython status as the NDG project is about to do a technology review. Herewith my half hour of chasing up on the subject.
Firstly, the following from the faq (dated Feb 2005) from http://www.jython.org is interesting to me:
Current status is that CPython 2.3 on Windows2000 is about twice as fast as Jython 2.1 on JDK1.4 on Windows2000. However, because of Java's slow startup time, Jython starts much more slowly (2.4 s) than CPython (80 ms). This means you don't want to do standard CGI in Jython, but long-running processes are fine.
... that's not good news for us. Let alone how we might exploit web services packages (unless via java libraries, which is I guess why one uses jython).
Now what about the language features? Well, the wiki says:
The final release of Jython-2.1 occurred on 31-dec-2001. Current work include improvements to java integration and implementing the lastest features from CPython 2.2/3.
Hmm, so no releases since 2001? I see from the jython-users mailing list that this really is the case, someone in October this year said:
I've down-loaded jython_Release_2_2alpha1.jar from sourceforge.net and I've downloaded the jython source with CVS. which makes me wonder what development has actually occurred for my colleague to get so excited.
A bit more looking at the jython-dev mailing list tells us that the wiki might be out of date, but that the development is still trying to repeat the cpython 2.2 features ... let alone cpython 2.4.
One last thing I need to know, how about Numeric and jython? Well, it appears that something exists in this space jnumerical:
JNumeric provides the same functionality as the core of Numeric module and aims to provide all of the standard extensions to Numeric module (FFT, LinearAlgebra, RandomArray).
Well I guess that's good news. I ought to find out how quick it is though!
Meanwhile, I've run out of time in my investigation. What do I think?
Well, it's difficult to get excited for now, particularly since jython development seems well behind cpython. although I guess since one gets simplified access to java libraries, there is maybe real functionality more 1 easily accessible there.
If anyone wants to update/correct me, then please do!
by Bryan Lawrence : 2005/11/04 : Categories python : 0 comments (permalink)
Quiet October
Well haven't I been quiet? I guess that's what happens when you spend two weeks on leave in a month, one week away in China, and one day in each of Liverpool and Leeds ... plus sundry meetings away from the lab. The bottom line was that I have spent six days in the office in October, most of which was in timetabled meetings ... no wonder I feel like I haven't achieved anything this month! (But I did have a great week in Wales in the third week of October, basking in unseasonable sunshine - and avoiding the odd heavy shower. I can recommend llangrannog if you want a quiet - we were out of season - little village with a lovely beach, good food, and good coastal walks).
2005/10/31 (permalink)
Disgusted
Among the many things I've caught up on was reading a disgusting article on Real Climate. You can guess my feelings when I read the article which begins with:
Today we witnessed a rather curious event in the US Senate. Possibly for the first time ever, a chair of a Senate committee, one Senator James Inhofe (R-Oklahoma), invited a science fiction writer to advise the committee (Environment and Public Works), on science facts --in this case, the facts behind climate change. The author in question? None other than our old friend, Michael Crichton ...
I hate coming back from leave and having my blood pressure go up within the first day (but I suppose that had already happened as I went through an inbox full of drivel) ...
by Bryan Lawrence : 2005/10/10 : Categories climate crichton : 1 comment (permalink)
Beijing Beauty
I've had a week in Beijing attending a WMO metadata workshop, and a week off, during which my only use of a computer was to play with photographs!
The meeting was very good, but I have to say I really enjoyed getting to see a couple of famous places. What I didn't enjoy was coming home with a case of Beijing Belly and spending a fair bit of my week off less than healthy ...
the Summer Palace
The amazing long gallery:
![]() |
and various temples, including:
![]() |
The Great Wall
A bit of shame that it was so foggy, but still "great":
![]() |
2005/10/10 : 0 comments (permalink)
python xml frustration
While I really love python as a programming language, and I'm committed to using XML, every now and again I come across something that doesn't work as easily as one might want.
What I want to do is find all elements in an xml document which have a specific attribute. This is the same thing that Bernard Lebel wanted to do. He'd moved to elementtree from Beautiful Soup (BS) because of performance reasons. It turns out that there was a bug in BS which has now been fixed, which means it now works fast enough for him, but I'm still stuffed. I want to use just one xml parsing code in my projects, not three or four (I seem to have used beautiful soup and elementtree and libxml2 recently, and it seems somthing called pdis might help). But I want one package ... is that too much to ask for? (Nelson Minar seems to think the same way). And that one package had better be easy to install and use (unlike Amara and 4suite which I've never had the energy to make work).
Elementtree is easily the most pythonic, and can be used as one additional file, but it has one or two severe failings, and this is one of them. Please, Please, Fredrik Lundh, can we have attribute support in the Elementtree xpath support?
2005/09/22 : Categories python : 0 comments (permalink)
Understanding networks
Now that we're trying to take data from the Earth Simulator, we're starting to have to understand our international network connectivity a bit better.
Here is a snippet from the traceroute from us to the Earth Simulator:
po2-0.geant-gw1.ja.net 2.009 ms 1.987 ms 1.979 ms janet.uk1.uk.geant.net 2.355 ms 2.278 ms 2.169 ms uk.ny1.ny.geant.net 70.912 ms 70.916 ms 70.799 ms sinet-gw.ny1.ny.geant.net 73.785 ms 73.695 ms 73.767 ms nii-IX1-P0-0.sinet.ad.jp 244.306 ms 244.133 ms 244.139 ms
from which you can see that we get into the SuperJanet->Geant gateway in about 2 ms (i.e. from the UK research network to the European research network). At first glance it seems bizarre that it takes nearly 70 ms to get to the Europe to Japan gateway, until we spot the ny in the hostname and think about the routing. The timing suggests it's about 7000 km away ...
How did I work that out? Basically we think the fibre speed is about 2/3 of the speed of light in a vacuum, so that's about 2x108 m/s, so 10 ms time should be about 2x106 m or 2000 km1. Remembering that the times are round trips, we can simply say that 10 m/s round trip time corresponds to 1000 km distance in the fibre. So we have about 7000 km to the ny.geant.net router, which is plausibly New York, and another 170 m/s or about 17,000 km on to Japan. This is plausible as the fibre across the Pacific is of order 12,500 km long2 and there must be a few thousand km of perambulations across the U.S. What's amazing is that there is no router in the way, it's all switched.
(These timing distance things can be off by 10-20%, but it's good enough for this sort of calculation.)
by Bryan Lawrence : 2005/09/16 : Categories computing : 0 comments (permalink)
Certification of Repositories
The RLG and (US) National Archives and Records Administration (NARA) have come up with a draft guide (pdf) for determining whether a digital repository can be certified as a trusted location for digital collections. Comments are requested by January 2006, more details here.
by Bryan Lawrence : 2005/09/16 : Categories curation : 0 comments (permalink)
The Microsoft XML saga.
A colleague of mine has asked for me to more fully explain my concerns with Microsoft XML, which currently have me forbidding MS docs to be in the long term BADC archive. So here goes.
Let me begin however, with a disclaimer. I use Microsoft Office (in particular I positively like powerpoint, excel and word), but I run them using Crossover Office on a linux laptop (but all my Microsoft products are properly licensed!). I have no problems with paying for software 1, what I do have a problem with is ensuring that
My own IPR is not inextricably bound up in the IPR of the software I use to encode my IPR ... i.e. my wordprocessor or whatever, and
Any documents I acquire, or create, are readable into the indefinite future. This "curation" or "preservation" requirement requires confidence that I can deal both technically and legally with the format of the documents (and that I can afford to do so).
In practice this last requirement (affording to do so), means in the medium term, I strongly dislike annual license agreements (happy to pay annual maintenance), I want to own my copy of the software. I think the brave new world of DRM is going to come a cropper with people like me, who are in the majority on this (see NZ government opinion and this from Tim Bray and links therein for similar opinions).
Anyway, if I take a sufficiently long view, I can assert with a deal of confidence that whatever software I use to create documents will not be the software I use to read the documents. There are all sorts of reasons why this may be so: I may no longer have a license for the software (if it was licensed); The company producing the software may no longer exist; etc.
What I need to be sure of then is that
I (or someone else with whom I can afford to consult with) understand the format the data is in, and
I am legally allowed to use the data.
Note how I've segued from document to data. I've done that, because this is a generic problem, it's not just about Microsoft, but they've got their heads above the parapet at the moment.
These are in fact exactly the same issues that the State of Massachusetts are dealing with in a very public manner. I've been writing about this for a long time (3rd June, 1st of June, 24th of March, 31st of Jan, 18th of Jan). Likewise, the NZ government is tackling it (op.cit.), the Norwegian government and many others including the EU (via Tim Bray again):
Transparency and accessibility requirements dictate that public information and government transactions avoid depending on technologies that imply or impose a specific product or platform on businesses or citizens.
As I say, Massachusetts have been exemplary in the open manner in which they have taken the issue on (starting with a clear policy - pdf - and public discussion) and they've come to some conclusions 2:
All electronic documents ?created and saved? by state employees would have to be based on open standard formats, and
Only two document types can be used in the future - OpenDocument and PDF.
NB: this again is the State's definition of open standard formats:
Specifications for systems that are publicly available and are developed by an open community and affirmed by a standards body. Hypertext Markup Language (HTML) is an example of an open standard. Open standards imply that multiple vendors can compete directly based on the features and performance of their products. It also implies that the existing information technology solution is portable and that it can be removed and replaced with that of another vendor with minimal effort and without major interruption.
There are a lot of people at Microsoft who don't understand why we're making such a fuss over this, and indeed there are a lot of positive policy statements from Microsoft, e.g. Jean Paoli back in 2003 said:
... customers now have the option of saving any Microsoft Word document, or Microsoft Excel spreadsheet in XML, which allows those documents to be read or written through XML Web Services by any application, on any platform, through any device ... To ensure broad availability and access, Microsoft is providing, under a royalty-free license, complete documentation and a full description of the Office 2003 XML Reference Schemas using XSDs (XML Schema Definitions), the cross-industry standard developed by the W3C. Designed for ease of use and broad adoption, the royalty-free license provides access to the schemas and full documentation to all interested parties.
But none of these are legal opinions. And, from a corporate point of view, if I was interested in broad availability, I'd use an open standards process (which is all Massachusetts and other governments want), not wrap up the storage format in something for which the legal status is very murky. In particular, I see no legally binding statement that my documents will be legally accessible from third party software (including that under the GPL) in perpetuity. There are a lot of thoughtful comments similar to this in the coments section of Brian Jones' Sep 5th blog entry, in particular those of Craig Ringer (here and here).
So, in summary, where am I on this, and what would change my mind to allow the BADC to archive MS docs:
Either the MS-XML format becomes an open standard with an encumbered legal position that allows all software and all usage in perpetuity, or
MS supports the OpenDocument format, and I store that ...
The bottom line here is that we are enforcing open standards on our data, and all the same reasons why we do that apply to the accompanying documents.
Meanwhile, we will happily store pdf versions of MS docs ...
by Bryan Lawrence : 2005/09/14 : Categories curation msxml : 0 comments (permalink)
Katrina and global warming: who knows?
People have been asking me whether I think hurricane Katrina is linked to global warming. My response is and was: nobody knows!
However, there is a rather good article at RealClimate on the subject. The bottom line is that we will never know about Katrina, but we can assert that in recent years there appears to be an increase in hurricane destructive potential (intensity) that coincides with an increase in sea surface temperature. Further, that greenhouse induced warming may be responsible for about half the sea surface temperature warming (and hence half the increase in hurricane intensity). That says nothing about an individual event of course ...
by Bryan Lawrence : 2005/09/06 : Categories climate : 0 comments (permalink)
Summer is not for blogging
I've been aware that my blog has been getting little attention of late. One might ask why it should get any? If so, the answer is that in some cases it helps me think about something long enough to write something about it, and in other cases, I simply use my blog to record some thoughts for later consumption (by me or anyone else). On even rare occasions I use my blog to have a rant about something that has annoyed me ...
Anyway, in general, I have found keeping a blog a good thing in that it has made me do some of the things that I wanted to make more time for (reading Science and Nature, and keeping up with some computing trends). It hasn't (yet) helped me keep up with specific atmospheric science things, but that's in part because I haven't yet chased up all the appropriate journal feeds, and by and large, the atmos community don't blog. Times will change on that front.
But it's summer and now I'm a parent, so I've found that even the small bit of time I used to make for blogging has been hard to find ... so I read this from Julie Leung with a great deal of sympathy and agreement:
If blogging were an Olympic sport, it would be in the Winter Games. Summer competes too much for me with blogging time. Flowers need water. Plants need pruning. Gardens need weeding. The deck needs to get done. And kids need fun...trips to the beach, visits to the aquarium, quiet time with the ice cream truck.
Saturday I hoped to catch up with email and blogposts. But the girls asked me if we could go to a playground. And they also wanted to run in the sprinkler. One day, eventually, I'll have time for my online life. But I won't always have the weather to have fun with the girls outdoors. And I won't always have girls at home to enjoy either. So the decision was easy.
In my case, my daughter is just under five months old. I want to spend as much time with her now as I possibly can, I can't believe how fast she's growing ... so when I get home at night, and in the weekends, I spend as much time as i can outside ... time will tell as to whether my blogging picks up in Winter ...
by Bryan Lawrence : 2005/09/02 : Categories blogging : 0 comments (permalink)
Iraq
In general I plan to stay out of politics on my blog, or at least politics that are unrelated to the environment ...
However, once in a while I read something that seems so sensible I want other people to read it. Here (via, yes, you guessed it, Tim Bray) is simply the most sensible exit strategy I've seen for Iraq. Yes, it's written for a US audience, but it applies as much to UK involvement too. Read it.
2005/09/01 : 0 comments (permalink)
op cit
For years I've been using the term "op cit" to do citations, and wondering what it was actually an abbreviation for (some of us got no Latin at school). It turns out that it is
short for the Latin phrase opera citato, meaning "in the work already cited."
which is of course how I use it, but it's nice to know the actual latin, and has the bonus of helping me understand the word opera as well :-)
2005/08/25 : 0 comments (permalink)
Windpower: Local Solutions to avoiding burning carbon
I've mentioned in the past that I'm interested in small scale wind power generation. It strikes me as crazy that many of those advocating the importance of windpower seem to think that it can only be done with massive wind farms. There are many of us who live in windy locations who could generate a significant amount of local power with small wind turbines. Done right (i.e. as part of the house structure) it ought to be relatively inoffensive to look at (we're used to chimneys already) and harmless to birds (or not any worse than our large glass panels which take out the odd bird from time to time ... certainly there are no large raptors flying around our house).
So I'm glad to see advent of just such a house-scale wind turbine. There are some obvious questions first:
What will it cost?
How noisy will it be?
Will I be able to feed back energy into the national grid?
How windy is my property?
(Update) Do I need, and if so, will I get planning permission?
The windsave site is rather coy about the first question, but it's not quite an issue yet as I wouldn't be planning to do anything about it til next year (there are only so many house based projects one bloke can do with a new child and a full time job). They do point out that there are some quite good subsidies available though ...
The answer to the second question is quite comforting:
Free spinning (loudest noise potential)
5 metres behind blades gusting to 5m/s /12miles per hour, 33.0 dB
5 metres behind blades gusting to 7m/s / 16 miles per hour, LAeq 52.0 dB
3 metres behind blades, height 1.5m background noise LAeq 36.0 dB
Compare these with typical fridge/freezer when running of between 40 and 50 dB. But I guess it would be interesting to know what it would sound like with winds at 30 mph (maximum generation) and 50 mph+.
Apparently the energy can feedback into the national grid, although it's likely the average house "on idle" will use as much as it can get from these systems (given it's rated at about 1kW/h at not far short of full speed). It's also not obvious there will be any easy way of metering that ...
The second to last question is something I can maybe do something about ... it might be interesting to see a map of mean and standard deviation wind speed for the UK. The obvious problem is that one needs the wind info at the microscale, and we only have the data at a much larger scale, but it'll still be interesting. Now to find the time ...
(Update continued) The planning permission issue will be interesting too ... given one needs permission for a second satellite dish, it would seem likely that councils will try and get involved ... but I hope they wont be obstructive. If house scale wind turbines become common, they'll be just as much part of our normal housescape as chimneys.
The bottom line though, from their web site is this:
There are approximately 23 million households in the UK today. If just 2 million of them - roughly 10% - were to have a micro-wind generator installed on their roof, that would take a potential 1,000,000 tonnes of CO2 out of the environment each year!
by Bryan Lawrence : 2005/08/23 : Categories environment : 2 comments (permalink)
Service Chaining
Having spent the last four years as part of the UK e-science community, and buying into the wonders of what "the grid" can do for me, I've just spent a bit of time looking into exactly what is in the ISO19119 standard for geographic information services. Despite my colleague Andrew Woolf telling me how important this was, I've never really looked into this standard (despite running round banging on about how important standards are, and going on about how good they are in the metadata world, I hadn't had time to read this one ...). Anyway, it turns out to describe what most of the e-science world has been reinventing, and redocumenting in an adhoc manner for the last few years. Oh why oh why do standards organisations hide their wonderful products so successfully that no one uses them?
I thought the following was an inciteful analysis of patterns of workflow management. Essentially, they describe three types of workflow management:
Transparent: user sees all of the services
Translucent: workflow aids the user
Opaque: aggregate service hides services
These three modes are described in three UML diagrams in the standard, only one of which is necessary to get the flavour, e.g. the translucent case:
![]() |
The transparent case differs in that the workflow is set up by the user after looking at catalogue descriptions, and the opaque case the catalog describes the result of a series of workflow activities which are, as the name suggests, opaque to the user.
While this isn't a great leap forward, I claim it's inciteful because it discriminates simply between the various things we as a community have been building, and clearly identifies how workflow will be useful to us. Now, if only someone was building workflow engines based on the ISO service metadata descriptions in this standard ...
by Bryan Lawrence : 2005/08/21 : Categories computing ndg : 0 comments (permalink)
When will the oil run out?
A few days ago I reported articles in Science on the rate at which oil might run out. Today I found this which states that the Saudi's are saying that OPEC wont be able to support demand in a mere ten to fifteen years. (I haven't read the entire article as I don't have access to an FT subscription).
This at the same time as
Combined production of crude oil and liquids by some of the world?s largest non-OPEC oil companies declined 0.2% in the first half of 2005 compared to the same period in 2004.
(source: Green Car Congress)
Maybe this is going to be a problem before I get to retirement ...
by Bryan Lawrence : 2005/08/15 : Categories environment : 0 comments (permalink)
Nature DOI Failure
I don't know what it is with Nature, but yet again, when I use their RSS feed to browse the table of contents of the current issue, not one of the doi links to the complete story resolves!
It's not very professional for a journal with such a great reputation.
2005/08/11 : 0 comments (permalink)
Metadata, XML and Deja Vu
It's funny how some concepts and issues are repeated in multiple communities. Recently I attended a meeting called "Activating Metadata" (agenda,talks) held at NIEeS. The sharp eyed amongst you will have noted that the link for the talks is .../metadata2 ... talks from an earlier event are here. There have been other meetings on a similar theme which don't seem to be archived.
These metadata events are both stimulating and boring in equal measures. Regrettably we go over much the same ground every time (boring), but I learn new things when I meet new communities (stimulating). Fortunately this second meeting was a very different community, being primarily geographers and users of geography data (in a loose definition of the sense). Of course they have their own vocabularies, and needs for specific metadata.
Anyway, there seemed to be a feeling amongst some that metadata standards were a hindrance to academic use of data, and that more "of the right sort of metadata" was needed. While I would obviously argue that more metadata is needed in just about any context, the usual argument that "the standards didnt do it for me therefore I wouldnt use them" was frustrating. Since the meeting, I've written to one of the participants, and amongst other things I wrote:
There are solution frameworks out there, and if one doesn't want to use them, one simply contributes to the proliferation of options and consequent user confusion. Most importantly, one needs to understand that standards exist not as an effort to constain all possible metadata, but to constrain those elements where there is a chance of commonality (and interoperation between groups). You're free to produce whatever else you like that is relevant to your own user community.
In quite a different context Dare Obasanjo of Microsoft has been discussing the new buzzword microformats of XML. This discussion is about whether or not folk should be introducing their own tags into XML documents (with or without namespaces). There seems to have been much blognoise on the topic, but I think Derek (Only This and Nothing More) got it right:
Ever sit down at a table with a number of experts in a field that you do not know? They may be speaking English, but that doesn?t mean you understand what they are talking about. If you try and force them to speak in laymen?s terms, the efficiency of the information exchange drops dramatically. Specific languages are sometimes necessary. Individual specialties within Math and Computer Science all have customized definitions of terms, that sometimes conflict. Each specialty evolved it?s terminology to enable efficient, unambiguous communication between specialists in that field. Custom grammars are a necessity for efficient communication. Language reduces to the least common denominator of the intended listenership. If an application expects generic tools to process it's data, then it should use a well known standard. If local efficiency (or development or data) is more important, then use custom formats.
Which reads like just the same thing I was saying in the metadata context, hence the Deja Vu.
Metadata standards such as ISO19115 are being designed with just this structure in mind. Indeed, ISO19115 has the following view of how it should be used:
![]() |
Similarly, XML documents are built with xml namespaces in mind, and so the XML syntactical rendering of ISO19115 (the almost mythical ISO19139) will be built to allow communities to build such application profiles. I think that's what my geographer friends and the xml microformatters need to do: Build application profiles of existing standards, with all their own information built in as extensions over the core. Sure not every application will know what it's about, but all applications conforming to the core standards ought to be able to recognise elements in common, and specific applications exploit the extra information.
How one stores this information is a moot point, flat files, databases, GIS systems whatever, but when we exchange the stuff it's going to have to be with XML. And with XML, we'll use namespaces. We fully expect the application profiles of standards to pull in stuff from other namespaces ...
... which in the microformatter argument leads to questions about whether or not entries should have the same names in all application profiles. Of course not! We just don't all use the same names for everything, but where we do have communities intersecting we can build ontologies (which are just sophisticating mappings between terms that we understand). We can then use tools to do conversion between documents, exactly as argued in, for example, Dare's first article linked above. But, I think the microformat principles linked in the definition above, hold well in the scientific metadata world too:
solve a specific problem
start as simple as possible
design for humans first, machines second
reuse building blocks from widely adopted standards
modularity / embeddability
enable and encourage decentralized development, content, services
by Bryan Lawrence : 2005/08/11 : Categories curation badc xml ndg metadata iso19115 metafor : 0 comments (permalink)
More on the Plextor
Some more things I've learned. Firstly, that Suse 9.2 and Suse 9.3 use exactly the same versions of growisofs and makeisofs. So the difference in the errors reported earlier under Suse 9.2 and Suse 9.3 is probably about permissions, given cdrecord showed something rather different between root and normal user access under 9.3. So, under 9.2:
Run k3b from root terminal, burn an existing image. Yes it works. Both DVD-R and DVD+R.
Run k3b as root from user process, and try and write on the fly. Fails.
Run k3b as user, and create the image and then write as part of one session, but with growifs and makeisofs suid root. Fails, but with the Suse 9.3 error, not the permission error.
Write the iso image and then use k3b to write it, as a normal user (but with all that stuff setuid still there). Works.
Turn off the setuid, fails.
Turn on setuid only for growisofs, now works from iso image as a user, but not on the fly.
Now, I know that my colleague is runnnig k3b 0.12.2, with the same growisofs, but perhaps different tools (and kernel) underneath. More to investigate.
by Bryan Lawrence : 2005/08/11 : Categories computing (permalink)
Broadband from Satellite
I'm still a satellite broadband user, and although things have been better since I last reported. I regularly get 512 kb/s and sometimes 2 Mb/s, although nearly as often the system appears overloaded and completely unresponsive.
It would appear that I haven't much to look forward to. Too far from the exchange for anything meaningful (i.e. 512, I see no reason to have 256), and nothing technological on the horizon.
If I lived in Japan, things would be different. Space Daily is reporting that a new satellite is planned that will
make it possible to send and receive data at a maximum speed of 100 megabits per second in mountainous areas and remote islands, as well as aboard Shinkansen bullet trains, airplanes and ships...
The satellite is due to be in service by 2015, and in terms of signal strength, will even allow mobile phones to communicate at 10 Mb/s!
by Bryan Lawrence : 2005/08/11 : Categories broadband computing (permalink)
plextor PX-716UF Woes Under Suse 9.2 and 9.3
I do all my computing on my laptop, and have about 10 GB of user files on board (including about 2.3 GB of mail files). Obviously I care very much about backup, and I usually backup via rsync to a badc server. My backup server broke last week and is still not up. For that, and other reasons (I want physical copies and to be able to backup my laptop at home), I purchased a usb/firewire dvd writer. Herewith my experience.
This device supports both firewire and USB2.0, but the firewire connection isn't hot mounted, and it's not obvious what to do with it. So, moving to USB2.0, and plugging it in, we have from dmesg:
Vendor: PLEXTOR Model: DVDR PX-716A Rev: 1.03 Type: CD-ROM ANSI SCSI revision: 02 sr0: scsi3-mmc drive: 40x/40x writer cd/rw xa/form2 cdda tray Attached scsi CD-ROM sr0 at scsi1, channel 0, id 0, lun 0 Attached scsi generic sg1 at scsi1, channel 0, id 0, lun 0, type 5 USB Mass Storage device found at 8
which I interpret to mean that the raw beastie is at /dev/sr0, although I'm a bit concerned about all this scsi emulation stuff since I'm told that we don't use the scsi emulation with 2.6 kernels ...
Under Suse 9.2, k3b 1 recognises the device, and the dvd-r media, but fails at write time with
growisofs
-----------------------
:-( unable to open64("/dev/sr0",O_RDONLY): Permission denied
growisofs comand:
-----------------------
/usr/bin/growisofs -Z /dev/sr0 -use-the-force-luke=notray
-use-the-force-luke=tty -use-the-force-luke=dao -dvd-compat -speed=1 -gui
-graft-points -volid Backup Mail+CEDAR -volset -appid K3B THE CD KREATOR
VERSION 0.11.15cvs (C) 2003 SEBASTIAN TRUEG AND THE K3B TEAM -publisher Bryan
Lawrence -preparer K3b - Version 0.11.15cvs -sysid LINUX -volset-size 1
-volset-seqno 1 -sort /tmp/kde-lawrence/k3bmHQT3a.tmp -rational-rock
-hide-list /tmp/kde-lawrence/k3boB3OIb.tmp -full-iso9660-filenames
-disable-deep-relocation -iso-level 2
-path-list /tmp/kde-lawrence/k3bctpRtc.tmp
(K3b Version:0.11.15cvs,KDE Version: 3.3.2 Level "a",QT Version: 3.3.3)
Under Suse 9.3, k3b recognises the device, and the dvd-r media, but fails at write time with
OPC failed. Please try writing speed 1x. Fatal Error at startup: Input/Output error
(but I had it set to 1x speed!!!). Show details gives me
... /dev/sr0: engaging DVD-R DAO upon user request... :-[ PERFORM OPC failed with SK=5h/ASC=2Ch/ACQ=00h]: Input/output error growisofs comand: ----------------------- /usr/bin/growisofs -Z /dev/sr0 -use-the-force-luke=notray -use-the-force-luke=tty -use-the-force-luke=dao -dvd-compat -speed=1 -gui -graft-points -volid K3b data project -volset -appid K3B THE CD KREATOR VERSION 0.11.22cvs (C) 2003 SEBASTIAN TRUEG AND THE K3B TEAM -publisher -preparer K3b - Version 0.11.22cvs -sysid LINUX -volset-size 1 -volset-seqno 1 -sort /tmp/kde-bnl/k3bAxKrpb.tmp -rational-rock -hide-list /tmp/kde-bnl/k3b2Hb2fa.tmp -full-iso9660-filenames -disable-deep-relocation -iso-level 2 -path-list /tmp/kde-bnl/k3bPRTK8b.tmp
So I tried burning a CD (from Suse9.3), and that works (very quickly)! So it's not connectivity of any sort. It's something dvd-acious ...
I then took the physical device to a colleague running redhat, using xdcdroast (version 0.98 with some patches apparently) which on his system used the device argument(dev= "/dev/scd1") to Cdrecord-ProDVD-Clone 2.01b31 with the "Unlocked features: ProDVD Clone". It worked.
So for the moment, on my system I have an expensive (but fast) CD writer, and some more investigating to do.
Update (later on same day): Tried this at home on a Suse 9.3 system with the lastest updates and the latest k3b ... failed with the same error. But this was interesting: If I issue the command cdrecord -scanbus as a normal user I get
cdrecord -scanbus
Cdrecord-Clone 2.01 (i686-suse-linux) Copyright (C) 1995-2004 J?rg Schilling
Note: This version is an unofficial (modified) version
...
Linux sg driver version: 3.5.27
Using libscg version 'schily-0.8'.
cdrecord: Warning: using inofficial libscg transport code version
(okir@suse.de-scsi-linux-sg.c-1.83-resmgr-patch '@(#)scsi-linux-sg.c 1.83
04/05/20 Copyright 1997 J. Schilling').
scsibus0:
0,0,0 0) 'SAMSUNG ' 'CDRW/DVD SM-332B' 'T403' Removable CD-ROM
0,1,0 1) *
...
but if I issue the same command as root, I get
cdrecord -scanbus
Cdrecord-Clone 2.01 (i686-suse-linux) Copyright (C) 1995-2004 J?rg Schilling
Note: This version is an unofficial (modified) version
...
scsibus0:
0,0,0 0) 'PLEXTOR ' 'DVDR PX-716A ' '1.03' Removable CD-ROM
0,1,0 1) *
...
by Bryan Lawrence : 2005/08/09 : Categories computing : 5 comments (permalink)
Catching up on Real Climate
I've been remisss in my reading of RealClimate which is simply one of the best "blogs" around ... although I have to say it's less a web log than an interactive nearly peer reviewed journal full of great articles and comments. (Why do I say it's peer reviewed - because the articles get solid review in the comments. Why nearly? Because no editor comes down and makes decisions.) In fact, I find the comments are usually as interesting as the articles...
by Bryan Lawrence : 2005/08/02 : Categories climate (permalink)
Apache release WS-Security Implementation.
Davanum Sriniva has pointed out that Apache have released their WS-Security implementation (thanks Marta).
A couple of weeks is a long time in the "is there a patent problem or not world". Two weeks ago I reported problems with the patent status of this activity (actually via Davanum's blog).
There is a long email conversation about this here (which covers IBM's position). There is a short email here covering Microsoft's position. The longer email conversation is interesting in that it covers a bunch of hypothetical situations, and exposes some fragility in relying on the Apache license (which protects the user from code contributor misbehaviour, but not third party patent encumbrance). However, Apache do have this statement:
Any known encumberance, such as Patent claims/required patent license, is an IP issue and covered by the general Board directive; Circumvent or Terminate.
(from this email). They didn't terminate. Which means of course that Apache believe WS-Security is safe (all three parties who were obvious candidates to have had patent/license claims have apparently stated that they dont have such patents, i.e. there are no known encumbrances). However, of course, the quote from Joseph Reagle I used last time is still valid:
Unfortunately, it's difficult for the patent status of anything to be very clear ...
This makes our legal folk nervous. They somehow think that if we write code ourselves we'll be safe, but of course we wont. I liked the comment somewhere in the long email conversation that any two lines of Java could probably be patent encumbered if you could find the patent. I suspect that's the reality of the software world. If no one is waving a patent around, and you don't know of one, you should just get on with it. So we will, and can use WS-security after all, which means our NDG roadmap is safe.
As an aside, I note that the Apache wss4j distribution supports SAML tokens, so that must mean the known SAML issues have been resolved too.
(Readers might wonder why we are so worried about patents when such patents aren't enforceable in the UK. The problem is that our legal people tell us that if we distribute software on a website where Americans could download, we could still get sued in American courts ... and might be obliged to defend. I think this is a rather small risk ... but if we can avoid such risks life is easier.)
by Bryan Lawrence : 2005/08/02 : Categories computing ndg : 1 comment (permalink)
Solar Influence on Recent Climate
There is a tiny war of words in Nature this week about reconstructions of solar influences on climate. Muscheler et.al. (2005) claim that current solar activity is not particularly unusual in a criticism of Solanki et.al. (2004) (who themselves claim that the last few decades are unusual with respect to the previous 11,000 years). There is a reply by Solanki et.al which essentially states that Muscheler et al have used an inappropriate normalisation to get large amounts of historical activity. It's only a tiny war though, because all parties agree with Solanki and Krivova (2003) who demonstrate that even if the last few decades are unusual, they can't explain the recent warming.
by Bryan Lawrence : 2005/08/01 : Categories climate (permalink)
El Nino or La Nina in the Pliocene?
Apparently (Wara et.al., 2005) the Pliocene should give us a good idea of what our climate could be like in a few decades, as many of the boundary conditions forcing the climate were similar then to what we see (or will see) today:
including first-order ocean circulation patterns, the Earth's continental configuration, small Northern Hemisphere ice coverage, and atmospheric carbon dioxide concentrations (about 30% higher than pre-anthropogenic values).
It would appear (from Wara et al, although not without controversy) that during the Pliocene conditions more like El Ni?o existed:
the eastern Pacific thermocline was deep and the average west-to-east sea surface temperature difference across the equatorial Pacific was only 1.5 ? 0.9?C, much like it is during a modern El Ni?o event.
This means that the existing atmospheric circulation is potentially not stable to significant warming, and there could be significant redistributions in the oceanic currents and major atmospheric circulations associated with greenhouse gase induced warmings. Indeed the final sentence of Wara et.al. makes reference to the possibility that such redistributions might already be beginning.
by Bryan Lawrence : 2005/08/01 : Categories climate (permalink)
The Linux Desktop and the Network
I've just been up to Edinburgh for a couple of days of meetings, but while up there, I tried to drop Linux onto a machine for a mate who is sick of spyware and other Windows-acious problems.
I installed Suse 9.3 on his system (keeping Windows), and noting the possibility of him wanting to run software suspend to disk I made sure there was a little boot partition running ext3.
The major thing from his point of view was that he had to have good internet access, and a stable system. Bare in mind that this bloke is very familiar with the various misbehaviours of "that other OS", and not at all about Linux.
He uses tiscali broadband, and that turned out to involve a USB-modem (like most broadband users I guess). The particular usb-modem involved was a fast 800 type modem ...
Of course Suse 9.3 didn't discover that, or do anything sensible with it. However, it turns out that linux drivers are available, and easily installed. The sagem web site has drivers and manuals.
The installation requires access to the kernel source, but seems otherwise straightforward for me ... but then I'm not afraid of command line access ...
Further problems arise in using it ... in practice there are three steps, which are encapsulated in the following script ...
# /usr/local/sbin/eaglectrl -d sleep 20 /usr/local/sbin/startadsl #
(there is a stop adsl command available too).
One can automate the up/down for a user ("user") by the following steps:
Place the above script in /home/user/bin as mystartadsl
Edit /etc/sudoers (using visudo as root) and add the following line: user ALL = NOPASSWD: /home/user/bin/mystartadsl, /usr/local/sbin/stopadsl
Use the kde menu editor (or whatever else you use) to add in the internet menu "start tiscali" and "stop tiscali" menu entroes which invoke (respectively):
sudo /home/user/bin/mystartadsl
sudo /usr/local/sbin/stopadsl
but the user doesn't know automagically whether it is up or down. We need to be able to interact with the kde internet management, and I didn't have time to do that. There is no way my mate has the necessary expertise (now).
Software suspend also breaks (with a usb key disk inserted). The bottom line at the moment is that my mate claimed "the OS is bugggy". It's not much chop for me to claim it's not the OS, it's the distribution ... particularly with a "state-of-the-art" commercial distro ... So he's gone back to Windows (hopefully only for a while).
It would appear that mandrake claims this usb modem works out of the box. I'll be interested to know if that means kde recognises it properly, and the mandrake (or should I say mandriva) firewall behaves properly. I also saw on the ubuntu forum that they hope to have it in a future ubuntu distribution.
I guess my point (apart from documenting what I did should I ever get back up to try and finish this), is that broadband access by usb modem is what nearly every home linux desktop user will want to do (at least until they get a wireless router). Until the desktop distributions support that easily, it's going to be very hard for folk to use these distros.
by Bryan Lawrence : 2005/07/28 : Categories computing : 2 comments (permalink)
Business Models and Curation
While up in Edinburgh, I visited the National Digital Curation Centre. Amongst the many interesting things we talked about was the pesky difficulty of applying business models to curation. Chris Rusbridge at the DCC differentiates between curation and preservation as:
Preservation: something you do for the future, and
Curation: something you do for both now and the future.
This is a useful distinction, although I would perhaps modify curation to be both preservation and facilitation, to make clear what the curator is doing for the current users (faciling their doing something else). In any case, this definition of curation works for us.
It's rather easy to provide a business model for facilitation, but not for preservation. If what you do is only preservation, then one has a difficult road to follow in establishing a business model. You have no users, and the future value of what you preserve is probably unknown. You only have costs associated with ingestion and ongoing storage (+migration etc). What are the metrics associated with successful preservation?
If you do both, then the risk you have is that facilitation dominates over preservation, because the business model for facilitation is rather easier to determine, and metrics for measuring success are much easier to determine.
It turns out that the DCC ran (with the Digital Preservation Coalition) a workshop on such issues, at the same time as a subcommittee of the NERC Data Management Advisory Group (DMAG) met to discuss output performance measures (or OPMs). Clearly OPMs have to relate to the objectives of the organisation, which to some extent come down to the business model. I attended neither meeting, but Chris was kind enough to send me some key presentations (they should be available on the web at some point, I'll update this page when they are). Sam Pepler from the BADC attended the DMAG meeting, so I've got some feedback from both.
The material Chris sent introduced me to the Balanced Scorecard approach (see here for an introduction to the balanced score card, but the basic idea is to apply more than just short term finances to evaluation of success, particularly when developing intangible capital). The espida group are applying this to digitation curation.
One of the things James Currall (from espida) appears to have talked about (at the curation cost model meeting) is the value-time behaviour of items, which he depicted in the following diagram (I will give a proper reference to where this comes from when I have one):
![]() |
and he talked about a number of asset classes that one might look at preserving, which included research data. What struck me though was the implicit assumption in the above figure that all things decrease in value with time. Leaving aside how historians feel about that, I felt obliged to redraw his figure as follows:
![]() |
The point I'm making is that for us, and for our science community, data usually has an immediate value, associated with (and preceding paper writing), and then, because it's generally something about the real world, the value increases with time, as it becomes part of our timeseries of world observations (athough, until it's used, which might be when the timeseries is "long enough" - whatever that means - we might have difficulty in justifying holding it).
This has implications on our cost model (and perhaps our balanced score card when/if we get to that). In this case, facilitation is generally about helping that first bump, and then preservation is about ensuring that the slope of the value time graph for environemntal data is positive!
The difficulty with our output performance measures is how to capture the latter. One of things that the DMAG subcommittee discussed was measuring the number of datasets published. Leaving aside the definition of publication, I would argue we need to measure the amount of ingestion work that is done (both of new datasets, which is generally hard, and of additional data for old datasets, which although it ought to be easier, is something we generally do badly in terms of updating metadata). Perhaps we can add to our list of criteria, the age of the datasets we hold - the older they are, and the more complete the timeseries, the more important they are - even if they have no current users. Worse still, if the timeseries is not being added to, and we have no (current) users, then how do we evaluate how well we are preserving it, and what resources should be devoted to doing so?
Update, August 2nd: James Currall tells me that when he actually gave his talk at the meeting the diagram had another curve on it: an upward increasing line representing the value of malt whisky with time ... so he was already thinking about the class of things with values that increase with time (another example he suggested was artworks).
Me: I think wine would be a better example than malt whisky (which apparently always increases in value with time :-) ... wine might or might not, but generally you don't know til you open the bottle!
by Bryan Lawrence : 2005/07/28 : Categories curation badc : 1 trackback : 2 comments (permalink)
flpsed
Quite often I have to make tiny modifications to existing pdf files.
I've just found flpsed and it works!
I downloaded it and the Fast Light Toolkit, and a few minutes later I could do it. Installing flpsed was a wee bit annoying on my Suse (still 9.2 dammit) system:
Configuring, making and installing fltk was a doddle.
The Configure for flpsed couldn't find the fltk though, I had to do this:
./configure FLTKCONFIG=/usr/local/src/fltk-1.1.6/fltk-config
Then it's straight forward to use. Take a pdf document. Use pdf2ps to produce a postscript file. Annotate (not edit) it with flpsed, and then export the result as pdf. Done.
by Bryan Lawrence : 2005/07/21 : Categories computing : 0 comments (permalink)
Icehouse and Greenhouse Worlds
I've been ignoring paleoclimatology for a long time as being yet another interesting field, that I haven't time to pay attention to. However, today's lunch time reading was Kump's letter to Nature which introduced me to the concept that in the Eocene (55-34 million years ago), the earth was thought to be essentially warm and ice free most of the time, associated with higher levels of CO2. However, apparently that all changed at the Eocene-Oligocene boundary, when the earth moved into it's current glaciated state i.e masses of permanent ice over Antarctica and elsewhere.
It turns out that during the Eocence there probably were glaciations but they didn't persist. The hypothesis advance for the transition into the Oligocence with permanent glaciation (or why earlier "minor" glaciations didn't persist) seems to be related to a rapid drawdown in atmospheric CO2 resulting from increased weatherability of the continents associated with Himalayan uplift.
There were two things I took from this:
I had no idea that weatherability of rocks could be so important for the atmospheric CO2 loading (ok, I should have, I've heard colleagues witter on about this in the past, but hadn't paid enough attention), and
Kump's final conclusion:
If decreasing atmospheric CO2 stabilized the glacial state in the Oligocene, might increasing atmospheric CO2 from fossil-fuel burning destabilize it in the future? The lesson to be learned here is that we should watch for subtle signs that we are moving from the icehouse world in which Earth has remained for 34 million years into a new, greenhouse world.
by Bryan Lawrence : 2005/07/21 : Categories climate (permalink)
Oil Supplies
Science also 1 has a couple of letters (Grant and Ehrenfeld) on the reality of how much oil is left to exploit.
As I said in January, this is something we should all give a bit of thought to from time to time, not least because these letters give us a timescale, of say, two or three decades, so it's going to effect our retirements!
(I especially liked the quote from Bartlett on the upper limit of this period: assume all the earth is oil, and current growth rates in usage, and it'll all be gone in 342 years!)
by Bryan Lawrence : 2005/07/20 : Categories environment : 2 comments (permalink)
Atlantic Ocean Oscillations and European Climate
Rowan Sutton and Daniel Hodson have an intriguing paper in Science on the influence of the Atlantic Ocean on summer climate in Europe (and North America).
The bottom line of this paper is that the "Atlantic Multidecadal Oscillation" (AMO) has an influence which lasts for tens of years on whether the summer precipitation and temperatures are wetter and/or warmer than the average or not. They conclude with:
In the absence of anthropogenic effects and assuming a period of 65 to 80 years, we should now be entering a warm phase of the AMO. Our results would then suggest a forecast of decreased (relative to 1961 to 1990) summer precipitation (increasing drought frequency) and warmer temperatures in the United States together, possibly, with increased summer precipitation and temperatures in western Europe.
They go on to point out a possible non-linear coupling between the AMO and possible anthropogenic effects on the thermohaline circulation (TC) which could ameliorate (presumably for a while) anthropogenic climate changes by moving the AMO (presumably more rapidly than normal) into a negative phase (with presumably cooler and wetter conditions) as the TC changed. (All the presumablies are mine not theirs).
All this is good stuff, but as Rowan is a mate, I have to take issue with a sentence earlier on the paper:
A simple significance test suggests that the major observed anomalies shown in Fig. 2 are unlikely to have arisen from internal fluctuations of the atmosphere.
I looked at the supporting material (pdf) for all of ten minutes and couldn't fathom it directly. I can think of no way that a significance test on the observational data alone can tell me anything about whether any patterns are due to internal or external fluctuations. If, as I assume from the supporting material (it doesn't state it directly with reference to that sentence), it's based on comparing the variabilty with the atmosphere only model (driven by constrained sea surface temperatures), then it falls or dies on whether that model has realistic variability on those scales (not on the significance test). The paper goes on to justify the argument based on the coupled atmosphere/ocean experiments, but the way it reads implies there is an a priori belief that it is external variability based only on the observed statistics without having done all the model experiments. However, that's just a quibble ... I'd like the paper to have made that more clear, but it in no way detracts from the results.
by Bryan Lawrence : 2005/07/20 : Categories climate (permalink)
RSS and Atom
Now that Atom 1.0 is pretty much out, it's useful to point to the comparison between RSS and Atom. I'm doing that here so I can easily find it again.
(As an aside, Sam Ruby who hosts that wiki page was slashdotted and coped, his description is here).
leonardo (the software that runs this site) is being upgraded to atom-1.0 even as I type, but I'm sad to say that I'm not contributing ... perhaps when Elizabeth sleeps through the night I'll regain some extra time for extra activities.
Update July 22: and Niels Leenheer compares Atom 0.3 and Atom 1.0. Via Sam Ruby again.
by Bryan Lawrence : 2005/07/20 : Categories computing (permalink)
elementtree
Joseph Reagle has been using elementtree, and that link points to a useful set of notes on how to use it.
I've had a play with a number of ways of processing XML in python, and I like elementtree the best of all ... but I find the original docs slightly disappointing (i.e. not complete enough). Given I don't get to use it day to day, I need an easy to use crib sheet ... if I had the time I'd link all the bits and bobs I've seen together ... but one day is probably a long way away regrettably.
by Bryan Lawrence : 2005/07/20 : Categories xml : 0 comments (permalink)
Patent Status of XML-Signature etc
I found out yesterday that WS-Security has patent/license problems which make it difficult to use in a GPL environment. That got me worried about NDG security. We depend on (or will depend on) three pieces of technology
xml-signature,
PKI X509 certificate handling (and signatures), and
our Attribute Certificates.
Taking these one at a time.
What is the patent/license status of XML-signature? It's a W3C standard which is a Good Thing (TM), but that doesn't guarantee much. What the W3C knows is summarised here, but probably the best summary of the status appears as a comment on the patent status of xml-signature by Joseph Reagle (the W3C co-chair) which because it's so relevant I'll repeat in it's entirety:
Unfortunately, it's difficult for the patent status of anything to be very clear. (It's like proving a negative: God doesn't exist.) The only clear patent status IMHO is one that has been upheld in court or otherwise considered uncontestable, and it's license has been publically excercised by many implementors.
Regardless, there are a few ambigous statements from a few years back that folks should be aware of, but I'm not personally aware of any specific claims of infringement or licenses with respect to the 12+ implementations.
PKI. Well, ideally we'll concentrate on using OpenSSL, which has a useful FAQ on the topic of GPL and patents and OpenSSL. The key points are that
OpenSSL itself is not a problem, but the various algorithms it uses are patent encumbered (as described in the README). In principle however, we can always change the actual agorithm we use.
The GPL issue is ok on linux systems, but in case of other O/S it is summarised with this:
If you develop open source software that uses OpenSSL, you may find it useful to choose an other license than the GPL, or state explicitly that "This program is released under the GPL with the additional exemption that compiling, linking, and/or using OpenSSL is allowed." If you are using GPL software developed by others, you may want to ask the copyright holder for permission to use their software with OpenSSL.
Finally, our attribute certificates are just XML documents which describe our own security policies. I don't think anyone elses patent could affect that. However, since we talked about migrating to SAML at one point, so an interesting question would be what would happen if we migrated to SAML to encode our attribute certificates (why we should do this is another question that needs an answer that I can't give - because I haven't one, but people keep telling me we should) ...
The situation for OpenSAML seems somewhat more unclear, and I'll probably need to follow it up at some point, but meanwhile it seems like apache binned OpenSAML because of patent issues, but on the other hand the Shibboleth team seem happy with the patent license that would be granted for them (see the attachment on the previous link), and presumably we could get the same (perpetual terms that appear to allow the delegation of authority to use). Fortunately, SAML isn't on the agenda yet, so we don't need to go there ...
by Bryan Lawrence : 2005/07/19 : Categories xml computing : 0 comments (permalink)
Microsoft Please Save Paper
I refuse to believe that the world needs default margins of one inch (or 2.54 cm) ... imagine how much paper could be saved if the default margin was 1.5 cm? The first thing I do with a new Office installation is change the default template ... but couldn't it be better from the beginning (those who really want the white space could change their own defaults).
(Yes, I do use MS-Office!)
by Bryan Lawrence : 2005/07/19 : Categories computing : 0 comments (permalink)
ECMWF to increase their operational model resolution
I have had my attention drawn to the planned increase in the resolution of the European Centre for Medium Range Weatherforecasting operational model from (in the case of the atmosphere) T511N256L60 to T799N400L91.
This corresponds to improving the horizontal resolution for the dynamics from about 40 km to about 25 km and the physics from about 80 km to about 50 km (where here I'm quoting the shortest resolved wave at the equator). Of course the true resolution is less than this because the numerics can't support two-grid waves (or anything near that). A rule of thumb might be to say four grid points, in which case the resolution is still about 100 km in reality, which is amazing!
If you're anything like me you can't remember - or easily calculate- what these represent in "real" surface resolution terms. So, for my own benefit, I reminded myself by perusing Laprise, 1992 and writing some simple python to generate the numbers.
# This code follows Laprise, 1992, in BAMS,
# See http://blue.atmos.colostate.edu/publications/pdf/NT-27a.pdf
#
from math import *
def gridres(N=None,T=None):
pi=3.141592
a=6371.0
if N is None and T is None:
return 'Unknown Resolution (need N and/or T)'
if T is not None and N is None:
N=T
if N is not None:
L1=2.*pi*a/(3*N+1)
L2=pi*a/N # note ECMWF seem to quote this.
L3=sqrt(4*pi)*a/(N+1)
L4=pi/sqrt(N*(N+1)/(2*a*a))
return int(L1),int(L2),int(L3),int(L4)
if __name__=="__main__":
# print gridres(T=31) # checking code against the paper
# ECMWF September Upgrade:
#
print "Best to quote the 2nd of these numbers on each line:"
print gridres(T=799)
print gridres(N=400)
***
highlight file error
***
by Bryan Lawrence : 2005/07/18 : Categories badc climate : 0 comments (permalink)
WS-Security Licensing Problems
Oh how I hate Intellectual Property wars ...
Infoworld are carrying an article which states that the WS-Security licensing terms are causing the Apache foundation problems 1. It turns out that
Although WS-Security, along with the other so-called WS-* specifications such as BPEL (Business Process Execution Language), is under the jurisdiction of OASIS, users still must sign license agreements with IBM and Microsoft.
And oh dear, the terms of that license are probably incompatible with the GPL, so we probably can't build something based on WS-Security and distribute it with GPL-based software. See David Berlind blogging at ZDnet, for a more detailed discussion of this issue. This article also points out the problems with the OASIS "standard"s in general ...
Frightening stuff ... but some hope for the future in another Berlind blog. Meanwhile I begin to understand why W3C standards are far more useful (I don't have to investigate what the licensing status of their standards are, they are clear).
by Bryan Lawrence : 2005/07/18 : Categories computing : 0 comments (permalink)
More Good Ideas about Blogging
I've blogged in the past about good reasons for blogging (well, strictly, I didn't blog, I linked to Tim Bray as usual). Here from Randy Hollaway is another one:
So here's a what if- what if you could do 20% time projects ONLY if you blogged about the effort with customers? Even if you couldn't share all of the details for competitive reasons, would this make 20% time more valuable to the organization? Something to ponder.
This is via Sam Ruby and refers back to an original blog by Stephen O'Grady, which stated:
we don't find the time for blogging, we make time for it. I've commented in the past that blogging isn't an addition to our day job, it's part of our day job. In recent weeks I've come to think of it as something akin - though different - to Google's 20% time. We have nothing so formalized, but we probably spend something like 20% of our time (ok, more) researching and writing and pursuing what we consider to be new and interesting avenues of interest. Some of these bear fruit for RedMonk, some don't. But it only takes a couple of hits to make the whole thing worthwhile.
In our context (an atmospheric data centre), one of the difficulties I have is that our staff need to remain research active (in the sense that the computer people need to keep their skills up, and the atmospheric scientists need to keep interested and up with the play). I've traditionally said this should be a twenty percent activity, but it's been nearly impossible to find ways to make this
meaningful, and
measurable.
Perhaps getting them to actually blog about it would achieve both (since most of what we do wont be publishable in the refereed journal sense).
2005/07/18 : 0 comments (permalink)
Linux drivers for HP Color LaserJet 2820
I'm considering buying a color (sic) laserjet 2820 (an all-in-one printer) to replace my aging deskjet G55 ...
Anyone out there running one? The hpinkjet site doesn't yet mention it, but I've seen that before ... and things have worked just fine ...
by Bryan Lawrence : 2005/07/14 : Categories computing : 2 comments (permalink)
Norwegian Government Says Yes to Open Standards
This says it all (via Ongoing). The bottom line is that:
by the end of 2006 every body of the public sector in Norway must have in place a plan for the use of open source code and open standards.
by Bryan Lawrence : 2005/07/05 : Categories curation msxml (permalink)
Aerosol Climate Forcing
There is an excellent paper in Nature by Andreae, Jones and Cox 1 on what the future may hold as greenhouse gas forcing increases global temperature, just as the protective affect of polluting aerosols decreases. In fact, it's worse than that: as the authors say:
The twentyfirst-century climate will therefore suffer the treble hit of an increasing warming from greenhouse gases, a decreasing cooling from aerosols, and positive feedbacks from the carbon cycle, whereby increased temperatures cause accelerated release of soil carbon by decomposition.
However, we don't really know how bad it will be because
we don't really understand how much the aerosol forcing is protecting us now. Again, as they say:
Do we live in a world with weak aerosol cooling and thus low climate sensitivity, in which case future climate change may be expected to be relatively benign? Or do we live in a highly forced, highly sensitive world with a very uncertain and worrying future that may bring a much faster temperature rise than is generally anticipated?
it would appear that the parameters constraining the carbon release by decomposition are also ambiguous.
While this paper admits a wide range of possible futures (and describes very well why), the bottom line is that
there is a possibility that climate change in the twentyfirst century will follow the upper extremes of current IPCC estimates, and may even exceed them.
(Actually, if one reads the paper in detail, they imply this isn't just possible, but likely, or at least that's my reading of what happens using best estimates of the parameters and likely emissions scenarios.). They go on to say:
Such a degree of climate change is so far outside the range covered by our experience and scientific understanding that we cannot with any confidence predict the consequences for the Earth system.
However, these predictions are cloaked in an enormous range of possibilities, and they rightly point out that a number of approaches (including improving parameterisation of cloud processes) are needed to improve confidence in such predictions.
by Bryan Lawrence : 2005/07/05 : Categories climate : 0 comments (permalink)
ERA40 Precipitation and Antarctic Ice
There is a fascinating article by Davis et.al. in Science on Antarctic Ice thickness. There is also a perspective piece by David Vaughan of the British Antarctic Survey. The gist of the substantive article is that some parts of the Antarctic land mass are thickening while others are thinning (the results are based on radar altimetry measurements over eleven years). Not surprisingly, they link the thickening versus thinning to precipitation changes. Leaving aside the implications (evidence for climate change and concommitant sea level rise which are the main thrust of the paper 1), one of the things I find most interesting is how good the ERA40 precipitation measurements are! The following image shows the ERA40 snow precipitation (left) and ERS elevation changes (reds more precip, snowmass increasing, blues, less precip, less snowmass):
![]() |
Given that there are nearly no precip measurements going into ERA40 at these latitudes, it shows that the precipitation physics in ERA40 are rather better than I thought they were, although, as the paper points out, the magnitudes are not as good as the spatial patterns (and some of the differences in some areas are due to ice dynamics).
(As usual the figures have been degraded in resolution and have had the colour scale removed to comply with my copyright fair use criteria).
by Bryan Lawrence : 2005/07/01 : Categories climate : 2 comments (permalink)
Upgrade
I'm pleased to say that after a silence due to
a significant ankle injury that meant that there was no comfortable way to approach my computer for significant lengths of time, and
upgrading the site to leonardo-0.6.1
... i'm back!!
Upgrade notes here.
2005/06/30 (permalink)
Terms and Conditions
eliterate librarian has drawn my attention to the LexisNexis website and their terms and conditions, which are a joke.
The implication that I can be bound by terms and conditions which I have not read (although actually I have in this case, but am ignoring), is as, I say, a joke. Our legal experts tell us that if you want a disclaimer to hold, it has to be obvious. Here is what my browser shows me of the LexisNexis web site is this:
![]() |
the point being that unless I scroll down, I wont see the terms and conditions link, and even if I do, would I assume I have to read it before I could link to any part of their website? Very unlikely. The terms and conditions that eliterate librarian highlights, and you can go laugh at, are useless ... (at least in UK law ... while it is a dangerous assumption, I assume US law is just as sensible about holding people to contracts they haven't seen)!
by Bryan Lawrence : 2005/06/30 : Categories badc curation : 0 comments (permalink)
Hurricanes and Climate Change
It has been an interesting six months for followers of the issue of whether hurricane intensities and frequencies are changing in response to changes in the background climate.
This has been a bit of a cause celebre, with some claiming that trying to make the link is evidence of duplicitous intent by "the climate scientists". Of course, the key problem is that not every part of the climate system is exhibiting signals of climate change, nor may they necessarily do so. However, it's obviously appropriate to ask the question, but regrettably having done so, the media charges on and marks such links as explicit and authoratitive well in advance of scientific opinion. Crichton in his book at least had that part right - although it's a bit much to imply that governments are provoking it ...
Anyway, as I say, an interesting six months. It began with an impassioned open letter from a respected hurricane scientist, Chris Landsea, who thought that Kevin Trenberth had gone too far in linking climate change to the intensity and frequency of hurricane landfall in the U.S. (particularly in later press conferences).
Roger Pielke wrote a measured followup note explaining the necessity of a peer reviewed paper explaining Trenberth's position.
A paper along those lines has now appeared. Trenberth's point, which I think is fair is that a) there is a plausible physical mechanism which could link global warming to more intense and frequent thunderstorms, but b) at the moment the observational evidence is still not unequivocal (especially in terms of statistical tests).
With respect to his mechanism, it's really rather simple:
Both higher sea surface temperatures and increased water vapor tend to increase the energy available for atmospheric convection, such as thunderstorms, and for the development of tropical cyclones.
... and there is observational evidence for both of those. There is also considerable simulation work 1 that demonstrates that this mechanism could work in practice, but not yet 2 unambiguous evidence of hurricane landfall and intensity changes.
by Bryan Lawrence : 2005/06/17 : Categories climate (permalink)
Laptop OS Woes
Well, my dicky stomach has lasted pretty much all week. While one major side affect has been the inability to concentrate on anything, that has allowed me to set the computer off doing things in the corner while I watched the clouds (or yesterday, the patches of blue on the clouds) ...
So, I did some real testing on kubuntu suspend, which broke! Broke so badly that it caused a kernel panic and corrupted the root partition ... so that's the end of kubuntu for me for now! (Despite what I said on Sunday).
Meanwhile, back in the Suse corner, I did get it to install finally. All it took in the end was to install with the safe settings. I suspect the dma between the dvd and the hard disk ... Couldn't Suse have suggested that in the dozen useless emails we exchanged? Apparently not.
I'm now back in the horrible land of Yast. I hate not getting any (useful) feedback on software installation progress, and in particular, what to do when it hangs, and abort fails! (Which seems to be most times).
The astute reader will recall that back in April I suggested I could be a candidate for a Mac. But even that option seems silly now, given that Mac laptops are due to move to intel next year. (I appreciate the Osborne Effect isn't a necessary way to behave, but given I might actually consider dual booting mac and linux, at least during a transition phase, it makes sense to wait).
by Bryan Lawrence : 2005/06/16 : Categories computing (permalink)
Subversion and Roundup
A while ago I was investigating choices for issue tracking and code maintenance, and was bemoaning the difficulties of trac (which sounded good, but was just too hard to get going).
Some interesting things popped up in todays reading (I'm banished to the spare room - the one with the computer in it - because I've got a dicky tummy, and so I can't do anything "useful") ... this means I'm reading deeper into my feed list than I have for a while.
Anyway, two things of interest:
Richard Jones has produced an integration of subversion and roundup.
The other is what looks like a useful code to produce an rss feed of subversion commits.
by Bryan Lawrence : 2005/06/12 : Categories computing ndg (permalink)
kubuntu 5.04 wins over Suse 9.3
Well, I think I'll give up on Suse. I tried to install it in another partition on my laptop today (so I could follow the advice from the support guy), but this time it didn't even install the software onto disk properly ... (but it installs fine onto my desktop at home). It got 20 minutes into the software install, informed me that a component had failed, gave me the option to retry or ignore. Retry did nothing. Ignore hung the system into an infinite loop.
So, it appears that kbuntu has won for now. I've started to put notes on kbuntu here!
Update (16th of June): Oh no it hasn't (won). See on why not.
by Bryan Lawrence : 2005/06/12 : Categories computing (permalink)
More on Software Patents
A couple of key happenings:
On the one side, we have Tim Bray summarising a legal argument by Greg Aharonian to show that the Microsoft XML (office) patent will never be asserted because of copious prior art.
On the other, Groklaw are attempting to draft suitable wording for a definition of "technical contribution" for purposes of EU patent law.
These represent respectively, the silliness that exists, and the silliness that could exist ...
by Bryan Lawrence : 2005/06/05 : Categories msxml curation (permalink)
MS Office Licensing
I think it's great to see the new blog from Brian Jones at Microsoft (via Sam Ruby). Brian has a specific entry on the licensing question for Office documents, but as he says, he's not an expert on that area. Some of the comments to that entry hit the buttton though:
Microsoft has lots of high-powered and smart lawyers. They know very well that this license is incompatible with GPL'ed software, like, for example, OpenOffice.
Again, I think the analysis on Groklaw is the best I've seen on this, and again, until this is satisfactorily resolved, I have a big problem with relying on ms documents to store my own IPR for posterity (i.e. beyond the next couple of years).
by Bryan Lawrence : 2005/06/03 : Categories msxml curation (permalink)
Playing with Kubuntu and Korganiser
As regular readers will know, I'm a disappointed Suse Linux bloke at the moment. I've had a support ticket running for a fortnight now, but haven't made much progress (to be fair to Suse, i've been slow in replying to a couple of emails, but to be fair to me, they never read properly what I write, we haven't actually moved past the information I provided in my first email yet).
Anyway, in a fit of frustration because I want a few things now, I installed kubuntu while I was having a bath last weekend (it took me all of a few minutes to set up the software download, a few more to burn a CD, a few to start of the installation, then I had my bath, and when I got out, I had a few minutes config to do). Well done kubuntu for that.
Kubuntu is a bit rough around the edges ... there are a lot of things that don't work quite right1... it's not quite as good an experience out of the box as Suse normally is, but at least it installed!! I've found playing with apt-get much better than yast ...
This afternoon I had a spare hour, and instead of doing something I should do, I've had a play with kontact, korganiser, and the calendering functionality. I'm starting from being a regular kmail user who uses outlook via cross-over office for calendaring from our corporate exchange server. I have two choices from here. I move over to evolution, or I hope that korganiser can support exchange properly ...
Starting from the latter because I like kmail, I've had a play with korganiser and the exchange 2000 plugin (in kde 3.4.1). It's very nearly functional :-). I have my appointments visible, and can create new ones, but I can't yet from this interface see who is coming to the meetings, and/or their free/busy status for new meetings. Still it's certainly progress from the last time I looked.
by Bryan Lawrence : 2005/06/03 : Categories computing (permalink)
Adobe slightly better than Microsoft
I'm on record with my complaints about how Microsoft license their document formats. Imagine how aghast I was when I read this - apparently the Adobe Acrobat license conditions are stupid. Go read it for yourself ...
... and then step back and realise these are the conditions to use their software! The data in the documents are still safe (or are they?).
It's fine for companies to protect their software IPR, but not the products that we create with their software. That's our IPR. How can it be that even smart Microsoft employees (i.e. See Dare Obasanjo in this and this) can't understand why this makes those of use with responsibility for preserving information for posterity very worried? Tim Bray has it right again.
by Bryan Lawrence : 2005/06/01 : Categories msxml curation (permalink)
Search Engine Ranking
Why is it that reading Tim Bray's blog seems to generate more of my own blog entries than anything else?
Anyway, some time ago I wrote a blog entry on searching; browsing versus discovery in NDG language. Today Tim pointed me to this article on search engine rankings. What I found interesting was this:
... adoption of shopping search and other vertical search tools as alternatives to general search; local search; the measurement of offline and latent purchases caused by search listing ... certain vertical search categories were holding their own, but that they also depended heavily on general search to maintain visitor levels.
Which I think reinforces my ideas about browse versus search ... arguably we can just offer up our browse metadata for free text searching, but I think we need more than that (so, yes, we'll do that). People in our game do want to do things like: "Find the simulation datasets which have preindustrial greenhouse gas forcings". To do that, I want them to be able to do a free text search on something to do with model forcings, not just on all information. This means we need some more sophistication in discovery metadata for numerical models, and surprise, surprise, that's what we're working on.
by Bryan Lawrence : 2005/05/31 : Categories ndg (permalink)
Project Management, Part One
As NDG1 nears completion (September), we are moving to try and deliver the first products of three years research on data grids in the environmental sciences. While NDG2 is funded to start immediately after, and provide "proper" deliverables, it is an interesting time to think about how "deliverables" work in a research project like NDG, and how we do the actual scheduling to achieve them.
Most of my career I have been an atmospheric scientist, and the research methodology I have used has not been what we have needed to deploy for NDG. However, there have been many similarities:
throughout we have treated conferences as milestones (just like "real" science). For conferences we have aimed to have something new to show (in the case of NDG, documents and/or prototypes), and
we have striven to write some papers.
Publishing for the NDG is difficult though: the material is generally not interesting to environmental scientists ("just make it work") or to computer scientists - here for example is a snippet from one review:
It was very disappointing not to see any allusion to related work or alternative approaches. Given that the submission reads like a report of work done and given that the work done is not particularly technically challenging (above the low-level issues of negotiating different languages, representations and environments) ...
which we found a bit harsh given
this was supposed to be a two-page extended abstract for what could have become a full conference paper ... just how much related work or alternative approach can you report in such a document? And,
"low-level issues negotiating representations" are what the whole project is actually about. Handling data interoperability is a hard problem! Not just technically, but organisationally (there have been semantic issues with which we have to grapple, and we believe we have new solutions). However no, we didn't invent any new algorithms ... which seems to be his/her problem ...
All the e-science programmes have had this problem: computer scientists dumping on applications of computer science as being intellectually beneath them ... you can tell I've been pissed off about this :-) Still, we've got a lot of stuff out there1.
Anyway, in addition to publications and communications, NDG1 is about delivering something we can believe will be rolled out for real during NDG2, which means before NDG2 we have to have:
realistic prototypes
comprehensive plans and a roadmap for the future.
Of course "realistic prototypes" belong to the "how long is a piece of string" category ... just how many do we need? Well my take on that is we need to have demonstrated a priori each piece of what we think we can deliver as solid software in NDG2.
So how do we project manage that approach? Up until recently we haven't had anything to project manage, because while we've had a comprehensive software architecture, and some real prototypes. we've not had many of the other information objects well enough defined to fully specify any of the components which manipulate them. But we do now, so all of a sudden we have project meetings with a schedule, and milestones, and deliverables ...
When next I have some time, I'll cover some of the issues in actually dealing with schedules, milestones and deliverables in this context.
by Bryan Lawrence : 2005/05/25 : Categories ndg (permalink)
Cosmological Artists In Residence
I've just found out about Jem Finer who is "Artist in Residence" in the Astrophysics department of Oxford University.
What a fine idea!
The blog linked above has some fab photos of one of his projects: building a large-scale spiral tower supporting a radio dish in the Oxford University Parks which is representing the "Centre of the Universe". The press release (pdf) states that:
It is both a sculptural object set in the landscape and a working radio telescope. It's about ten metres high ... and will be available for public viewing from the 4th of June.
I'll be there at some point.
2005/05/25 (permalink)
Nodding off in Lectures
Some colleagues in atmospheric physics at Oxford found and circulated references to seminal work on the "Incidence of and risk factors for nodding off at scientific sessions".
The abstract identifies the methodology:
We conducted a surreptitious, prospective, cohort study to explore how often physicians nod off during scientific meetings and to examine risk factors for nodding off. After counting the number of heads falling forward during 2 days of lectures, we calculated the incidence density curves for nodding-off episodes per lecture (NOELs) ...
and towards the end they noted that:
The questionnaire administered to the nodders-off was revealing. Most were reassured to know that it wasn't their fault.
I guess with a middle name of Noel, I've got an even better excuse than just tedium-overload, but I too am glad it's not my fault :-)
2005/05/24 (permalink)
Dual Core Opterons
In general I try and avoid being a hardware geek, but it appears that I'm going to have to understand the relative performance issues between dual-core opterons and normal ones and xeons ...
I found this review, dated April 2005, which has the following snippets which are perhaps worth remembering:
From a core perspective, there's really nothing new to report about the Opteron chips. They consist of the same 64KB of data cache, 64KB of instruction cache, and 1MB of L2 cache. Only now, there are two of them manufactured at 90nm ... both cores attach directly to a System Request Queue and crossbar, over which they communicate with the package's three HyperTransport links and integrated memory controller. It still supports dual-channel DDR memory at up to 400MHz and those HT pathways still purr along at 1GHz. The only difference is now there are two cores utilizing them.
![]() |
If you're worried about bottlenecking, don't be. AMD is claiming resource conflicts are rare and the impact of shared memory is a roughly 10 percent reduction in bandwidth. In cranking up the HyperTransport frequency last year, AMD helped circumvent any limitations there.
Of course that's all very well, but what about price performance? Well this review had a very interesting price table, which I've repeated here:
![]() |
which demonstrates that the extra cores are on average three times more expensive.
The same article has an extensive list of benchmarks, not all of which are particularly relevant to us, but
the sql order test seems to demonstrate that quad xeon processors can't exploit their memory because:
the memory bandwidth limitation of the Intel FSB architecture does not allow the quads to really stretch their legs. On the other hand, the Integrated Memory Controller of the Opterons allow them to pull ahead.
the data warehouse test shows a clear advantage to the opteron, is good given that it
is all about the number of instructions that can be completed in a given period. The ability to move data quickly in and out of the CPU are the characteristics of a winner ...
Linux hardware also has a review with a number of benchmarks that are possibly a bit more relevant. Here for example is the result from a 5000 element matrix multiply running on dual processors (I think the AMD875 is a dual processor where each processor is dual core):
![]() |
What we see is that once we exploit all the power of the processors, the Opteron shines (but up to then it is perhaps the better maths library on the intel machines which is shining?)
Finally, how well does it do scaling, and what about relative integer/floating point performance? Infoworld has an article which includes those figures and other information. The key table is this one:
![]() |
which speaks for itself. The second core does remarkably well.
by Bryan Lawrence : 2005/05/24 : Categories computing (permalink)
Readership
Given that I have no comments and no trackback enabled, I wondered how many readers I actually have ... so for the first time, I had a wee look at my server log.
These are the figures(remembering I started blogging in November last year, and this month is unfinished):
| Original | Update1 | Update 2 | |
| Nov | 125 | 100 | 9 |
| Dec | 283 | 240 | 60 |
| Jan | 730 | 574 | 347 |
| Feb | 1008 | 761 | 267 |
| Mar | 1085 | 742 | 283 |
| Apr | 1274 | 898 | 475 |
| May | 815 | 576 | 466 |
| Jun | x | x | 439 |
| Jul | x | x | 439 |
I found this out with the following quick script:
#!/bin/env python
# blog.log created by grepping for lawrence in the apache log.
file=open('blog.log','r')
lines=file.readlines()
months={}
exclude=('sstdlbnl','bot','207.46.98.70','crawler','search')
#
for line in lines:
w=line.split(' ')
ip=w[0]
month=w[3].split('/')[1]
if month not in months.keys():
months[month]=[]
if ip not in months[month]:
# parse for bots and me at work:
ignore=0
for word in exclude:
if ip.find(word)!=-1: ignore=1
if not ignore: months[month].append(ip)
for month in months.keys():
print month,len(months[month])
***
highlight file error
***
I'm sure there are more elegant ways of doing this, and one day it might be interesting to find out what people read ... but for now I'm comforted that my words aren't disappearing into the ether ... and more importantly, that my plan to role out blogging wider into NCAS makes good sense.
2005/05/23 (permalink)
EU Patent Directive
There are many problems with the proposed EU patent directive, and I'm no expert on them. Groklaw has a short article on the necessity for better wording to avoid what I call patent drift to frivolous patents (example).
The EU is I believe trying to protect software that accomplishes something physical, which seems fair, but what does physical mean? The problem is summed up exactly in the first (regrettably anonymous) comment to the above groklaw article:
An anti-lock braking sysem for a car is (in principle) patentable. It has a useful physical effect.
The same software, the exact same bit string, used as an anti-lock braking system in a physically-realistic computer game, is not patentable.
The child who writes the computer game and spreads it round the world is to be thanked for educating us. Not taken to court and asked to hand over 90 million dollars.
Figure out how to say that in legalese, and we will get it right.
by Bryan Lawrence : 2005/05/23 : Categories curation (permalink)
Cherry Picking
There is a nice article on "Cherry Picking" at prometheus. I especially liked this:
Every time anyone makes an argument and invokes facts or information they have some agenda for doing so (except Michael Crichton, that is).
I liked it because a) it's true, and b) it "respects" the position that Michael Crichton is an environmental saint :-). Regular readers will know my position on that.
And again, the bottom line:
Anytime someone uses facts or information to make an argument, that use is selective. Cherry picking is inevitable. But it is important to recognize that how one uses information can either foster or damage legitimacy and authority ...
by Bryan Lawrence : 2005/05/23 : Categories climate crichton (permalink)
There is no such thing as an XML editor
Oh how I agree with this (via) on the sad state of play of XML editors.
... and the moment we have one, a proper configurable one, just imagine how much easier it'll be to get metadata!
by Bryan Lawrence : 2005/05/18 : Categories xml (permalink)
On NCAS Blogging Policy
If we do start hosting blogs for NCAS staff, then we'll need policy. I've mentioned this before. It's nice to see via Tim Bray (as usual!) that IBM have published some policy guidelines too. I like Tim's analysis of their guidelines, but the thing I picked out as relevant for our scientific community to think about is this:
Whether or not an IBMer chooses to create or participate in a blog or a wiki or other form of online publishing or discussion is his or her own decision. However, it is very much in IBM's interest -- and, we believe, in each IBMer's own -- to be aware of this sphere of information, interaction and idea exchange:
I would argue the same is true of scientists, particularly those of us in the environmental sciences. What we do is of signficant public interest, and blogging gives us a way to publicly discuss things that are of public interest. Of course one needs to be careful about prejudicing material appearing in refereed publications because of "prior publication" (on the web in your blog), but I don't think that will be a show stopper. (More on that at a later date).
The IBM policy goes on to say (following on from "It is in IBM's and IBMer's interest ... to be aware):"
To learn: As an innovation-based company, we believe in the importance of open exchange and learning -- between IBM and its clients, and among the many constituents of our emerging business and societal ecosystem. The rapidly growing phenomenon of blogging and online dialogue are emerging important arenas for that kind of engagement and learning.
Same arguments for our community.
And on:
To contribute: IBM -- as a business, as an innovator and as a corporate citizen -- makes important contributions to the world, to the future of business and technology, and to public dialogue on a broad range of societal issues.
Same arguments for us (we make important contributions etc). And on:
In 1997, IBM recommended that its employees get out onto the Net -- at a time when many companies were seeking to restrict their employees' Internet access. We continue to advocate IBMers' responsible involvement today in this new, rapidly growing space of relationship, learning and collaboration
So, for my part, I recommend to my scientific colleagues. Get out there. However, if you want us to host your blogs, then you're going to have to wait just a wee bit longer. Leonardo development is progressing, and when I'm satisfied it's stable1 enough for us to offer a service, then we'll roll it out for badc staff and then the wider community.
Before then I'll need to get onto the policy ... but not yet :-)
by Bryan Lawrence : 2005/05/17 : Categories badc (permalink)
Scientific Consensus and Climate Change
There is a nice short piece on consensus and climate change in Science by Roger Pielke Jr. This was following up the Orestes paper I highlighted back in November.
The bottom line is that he takes this:
I have seen one claim made that there are more than 11,000 articles on "climate change" in the ISI database and suggestions that about 10% somehow contradict the IPCC consensus position.
and rightly points out that that doesn't change the consensus position one little bit. As he says, a consensus is simply "A measure of central tendency". I like the bottom line though, that one shouldn't make policy only on the central tendency, policy needs to admit the diversity of perspectives.
by Bryan Lawrence : 2005/05/16 : Categories climate (permalink)
Grid Architecture
In the NDG we're grappling with exactly how we deploy the components we've designed and started building. There is a nice simplified view of the way these things have to work in the SAKAI EVALUATION EXERCISE report by Crouchley et.al. (pdf):
![]() |
by Bryan Lawrence : 2005/05/16 : Categories ndg (permalink)
Disappointment with Suse 9.3
The bottom line is that several attempts to install it have failed, having meant that I've been essentially computerless for the past 72 hours. Fortunately, I was attempting to install it in an empty partition, so I haven't broken my data (or old systems) ... mind you it was invisible because Suse 9.3 has been breaking by boot loader. I tried:
a vanilla install, ended up with garbage in /boot/grub/menu.lst ... so nothing would boot (fixed that with rescue system from the install dvd), but then we can't finish the install properly and at least one other file was corrupt (/etc/sysconfig/clock) so don't know how many others were too ...
tried again with the cds, not the dvd ... exactly the same files corrupt in exactly the same way.
tried again by duplicating my old suse 9.2 system, and doing an update. This time it completely blew away my mbr ... so was reduced to reminding myself how this works. Fortunately there is an excellent grub tutorial by Nerderello at linux forum, so have recovered myself back to where I was at Thursday lunch time. Given how little spare time I have at the moment this has been a big problem for me ...
(Although I purcahsed my copy of Suse 9.3, and could have asked for support, it's not been possible to go down that avenue ... yet). So now I have my 9.2 system back ... I'll think about accessing whatever is left in that upgrade ... but not for a few days ...
by Bryan Lawrence : 2005/05/15 : Categories computing (permalink)
Citation Decay Rate
I found out via the Digital-Preservation Mailing List about this article about the decay rate for online citations.
I can't read the actual article because we don't subscribe, but a quick scroll of reports about it states that in a study of five journals (Human Communication Research, the Journal of Broadcasting & Electronic Media, the Journal of Communication, Journalism & Mass Communication Quarterly, and New Media & Society) and 1126 articles in the years 2000 to 2003 that:
373 of the citations now did not work at all, a decay rate of 33 percent; of those thatworked, only 424 took users to information relevant to the citation. In one of the journals in the study, 167 of 265 citations did not work.
Amusingly, the article describing this problem does not appear in a journal (Scholarly Communications Report) which appears to use the DOI system. (SCR does carry an ISSN number but I have yet to find anything useful one can do with an ISSN number). I wonder whether SCR will move or reorganise their website in the future? In case you wonder whether they might, the link to that paper was:
www.extenza-eps.com/extenza/loadHTML?objectIDValue=64138&type=abstract
Even the pdf file itself had this link:
http://www.extenza-eps.com/extenza/loadPDFInit?objectIDValue=64138
which at least implies they ought to be able to without breaking the link.
On a more local scale, I wonder how well blog "permalinks" will survive. I realise that much of the blogosphere is truly ephemeral, but there is much good stuff out there too which ought to be preserved. In my case, I'm starting to treat this blog like a list of notes of "public record", and I imagine I'll want them for decades to come.
I'm pretty confident that many of the links will decay pretty quickly. But I feel obliged to ensure that mine wont (after all, I run a centre dedicated to preserving data).
How will we deal with permanence and migration? Firstly, for this to survive, the URL home.badc.rl.ac.uk/lawrence has to survive and point to the right stuff. Secondly, either leonardo has to survive, or the xhtml content has to survive. Obviously taking the long term, it's the xhtml that might survive (even if it's in a new format in the future).
So to make this happen we need to
guarantee to preserve home.badc.rl.ac.uk/lawrence in perpetuity, and
ensure that leonardo can export easily the entire contents as static pages.
The latter is a preservation issue by which all content management systems need to be judged. The former is ok for the small numbers of NCAS/BADC staff, but if we role leonardo out more widely for NCAS we might want to think about a more suitable URL for long term preservation.
(Of course none of this precludes the writer of the blog moving onto a new URL and new software, but the content needs to survive).
by Bryan Lawrence : 2005/05/11 : Categories curation : 1 trackback (permalink)
Prefetching etc
There is considerable discussion on Google's Web Accelerator and how it has highlighted problems with folks applications. See for example:
The bottom line is that it works by grabbing all the links off a page and caching them ... which is fine except for when they might be links which do something like delete x ... Simon's article and the comments in it discuss this in some detail, and it makes scary reading. This appears only to be mainly a problem with HTTP Get though, so maybe it wont be a big problem for our services ... but we need to make sure.
by Bryan Lawrence : 2005/05/10 : Categories computing (permalink)
Internationalisation in Metadata
Clearly scientific metadata is international in context, even if some folk think that English is the language of science, the reality is that much metadata will not start in English - why should it?
Anyway, there are to threads pushing this issue right now that I need to keep an eye on (how many eyes can one person have?):
The HDF/NetCDF amalgamation team wishes to support UTF-8 unicode character strings. See the draft RFC. There is also a useful email by Russ Rew on the hdf-netcdf mailing list (but I can't find an archive for this).
The CF community seems to have decided (when, how?) that American english is the spelling of choice for standard names. This has major implications for atmospheric chemistry I think. I need to find out what the international stanadrds practice on chemical nomenclature is. We also need to work out how the standards community addresses this issue.
by Bryan Lawrence : 2005/05/06 : Categories badc curation cf metadata (permalink)
WSDL or not
In a meeting yesterday I declared that I was becoming less suspicious of WSDL, although I still didn't think it was any use for "what service composition" or "service discovery" (via any sort of registry). It turns out that most folk I know are using WSDL to write code: they write the WSDL, and then use some code-generator to produce the web service and/or clients stubs ... (or at least I think that's what they do).
Further, it certainly does seem that without agreed WSDL getting web-services to work across language divides seems fraught with difficulty (for example, our NDG discovery web service is written in Java, but getting python clients to work has so far been beyond us ... but at this stage we're not yet using a WSDL intermediary ... we'd better get onto it quickly).
Or do I really mean that? As in, do I really mean WSDL? Today, while our calendar and email service was down (and I therefore had no idea what I was supposed to be doing) I found Tim Bray yet again weighing in with some good thoughts, this time on the WSDL issue. He describes SMEX-D:
SMEX stands for Simple Message Exchange, and SMEX-D for SMEX Descriptor, an XML language designed to provide simple descriptions of a wide range of Web-Service message exchanges, both REST-based and SOAP-based.
In Replacing WSDL Tim also mentioned Norms Service Description Language. The latter started out by stating:
it must be possible to describe services so that compilers can build interfaces to them, that's the only way to make them accessible to ?ordinary programmers? who don't care about web services for web services sake.
He went on to say:
There seem to be two main requirements: (1) Make it possible for ordinary programmers to use web services as transparently as they use other code libraries. (2) Make it possible for ordinary service providers to describe their interfaces in a standard way so that some level of interoperability can be achieved.
No arguments with that ... but perhaps ordinary programmers need to invest the time in WSDL to get the benefits? Are we getting too used to easy-to-learn technologies? Are some powerful technologies worth the investment in learning to use them?
There appear to be other games in town. Tim also refers to RSWS the Really Simple Web Service descriptions ... and there is SSDL which is soapcentric.
The bottom line however is that a lot of smart folk seem to be giving up on WSDL ... and offering competing options which are certainly much simpler. The problem for me though is that to evaluate them, someone has to spend time doing it (who?), and we know a priori, that we will have very complicated object structures with which our web services will need to deal ... it's not obvious that these simpler description languages can cope with complicated object-types that are not in the standard XML namespace. Further, here and now, no one is writing code generators for them. (Or are they?) For better or worse, right now, we can write WSDL and get code generated from it (buggy or not).
Well, yesterday we took a decision to get down and dirty with WSDL in the NDG discovery interface. It's not a top priority right now though, but when we do, I'll report our experiences.
Update: Tim Bray has pointed to this useful summary of WSD Languages ...
Update (17th May, 2005): And Dion Hinchcliffe has more ... and it looks like there will be a lot more, but I won't keep linking to him from this entry.
by Bryan Lawrence : 2005/05/04 : Categories ndg computing (permalink)
Glaciers Indicate Temperature Increase
Back when I started criticising Michael Crichton, I commented on glacial length from what I freely admitted was a state of ignorance.
However, for some real expert opinion we can cite Oerlemans, 2005, in Science who states that:
Mass-balance modeling for a large number of glaciers has shown that a 25% increase in annual precipitation is typically needed to compensate for the mass loss due to a uniform 1 K warming. These results, combined with evidence that precipitation anomalies normally have smaller spatial and temporal scales than those of temperature anomalies, indicate that glacier fluctuations over decades to centuries on a continental scale are primarily driven by temperature.
Which means that the situation is not simple on the Greenland scale. But as usual, the issue with Crichton is he picks his data to tell his tale. The same Oerlemans article uses glacial records from 169 glaciers world-wide (although he admits a bias to European alpine observations) to construct a global temperature proxy based only on glacial length. The reconstruction is split into regions to avoid the European bias.
He finds that:
Moderate global warming started in the middle of the 19th century. The reconstructed warming in the first half of the 20th century is 0.5 kelvin. This warming was notably coherent over the globe.
Although looking at the figures, the coherency recently is certainly not global, with significant advances in many places. However, I'd be very intrigued to see what would happen if we looked at the precipitation records for the areas of significant advances (e.g. New Zealand). It's also difficult to argue that NZ is "continental scale" (neither indeed is Greenland, despite the size it gets on Mercator projection maps).
What I found particularly impressive is the figure with the reconstructed temperature series! I've snipped a bit of it out so you can get the flavour, although you should look at the original article for the real thing and the axis information:
![]() |
(I've deliberately degrated the image by removing the axis information and the line legend, so as to meet my own fairuse criteria).
by Bryan Lawrence : 2005/04/29 : Categories climate (permalink)
Copyright, Blogs and Fair Use
One of the things I want to do with my blog is often cite research articles, and I often want to include a figure or text from those articles. There is obviously then an issue of copyright that I need to address.
There are no real guidelines1 on what to do, but I plan to do the following unless I find that I'm doing wrong:
I will always cite the original article.
I will never quote a substantial part of any article.
When reusing figures from any article protected by copyright I'll degrade2 them in some way, e.g. lower the resolution or remove legend and/or axis information) to encourage readers of my blog to go to the original article if they want to refer to it or use it themselves.
I hope that in this way I will remain within the fair use criteria ... as usual with these things I wont really know if this is fair use unless someone objects and claims it isn't. It does seem fair to me.
(Updated May 24th and again July 1st, 2005)
2005/04/29 (permalink)
US Likely to cut Earth Observation
Science is reporting that the U.S. are seriously considering cutting back on their global observation programme. It would appear that NASA and NOAA are in what I would call and inverse turf war: neither wants to take responsibility for "non-operational" earth observation missions.
Missions at risk include ones covering:
global precipitation
ocean vector winds
land cover
optical properties of aerosols
seal level altimetry.
by Bryan Lawrence : 2005/04/29 : Categories environment (permalink)
XML Databases
Ronald Bourret has a nice article in xml.com on xml databases, with lots of examples of real live systems using native xml databases, and some of the power of using xpath type queries on semi-unstructured documents (where you know something about some elements of the structure).
by Bryan Lawrence : 2005/04/28 : Categories xml (permalink)
Sparklines
Joe Gregorio in his bitworking blog has some python code for producing sparklines (RFC2397 inline images as opposed to URI referenced images). He makes the point that it doesn't work on IE browsers. Regrettably it doesn't on my kde3.3 konqueror 3.3.2 browser either.
by Bryan Lawrence : 2005/04/28 : Categories python (permalink)
Managing Action Items
43 folders (via dirtsimple) has a nice discussion of how to manage one's action items, particularly in the context of categorising things into "next actions". I liked the following key points about action item lists (paraphrased):
Actions need to be atomic (if not, put them on a project to-do list)
Make them physical actions (not "think about", but "write notes" etc)
Make sure that any dependencies are resolved
If it's not something one is committed to (i.e. learn about something), then put on a different sort of list (e.g. on-hold, learning list, etc).
Make them well defined (begin with a physical verb, e.g. email, call, recode, visit etc).
Make sure the items are complete enough that you know what needs doing and why tomorrow, or next week, when you actually get to doing it.
by Bryan Lawrence : 2005/04/28 : Categories badc management (permalink)
Sleepless in Oxfordshire
Like all new parents, I'm not getting enough sleep, but I'm back at work, sort of. Anyone wanting gratuitous baby photos can look here ...
2005/04/26 (permalink)
Fine Excuse for Digital Silence
This blog will go silent for a couple of weeks now, as our first child arrived on Tuesday night, and will be home from the hospital shortly. Everyone tells me I'll be too tired to do anything. I believe them.
![]() |
2005/04/14 (permalink)
To trac or not to trac?
... that's the question!
There are a host of blog entries about trac out there: (1, 2, 3, 4 ... and more).
The bottom line appears that everyone loves trac, but installing it is a nightmare. It runs on subversion, which is a good thing, but a raft of packages are needed ...
Right now we are desperate to find a better way of managing our disparate activities and coding for the NDG ... and this looks good. If anyone has a better idea, I'd like to hear it!
by Bryan Lawrence : 2005/04/11 : Categories management ndg badc (permalink)
Satellite Temperature Trends
There has been some contention over exactly what the satellite data tells us about temperature trends. In particular, analysis of data from the MSU instruments aboard the NOAA satellites has been interpretted in a number of ways.
Nathan Gillet gave us a seminar today where he reported work done on repeating the Fu et al 2004 analysis (in a model). Nathan's work (coauthored with Ben Santer and Andrew Weaver) showed that the Fu analysis gave a good estimate of actual trends (comparing pseudo obs in the model with actual model values). That being so, the methodology should work in the real atmosphere, which means that the satellite obs are consistent with tropospheric warming and stratospheric cooling associated with greenhouse gas induced climate change.
So, he's supporting the Fu et al analysis. The bottom line of that analysis is that if you take just the T2 channel of the MSU, the "tropopsheric" channel, you don't see much trend. However, T2 extends1 into the stratosphere, and so it's "contaminated" by the stratospheric cooling. However, Fu and et al (verified by Gillet et al) have shown that you can untangle the stratospheric signal, and when you do that, the satellites show clear tropospheric warming signals. Here, is the key result from the Fu et al paper:
![]() |
What you see is that in case a) the analyses (two different methods) from satellite don't give as large a cooling trend as the surface observations (i.e. this is the result that the greenhouse skeptics keep reporting). However, if one corrects for the stratospheric influence on the T2 channel, one gets the results in b). That is, the satellite data supports the surface analyses (albeit weakly in the SH for the UAH analysis).
by Bryan Lawrence : 2005/04/11 : Categories climate (permalink)
Suse 9.3 Looming
My personal computational environment consists of my work laptop, and my home computer. Both run Suse 9.2, and on the whole I'm pretty happy with Suse, I've been upgrading regularly on my laptop every six months (on the Suse release cycle) for years now ... and everytime it's got easier (the first few times I had to do kernel rebuilds and all sorts of hacks to support my laptop).
9.3 is coming now, and I've preordered my copy. I more or less have to upgrade on my laptop because I've broken Yast (no I don't know how, and haven't bothered to find out how), and I want to to be able to do security upgrades automagically ... I need that, because my laptop gets connected everywhere. There seem to be some other good reasons too:
xen seems to be an excellent virtualisation tool, and we need something like that in the BADC. We're about to deploy a development system to support multiple O/S instances, but they're all Linux based, so xen may save us the vmware cost and overheads.
OpenOffice 2.0 (ok, a pre release) ... just last week I had (yet another) a nightmare with a Word Document ... neither myself (running Word-2000 on Cross-Over Office) nor my colleagues (running Word-2003 on their XP systems) could remove a footnote that had spuriously appeared ... until I found the problem using CrossOver office (which reads more and more of the docs I get by email that Word-2000 does not) ... maybe with Suse 9.3 now will be the time for me to change to OO for good ... (I'll keep CrossOver office on until
I can get korganiser to communicate to our exchange server (and lose the Outlook calendaring support I currently use crossover office and outlook for), and
I find something on linux as good as visio ...)
Adobe pdf reader version 7 ...
I have to confess once upon a time I would have been downloading these things and playing with them already ... but I seem to have less and less time for mucking with my system ... (or desire to do so) ... I think this means that the systems are better, and i'm becoming more and more a candidate for a Macintosh.
It's not so obvious that I want to upgrade my home computer. After much mucking around, I've got the multimedia situation where I want it, but it sounds like multimedia support in 9.3 is broken without some package downloads ... and as I say, my mucking with the system threshold is that much higher now :-)
by Bryan Lawrence : 2005/04/10 : Categories computing (permalink)
Arctic Climate Change
I've just found the Arctic change site via Overland and Wang, GRL, 2005. The website has some excellent graphs of climate change indicators for the arctic region. Two of the most interesting (my definition, there are lots of others that you might find interesting, go look ...) are
![]() |
which is the Tundra area based on raw satellite data and on a classifcation scheme using two alternative input temperature datasets (I have a local copy of the image in case that goes missing).
Also interesting is the sea ice extent:
![]() |
(Similarly, here is a local copy).
by Bryan Lawrence : 2005/04/10 : Categories climate (permalink)
Glacier Retreat
The NERC quarterly mag1 reports that eighty-seven percent of glaciers in Antarctica have retreated since 1940, based on an analysis of maps of the continental ice-sheet margin.
Mind you, the 244 glaciers analysed are all in the vicinity of the Antarctic peninsula. As the article states, the glacial retreat pattern has been different from that of the ice shelves, which suggests that warming may not be the sole driver of retreat in the Antarctic Peninsula, but that other boundary conditions may be playing a significant role. nothing new in that, but the skeptics always need to be reminded that we don't think it's all about climate change all the time ...
(Update; April 29th: These results are published properly in Cook et.al.,2005).
by Bryan Lawrence : 2005/04/08 : Categories climate (permalink)
Another excellent summary of climate modelling
The Institute of Physics commissioned Alan Thorpe to explain how predictions of future climate change are made using climate models. They did so hoping
that the paper will increase believability in these models and be persuasive that anthropogenic activity is likely to be causing global warming. It aims to convince policy-makers, the general public and the scientific community that the threats posed by global climate change are real.
The paper (pdf) begins:
Like the weather, everyone has a view on climate change but, as will be discussed, not all such views, ... are equally defensible on scientific grounds.
Climate change is a fundamental problem involving basic science including physics ... There is little doubt that a lack of knowledge about how climate change is predicted and the associated uncertainties are amongst the main reasons for ill-informed comment on climate change.
and I think that's the bottom line. People need to understand how it's done, and the real limitations, and lack there of ...
Later on we have
Scientists are appalled that they could be suspected of distorting the evidence to enhance their reputations or funding opportunities. Of course scientific hypotheses and analysis can be refuted by later discoveries but this is not the same as complicity. The fact that everyone experiences weather and climate may explain why nonscientists feel confident in attempting to refute the scientific evidence. The complexity of the climate system and its many interacting and compensating physical processes means that simple arguments that gloss over this complexity have to be approached with a significant degree of scepticism.
Hear, hear!
A common method of arguing starts by identifying a single cause or physical process that either has not been included or has been included in an imperfect way, into climate models. But the climate changes because of a multiplicity of interacting processes and any one process alone cannot be the whole story. ... Climate modellers attempt to include in the models all the processes that are even remotely likely to have a detectable effect, any newly discovered process will quickly find itself incorporated into the models!
As an aside, RealClimate has a nice discussion of one such single process issue that causes a few contrarians grief: the water vapour feedback/forcing issue. Well worth a read.
by Bryan Lawrence : 2005/04/08 : Categories climate crichton (permalink)
Loaded Dice
Suppose a weight is added to a dice, and detailed physical calculation predicts that this will increase the number of 6s at the expense of 1s, but a sceptic refuses to believe the calculation: how many times do we need to throw the dice in order to convince the sceptic?
We all know that if I throw a (fair) dice, I have a one-sixth chance of any given number. If I throw it enough times, the expectation is that the average of all my throws should be
Imagine now that we have a dice loaded so that it is more likely to throw a six (3 times in 12 throws) and less likely to throw a one (1 times in 12), so that my expectation is that
Now, let's ask the question, if I gave you one of those two dice, but didn't tell you anything about the loaded one apart from the fact that
it is loaded with an expectation that if you threw it many times, the average would be 3.9
Could you tell the difference, i.e. which one I had given you?
Well, simple probability suggests the error in your estimate of the expectation would be, for the first dice:
Similarly, allowing for the altered weights, and altered mean,
(not much difference then). For n throws of the dice, the error in my estimate of the mean value of the dice is
So, to distinguish between them, we need to throw the dice enough times that the error is less than 3.9-3.5=0.4, that is we need
throws (using the largest estimate for σ).
What this tells us is that, on average, seventeen throws of the dice should be enough to tell which dice it is (at the one sigma level). Of course, any given sequence of throws may, or may not, conform to the probability. It is entirely possible that one throws a six, every time, with both dice ... all this is saying is that after seventeen throws of the dice, most such sequences will end up with an average value that will indicate which dice is used.
Of course, to be properly sure, we would recast this in terms of a null hypothesis, and depending on whether you can argue that this is a one or two tailed problem you might end up arguing you need twice or four times as many throws to reach 97% confidence ... Even better, we could be more proper and do a proper chi squared test, and if we knew that the loading didn't alter the 2 to 5 values, we could only test on sixes etc ...
... but the argument is much simpler without the statistical test. The key point is that to know whether the dice is loaded, you do have to appeal to statistics!
For the record, I used the following piece of python to do the calculations:
## Loaded Dice
## BNL, April, 2005
import Numeric as N
import math as M
def sigma(x,p):
e=sum(x*p)
s=M.sqrt(sum(p*(x-e)**2))
return e,s
x=N.arrayrange(6.)+1 # dice has sides one thru six.
p1=N.ones(6)/6. # normal dice probabilities
p2=p1.copy() #loaded dice probabilities, I needed to use copy,
p2[0],p2[5]=1./12.,3./12. #else changing part of p2 would modify p1
e1,s1=sigma(x,p1)
e2,s2=sigma(x,p2)
print 'Expectation values for dice one and two are: ',e1,e2
print 'Sigma values for dice one and two are: ',s1,s2
print 'Number of tries required:',4*(int((max(s1,s2)/(e2-e1))**2)+1)
***
highlight file error
***
by Bryan Lawrence : 2005/04/07 : Categories climate (permalink)
Predictability and Crichton
Well, back to that book :-)
On page 284 we find Crichton (or his characters, it's fiction damn it), totally misunderstanding the nature of predictability. He reinforces it absolutely on page 570 in his authors message (where we can't blame it as fiction). He then goes on to make his own prediction for the amount of warming over the next century: 0.812436 degrees C, and then states (my emphasis):
There is no evidence that my guess about the state of the world once hundred years from now is better or worse than anyone elses's ... we can't ... predict it. ... we can only guess ... an informed guess is just a guess.
And there we have it. The last word from Crichton. I think this blog entry on Crichton will be my last on the book per se, although I will take up one more issue from it later (that of over-attribution of weather events to climate change, but that's for anotherday). But, it's a good one to finish with, because these snippets highlight how little he's understood.
So, what is climate predictability. Let's start by what it's not. It is not deterministic predictability. What is deterministic predictability? If I launch a missile at the moon, with a specific velocity, then I know to an enormous amount of accuracy1 where that missile will go. This is an example of a system where the state of the system is (nearly) totally described by deterministic equations and entirely predictable from the initial condition (launch position and velocity).
If we examine our missile example a bit further, we find that our equations consist of equations of motion and coefficients that depend on the planet mass and mass of the moon. You can think of the mass of the planetary bodies as being boundary conditions. Clearly, if I launch my missile in another planetary system, I'll have the same equations, but different coefficients, but you can rely on my predictions of what will happen there because I've tested my system rather accurately here with earth/moon boundary conditions. The key thing here is that the system is totally constrained by the initial conditions (given the boundary conditions) ...
Now let's make the situation just a lot more complicated. Instead of just having the equations of motion involved, let's add a thermodynamic equation, and recast the equations for a fluid medium not a particle. At this point we have a system of equations called the Navier Stokes equations. These are the foundations of everything we know about aerodynamics and hydrodynamics. People design aircraft wings with these things ... but they describe a system that is not always deterministic, that is, it is possible for there to be states where the initial condition does not allow predictability of the system state some time ahead (but sometimes it is, more of this below).
If we do some scale analysis of those equations (i.e. deal with the fact we want to solve these equations on our rotating planet and we have only finite computing), we end up with a system of equations we call the primitive equations. If we add an equation of state for water in various guises, we then have a system where the fundamental equations are as well understood (and as reliable) as our missile equation, but the nature of the things that they can describe has changed significantly. If we add some equations for radiation, clouds, and to simulate the affect of scales not resolved we have a weather prediction model.
The predictability of weather is well described in a met office page on the concept of ensemble prediction. Go read that, and come back ... :-)
The key points to understand from the met office article are, that for the weather system
initial condition predictability is poor beyond a week or two, but
it does support some elements of predictability, and
you can make probabilistic forecasts based on averages which are better when the system is in some states than others.
But weather isn't climate. Climate is about averages. We inevitably want to know about climate at times a long way past that initial condition predictability of the system. In the same way as the planetary boundary conditions affected the missile, they affect weather and climate, and in particular, the boundary conditions dominate over the initial conditions for climate.
Why is that? Well, consider another simple system. A dice. At the risk of pushing my analogies, consider the average value of a series of dice throws to be a proxy for the climate state. Clearly the initial condition of how I throw the dice is pretty irrelevant to what I get on any throw, that's dominated by the physical nature of the dice. A fair dice ought to give me a sequence of throws which average to 3.5. But if I load the dice then I can get a different average value.
Back to the climate. The boundary conditions on climate depend on the timescale you are interested in and things for which you are interested in knowing the climate. For example, if I want to know the global average temperature, then CO2 is a boundary condition (although it may not be a boundary condition if I chose to ask what is the amount of CO2 in the atmosphere on average ... that too is a climate question).
So, I can ask the meaningful question: what will the global average mean temperature be if I double CO2, and it's a boundary condition problem. I can make a prediction. But to evaluate whether my prediction is true, I have to look at an awful lot of cases (throws of the dice), and it'll be a statistical analysis in the end that will validate my prediction (to a certain level of confidence).
This requires me to understand the nature of the errors in my prediction, so I can do the test. That's why projects like http://climateprediction.net are so important.
However, the bottom line here is that no such analysis of the climate can ever give Mr Crichton the level of precision in prediction he wants. It's not a simple engineering problem (like firing a missile), and so on page 284 and at the end of his book, Crichton shows that he just doesn't understand the nature of predictability.
by Bryan Lawrence : 2005/04/07 : Categories climate crichton (permalink)
PDF Heads down hill?
Can this really be true?
It implies that pdf's can be contaminated (tagged) with what I would call a virus: a piece of digital rights management code that is fired up when a pdf is read ... and potentially stops it being read ...
update: ok, so virus was a bit over the top, but you know what I mean
by Bryan Lawrence : 2005/04/07 : Categories computing msxml (permalink)
xmlbase, xinclude and schemas
I found Norman Walsh's excellent discussion of issues with xmlbase, xinclude and validating schemas via Tim Bray.
xmlbase is the standard which supports the anchoring of relative xlinks from a document, e.g. using the Walsh examples, the method that expands to find the full URI of the following image
<imagedata fileref="picture.png"/>
Walsh is making the argument that the schemas are broken, although one of the comments states:
Personally, I don't consider this all that big a deal. Schema-validity is vastly overrated. I routinely add markup to my documents that is not accounted for by the schemas, and as long as you don't blindly throw away all invalid documents, everything pretty much works ... If you really need validity and XInclude, then you need to update your schemas/DTDs to support xml:base everywhere. It's probably a good idea to do that anyway.
The bottom line appears to be that
Your schema needs to support xmlbase
explicity ...
Regardless of all these things, the Walsh blog entry is a nice summary of how it all works ... and links to a good discussion of pipelining from xml documents to bring them together and process them to output. I guess I hadn't appreciated that the schema validation needs to be done after xincludes are completed.
by Bryan Lawrence : 2005/04/04 : Categories xml (permalink)
Function Creep and Institutional Repositories
The case for institutional repositories (IRs) is well made in Crow et.al.: The Case for Institutional Repositories: A SPARC Position Paper (link)... as long as the definition of "digital collections capturing and preserving the intellectual output of a single or multi-university community" is appropriate.
I think many of those who are considering establishing, funding, and running IRs are visualising IRs which encapsulate documents (whether published or not), but the natural tendency of many is to be (dangerously) all inclusive. Statements like these abound in the IR literature:
The vast global corpus of heterogeneous data that the repositories represent can be curated by the local content managers best prepared to accommodate each data set's specific detail and particularities (for example with detailed metadata appropriate to the content).
... the disaggregated model includes not only pre-prints and research papers, but also extends to research data sets ...
Access to disparate content types will drive retrieval interfaces to evolve from text search models to search engines capable of handling complex data types (for example, biological information, cell structures, and genome structures).
This material might include student electronic portfolios, classroom teaching materials, the institution's annual reports, video recordings, computer programs, data sets, photographs, and art works-virtually any digital material that the institution wishes to preserve.
(all these from Crow et.al., but they're hardly unique). The problem of course is that once one asks what a digital repository is for, one gets answers in terms of open access and preservation. Readers of my blog will know that I'm totally in favour of open access, but I don't believe too many of the folk who are advocating IRs including non-document data have thought too much about the preservation problems they will be entertaining with data as opposed to documents.
It is precisely the issue of preserving data that has motivated the establishment of the National Digital Curation Centre here in the UK; not to do digital curation, but to advise on it. Nationally, it is recognised that coping with the data deluge is a problem, yet most of these IRs seem to think that they can be set up with relatively modest investment.
The idea that
institutional repository systems must be able to accommodate thousands of submissions per year, and eventually must be able to preserve millions of digital objects and many terabytes of data.
is fine. However, it is only technically feasible for an institution at modest cost, if, and only if, one limits the format and variety of digital objects in the repository. This is because, while
libraries can most effectively provide much of the expertise in terms of metadata tagging, authority controls, and the other content management requirements that increase access to, and the usability of, the data itself.
is true for document-style metadata, the metadata required to access, understand, and manipulate scientific datasets is always going to be the preserve of domain-experts. At any given time, no institution is likely to have the appropriate mix of individuals to maintain and migrate for the future all the data and metadata it has produced in the last year 1, let alone over the institutions digital lifetime.
It's not just me that thinks like this,
One thing that has been learned from existing scientific data collection management is that these archives must be initiated by respected scientists working with people trained and skilled in data management (CODATA, 2002). Domain knowledge is needed in order for collections to be managed, documented, and kept usable by scientists and researchers.
(from " AAnderson, 2004). Nonetheless, it is true that as Crow et al stated
... overlay journals pointing to distributed content, high-value information portals-centered around large, sophisticated data sets specific to a particular research community-will spawn new types of digital overlay publications based on the shared data.
which is to say that raw data (whatever raw means), can be held in repositories, and made available to institutions. It's also true that some data should be stored in IRs, not least because there may not be anywhere else suitable for such data to be archived. However, I have no truck with the logic that preserving the bits and bytes for the future without worrying about how to migrate the accompanying metadata (by which I mean more than Dublin Core type metadata) is a sensible attitude. I think it's a way for people to think "oh, I don't need to worry about preserving this data (and metadata) because my IR will do that", which if nothing else is done, will result ten years later in incomprehensive binary (junk).
So, yes, I'm very much in favour of institutional repositories, but they need to be established with a very clear understanding of what they will host and they need to reject material that they can't hope to preserve. Then they need to hold a hard line against function creep, and only accept material in "well known formats" (whatever that means) with "well structured metadata" (whatever that means). This is exactly how the discipline specific respositories function: we can't take just anything because we could never preserve it, and we have a dozen domain experts to work on it ... we demand specific formats, and work hard to get structured metadata.
That's not to say that IRs can't provide a backup service for their institutions: somewhere to put other material for temporary storage but they ought not to imply that such data has longevity when it doesn't and can't.
And, having said all that, the situation may well change (if for example a lot of complexity becomes absorbed into widely understood standard descriptions, like, for example the Geography Markup Language, GML). However, when that nirvana arives, it'll still be domain experts that will be needed to get the stuff out of whatever binary format it's in and into the new one, and all the accompanying usage information will need migration too ...
(There is a good collection of material on scientific data repositories here.)
by Bryan Lawrence : 2005/03/31 : Categories curation : 2 trackbacks : 0 comments (permalink)
Snippets from Lunchtime
Today's lunchtime reading was from newsforge via the osdir newsfeed and included two tasty morsels:
Firstly, apparently ESRI is now recommending python as the extension language for customization and automation of ArcGIS!
Secondly, apparently OpenOffice 2.0 is even more dependant on Java. The article is about the issues this brings to the FOSS community, because Java is not FOSS, while OO is (or purports to be if it really does depend on Java).
This last doesn't affect the curation issues (the document format is completely open). I thought this was worth stating because my problems with microsoft are not about them charging for software, it's about open standards and access. (It's always worth pointing out that in FOSS, as in Free and Open Source Software, these are two things, and professionally the Open Source aspect is more important to me than the Free Software aspect, although obviously privately I like that too.)
by Bryan Lawrence : 2005/03/30 : Categories python computing (permalink)
Attention Deficit Trait
eliterate librarian has an article describing this "disorder", defined as
you've become so busy attending to so many inputs and outputs that you become increasingly distracted, irritable, impulsive, restless and, over the long term, underachieving. In other words, it costs you efficiency because you're doing so much or trying to do so much, it's as if you're juggling one more ball than you possibly can.
Hmmm. Sounds familiar. Her blog entry, quoting another article I didn't get to the bottom of goes on to say:
How you allocate your time and your attention is crucial. What you pay attention to and for how long really makes a difference. If you're just paying attention to trivial e-mails for the majority of your time, you're wasting time and mental energy. It's the great seduction of the information age. You can create the illusion of doing work and of being productive and creative when you're not. You're just treading water.
This so reminds me of the best thing I got out of our lab's standards of management excellence course ... it was about the quadrant: for every activity, allocate whether it is urgent or not, and whether it is important or not. Now you have four quadrants:
Things that are urgent and important
Things that are not urgent but important
Things that are urgent but not important
Things that are neither urgent nor important.
Never, never do anything in the last quadrant, and stop wasting your time doing things in the third! Spend as much time as you can doing things in the second, but obviously you have to attend to the things in the first quadrant first.
I try to obey these principles, which explains a) why I sometimes never reply to email and b) I spend time reading material on the internet (carefully focussed of course).
by Bryan Lawrence : 2005/03/30 : Categories management (permalink)
More Legality
The Digital Curation Centre is starting to provide useful information for those of us who are professional curators of scientific data. I've just skimmed through "Public Domain; Public Interest; Public Funding: focussing on the 'three Ps' in scientific research" by Waelde and McGinley (main link, local pdf 1)
This is an excellent, albeit lengthy, discussion of the key issues surrounding IPR in scientific databases (and in some other contexts too).
On the importance of data:
Data and other information is generated on an exponential basis, and held within vast databases. The progress of science depends on the re-use of that data for a variety of purposes. It can also be hard for the non-scientist to appreciate the size and importance of these databases to the scientific community ...
In Europe, a key piece of legislation has been the 1996 Database Directive, which could, as will be seen, apply to the whole or part of any NERC data centre. An obvious concern is that either NERC or CCLRC might try and use the database directive to ringfence such data from the research community. While that's not possible now, the mere presence of the legislatin, in the context of the UK Baker Report could be threatening. It's thus important for me to understand this stuff:
The definition of a database:
... in the Directive refers to ?a collection of independent works, data or other materials arranged in a systematic or methodical way and individually accessible by electronic or other means.?
The definition of extracting from a database:
The Database Directive defines ?extraction? as the permanent or temporary transfer of all or a substantial part of the contents of a database to another medium by any means or in any form, and ?re-utilisation? to mean any form of making available to the public all or a substantial part of the contents of a database by the distribution of copies, by renting, by on-line or other forms of transmission.
What is substantial?
One third of the contents? One half? In other words, how much falls into the public domain? But any answer is not as simple as a fixed figure. The test for determining what is substantial is both quantitative and qualitative. The ECJ has said that a substantial part evaluated quantitatively refers to the volume of data extracted from the database and must be assessed in relation to the volume of the contents of the whole of that database ...On the matter of a qualitative part of the database, this refers to ?the scale of the investment in the obtaining, verification or presentation of the contents ? regardless of whether that represents a quantitatively substantial part of the general contents of the protected database.? A quantitatively negligible part of the contents of a database may in fact represent, in terms of obtaining, verification or presentation, significant human, technical or financial investment. This test would appear to require analysis of the investment that has been made in that part of the database that has been extracted.
Which leaves us where?
So what advice may be given to the scientist seeking to carry out research within the boundaries of the legal public domain concerned to avoid proprietary and contractual claims by the database maker? As can be seen, any answer is far from simple and may often lead to the comment ?it depends?, which is hardly useful for a scientist whose concern is to progress science unfettered by legal niceties.
In the paper, they refer to a horse riding case. I heard about this at the jisc digital rights meeting too. If I've understood the arguments applying to that, then some of the NERC data activities may find it hard to appeal to the database directive to protect their data (because the database wasn't collected for the purpose for which commercial return is anticipated), whereas some might find it easy (bizarrely, those with research data may be able to charge their communities). However, my take on this is that those who can legally protect their databases are exactly those for whom there is no market and vice versa ... so that implies for the moment I shouldn't have to be defending open access to our data against legal challenges ... long may that continue!
by Bryan Lawrence : 2005/03/29 : Categories curation (permalink)
Blog Summaries
In order to provide limited support for feed aggregation in Leonardo, I've developed a piece of code which exploits Mark Pilgrim's feedparser and Fredrik Lundh's elementree. The code can be found here. It has currently only been used to provide a summary of this blog itself, see summary.
To use it in Leonardo, you need to
cd your-leonardo-distribution/lib
tar xzvf feedlook.tgz
and then modify your core.py to include it as a provider. You need to have
import providers.feedlook
resource_manager.register_provider(providers.feedlook.feedlookProvider(config))
in the right places ...
2005/03/29 (permalink)
Debunking Crichton Done for Me
I haven't finished with my comments on Crichton's book. I have finished reading it, and it sits on my desk with a few bits of paper marking things I want to pick up on ... but other things intrude :-)
Meanwhile, I can recommend this by William Connelly, which is a splendid summary of myths about climate change ... as I say, this more or less does my job for me. Some of the comments are pretty apposite too ...
by Bryan Lawrence : 2005/03/28 : Categories crichton (permalink)
Wiki Parser Update
The wiki that provides the xhtml for this site has been updated to support footnotes, and a couple of other modifications that have been requested by the leonardo mailing list. Details of the parser's current syntax are at WikiFormat. (The wikiformat page with the new parser is valid xhtml).
2005/03/27 (permalink)
Maybe I spoke too soon about Massachusetts
In January I was pleased to find that Massachusetts had effectively banned Microsoft XML (presumably Office 200X) documents from being archivable documents. I was pleased, not because of any particular issues with Microsoft (although I have them), but because it recognised the importance of being able to rely on interpretting archived material in the future when one might not have licenses1 to the software that wrote the material.
However, it seems things have changed. Groklaw is now reporting that MA has approved Microsoft XML format as complying to the following definition of an open format:
Open Formats, as we're thinking about them, and we're trying to be precise with the language, because people use different English words for different technical terms, in our definition, 'Open Formats' are specifications for data file formats that are based on an underlying open standard, developed by an open community and affirmed by a standards body; or, de facto format standards controlled by other entities that are fully documented and available for public use under perpetual, royalty-free, and nondiscriminatory terms."
As Groklaw states, this is a bizarre decision given Microsoft are patenting their XML format all over the place. (Whether these patent applications will survive seems rather unlikely, but it's the intent that counts ... they obviously intend the documents to be far from open.)
by Bryan Lawrence : 2005/03/24 : Categories msxml (permalink)
JISC Digital Rights Meeting
On Tuesday I attended a JISC Consultation Workshop on Rights in Digital Environments. While the emphasis of this meeting was on rights management issues in a learning environment, and on documents etc, there was some interesting discussion on data rights issues. Among a number of interesting discussions and presentations were two that I found especially interesting: one by Charles Oppenheim on IPR issues in general, and one by Podromos Tsiavos on Commons UK.
I especially liked the following :
A simple way of deciding about copyright and database rights, and whether relevant: Essentially: "Copyright is about creativity, and database rights are about investment of time and labour." A digital object might enjoy both, one, or neither.
Also: "Copyright is less to do with the law than it is to do with the management of risk". Oppenheim mentioned his formula for the financial risk of violating copyright, which should be compared with the effort and cost associated with getting copyright clearance. Essentially you need to multiply together the risks of various things happening against the possible financial implications of copyright violation (which may not be large, depending on what loss of income was involved the legal costs might be greater than the actual violation). The formula terms include something like:
chance of what you are doing actually offending copyright
chance of being noticed by the copyright holder
chance of the copyright holder actually objecting
chance of them actually taking you to court
value of potential liability
Additionally,
In the presentation on creative commons, I finally understood that actually the creative commons license isn't one license it's several (this is obvious actually, but only when you know :-).
There was general agreement that nearly no one actually reads licenses, and the Creative Commons icons were a very useful concept. Here, for my record are the icons.
The science commons project will explicitly look at the database rights issue, not just copyright for scientific activities.
The idea of using a (unique) identifier to link to a rights database was discussed, but the usual issues for globally unique identifiers intrude on actually doing it ...
There are lots of IPR issues associated with metadata
Who owns it as we add layers?
Relationship to the data ownership and accompanying rights?
Note that auto-generated metadata becomes a "computer-generated work"
etc
by Bryan Lawrence : 2005/03/24 : Categories curation (permalink)
GPL Tested in Case Law
Groklaw has an article on a case where the GPL has actually been tested in (a US) court, and won. This is important because some argue that the GPL hasn't been properly tested. Well it has now.
This case is also interesting in that it has regard to standards compliance and copyright. The bottom line with regard to the latter appears to be that when it becomes law to comply with a standard, the standard itself can no longer be protected by copyright.
by Bryan Lawrence : 2005/03/24 : Categories curation computing (permalink)
P2PGIS
I've just had my attention drawn (by the good folks at EDINA) to the Peer-to-peer GIS project and the OPen Use Server (OPUS). The link refers to their implementation being based on Java, Apache, PHP and MapServer. It also says that it is based on RoMap.net (Rapid Online Mapping Network), and they say that
RoMap is a set of peer-to-peer protocols and network infrastructure to allow compliant GIS applications to work together over a network.
I can't yet find any real information on ROMAP beyond what fgdc says about the project. Annoyingly there would appear to ought to be a link to a website http://romap.net, but at the moment that goes nowhere ...
Anyway, the FGDC says that they funded it with OGC, and given it's based on Mapserver, I'm guessing that OGC Web Mapping Server protocols underly how it works.
When I get a moment (ha, famous last words), I'll contact the authors and find out a little more about it ...
by Bryan Lawrence : 2005/03/23 : Categories ndg (permalink)
Linus spoof letter to Bill Gates
This is absolutely brilliant!
by Bryan Lawrence : 2005/03/23 : Categories computing (permalink)
ICSU Recommendations
Yesterday, while on a train to a JISC meeting in Bristol on Rights in a Digital Environment (of which more later), I found time to read through the ICSU Report of the CSPR Assessment Panel on Scientific Data and Information (pdf).
It's an excellent report. I've extracted the key points from my perspective here ...
(I did the extraction to a file, knowing I would be putting it on the web, while listening to a talk on copyright. The verbatim extraction is probably a violation of copyright, but I'm also sure that ICSU would grant me the right to have done this. It's a case where ICSU really ought to publish the document with an appropriate license, e.g. a creative commons license, rather than put the onus on the user to seek permission to use it in this way).
by Bryan Lawrence : 2005/03/23 : Categories curation badc (permalink)
Python Eggs
One of the main complaints new users of python give is that the library dependency issue is not as clearly dealt with as it is in the perl and java communities. Today, I found out about the python eggs project via this. In the latter, eggs were described as
An egg is to a Python as a jar is to Java
and the planned usage would seem to be
The resulting .egg file can be added to PYTHONPATH or sys.path, as long as it contains only pure Python modules, or if you have the "pkg_resources" runtime installed ... But the idea is that .egg files, unlike normal Python zipimport files, can include C extensions as well as pure Python.
It sounds like some work is yet to be done (pkg_resources doesn't yet work), but if this is successful, it will make a huge difference to the ease of building and using python applications for the non-expert.
by Bryan Lawrence : 2005/03/22 : Categories python (permalink)
Sea Level Rise Predictions
Well, I'm clearly not an oceanographer, but given my recent interest in assertions by Michael Crichton (there is more to come on that subject), I thought this might be interesting for occasional readers.
Meehl et.al in Science show, using a pair of climate models, that even if Greenhouse gases were to be stabilised to 2000 levels, the thermal inertia of the climate system would result in a further roughly half a degree (K) rise in global mean temperature (compared with about the same over the last hundred years). A bunch of more realistic scenarios with increasing greenhouse gases result (obviously) in much larger global mean temperature increases.
This paper also addresses what sea level rises might be expected. They assert that the sea level rise over the last one hundred years was about 15-20 cm. While their models do a good job at producing hindcast predictions of the last hundred years of temperature increase, they significantly underestimate the increase in observed sea level (not least perhaps because they dont include melt water from glaciers etc). Their predictions for the next hundred years or so are probably therefore also underestimates, but if we take the ratio of increase, they predict a three-fold increase over what has already occurred - even in the best case of stabilising at 2000 levels.
I've not really been following the sea-level rise increase issues (there is only so much I can keep track of), but these are suggestive figures ...
by Bryan Lawrence : 2005/03/20 : Categories climate (permalink)
SOA Design Principles
John Crupi (via Tim Bray) has this to say about the design of Service Orientated Architectures:
...for SOA to be successful, it must be a "top-down" approach. And top-down, means problem to architecture to solution. It does not mean, working from what we have and just wrapping it with new technologies just because we can. This bottom-up approach is quite natural and easy and is the perfect recipe for a SOA failure.
While I agree with this in general, I'm not so sure it is axiomatic. Firstly, my take on this is that using web services to wrap things up and calling it a SOA is not so much a recipe for a failure as a recipe for spending lots of money and potentially adding little value. Adding little value could be argued to be a failure (but it's not always transparently a failure), but the process could also be argued to have educational value. If one designs a new architecture from the top-down, and builds it with a new technology, and then you get a failure, what then was the problem? The technology or the architecture? Whose implementation failed? The architects or the implementors?
From an NDG point of view, we've played a little at wrapping up web services, but we've done the big architectural design thing. Time will tell as to whether we're in any of the failure modes, but this much I know, we've learned a lot about what is possible, and that has value!
by Bryan Lawrence : 2005/03/20 : Categories ndg (permalink)
More thoughts on Blogging
Apparently the original Tim Bray article on why blogging is good for you is Tim's most linked to article. In that latest he links to an excellent critique from a European perspective by Rui Carmo. It's worth reading.
Also worth reading on that blog is Rui Carmo's personal disclaimer. If and when I get round to writing some guidelines for blogging on servers funded by academia there are some nuggets there ... albeit ones that will need translating from a commercial setting to an academic one.
2005/03/17 (permalink)
DDT and Malaria
I started my series of responses to State of Fear (SOF) thinking that I was going to be commenting on climate change issues alone. However, Crichton goes off on a lot of tangents, and I now understand why, having understood that State Of Fear is essentially a novel written around a speech where he makes a fairly strong pitch that environmentalism is a new religion. Much of what he says is very fair, not least:
Because in the end, science offers us the only way out of politics. And if we allow science to become politicized, then we are lost.
but then he proceeds to write a book which is undisguised "lobbying" in the worst political sense ... a book which finishes with what I hope is a tongue in cheek claim that he is the only person without an agenda (oh yeah, really ...).
The trouble is that in all his examples where he claims the environmental science is bad science that I'm an expert on I find that he's only managed to read (or see) some of the relevant material, he's probably not understood it all, and he's certainly drawn what I regard to be erroneous conclusions. So how then can I trust him on issues where I'm not an expert?
Take for example, the issue of DDT and Malaria, which was introduced both in the speech and the book. The bottom line is that he concludes that the banning of DDT has led to millions of avoidable Malaria deaths. Let's deconstruct his relevant arguments which I think I can summarise as:
The link between DDT use and bird deaths is based only on a correlation (which led to mass hysteria following Rachel Carson's "Silent Spring") and there is no physical link to justify the correlation (aka "We know DDT doesn't cause cancer")
There is a (worldwide) ban on the use of DDT.
There is a correlation between the DDT ban and increase in malaria deaths
Therefore banning DDT was a bad thing.
So, I spent a little bit of time chasing these things down. Eventually, I found this excellent series of blog articles which are a highly recommended read, but I'd already assembled the following:
Let's go back to the points:
DDT is dangerous ...
It's not even about Cancer(although it might once have been a worry)... even Ron Bailey agrees that DDT is dangerous for many birds. He references this from the International Programme on Chemical Safety. (However, the same arguments I apply above about whether to trust MC apply to Ron Bailey).
There is a world-wide ban on DDT use.
Oh no there's not: The twentieth report of the WHO expert committee on Malaria (1998) states:
It is anticipated that for some time to come there will continue to be a role for DDT in combating malaria, particularly in the poorest endemic countries ...
Banning DDT led to a huge rise in Malaria deaths.
Oh really? Leaving aside the ban issue, we have Crichton on the one hand dumping on the use of correlation only arguments (bird deaths without mechanism), but using the same to justify increase in malaria. It seems that the increase in malaria was going to happen anyway. See for example:this, by Shiva and Shiva, which makes two key points:
Major ecological changes have contributed to malaria resurgence. The spread of irrigation projects has been recognised as a major cause for the spread and increase of malaria epidemics. In addition, the expansion of water-intensive crops has created conditions conducive to the spread of malaria.
The morbidity and mortality related to malaria is undoubtedly increasing. Increased dependence on spraying rather than on environmental sanitation (and that too irregular spraying, improper spraying, inappropriate spraying) has resulted not only in increases in the number of vectors but also the emergence of pesticide resistance.
More specifically, Shiva and Shiva say:
Under the continued assault of insecticides, vector mosquitoes have developed resistance, thus undermining the 'insecticidal approach'. During the 1960s, no research was done on the evolution of resistance because of the euphoria of malaria control programmes. This gap in research was filled only after resurgence had occurred. In the case of Anopheles culcifacies resistance to DDT is now common in 18 states and 286 districts, to HCH in 16 states and 233 districts and to malathion in eight states and 71 districts. An stephensi has developed resistance to DDT in 34 districts in seven states and to HCH in 27 districts in six states. Resistance to malathion has been detected in eight districts in three states.
See also Sharma (pdf), which
examines the role of DDT in malaria vector control and argues that DDT-spraying produces diminishing returns and eventually becomes counterproductive.
The ban on DDT was a bad thing.
No ban worldwide. What about in the States? Well, it seems clear that malaria is not a problem there anymore, and the bird populations have recovered. Doesn't seem so bad to me ...
Now obviously I can't speak as to the veracity of any of these references on DDT etc (but I believe the WHO), but it seems that the story is not nearly as simple as Crichton would have it. It is also true that the story is not ever as simple as some evangelestic environmentalists would have it ...
... for my completely unsubstantiated take on this, it's clear that using DDT has worked to eradicate malarial mosquitoes in some places, but it didn't and never could have in some others ... it also happens to be very nasty for some other parts of the ecosystem. The bottom line appears then to be to try and avoid it, but it makes sense to continue to use it where it can make a difference ... that's a much more complicated opinion than "DDT Good" so the science involved in the (non-existent) ban was "Bad".
by Bryan Lawrence : 2005/03/16 : Categories crichton environment (permalink)
Peopleware
A few weeks ago, various blogs I read started praising "Pragmatic Version Control using Subversion" by Mike Mason. One thing led to another, and I found this with a list of three seminal books, only one of which I had read (the Mythical Man Month). So, I bit the bullet and bought the other two and the Pragmatic Version Control book as well. One of the ones I bought was "Peopleware - Productive Projects and Teams" by Tom DeMarco and Timothy Lister. It's fascinating ...
I especially liked the analysis of the coding performance of individuals in different work environments:
| Environmental Factor | 1st Quartile Performance | 4th Quartile Performance |
| Dedicated Workspace | 7.4 sq m | 4.3 sq m |
| Acceptably Quiet | 57% yes | 29% yes |
| Acceptably Private | 62% yes | 19 % yes |
| Unncessary Interuptions | 38% yes | 76% yes |
They finished up with this statement:
The top quartile, those who did the exercise most rapidly and effectively, work in space that is substantially different from the bottom quartile. The top performers' space is quieter, more private, better protected from interruption and there is more of it.
As it happens we've currently got reasonable accommodation, but it's always being squeezed. It's nice to have ammunition for the next squeeze.
The analysis also included the effect of phone calls as an interrupting factor, but I left that out, because for me the problem is email. In the book they imply email is a force for good (less interrupting), however, I think once one gets enough of it, it becomes either something you have to handle when it arrives (or drown), or something you basically ignore. Neither situation is good ...
Another really nice result in the book was the analysis of the impact of music on creativity in a Cornell experiment. It's a bit complex to summarise here, suffice to say that in an analysis of left-brain right-brain function, the results indicated that while having music on didn't affect efficient arithmetic and logical activity, it did affect the ability to make creative steps in that activity. Again, their summary is excellent:
The creativity penalty exacted by the environment is insidious. Since creativity is a sometime thing anyway, we don't often notice when there is less of it ... the effect of reduced creativity is cumulative over a long period. The organisation is less effective, people grind out the work without a spark of excitement, and the best people leave.
Two organisations I care about (the Met Office and the British Oceanographic Data Centre) have just moved to open plan offices ... I hope they can remain creative ...
by Bryan Lawrence : 2005/03/15 : Categories management (permalink)
Cloud Feedbacks
In my ongoing criticism of Michael Crichton's book, I made the point that parameterisation links small and large scales, and I emphasised the IPCC point that clouds and humidity remain sources of significant uncertainty.
I've just started reading a gem of a paper by Graeme Stephens on the current state (circa end of 2003) of knowledge about cloud climate feedbacks. I suspect I will be making lots of notes from this, and excerpting lots of bits. However, wrt to Crichton, and the "no-one knows" issue, Figure One in Stephens' paper is a comparison of future climate predictions from a number of models. In each of these models, the climate heats up with a 1% CO2 increase year on year. He shows the low cloudiness in two versions of models that lie at either end of the range of the global mean warming responses (ie just under 2C and just over 4C). He says:
The reduced warming predicted by one model is a consequence of increased low cloudiness in that model whereas the enhanced warming of the other model can be traced to decreased low cloudiness.
Why do I bring this up? It shows that while the cloudiness is a key part of the response, it isn't changing the sign within our current understanding... we clearly have a lot to do to get it right ... but it doesn't change the result: increased carbon dioxide implies a warmer climate, how much warmer is a fair question though!
by Bryan Lawrence : 2005/03/15 : Categories climate crichton (permalink)
When is a webservice a grid service?
I often get asked what is this thing called a grid. Soon after, I get asked, so what is the difference between a web service and a grid service. The short answer is nothing ... the slightly longer answer is a grid service is a web service packaged up to deal with state, notification, and pointers to objects with state. The following schematic outlines a model of a possible transaction I could have with some data.
![]() |
The concept is that I obtain some data, do some processing on it, produce a picture, and keep the picture. I could do this with instructions following the 1,2,3 route in the schematic ... with the data following the same route. If the data was voluminous, then this is a bad idea.
However, if the data follows the A,B,C route, then things are far more efficient. For that to happen pointers to data have to be passed along the A,B,C route. Often this will be asynchronous - the things happening in Processing may take a while. The VIz service may need to be notified when something is available to be visualised ... so we need state, notification and pointers. This needs to be done in a secure manner.
Now, I could do all this with webservices, passing context tokens around ... but better yet, I can use grid services which have many of the interactions implicit in the tookit of services. As an aside, the information flow may still be using 1,2,3, but the data flow is going A,B,C ... this is one of the reasons why we avoid attachments as a method of achieving data flow.
by Bryan Lawrence : 2005/03/10 : Categories computing ndg (permalink)
Security in Python for the NDG
The NDG secure access infrastructure is being engineered in both python and java, because we believe there are two different groups that will want to implement it.
Way back in one of my first blog entries I waxed enthusiastic about xml-signature and pyxmlsec. We did eventually get a little piece of code which interoperated very nicely with a java implementation of the same thing. (Don't let anyone tell you these things are trivial).
Obviously as we build our python infrastructure we start to notice the number of third party dependencies grow. This is not a Good Thing (TM). In terms of third party secure infrastructure, we want three things:
xml-signature
x509 authentication (not just for the xml-signature), and
x509 proxies (RFC3820).
The question is how to get this with the minimum of external dependencies. One key part of the problem is the OpenSSL infrastructure which currently gives us the x509 handling (and will give us x509 proxies in an upcoming branch according to Richard Levitte in private email).
The question then arises how to utilise that from python. There are two candidates:
M2Crypto, which depends on SWIG, and which was the subject of Guido's rant, and
pyopenssl which appears to implement direct python extensions for some of the openssl components.
All things being equal, one might expect the second approach to yield "better quality" interfaces, for example, the pyopenssl docs include this:
The M2Crypto.SSL module does implement a lot of OpenSSL's functionality but unfortunately its error handling system does not seem to be finished, especially for non-blocking I/O.
which is fair enough, given the M2Crypto author in this motivates the SWIG type activities this way:
The more I work in Common Lisp and Smalltalk, the less partial I am to "solutions" involving FFIs to C libraries. Although, as a counter balance to this thought, the main reason to use Lisp or Smalltalk is because these languages are great at modeling the problem domain; oft times, interfacing a C library is the quickest way to get "low-level" infrastructure that supports the modeling.
he also has some notes on extending M2Crypto.
So that would imply that it makes sense for us to initially engineer extensions based on M2Crypto (which is what we are doing) but then proceed at some future time to refactor to pyopenssl.
Well, that sorts out the first two items (although I think it would be better to remove the pyxmlsec dependency on xmlsec and libxml2 since we could handcode our xml-signature). The third item, proxy certificates, is a bit more complicated right now. Really the only "working" implementation of proxy certificates seems to be in the globus security infrastructure, but that's a lot of overhead just for RFC3820.
Fortunately, there is a third party to all of this, and that's the good folks from LBL working on the pygridware project, which is implementing the WS-Resource framework. They too have implemented XML-signature, and are working on the proxy certificate issue via an as yet unobtainable version of pyopenssl. As usual the question arises as to timescales. For proxy certificates we would seem to need to await either a future pygridware release (but it too has a long list of dependencies) or we engineer from the upcoming openssl branch (which we believe to be not far away)...
by Bryan Lawrence : 2005/03/10 : Categories python ndg (permalink)
More Satellite Broadband Woes
For the last couple of weeks I've had no satellite signal on my home broadband, and i've been wondering what was wrong ... given my previous frustration, the fact that Micronet don't answer email, and my current workload, I'd done nothing about it and simply gone back to 128 kbit ISDN until I could find time to hassle Micronet on the phone.
Today I got email from someone else with the same F10+Micronet problems, and in passing he mentioned transponder problems. It turns out that on the 28th of February Opensky changed their transponder on my satellite. The opensky web page had instructions on the changes:
| Parameter | New | Changed? | Old |
| Frequency | 11.513250 GHz | Yes | 11.471750 GHz |
| Polarisation | Vertical | Yes | Horizontal |
| Symbol rate | 30 MSym/s | No | 30 MSym/s |
| FEC | 2/3 FEC | Yes | 5/6 FEC |
On the F10 it appeared that I could only change the frequency and polarisation, so I did, and now I have signal again. How hard would it have been for Micronet to have emailed everyone to tell them about this? Don't they have a list of their subscribers? (I understand they don't sell the F10 anymore ... but as I say, they've not responded to any email on my problems).
Of course, I have the signal now, but will the VPN stay up, and can I get back in? You guessed it. I'm still hanging out for ADSL ...
by Bryan Lawrence : 2005/03/10 : Categories broadband (permalink)
Video Conferencing
As part of my JCSR duties, I spent yesterday learning about what is needed to support academic video conferencing in the UK. There are three key technologies hanging around:
H323 (and H320) which are the commercial video conferencing protocols over IP and ISDN respectively. This is what most institutions have if they have anything.
Access Grid (AG) in the UK comes in two flavours: what i might call "Real AccessGrid", from Argonne, and the INSORS grid (essentially a commercial packaging of access grid which is simpler to install and has support).
The Virtual Room Videoconferencing System (VRVS).
The key distinction between them seems to be that
H323 etc are very widely distributed, but support (effectively) only a limited number of video streams, and don't really support the data sharing options that sophisticated collaborators want nowadays. However, you can have open source based H323 on just about any platform.
VRVS has one (or two) sites which serve as hosts, whereas
with AG, any site can serve as venue host (within limitations which have to do with networks and resources and other boring issues).
The basic concept of how these (AG) things work is that you have a sequence from one site to the other
Analogue video/audio signals from cameras etc produce something digital
Something like VIC and RAT (see UCL multimedia) encodes this in packets using a specific codec
The packets are transmitted and received using the RTP (real time protocol).
Something like VIC/RAT decodes
and you see/hear something on your hardware.
It turns out that both VRVS and AG depend on versions of VIC and RAT. It also turns out that provided RTP is used it doesn't matter what does the wrapping into the packets ... and so INSORS based video streams and even H323 based streams can communicate with each other. The big difficulties are the session management of the streams and that's where they are all so different.
The big issue for JCSR is how to make sure all these systems can interwork easily, and to make sure that we are not building on shakey grounds in terms of unsupported software.
In the UK, the Access Grid Support Centre has a fund of useful information on how to use AG etc. There is also a report on the H323 interoperation issues which is very comprehensive.
As well as providing much technical information, the above mentioned report makes the point that by and large the AG community think of their technology as providing "virtual meeting rooms" (with all that entails, including whiteboards, presentations etc), whereas the H323 community are thinking about a "sophisticated telephone call". It goes on to say:
It is also clear that the metaphors of the virtual meeting room and the sophisticated telephone call are very different and so encourage quite different behaviour in users. It is not clear how these two metaphors can be reconciled ...
In otherwords, the expectations of the two communities may not be met by for all players by a technical fix ... (although my personal view is that there is a lot be gained for a subset of interactions which could be mediated by a bridge which functions rather better than the existing one is reported to do).
In my opinion these technologies can yield simultaneously savings in travel time and cost (both financial and environmental) and improved collaboration. They work best when all parties know each other well enough anyway, so physical meetings are still necessary, but they can and do obviate the necessity for many physical meetings. However, this will only work when the limits and advantages of the technologies are well understood by all concerned ... which is more than a technical problem.
by Bryan Lawrence : 2005/03/10 : Categories computing ndg management (permalink)
Your website really does need to be current
A lesson from NZ on why an out of date website can be a legal and financial problem.
Operators who put up a website and then forget about minding it have been given a sharp reminder by the court that, in some cases, those old prices and specials could constitute violations of the Fair Trading Act ...
by Bryan Lawrence : 2005/03/09 : Categories curation (permalink)
Why blogging is good for you (and yes we need a policy)
Tim Bray and Sam Ruby on why blogging is a "Good Thing (TM)"
Tim Bray starts with
Let's assume that you're reasonably competent, reasonably coherent, and reasonably mature. Cynicism aside, a substantial majority of the people in the workplace qualify ...
and gives ten reasons why blogging is a good idea.
Sam points out that it's not always a good idea, but gives an example of how, even when it isn't, people can cope ...
One of the things Tim recommends is some sort of policy. Given that I want my community to get into this, and given that the dividing line between professional blogging and personal blogging is non existent, we may have to think carefully about how blogging fits into the Janet acceptable use rules. While blogging ought not to fall foul of
the creation or transmission of material which is designed or likely to cause annoyance, inconvenience or needless anxiety;
one could imagine that some might argue that the minutae which can appear on personal blogs might have a problem with
deliberate activities with any of the following characteristics: wasting staff effort or networked resources, including time on end systems accessible via JANET and the effort of staff involved in the support of those systems ...
I note that the Imperial College information systems policy has this too:
Whilst the Defamation Act 1996 appeared to significantly limit the liability of universities and colleges acting as ISPs for the publication of potentially libellous statements by staff and students on their websites, a recent court case makes it clear that such institutions are obliged to take some steps to monitor information content in order to get that liability protection for such acts, and that, once they are apprised of third party defamatory content on their servers, they must take all reasonable steps to remove or deny access to it. Failure to do so will, under the current law, open them to liability.
All this means that before we roll out blogging as a service for NCAS (which we are considering), we'll need some sort of clear guidance (i.e. a policy), and some mechanism of ensuring that we can shut down anything defamatory. While shutting down wont be a problem, monitoring it in a meaningful manner might be.
(The only sad thing here is that I might be the person who has to write the policy ...)
by Bryan Lawrence : 2005/03/09 : Categories badc (permalink)
Meteorological RSS Feeds
I've just spent a frustrating hour trying to find rss feeds for the table of contents of atmospheric and climate journals. I've found plenty of websites, and even email notification pages, but all I want is RSS. Given our discipline really stretches computational ability at the top end, why can't I find such a service, when even lawyers do it?
Will I really have to do it myself? It looks like ingenta do actually produce rss feeds for some journals, but it's not obvious where the list of feeds can be found, and what journals are included. (It appears to be mainly Kluwer and Elsevier ...).
Meanwhile, I've started compiling a list of atmospheric feeds from the ingenta table. I'll add more as I find them (or write my own scrapers if I get desperate).
by Bryan Lawrence : 2005/03/08 : Categories badc (permalink)
More State of Fear
Another installment in my (seems likely weekly) discussion of Michael Crichton's work of fiction. Before I get started, I can highly recommend the essay entitled "Dangerous Fiction" by Jeremy Leggett in New Scientist (5th March Edition).
I especially liked this excerpt:
The place to start in trying to understand this horrific assualt on sanity is this bibliography. It is indeed extensive, and shows that the author has read the basic texts on the human-enhanced greenhouse effect, apparently without being impressed. What does impress him is every un-peer-reviewed neoconservative funded pamphlet and book you can think of. And he doesn't hesitate to say so. As for the 12 volumes of Intergovernmental Panel on Climate Change (IPCC): on this painstaking compilation of the work of hundreds of climate scientists around the world, we find not a word of editorial comment.
Leggett's whole review is worth reading!
Last time I picked up on a number of points in a conversation the characters had. I promised to come back on those issues, which basically had to do with Crichton not understanding either the scientific method, the complexity of the atmosphere, and scientific reserve (when we refuse to be absolute about what we know). When I wrote that blog entry, I intended to go through the enumerated points, but I've slept on that a few times, think it better to come back to what the IPCC says plus a few words on paramerisation.
So Crichton says "No one knows ..." well, the IPCC really is the sum of human knowledge on climate change, so what did they say? On page one of the executive summary, we have the evidence for current climate change. What's really interesting is this figure:
![]() |
The key thing this shows is how very unusual the last one hundred years of this millinium have been - and is more evidence that we shouldn't be too perturbed by the implications that Michael Crichton tries to draw from his temperature data from carefully selected locations.
Another thing the IPCC report states is;
A few areas of the globe have not warmed in recent decades, mainly over some parts of the Southern Hemisphere oceans and parts of Antarctica.
OK, well that deals with the conspiracy that Crichton tries to imply on page 194, where he has his characters citing some (real) literature to make the point that the Antarctic as a whole has not increased in temperature. He seemed to be trying to build a case that the climate change science community is trying to hide evidence. Far from it, it's all grist to the mill. The climate is a comlex system, but still, the IPCC final conclusions include:
Emissions of greenhouse gases and aerosols due to human activities continue to alter the atmosphere in ways that are expected to affect the climate.
There is new (since the last report) and stronger evidence that most of the warming observed over the last 50 years is attributable to human activities.
There is much much more that we know ... and you should read about it. The IPCC summary is worth reading and is quite accessible (accessible in terms of finding it, in readability and in authority).
Now let's get back to some other things that I said I'd talk about, parameterisations, and why they most definitely are not guesswork! More things the IPCC say include:
In general, they (Coupled Atmosphere/Ocean models) provide credible simulations of climate, at least down to sub-continental scales and over temporal scales from seasonal to decadal. The varying sets of strengths and weaknesses that models display lead us to conclude that no single model can be considered "best" and it is important to utilise results from a range of coupled models. We consider coupled models, as a class, to be suitable tools to provide useful projections of future climates. ... Clouds and humidity remain sources of significant uncertainty but there have been incremental improvements in simulations of these quantities.
Now, let's translate this last sentence. We do know the physics of what happens to clouds when we increase the temperature, and/or add more water into a small volume (which Crichton implies we don't). However, when we take these complex models, and compare them to the real world, we find that the real world and the model diverge in a number of ways! Why is that? Well, it's true that it is in part to do with parameterisation uncertainty, so let's define what we mean by parameterisation and indeed uncertainty.
Firstly uncertainty: as scientists we are never certain, but we can put bounds on our level of uncertainty. Often times these bounds admit of very little uncertainty, but we refuse to say we know something ... that's just the way it is. If you did science at high school, you will remember being forced to measure things with "errors". So you should never say that you know that an object is 10cm long, it is 10cm plus or minus 0.5 cm (or whatever). That's an example of uncertainty. You'd never find a (good) scientist making any statement without qualification.
What about parameterisation? To understand that, we need to come back to what a coupled model is. In short, it's a complex system of hundreds of mathematical equations used to predict climate. These modells encapsulate nearly everything we know about the physics of the atmosphere: starting from Newton's laws of motion, the gas laws, and working up to detailed thermodynamics of water ice transitions and much much more. These equations represent the processes going on between and in "grid box"s. Each grid box represents a finite amount of the atmosphere in three dimensions (hence "box"), and they are distributed throughout the volume of the atmosphere. Generally, we call the processes which are dominated by the physics of the interaction between boxes as "fully resolved", and those that are dominated by processes which have scales smaller than a grid box as "sub-grid" scale.
Typical climate model grid boxes are of the order of a couple of hundred km square and with varying vertical resolution (a few hundred m near the surface to 10 km or so in the stratosphere). Problems arise because of these large scales. What we call a parameterisation is where we take the properties of the grid box (large-scale) mean and attempt to calculate how the average behaviour of these sub-grid scale processes affect the grid box mean.
Why do we do that? Because many processes are clearly much smaller in scale. Obvious examples of such processes include cloud processes. Clearly within a grid box there could be a wide range of clouds of different types. We might expect that there is a relationship between the large scale (temperature, humidity etc) and the cloud. We certainly know there is on the small scale, for example all other things being equal if we increase the temperature of a volume of air then there could be more water vapour as opposed to water condensate - hence less cloud. The temperature itself is generally dominated by large scale processes. So, a parameterisation is a set of equations which link what we know about the physics of the small scale, with the large scale, usually based on some statistical relationship between how the small scale physics might scale to the grid box mean. They are complicated physical equations, and they are tested rigorously by comparison with observations.
Any good GCM has dozens, if not hundreds, of parameterisations, most of which have been tested regularly in numerical weather predition as well as in evaluation of the climate models. Most parameterisations are the result of years of work comparing physics with observations! Improving model predictions is a trade off between improving resolution, improving paramterisations, and improving our confidence in our predictions by running ensembles. See my blog on model resolution.
But, in the final analysis, parameterisations are approximations for the large scale of small scale physics. However it is absolute rubbish that they result from guesswork, quite the opposite, most parameterisations are simply approximations of complex physics. The validity of the approximations is tested by comparison with observations.
So back to Michael Crichton (who believes that in a few months light research he has understood more about the climate than the hundreds of scientists who have dedicated their lives to understanding it). When he calls parameterisations guesswork, he is displaying ignorance. When he says "no one knows", he is displaying more ignorance. When he shows a few timeseries of temperature and draws conclusions based on his special selections, he is being more than ignorant, he is being duplicitous. Oh well, he has that in common with many others ...
At this stage I'm on page 218. Frankly, this book is so frustrating that meanwhile I've read two others which have been far more interesting (actually novels about Roman times ...). I'll try and carry on with State of Fear ...
by Bryan Lawrence : 2005/03/05 : Categories climate crichton (permalink)
Content Management, Wikis, Blogging etc
A number of threads are banging around in my mind. Here at the BADC we are running a number of bits of software that all do a bit of the same task. We run
Leonardo for blogging.
BSCW (Basic Support for Cooperative Working], for collaboratory activities, including private workspaces and file sharing.
A number of kwiki based wikis,
and of course we have our comprehensive (and complicated) web site.
The boundaries between these pieces of software and their target communities is rather blurred. From time to time we set up project web pages. If the project is at all active, we end up creating BSCW pages for it, wiki pages for it, and static html. Before long no one in the project knows where to put material.
We're about to add a new project site, and update an older one. At the same time, our existing major web site (the badc) is beginning to creak at the seams: the number of pages is multiplying, we have duplicates, maintenance problems, no history management. All classic stuff for a proper content management system.
Obviously the question arises as to whether we can simplify our environment?
In the case of the badc web site, I think we need a proper content management system, and there is no getting around it. We've just kicked off a project to look at replacing it with plone. That project will run for quite a while, because our existing content is complex; driven in part by a number of databases and scripts. The relationships between what we do and what plone could do need to be well understood before we would be able to start developing the new website.
Our BSCW workspaces are very popular with the research community who primarily use it for sharing documents and data for (especially) field campaigns. It's not obvious to me that it is used for more than a private file repository (for each community), so it's not clear that we couldn't replace BSCW with something which has the same functionality (Although personally, I do like the ability to know if someone has downloaded a file that I have uploaded).
I've invested a good deal of time in Leonardo, because I thought (and still think) that blogging could be of significant importance for the academic community who could use it to improve collaboration. We've tried in the past to encourage them to use products like BSCW (and simpler bulletin board products) to engender discussion and questioning, but it's not really worked. One of the reasons why I don't think academics much like bulletin board type collaborations is that it is ok for question/answer type situations, but not so good for the gradual development and sharing of ideas. It's the latter process that blogging is so good for, and when coupled with categories, comments and trackback, can allow the efficient exchange of information. Of course, one of the limitations is that sometimes one wants to keep ones thoughts private to a small community, and that's not so easy with standard blogging software. So, before Leonardo can really take off in this context we need to add support for
trackback, comments and categories, and
the ability to limit access to some pages to some parts of the community.
As many have said before, the difference between a community blog and a wiki is nearly impossible to identify. Again, Leonardo fills this space in a nice way, with a relatively seamless transition between content which is (perhaps) slowly varying (the wiki material), and diary content which is time-stamped and evolves by addition, not correction (the blog material). If Leonardo was to replace our existing wikis though, we'd have to add two crucial pieces:
Version Control, and
Multiple Users.
Of course, we need multiple user support to limit blogging conversations to a small community (I imagine that being done with some categories being limited to some sort of access control list). This implies that beyond what we need for academic (and other) blogging, the only additional thing Leonardo would need to move into our wiki space would be good support for version control.
Given we're moving to plone and Leonardo, can we remove some others of the pieces from the mix? The most obvious candidate for improvement would be to replace all the adhoc websites with instances of a complex wiki that would allow one to have immutable pages, and mutable pages, and access control. If this wiki software allowed the uploading of documents as well as wikipages, and could hold some documents in a private space, it could certainly replace those situations where we have three different products in the same space.
It's even possible with the above spec, that it could replace the BSCW software. However, for those applications where people simply want to share documents and collaborate, BSCW still seems to have all the functionality we need, and it has a happy user community. Now is not the time to think about shutting it down.
kwiki doesn't quite meet all those requirements, so I've spent some time looking at wikis, and while I wouldn't claim to have done a comprehensive job looking, two good candidates (in terms of their feature support) would be:
However, when push comes to shove (and I spent much of this afternoon shoving MoinMoin) there do seem to be enough little limitations that we'd have to do a little engineering to get around those problems. Of course being in python, MoinMoin has a significant advantage (we have relevant expertise, and it will probably be more easy to engineer it to fit in our access control environment).
We have to start a new website in the near future, and upgrade another, and we have clear choices: If we look at what is required for Leonardo to fill this space, we can see that it's all on the roadmap for Leonardo, but it's not going to be here soon. MoinMoin is much closer to what we want now, so we'll probably go with that.
Annoyingly then, it looks like we'll still have four things to maintain, but at least they'll be targetted more clearly:
Sophisticated Web Site (plone)
Less Complicated Web Sites and Community Wikis (MoinMoin?)
Blogging (Leonardo)
Out and Out file sharing (BSCW)
It should be that these new project based websites should be much easier to setup, maintain, and use. That'd be a result!
by Bryan Lawrence : 2005/03/02 : Categories badc (permalink)
HinesParameterisation
Alison has just pointed out that Colin Hines has a new paper on his Doppler Spread Parameterisation. The abstract states that:
An embryo Lagrangian DST is introduced and employed to assess the original DST. Earlier results near the Eulerian spectral peak are found to be reasonably valid, whereas those at greater vertical wavenumber are confirmed to have produced too much spreading. The earlier DSP is found to need little if any change, though specific values are suggested for its two most important "fudge factors".
The two numbers he gives are Φ1 = 1.9 and Φ2 = 0.4. These are suggested in order to get suitable values of the Lagrangian Richardson number, to compare with observational values. As it happens, these are precisely the values which I have used in my previous papers (e.g. height of model lid influences on stratospheric climate, forcing planetary waves in the mesosphere, and simulating the quasi-biennial oscillation). It's also obviously good news that his parameterisation is standing up to more rigorous Lagrangian analysis.
by Bryan Lawrence : 2005/03/02 : Categories climate (permalink)
What is a Web Service
Last week I gave a kick off talk at the National Institute for Environmental e-Science (niees) on "What is a Web Service". This was in a meeting on "Developing applications for real-time environmental data" (DARTED).
In that (short) talk, I spent a few minutes on the SOAP v REST wars, because I think the environmental community could be sold a dummy if they put too much work into mis-applying technologies. In particular, I do believe that while there is a place for SOAP-ful web services in some applications (we are using them in NDG), in many cases it's simply not worth the effort. The trick is going to be deciding at project conception whether one needs the full WS machinery or not.
Meanwhile, the war rumbles on ... but here from soap-is-dead via Ryan Tomayko is a key point for us application developers:
If everything is sent over the wire as a XML document that is described by an XSD then it all boils down to how easy you can work with these documents. That is working with XML api's like DOM and XPath. The enclosing envelope should be irrelevant to the concerns of the average developer; it should be treated like just any other transport protocol.
The same article goes on to say:
All that extra machinery provided to support the SOAP envelope is precisely that, extra machinery and has never been shown to improve interoperability. Therefore, in terms of effort, interoperability via SOAP is not any easier than doing it in REST. In fact, its actually more insidious because a developer is all too easily lulled in the fallacy that an object is the same as the XML document.
Which goes to show that this battle at least is about interoperability. I don't know where the truth is here yet, so I'm happy to keep on reading about it. But it's not all about interoperability, we need to add security to the mix, and if I have the same code base at each end of the wire (or not), using WS-whatever and/or SOAP may be easier than doing it all (over again) RESTfully. As I say, I really don't know how this war will pan out, but I do know that NDG needs X509 (proxy) certificates, and XML-signature, and probably a fair hunk of WSRF (or something like it) ... so if we can have it in a SOAP toolkit, we'll use it!
by Bryan Lawrence : 2005/02/28 : Categories computing ndg (permalink)
Internet Explorer REALLY Sucks
I've been aware that my blog didn't render properly on internet explorer. I thought it was probably because my CSS and/or my XHTML weren't standards compliant. So, I've done some testing. My new wikiBNL code and my style sheed have been tested with
It passes both. Now, all newer pages should render correctly, but perhaps I can understand it if the older (non standards compliant) ones don't. And what was the problem? IE didn't like comments in the CSS file (which stopped it being valid CSS, but didn't stop other browsers from getting it right).
I'm obviously not the only person with major problems with Internet Explorer. I think this open letter to Bill Gates says it all.
by Bryan Lawrence : 2005/02/28 : Categories computing (permalink)
The Temperature is Increasing
In my last polemic on the State of Fear, I said I'd put up some station data from elsewhere. Recalling that Michael Crichton has his characters imply that one can only trust U.S. data, I've plotted some data from the Central England Time Series. I don't think there is any doubt about the quality of this data (although of course it is from a very small region of the globe).
I've taken seasonal means, and fitted some trend lines for the last 100,50 and 20 years. The actual figure is here, rather than included directly, as it's rather large. You can see that temperature increases over all these periods, but the rate of increase is itself increasing. There is also a consistent signal in all seasons.
A word of caution. While such graphs are interesting, we wouldn't treat these as evidence of anthropogenic (human induced) climate change without lots of other information ...
by Bryan Lawrence : 2005/02/27 : Categories climate crichton (permalink)
Matplotlib
In my last post, I produced a figure of temperature data. As an exercise in learning something new, I produced it with python and matplotlib. I have to say I was pretty impressed with matplotlib. Although I only played with the tkinter gui and the postscript and png backends, it looks like there are lots of other interesting output types including SVG. I'm certainly going to play with it some more.
by Bryan Lawrence : 2005/02/27 : Categories python (permalink)
State of Fear, pick my timeseries
Following on from this.
Well, with 170+ references, there is an implication that his facts are good, but his attention to detail is not so good, or perhaps we just need to remember it's fiction. Anyway, the Hadley Centre isn't in East Anglia (perhaps he's confused the Climate Research Unit at the University of East Anglia, with the Hadley Centre for Climate Prediction and Research, based mainly in Exeter).
More importantly, from page 84 onwards, Crichton starts showing some carefully selected graphs to make some points. In particular, he shows
Global Temperature, 1880-2003, using GISS data.
Same again, but with carbon dioxide overlaid
Same again, but only looking at 1940-1970.
The timeseries of US temperatures 1880-2000.
He appears to be using these to attempt to make the following points:
A: carbon dioxide and temperature don't always follow each other.
B: the U.S. temperature data is obviously the best in the world, the U.S. is big enough to represent the rest of the world anyway, and since that doesn't show a significant warming until post 1970, then it isn't real.
Firstly A. Let there be no misunderstanding. All other things being equal, increasing global mean carbon dioxide must increase the global mean temperature. If this were not true, we would live on a frozen planet! We know this (No serious scientist denies this as far as I know). However, all things are not always equal, and as we will see there are issues with, particularly, clouds. Also, let's not forget natural variability. Natural variability occurs on a range of timescales, and is bound to result in periods (even decades) where the carbon dioxide signal and the temperature signal will not align. In the long term though, they just have to! The real questions should be:
C: Is the observed carbon dioxide increase manmade?
D: Is it large enough in magnitude for the atmospheric response to swamp natural variability?
E: Are there other parts of the system that might respond before temperature, and produce feedbacks which might stop a significant global temperature increase?
All the available evidence suggests: C: yes, D: yes, E, No. That's why all the refereed literature agree on this point.
Let's deal with his assertion (B) that the U.S. data is the only data that matters. Leaving aside the breathtaking arrogance of this position (no one else can maintain quality recordings!), remember that the land area of the U.S. is about nine million square km. Roughly six percent of the land area of the globe, or about 2.5 percent of the global surface area. So, actually, his assertion that because of the size, it has relevance for the globe is nonsense. So now, let's look at some pictures he didn't choose to show. Here, from the same site as he got his data from are some figures he doesn't want to show:
![]() |
Hmmm ...
It looks like that when you use all the data, we learn that the warming problem is not the same at all places and all latitudes, and so the American figures are not the whole story - but even they show a significant increase in temperature in the last three decades. Note that GISS explain in some detail what data they use and how trustworthy it is!
Sometime, if I have the energy, I'll put up some time series from some other locations ...
Appropriate Use of Facts: 0, Misuse: already too big to count.
On page 188, we have the following exchange, I've added some parenthetical clause numbering so I can rebut the conversation:
... No one can say for sure (1) if global warming will result in more clouds, or fewer clouds (2) ... global warming will raise the temperature, so more moisture will evaporate from the ocean, and more moisture means more clouds (3) ... but, higher temperature also means more water vapor in the air, and therefore fewer clouds (4) ... so which is it? Nobody knows (5) ... so how do they make computer models of climate ... As far as cloud cover is concerned they guess (6) ... well they don't call it a guess. They call it an estimate, or parameterisation, or approximation (7). But if you don't understand something, you can't approximate it (8). You're just guessing (9).
Oh dear, oh dear. In my disclaimer, I should have said, I'm a dynamacist who specialises in parameterisation. Now this is something I do know something about. But I haven't time for this now. Expect me to come back and comment all these points too ...
by Bryan Lawrence : 2005/02/26 : Categories crichton (permalink)
XHTML Table Alignment
Leonardo purports to produce xhtml1.1, so I decided to do some work on validating my pages (with the new underlying wiki format). It turns out that table alignment was one of two big problems (and a few minor ones) for me to resolve (the other one is that images can't exist in paragraphs, I have yet to solve that one - it requires real work in my wiki parser).
I used to use <center> but that didn't validate (even though it worked). So, given it took some finding (thanks), I've repeated it here: To center a table in xhtml1.1, you need to put it in a div of some sort, and, use the margin and text-align attributes. Here, for example is the piece of my css that works for the "main" part of the Leonardo layout:
div#main table {
margin: auto;
border : thin solid black;
background-color : #FFFFFF;
text-align: center;
}
I had hoped this would fix the Internet Explorer problems I've had (well, strictly, the Internet Explorer problems others have had!) - but it didn't. I'm sorry IE folks, don't blame me, blame Microsoft ...
(The latest version of my wiki and embedhandler that are closer to xhtml1.1. compliance are at corewiki.tgz)
by Bryan Lawrence : 2005/02/25 : Categories computing (permalink)
M2Crypto Woes
Guido (he of python fame), has a rant about M2Crypto. This is a worry from an NDG point of view. In it he also makes the following comment about SWIG as well:
... I've yet to see an extension module using SWIG that doesn't make me think it was a mistake to use SWIG instead of manually written wrappers. The extra time paid upfront to create hand-crafted wrappers is gained back hundredfold by time saved debugging the SWIG-generated code later.
We've only been exposed to SWIG in a big way once, and it was a nightmare. The code involved worked on one platform, not on another, on same days, and not on others. Repeatable problems they weren't ... but to be fair, in that case it might not have been SWIG's fault, but then, we couldn't tell!
by Bryan Lawrence : 2005/02/25 : Categories python computing (permalink)
Orographic cloud in a GCM: the missing cirrus
At last a chance to talk about real science: Sam Dean, myself, Don Grainger, and Darlene Heuff (Dean et. al.) have had a paper on orographic cirrus accepted by Climate Dynamics. The abstract is:
Observations from the International Satellite Cloud Climatology Project (ISCCP) are used to demonstrate that the 19-level HadAM3 version of the UK Met Office Unified Model does not simulate sufficient high cloud over land. By using low-altitude winds from the European Centre for Medium Range Weather Forecasting (ECMWF) Re-Analysis from 1979-1994 (ERA-15) to predict the areas of maximum likelihood of orographic wave generation, it is shown that much of the deficiency is likely to be due to the lack of a representation of the orographic cirrus generated by sub-grid scale orography. It is probable that this is a problem in most GCMs.
Update (May 3rd, 2005): The complete reference is doi:10.1007/s00382-005-0020-9. A link to a personal copy of a pdf will eventually appear somewhere on this website.
by Bryan Lawrence : 2005/02/24 : Categories climate (permalink)
Broadband Frustration
"Broadband Britain" is a wonderful but empty phrase. I work at home a lot, but it isn't always easy ...
I live a long way from my exchange. Currently I have ISDN. I might be within ADSL range, I didn't use to be, but now, who knows? Multiple phone calls have yet to get BT to say anything sensible. I try every few weeks to order broadband from BT, and each time I get lost in a torturous maise of transfers and inaccurate information. I tried again this morning, and may try again this afternoon. (Why order from BT? Because if I can't get 512 kB/s, I want to stay with ISDN ...)
I currently have satellite broadband provided by Micronet Broadband and Opensky/Eutelsat (the satellite is ebird at 33E) and I use BT Midband ISDN for uplink. Because I have a home network, I use an F10 router.
The F10 provides a VPN using ISDN uplink and satellite downlink via osda.eutelsat.net, and all my local IP traffic going externally should go via that VPN. Should I say. The VPN is as unstable as hell, and it appears that eutelsat have a small bank of VPN circuits. I get good service some of the time, but a lot of the time, far too much of the time, I see this:
![]() |
Far, far too often, this follows:
![]() |
Note added later in the day, and for once, I caught this message, which must have been to do with the satellite, because it occurred in the middle of some file transfers:
![]() |
The bottom line is that I have to support fail over to ISDN, but guess what? The F10 doesn't support this without manually changing the configuration! The F10 software is all round poor, for example, you'd expect it to have an option to automatically establish the connection for outband packets (my other ISDN modem can do this). It should be able to support 128 kbit ISDN when the satellite is unavailable (or more usually, when the VPN is unavailable). It should also support optional 128 kbit uplink. However, it does neither, so I have to support two routers, and much mucking around.
When satellite broaband is good, it's very good, but when it's bad, it's useless. And it's bad far far too much of the time.
by Bryan Lawrence : 2005/02/24 : Categories broadband (permalink)
State of Fear, the beginning
Well, I threatened to post my thoughts about Michael Crichton's State of Fear. I finally got to start the book last night, and it's immediately apparent it's going to annoy me as I read it. This book has 170+ references in a bibliography, so he's trying to assert some sort of psuedo-scientific credibility. I have a choice now, just read the book, and give an overall opinion, or try and do a piece by piece critique. The former has been done, so I'll attempt the latter, although I doubt that I will have the time or patience to be very complete ...
OK: a disclaimer. My speciality is atmospheric dynamics, although i am becoming more and more informed on clouds. I suspect, with a high degree of probability, that I am better informed across the issues than Michael Crichton. So I'll bring my own misconceptions to bare. Please email me if you disagree (on scientific grounds) with anything I opine on this topic.
What baggage do I bring to this? As a doctoral student I was quite anti-climate-change. Where is the evidence I used to say ... well fifteen years on, there's plenty. So, yes, I start from being a believer. But I'm a physicist. I'm evidence driven, and yes I am UK government funded, but I run a data centre, there is no impetus for me to be on any side of the debate apart from the evidence.
So far I've made it to page 58, so this is going to be a piecemeal critique, that way I may actually do it ...(The version I am reading is the "Special Overseas Edition", so apologies if the pagenumbering differs from yours).
The first time my sensibilities were offended occurred when he referred to Chylek, et.al., 2004 (Global Warming and the Greenland Ice Sheet). I haven't read the paper, but I have read the abstract (we don't get Climate Change, the topic of financial limits to academic journal access is something for another day). The abstract makes the point that there is a high natural variability in the regional climate, and links the results to a couple of well known oscillations, but reports
Since 1940, however, the Greenland coastal stations data have undergone predominantly a cooling trend. At the summit of the Greenland ice sheet the summer average temperature has decreased at the rate of 2.2 ?C per decade since the beginning of the measurements in 1987. This suggests that the Greenland ice sheet and coastal regions are not following the current global warming trend
Crichton used this in a context where he clearly wants to imply that there is some evidence of a conspiracy by scientists to hide evidence of global cooling that contradicts received wisdom.
Well, firstly, the paper clearly discussed the regional context. Secondly, I'm not glaciologist, but glacial advance and retreat are governed by the balance between accumulation (snow and ice etc falling at the top) and ablation (melting, calving etc). If there is general warming, we expect more precipitation. In some places this will lead to more accumulation than is balanced by ablation due to higher temperatures. In such regions, glaciers will advance (and the regions near the glaciers will cool). I've no idea about the details of this paper, but the way Crichton has used it is duplicitous. What this means is that glacial advance (or retreat) in any region of itself is not evidence of general warming or cooling locally, let alone globally.
From now on, I intend to mark this book by a little score card. At this point we have
Appropriate Use of Facts 0, MisUse 1
(NB: The link above to Chylek et al to the reference uses a doi published at this link. It should resolve properly to the abstract directly, but today it goes somewhere dead at www.springerlink.com where it returns "Bad Request". Shame on you springerlink! Meanwhile, you should be able to get it at the first link in this note.)
by Bryan Lawrence : 2005/02/19 : Categories crichton (permalink)
DataDeluge
I have just got a hard copy of the JISC briefing paper on the Data Deluge: Preparing for the explosion in Data.
There are some good thoughts in here, and some key questions:
... e-research data are likely to be annotated automatically and stored in a digital archive or library? What will the role of libraries be in this context?
Indeed. As we all become responsible for our own information environment, my take on this is that the library function will become an advisory function because:
... Could some institutions act as repositories for scientific or technical data on behalf of a number of institutions?
Is likely to be more and more true ... the cost of having local copies of all relevant materials (be they physical or digital) will become prohibitive. However, the cost of not having local expertise in how to find and utilise digital (and physical) objects will be too great for institutions. The library function will remain, it'll just be different. Librarians like to call themselves Information Specialists, and it's going to be true!
By the way, I also liked these useful quantifications of thousand fold increases:
| 1 megabyte | A large novel |
| 1 gigabyte | Information in the human genome |
| 1 terabyte | Annual world literature production |
| 1 petabyte | All US Academic Research Libraries |
| 1 exabyte | 2/3 of annual production of information |
by Bryan Lawrence : 2005/02/14 : Categories curation (permalink)
Testing, Extreme Programming and Demos
Patrick Logan has some really good thoughts on how to do design a demo so that it doesn't break ...
I liked his comparison with extreme programming. I liked this so much, I'm repeating it here, because I figure I'm going to want to find this in ten years time, and who knows how long blogspot.com will last? (But please read his copy not mine).
by Bryan Lawrence : 2005/02/13 : Categories computing (permalink)
Sunday Morning Procrastination
It's Sunday morning, and outside can't make up it's mind whether it is a hailstorm, a gale, rain, or sunshine ... clearly time to do something useful. So I'm procrastinating and catching up on blogs and such.
Patrick Logan's good value this morning. He also pointed me at Sam Ruby's 2004 presentation on encodings and difficulties with URIs. Patrick's quote from Sam that grabbed my attention was:
The accuracy of metadata is inversely proportional to the square of the distance between the data and the metadata.
but the entire presentation is food for thought if you care about standardisation on the web in terms of encoding and why it matters. (Of course, initially I just liked the quote for it's wider connotations).
He (Sam Ruby) concludes with
Comparing characters and uris is surprisingly more difficult and important than you might otherwise imagine (think: security holes).
Having found this source of presentations, I also liked the 2003 presentation on some elements and details of atom, and in particular, this statement on required versus optional elements:
The number of optional features in XML is to be kept to the absolute minimum, ideally zero. As a result of this, any XML document has a high probability of being handled successfully by any XML processor.
... good advice for us in some of our NDG schema, many of which are very complicated, but they don't all need to be!
by Bryan Lawrence : 2005/02/13 : Categories curation xml (permalink)
Upgraded to v166
This web site is now running a very slightly modified rev 166 version of leonardo. I only had to make a few changes:
modify my menu page to use the new calendar insert option
copy my modified wiki04 over top of James ...
add one line to the page template to get an image
copied my version of static.py (which handles a few more mime types and slightly different default file locations) over top of the svn version.
fix a whole heap of bugs in my wiki code. (A revised version is at corewiki.tgz).
At this stage I haven't yet had a chance to test all the functionality of my new latex wiki code, as I have yet to install dvi2bitmap or source-highlight ...
2005/02/12 (permalink)
yet more on licensing
The licensing saga isn't over yet. I so don't want to be thinking about this ... As Steven Vaughan-Nicholls writes in e-week.com, commenting on the plethora of open source licenses:
You shouldn't need to be an IP attorney to be a developer, and there are times that I wonder if that's where we're going.
Anyway, my musings on sundry GPL issues are obviously timely. I should have been aware of the article on a GPL3 in eweek from November last year, but yesterday's eweek article Rewriting the GPL no easy task brought it to my attention. Another useful thing from yesterday:
It (GPL3) is designed to be a copyright license that works all around the globe, so whoever rewrites it had better put on their global copyright harmonizing hat in order to do the task productively.
Wonderful. Meanwhile, for me, the software saga continues, because as well as my curation requirements, we want to release some code into the wild, now. You may recall that in this context my major issue about licensing came down to avoiding liability. Fortunately, better trained minds than mine have been on the problem and crafting a CCLRC license. Hopefully in the longer term we can use a license better understood by the community. Meanwhile, the key parts of it are:
All warranties, conditions, terms, undertakings and obligations on the part of CCLRC, implied by statute, common law, custom, trade usage, course of dealing or in any other way are excluded to the fullest extent permitted by law.
Subject to condition 4, CCLRC will not be liable for:
any loss of profits, loss of revenue, loss or corruption of data, loss of contracts or opportunity, loss of savings or third party claims (in each case whether direct or indirect);
any indirect loss or damage arising out of or in connection with the Software;
any direct loss or damage arising out of, or in connection with, the Software
in each case, whether that loss arises as a result of CCLRCs negligence, or in any other way, even if CCLRC has been advised of the possibility of that loss arising, or if it was within CCLRC's contemplation.
None of these conditions limits or excludes CCLRC's liability for death or personal injury caused by its negligence or for any fraud, or for any sort of liability that, by law, cannot be limited or excluded.
(Note that because of limitations of formatting in this wiki, the clauses are not strictly as will appear in the license, but you get the drift). I wish I understood how this was better than the GPL ...)
In terms of the wider issue, of derived works, and how that concept is embodied in the GPL, we have more work to do :-).
by Bryan Lawrence : 2005/02/03 : Categories badc curation (permalink)
non rigid metadata classification
It is an act of faith in the data management community that we need controlled vocabularies that interact using thesauri and ontologies. It's an act of faith at the moment because we dont think we have any other options ...
This is impacting on me right now:
In the NERC DataGrid development, we are developing metadata schema, and we want them populated with terms that we can use to allow scientists to search on. We know we have to be very careful here though, one persons aerosol content is another persons sulphate (fine), and both may use different units. We propose to address this in general by the use of standard names as well as long names, (e.g. the CF convention and the BODC parameter dictionary). There are many other places where controlling the vocabulary and/or categorising the entries will help.
If we take CF as a specific example, we already have a problem with the fact that the CF convention, and the inherent standard name table are a moving target. Without version control, one doesn't know for sure that the item you point to is the one you want. Even worse, at the moment we have a (perceived) problem that it takes time to get a standard name accepted ... perhaps leading to problems in use of the CF standard names more widely.
However, we have the problem what it is damn hard to get anyone to conform to any standard (who has time to read these things?). Of course, the data centre professionals do ... but we are dependant on the scientists in the first place, and mostly they don't.
i d e a n t has an article entitled Bookmark, Classify and Share: A mini-ethnography of social practices in a distributed classification community. It's about tagging information in the (primarily) blogging community. As he puts it:
... allow users to collaboratively organize a shared set of resources by assigning classifiers, or tags, to each item. The practice is coming to be known as free tagging, open tagging, ethnoclassification, folksonomy, or faceted hierarchy (henceforth referred to in this study as distributed classification), and is associated with popular online services such as furl, del.icio.us, or flickr ... One important feature of systems such as these is that they do not impose a rigid taxonomy. Instead, they allow users to assign whatever classifiers they choose.
There are obvious problems with this approach (can any of us agree on a common system of organising anything?). Jon Udell wrote that:
Conventional wisdom holds that people will never assign metadata tags to content. It just isnt on the path of least resistance, the story goes ...
So, no difference from the science community there ...
He goes on to say:
yet somehow, users ... routinely tag content, and those tags open new dimensions of navigation and search. Its worth pondering how and why this works ... Abandoning taxonomy is the first ingredient of success. These systems just use bags of keywords that draw from and extend a flat namespace. In other words, you tag an item with a list of existing and/or new keywords.
The reason why this works is because
Feedback is immediate. As soon as you assign a tag to an item, you see the cluster of items carrying the same tag. If thats not what you expected, youre given incentive to change the tag or add another. If your items arent confidential and online-only access is sufficient, this can be a great way to manage personal information. But the real power emerges when you expand the scope to include all items, from all users, that match your tag. Again, that view might not be what you expected. In that case, you can adapt to the group norm, keep your tag in a bid to influence the group norm, or both.
By contrast, both James Tauber and David Megginson are suggesting the use of Wikipaedia as a way of providing tagging. In brief, what they suggest is that by referring to a wikipedia article, one defines what one is talking about, or at least defines it in terms of the definition that existed when one made the link. While I can see the simplicity of the idea is attractive, my worries about impermanent links apply here too. Mind you, now I've introduced two concepts:
The concept of categorisation, and
The concept of definition.
While the examples that David and James use for categorisation are probably going to be reliable enough in the long term, it may be that for more controversial topics, the actual categories will drift in time, linking things together in unforeseen ways, and introducing new misunderstandings as the definitions may no longer agree. In my CF example above, I spoke of the importance of version control, and I think this applies to serious application of categorisation ....
Tim Bray has discussed the issue of hierarchies in the use of tags (categories). He introduces us (well me) to the Atom category concept, and I can see the direct cross over to the GML dictionary syntax which we are already building into NDG. The important point is that one uses a URI, either directly, or indirectly in a namespace sort of way, to ensure that the vocabulary origin is understood.
Nearly all the bloggers on this subject speak for simple approaches, perhaps like using wikipedia, or using pings, or del.icio.us or ... The requirement for simplicity applies to scientists, but I think thus far the blogosphere has two further things for us (the data management community) to learn:
the importance of instant feedback
the attractiveness of a reference source that can be manipulated if it doesn't have what you want in it
I began by introducing some areas where categorisation are really important to me (and my community). For us, we have no choice but to have complex semantic markup (we want machine understandable metadata), but we do have a choice about how we get it. To that end we need
A vocabulary server, so you can find definitions (this is under way at the bodc as part of the ndg project,
A way of finding what has linked to particular vocabularies. (We'll be able to do this using our NDG search tools),
A timely (i.e. practically instantaneous) way of users proposing new definitions, and a responsive governance system for deciding on them (yes, we do need that, otherwise we have conflicting definitions, and a number of folk - e.g. see links from here - have correctly identified how important that is).
The third of these is the only thing that we haven't yet tackled decisively, but it's on the menu ...
by Bryan Lawrence : 2005/02/03 : Categories badc curation ndg metadata (permalink)
there is no such thing as a foolish question
Generations of young children have been admonished with variations of a quote typically attributed to Mark Twain:
Better to keep your mouth closed and be thought a fool than to open it and remove all doubt
I say generations, because the same link points out that it probably arises from Proverbs 17:28:
Even a fool, when he holdeth his peace, is counted wise: and he that shutteth his lips is esteemed a man of understanding.
In my experience this is a piece of poor advice that has stood in the way of understanding and progress! I would say that the only foolish question is the one you are too afraid to ask, and the only thing foolish about it is that you are afraid of being held to be foolish! In every educational situation I can think of (and I can think of many), there is almost no time when it makes sense to sit wondering what someone means when they are talking (OK, there is one, if you've been too lazy to read some preparatory material), or having asked a question, be afraid of following up because you didn't understand the answer. How foolish is it to keep your mouth (keyboard) shut and never know the answer to your question?
2005/02/03 (permalink)
Copyright and Blogs
Oh well, while I'm on the legal thing, it turns out that my thoughts on coyright and blogging preceded (coincidently) a rather good article on some of the issues in informationweek.
2005/02/03 (permalink)
Impermanent Links and Icebergs
The wonderful think about the Net is that you can link to things. What's not so wonderful is when the link disappears. Even less wonderful is when the content at the link target changes after you made your link. A couple of weeks ago I linked to NASA article about iceberg/glacier collisions. At the time of writing, the article predicted a collision. As I type this, if you follow the link, it describes the iceberg going aground on a shoal a couple of miles short of the glacier. Who knows what it will describe when you follow the link ...
To be fair to NASA, under "related links" they do provide what I suppose was the original article. However, in general I think this is the wrong way around. If you are going to call something an article, then you should not change it ... except in ways that preserve an information audit trail with a clear concept of time's arrow (in this case the older file is called ice_berg_ram2, compared with the newer ice_berg_ram.html). Far better, to leave the link as it is, and have an "updates link" added to the page ... i.e. to have what would I suppose be a manual trackback ... and have a new article there. The NASA behaviour compares poorly with the blogging concept of permalink. With a defined concept of a permalink, one can have a site like mine: a mixture of permanent things (ok, I admit to fixing gross errors of spelling and grammar), and things that change (e.g. the personal wiki pages and/or external links). It should be obvious to the casual reader which is which and that's fine. What's not fine is to have something which appears permanent ... i.e. an article ... and have it changing ...
Back to the iceberg. It looks like it's gone into reverse, and may be cracking up a bit ... not nearly as exciting a grand crash as seemed likely before, but the images from the NASA site are still a great advertisement for earth observation.
by Bryan Lawrence : 2005/02/01 : Categories environment curation (permalink)
Open Document
I have previously commented on the importance of open document formats for long term archival. One interesting possibility coming over the horizon is OpenDocument. At the moment this is based around OpenOffice.org software, but should kOffice gather more momentum there is a prospect that it will gain in importance. At the very least it will at least be a public standard.
As an aside, it has some relevance too to discussions we've been having in NDG about the right way to deal with images in metadata. One of the discussion points between the EU and Microsoft, was about binary content embedded in the documents (e.g. images). OpenDocument embeds these in separate directories, unlike MS XML which embeds throughout. As far as both long term curation and NDG goes, it is obviously preferable to run with the separate directory concept.
by Bryan Lawrence : 2005/01/31 : Categories msxml (permalink)
Derivative Works
The GPL talks about derivative works, and we've already seen that UK law doesn't know what derivative works mean. What does it mean? (Otherwise unattributed quotes are from Rosen 2005, which I need to read properly ...)
One definition, used in US case law (albeit eventually settled) is:
linking to GPL'd software turns the linked software into a derivative work
(where this is presumably a quote from Stallman).More fundamentally, and matching what one imagines it should mean:
under (US) copyright law a derivative work is a work based upon one or more preexisting works.
Alternatively, Bruce Perens points out that
a programmer can make a server of a software library and export all of its functionality to another program without creating something that would be considered a derivative work under copyright law or the definitions in Open Source licenses. In Unix parlance, servers are referred to as daemons, and thus the practice of embedding software in a daemon in order to avoid creating a derivative work is called daemonization. It is possible that a future Open Source license could restrict this practice.
It appears that there is a long running sore on the difference between collective works, and derivative works, and the role of linking. In (US) copyright law
the fundamental difference in copyright law between a collective work and a derivative work (is that) generally that the former is a collection of independent works and the latter is a work based upon one or more preexisting works. A work containing another work is a collective work. A work based on another work is a derivative work.
but it seems the GPL merges these concepts in an unhelpful way, leading to the uncertainty of what constitutes a derivative work (directly linking, yes hence the LGPL, daemonization, no). As usual, wikipedia is a source of useful information on this.
by Bryan Lawrence : 2005/01/31 : Categories badc curation (permalink)
FreeAccess
From JISC:
The University of Southampton is to make all its academic and scientific research output freely available. A decision by the University to provide core funding for its Institutional Repository establishes it as a central part of its research infrastructure, marking a new era for Open Access to academic research in the UK. The repository provides a publications database with full text, multimedia and research data.
Of course, they aren't really putting all their data out there ... as usual the university has no idea how much data they really have ...
by Bryan Lawrence : 2005/01/31 : Categories curation (permalink)
State of Fear, Part 1
A colleague has sent me a copy of The State of Fear by Michael Crichton for my comment. I should hasten to add this colleague is not one of my atmospheric science colleagues, who I suspect would all agree with Myles Allen's review in Nature.
I'll report my thoughts here .. eventually ... but it arrived just after a fat parcel of rather lighter reading from Amazon ...
by Bryan Lawrence : 2005/01/27 : Categories crichton (permalink)
The Economist isn't always 100% right
The Economist has an article which amongst other things is about the ClimatePrediction.net early results recently in the Nature news column.
Regrettably, they got the wrong end of the stick. The article was titled How to model the climate on the cheap and it finished with the sentences:
...And its description of the interaction between atmosphere and ocean is far too simple. But it does point the way towards a better way of doing the modelling business. And a cheaper one, too.
So sad that the author couldn't see the contradiction between the penultimate sentences. To fit on a PC, it can never be as sophisticated as high resolution models. As I was trying to imply in my blog on model resolutions etc, there are three things we need to push ahead with on climate modelling:
higher resolution
more complete ensembles
better parameterisations (and sub-models and sub-model interactions).
Sometimes these are orthogonal aims and we can't have them all: the ensemble sizes for the most sophisticated models have to be small (they're expensive to run). The bottom line is that there is a place, and a necessity, in climate science for both big iron in the supercomputer centres and massively distributed computing on the desktop. However, I guess it's hardly surprising that The Economist bottom line is the cost, and not the quality.
by Bryan Lawrence : 2005/01/27 : Categories climate (permalink)
Oil Again
I don't know much about energy issues, but it's one of those things that any citizen of a modern democracy should think about from time to time. I am trying to do a bit of thinking about it from time to time (e.g. here), and track relevant info (e.g.here).
Dave Orchard has an interesting post in which he has some figures on world reserves of Oil. I think this sort of thing should be better known, but will let you read his article rather than mine!
by Bryan Lawrence : 2005/01/26 : Categories environment (permalink)
Atomic content in NCAS
The NERC Centres of Atmospheric Science, NCAS, consists of a number of centres and facilities that have their own life and web presences. Today some of us had a brief discussion about whether it would be feasible to utilise RSS and/or Atom to improve the connections between the centres, the ncas website itself, and the wider world.
Ideally, all the information parcels on the websites would be individual atom parcels that can be aggregated in sensible ways: for example, it would be good if an atmospheric aerosol research group could aggregate all the ncas website articles on aerosols onto one page, while at the same time, the individual website articles could also appear (where appropriate) as NCAS news items. Clearly one prerequisite would be the existence of appropriate feeds!
We discussed the steps we would need to take:
We'd need to know how to exploit these feeds from the NCAS asp based web site.
We'd need to decide on the emphasis between headlines, abstracts, and full text.
We'd need to begin by scraping information from the existing sites, and producing feeds (at the BADC). We'd need buy in by the individual centres before progressing to anything more magic.
We'd need to provide good tools for content creation
Ideally we should identify some key folk to create some quality content, perhaps organised around key themes (the NCAS2 cross-cutting themes might be a good start).
In a more general sense, blogging tools customised to scientists could allow much better collaborative working. One of the problems with wikis is that a standard wiki doesn't allow clear authorship distinctions in a discussion. Using blogs, together with trackback and or comments could be one way of doing this, another is to modify a wiki to allow individual page ownership. Actually, the second option is just a special case of well constructed blog software, and would be easily supportable by Leonardo. We discussed what would need to happen for us to roll out Leonardo to NCAS scientists:
Stand-alone editing tools (this is underway)
Better math and code support (this is underway)
Implementing trackback and comments (planned)
Automatic creation of new blogs from the BADC user page (not difficult)
Caching of the wiki pages rather than on demand generation (not difficult)
Version control (can be supported by Leonardo with lfs extensions)
Searching of websites (needs substantial work for Leonardo support, but not really necessary a priori)
Support for external atom/rss2 content to be inserted within Leonardo (probably not hard)
Support for individual pages to have individual owners (probably the most difficult extension, but not necessary for individual blogs only for community blogs).
Of course blogging is something that may not come naturally to many of our community, so it's more likely that it'll take off amongst the NCAS students where, for example, we could establish an NCAS wide journal club. We could survey interest at this years Royal Met Soc Meeting.
What next? We're going to build a wee prototype news service, and consider doing the seminars between a couple of candidate sites, and test out the feed constructs in our environment. A few more folk are going to play with Leonardo. We'll revisit this all again in Mid-March.
At the same time (well, actually, over a longer time scale), BADC is going to be moving to a comprehensive content management system (possibly plone). plone has some (if not all) of these features already, we'll need to consider whether we should simply support this activity with plone rather than extend Leonardo (Personally I'll stick with Leonardo, here I'm talking about a solution we may have to roll out to hundreds of users who have complex roles and relationships, plone already has this sort of thing built in).
So, in summary, two new activities,:
Investigating how to use atomic content better in NCAS, and
Exploring the possibilities of wide-scale collaborative blogging in our community.
by Bryan Lawrence : 2005/01/25 : Categories badc (permalink)
Legal Stuff
I've already been talking about the legal implications of open source software on our activities here at the BADC (here and here). Some initial feedback is that apparently the GPL is just as liable as the MIT license under UK law ...
We've also been told to be worried about legal liability of the use of our data, and we need to put a legal disclaimer on the web site about data use (never mind software use). Given that all our data is currently only for research, it is difficult to see what we could be liable for, but times are a changing ...
Of course, even this blog is a grey area (a colour the Law hates). In principle, I have no intellectual property rights to my work, they are all owned by the CCLRC, and so therefore is the copyright to this website. However, having the copyright to this blog owned by the CCLRC is simply unenforceable - Imagine I put (C) CCLRC below, and so did one thousand other employees (unlikely, but possible). What are you going to do in the event you want to copy some material from this site? Ask the CCLRC? Why would you, it's obviously designed to be public (else why would it be on the web)? Well, maybe you like what I write so much, you want to aggregate it all, I do after all offer an Atom feed, so I'm implying you can. There is a bit of a controversy raging about that (see Tim Bray and Scobleizer both of whom are responding to Martin Schwimmer). Tim's summary sounds good to me:
It seems to me the right solution is obvious; rewrite things to apply a Creative Commons Attribution-ShareAlike license to the syndication feeds, while retaining Attribution-NonCommercial coverage for the full text ...
Of course, you may want to so something not covered by that license (or whatever license I get around to putting here). In that case, perhaps you do want to ask the copyright holder whether you can do something with material on this website (stick it in a book for example - but see below). It would be an unfeasible concept for the CCLRC to have oversight on all the websites here, let alone the blogs we might have, but it is feasible for me to have oversight of my work. It makes sense therefore for me to indicate that I can allow you to make a copy for some further purpose, and I do so by having a personal copyright statement (I would argue that morally, if not legally - see below - I'm doing so on behalf of the CCLRC, exactly as I do when I sign a copyright statement to publish an academic paper).
I know that the last paragraph would be risible to a lawyer, but then so is common sense ... Amusingly, I have had dealings with two different lawyers from the one law firm: one says that I can give you permission to use data on behalf of the CCRLC, one says I can't give you software because I'm not the IPR manager or the director of the CCLRC or whatever ...). Both data and documents/software are forms of IPR ... what gives? (From the first solicitor's point of view, the argument was that you can't know who at the CCLRC has legal authority to sign on behalf of the CCLRC, so if I do something I'm not allowed to do, that's CCLRC's problem not yours). The bottom line here is what is practical: it's simply not practical for the IPR management team at CCLRC to a) vet all our web sites, and b) choose the appropriate license terms and/or c) decide on specific rights cases. We need to discriminate between things that are worth corporate oversight, and things that are not. One way is for me to assert that I can give you rights to copy this information, and I do so by saying the site is copyright to me on behalf of the CCLRC. Until someone comes up with a practical alternative, this site will remain copyright to me (or someone makes me change it).
Anthologies, ie books, versus aggregation, ie the web: at our legal workshop we talked about books of poetry: the copyright to each poem is owned by the author and there is copyright for the anthology also. This implies to me that an aggregation would be in the same category except for one key difference: for rss/atom aggregation you've obtained the poem (data/software) from the web subject to whatever license I provided, if it doesn't allow you to create a book, then you can't. In the case of a book (or anything else not covered by the license by which your read this site or the text aggregated from it), you need me (technically the CCLRC) to give you those rights.
However, even for the automatic case, the concept that my output can be controlled by any part of central management is exposed for the silliness it is. Regardless of who I claim has copyright of my thinking, by putting a feed on this, I'm allowing a remote computer to aggregate information without human intervention at either end. I'm sure that legally, the person who runs the server doing the aggregation, could be in violation of copyright law if I'm not allowed to publish a feed ... the question is then, should the CCLRC allow me to have a feed? (Actually whether it's from a corporate web site or not doesn't matter, given what I want to blog about, it wouldn't matter if I did it on an external provider, the CCLRC still owns my output).
Coming back to full circle, if we look at Tim's statement suggesting having a different license on his feed, there is a real issue for the legal system and commercial (automatic) aggregators to deal with: the computer that is doing the aggregating doesn't read the license statements on individual blog sites ... so how can it respect commercial versus non-commercial aggregation unless there is a person in the loop, or we have machine readable licenses?
by Bryan Lawrence : 2005/01/25 : Categories curation badc (permalink)
Internet Explorer Sucks
I've just discovered that Internet Explorer has been rendering the menu for this site with nothing readable in the menu ... for three months! (I checked it out on firefox, konqueror, and safari ...). The menu used to look like this:
![]() |
The culprit? My css used to have div#menu p { margin-top : 0px; margin-bottom : 0px; font : 0.9em "trebuchet ms", verdana, arial, helvetica, sans-serif; line-height : 1.1em; } Now I've removed the 0.9 em it seems to work fine. Of course it looks ever so slightly less attractive on the other browsers. My, I hate IE!
Note added later: actually, it turns out that even this hasn't worked properly on IE explorer, I suspect those of you with IEV6 will be seeing the image in the bottom right and overlaying some of the text. I might not blame IE too harshly though - it's possible that the html soup on this page has confused it. Maybe my new wiki code will clean this up ... but we'll have to wait for that.
by Bryan Lawrence : 2005/01/25 : Categories computing (permalink)
Model Resolution, Ensembles and Physics
Last Wednesday I attended a meeting of the Royal Met Society on Perfecting Imperfect Models.
During the panel discussion we spent time dicussing the relative importance of spending effort on
improving the physical parameterisations in climate models
increasing the number and breadth of ensembles, and
improving model resolution
In a somewhat provocative manner, we were charged to think about whether, if we could achieve 1 km horizontal resolution (in climate models for climate length integrations) in 10-15 years time, we would put less effort into improving the physical parameterisations in our models?
Leaving aside the obvious response that some physical parameterisation will always be needed, there was little discussion about the reality of the proposition. Let's consider the facts:
Recent state of the art climate models (HadCM3 in 2000, HadGEM in 2005) have moved resolutions from order 300 to order 150 km (implying a factor of two was achievable in those five years).
The next generation of models (e.g. HiGEM) are aiming for a doubling again, but it is a three year project (again, e.g. HIGEM) to do the work required to double the resolution (this is not the computing work, it's about producing ancillary datasets, understanding the coupling of the components at higher resolution, and indeed, addressing changes to the physical parameterisations).
So, in fifteen years from a scientific point of view, we might be able to sustain at best doubling five times (although I suspect it ought to be less than that to get better mileage out of the better models at each step). That implies, scientifically the absolute best we might be able to get to for climate is around 150/(25) = 32, i.e. about 3 km. More realistically it will be necessary to do some science along the way too, so we can imagine a process of resolution enhancement, scientific consolidation, and further enhancement. This process of model evolution could be called punctuated equilibrium (with obvious apologies for appropriating the name). So, with punctuated equilibrium, scientifically the best we could do is probably going to be more like 10 km ... (for climate, obviously NWP etc will be at higher resolution).
What about from a computing point of view?
Remembering that a factor n horizontal resolution increase requires n squared calculations, and that in practice we would need to increase the vertical resolution and decrease the time step, means that we are talking about somewhere between n3 and n4 more calculations (anything less than 4 implies some smart improvements in numerics). In reality as we increase the resolution, we will undoubtedly find that we needmore parameters to be advected on a global scale, and/or additional complexity (e.g. new chemistry/aerosol schemes) so there is an additional factor p to be included which will also scale in the same way.
Moores Law implies a doubling in computing capacity every 2 years, which needs to be compared with the increase in computations of (at best) (pn)3 If m is the number of years required to support p and n, we have (at best) 2n/2 = (pm)3 or m=11.5 log(pn). For p=1.2 and n = 10 (a guess, and from the example above), we would have m=28 years from Moores Law ... so it seems that computing will limit development (and allow time for punctuated equilibrium in model resolution improvement).
Another way of thinking about it would be: what if we spent all our computing capacity improvement on resolution enhancement for the next ten years? In that case, we would have 25 = 32 times more capacity available (from Moores Law) which neglecting p would support, a resolution increase of just over two! This seems to be evidence that the 2000-2005 increase was only possible because there had been a rest period in the punctuated equilibrium of model development.
Of course, Moore's Law isn't the whole story because as the computing capacity has gone up, we have also been able to apply massive parallelisation as well ... this is an example of a technical improvement in the way we do things. However, we've done that, so without a new way of doing our modelling (or a new cost model for the hardware), the bottom line here is that a resolution increase of much more than 2-4 for climate models in the next decade is very unlikely.
One of the factors that impacts on all this will be how much effort we put into increasing our ensemble sizes. We carry out ensembles in two main ways:
initial condition ensembles and
parameter ensembles.
In general these are targetted at understanding the uncertainties in predictions on shorter and longer timescales respectively. In general the larger the ensemble, the more likely the predictions are to sample a wider range of possible futures, and the better the accompanying prediction of the various likelihoods. However, one thing that ensembles can never do is sample possible futures that are not predictable by the models involved. That means the probability distribution and accompanying likelihoods must be biased by all the things that we didn't or couldn't include in our models. While that seems obvious, what it means is that spending our increasing computer resources on increasing ensemble sample size will not necessarily result in better or more accurate climate predictions. However, without ensembles, predictions are not acompanied by any uncertainty at all, and as someone said, a prediction without a quantification of uncertainty is no better than no prediction at all (tarot cards anyone?). The question is, how big is an adequate ensemble size for most work? We don't yet know!
What about the physics improvements? I think there are three classes of physics parameterisations in our models:
parameterisations which represent scales which are a long way from ever being resolved (e.g. radiation, cloud nucleation)
parameterisations of the ensemble effects of complete systems which we may be able to resolve to a "suitable" scale (e.g mid-latitude storm systems, some classes of clouds).
parameterisations which represent the sub-grid scale effects of processes which occur on a range of scales, some of which are resolved (e.g. the gravity wave spectrum and associated effects of flow over orography).
I believe we need to think about each of these rather differently, and weigh the effort required against predicted resolution at the time a particular round of parameterisation improvements may be complete.
Things that we didn't spend time thinking about include:
What about the effort improving the complete systems which we need to address as separate component models (ice, the land surface etc)?
What about the problems handling the datasets that are produced by enormous ensembles and high resolution? (Indeed, IO limits may in fact be a bigger problem than CPU limits in the near future).
Another thing we never spend enough time on in climate science is thinking about how best to exploit adaptive grid techniques, e.g. the material discussed at Cambridge in December 2004.
by Bryan Lawrence : 2005/01/24 : Categories climate : 1 trackback (permalink)
More on Open Source Licensing
In my blog entry on open source licensing I reported a pre-workshop meeting on software licensing. The actual workshop happened on Friday. As it happens, it was more targetted to the commercial exploitation of software by the CCLRC, rather than the scientific exploitation of software (The Baker report has lead to some very bizare distortions in academia). However, there were some useful tips and pointers about risks in using software.
Key things to look for in the license of the software we deploy include: Do we have rights to:
use (obvious, else why have it)
copy (probably obvious)
modify (didn't realise I could get software without the right to modify, but you can)
distribute (often important for BADC)
sell (not really relevant to us, Baker report notwithstanding)
and are these rights exclusive or non-exclusive? (Again, not of great interest as long as we can do it.) It's also important to ensure as far as is practicable that the person supplying the software is legally able to do so. Something I hadn't really thought enough about is in the case where we get "educational" or "academic" licenses, we need to be sure we can provide services to third parties with that software, many deals do not allow that.
Most importantly, however, I left this workshop with two things on my mind, firstly: I need to do a software risk analysis (or which more below), and secondly: If we provide software with some Open Source licenses, under UK law we would have unlimited liability ... (think about the situation where we provided some open source software which resulted in someone deleting their entire database of expensively obtained scientific results ... imagine if that was a drug company ...). This arises because, taking the MIT license for example:
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
There are two sentences in this ... both of which would be struck out in a UK court. I can't remember why the first one would be struck out, but the second would go because one must not in UK law, attempt to avoid liability for death and/or injury (neither of which can be avoided by a contractural clause). Any clause which attempts to do so is history ... So, we would have unlimited liability for anything arising from the use of software we passed on under the MIT license.
We didn't cover the LGPL and GPL in the workshop, which is a shame, because they have rather different clauses:
BECAUSE THE LIBRARY IS LICENSED FREE OF CHARGE, THERE IS NO WARRANTY FOR THE LIBRARY, TO THE EXTENT PERMITTED BY APPLICABLE LAW. EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES PROVIDE THE LIBRARY "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE LIBRARY IS WITH YOU. SHOULD THE LIBRARY PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING, REPAIR OR CORRECTION.
IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY AND/OR REDISTRIBUTE THE LIBRARY AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE USE OR INABILITY TO USE THE LIBRARY (INCLUDING BUT NOT LIMITED TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD PARTIES OR A FAILURE OF THE LIBRARY TO OPERATE WITH ANY OTHER SOFTWARE), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.
I think, but I'm not sure, that some of the first of these would be lost for the same reason as the first sentence of the MIT license. However, I suspect it would be far harder to argue, and therefore unlikely to happen. In the second, at least there is the phrase "unless required by applicable law" which I suspect could be used to imply that the license was not attempting to avoid something unavoidable ...
We also didn't cover creative commons types licenses, which again is a shame, because they have been ported to different legal jurisdictions. Here, for example, is a UK draft of the liability exclusion (full link):
With the exception of (i) liability which may not be excluded or limited by applicable law (including without limitation death or personal injury caused by breach of duty); and (ii) liability that You may incur to a third party resulting from breach of the warranties in Section 5(a), in no event will Licensor be liable to You for any incidental, indirect or consequential loss or damage arising out of this Licence or the use of the Work, whether such losses were foreseeable, known or otherwise.
Risk Analysis
Consider the situations where software licensing impacts on our activities at the BADC:
We buy software and deploy systems. The chief risks here are that we use the software in ways that invalidate our license agreements, and the consequences could be a) financial, and b) loss of service to our scientific customers. There is also the risk that the supplier didn't have the rights to supply it, and/or cannot maintain the software. Not really relevant to the Open Source issues ...
We develop and deploy open source software for our web systems and data manipulation packages. Here the only risk is that the open source software we use infringes some third party patent or copyright. In this case we'd have to re-engineer. I really think it's unlikely anyone would go after us for loss of profits, and equally unlikely that we would have to cease and desist with things we have built. This is so unlikely that we can ignore it.
Some of the software we develop we make available as packages for third parties (e.g. the NERC DataGrid project), and some we contribute back to other projects. The CCLRC worries about these two activities for two reasons: potential loss of income from our IPR, and liability arising from our contributions. Given that almost all our work on software development is about making it easy to exploit the data for scientific goals, I don't believe we should be attempting to exploit commercially the software (and indeed, I think there is no business case to do so). In terms of liability, clearly we have to avoid having unlimited liability (so no use of the MIT license), but it looks (to me) like variants of the GPL are far less likely to cause problems in this regard. In fact, using something which is not a variant of the GPL is likely to give us more trouble, because we can't then exploit the GPL, and the cost of not being able to exploit GPL software would make our activities unsustainable.
There is an issue of aggregation that needs thinking through as well: software is aggregated, as again we may do with the NDG, we need to ensure we don't violate any of the sub-licenses.
We provide software with our datasets, so the data can be available (now and in the future). As I said in my earlier entry, we have a policy that such software must be covered by an appropriate open source license. There are a number of risks associated with this one too:
Has the copyright holder of that software provided us with an appropriate license? (This is particular issue for bespoke software: we had thought that because we had code without a copyright statement, we had the right to do what we like. Apparently not ... we need an explicit license for this stuff. To some extent we can alleviate this by having common data formats, not bespoke software, and then we have less software to keep track of ...
Can we maintain this software into the future? Leaving aside the technical issues, does our license allow us to modify the software? (And obviously, importantly, does it allow us to distribute). Any Open Source license would give us that!
In the unlikely event we did want to commercially exploit software we write and distribute based on the GPL, then the risks to consider would be:
have we used any GPL software, and
have we used contributions from 3rd parties to our software.
(It would seem unlikely that the answers to either of those questions would allow us to commercially exploit!)
The bottom line here is that we have very low risks in using GPL based licenses, and substantial risks in being unable to deliver our curation responsibilties using any other licenses (except possibly something based on creative commons).
by Bryan Lawrence : 2005/01/24 : Categories badc curation (permalink)
generateDS
I've just discovered generateDS (via Uche Ogbuj's column).
generateDS.py generates Python data structures (for example, class definitions) from an XML Schema document. These data structures represent the elements in an XML document described by the XML Schema. It also generates parsers that load an XML document into those data structures. In addition, a separate file containing subclasses (stubs) is optionally generated. The user can add methods to the subclasses in order to process the contents of an XML document.
In the limitations, we find that
it supports the following XML schema constructs (which are described as a small subset of XML schema):
Attributes of types xs:string, xs:integer, xs:float, and xs:boolean.
Repeated sub-elements specified with maxOccurs="unbounded".
Sub-elements of simple types xs:string, xs:integer, and xs:float.
Sub-elements of complex types defined separately in the XML Schema document.
generateDS.py generates two kinds of parsers: one kind is based on SAX and the other is build on minidom.
the SAX parser is noted to be pretty broken and it's advised not to use it.
both styles of parsers construct instances of the data structures generated by generateDS.py. This means that, even when the SAX parser is used, generateDS.py may not be well-suited for applications that read large XML documents, although what "large" means depends on the hardware involved.
by Bryan Lawrence : 2005/01/21 : Categories python xml (permalink)
xml handling in python
Much of NDG will depend on code for handling xml documents. This will probably need to be done in two architectures: python and java (to support two communities: scientific application programmers and web development of access tools).
Up to now there have been two packages in the frame from the python perspective:
libxml2, and
I have been using libxml2 in my certificate checking and signing code for NDG (e.g xmlsec) but even for such a simple task as signing a document and checking the signature using libxml2, we had a nightmare getting a java an a python implementation agreeing ... it was also apparent that using libxml2 in python sucks. However, we're going to stick with libxml2 for the next six months anyway ...
There is some good news on the horizon: Until I read Ryan Tomoko's blog entry I was not aware of the lxml project linking the elementtree api to libxml ...
Even more interesting in some ways is the sequence of discussion between Nelson Minar and Uche Ogbuji on the plethora of un-pythonic ways of doing xml handling in python. Nelson has code snippets from PyXML, libxml2 and elementtree. This was followed up by Uche initially in two blogs. The first of which is summarised with
If you're coming more from a Python background, and XML is just something that's getting in your way, try Amara. If you're coming from an XML background, and you think in DOM, XSLT and all that, try 4Suite.
In Uche's second blog entry he added some snippets from 4Suite and Amara.
This is followed up by Nelson again who concludes:
There are too many XML choices in Python. And the obvious ones aren't right!
Nelson also goes on to comment that even the Amaya syntax isn't particularly easy at startup. Uche's comeback (number three) is interesting, and would appear to demonstrate that Amaya is incredibly efficient and easy to use.
Buried in the comments to Uche number two is the following on lxml (which is where I started on this from Ryan's blog entry): from lxml import etree tree = etree.parse('ot.xml') tree.xpath('(//v)5/text()') which returns [u'And God called the light Day, and the darkness he called Night. And the evening and the morning were the first day.\n'] The point being that this is a much more pythonesque interface to libxml2 than the existing binding, and is compatable with elementtree. This might allow us some real flexibility for NDG.
However, on libxml2 the final word may come from Uche number one:
... libxml2 is a miracle of function, but alas in a form that doesn't suit Python one bit. I know that folks are working on better libxml2 wrappers, but familiar as I am with the C code, I honestly don't believe they can produce anything truly Pythonesque without losing all the performance gains.
Well, what are we going to do? We're committed to libxml2 in NDG phase one, but that doesn't commit us to how we interact with it, we can either use dedicated wrappers (current approach) or we can investigate lxml some more.
Now that celementtree is out, and it seems also to be blindingly quick, we'll investigate that for NDG2, so lxml has significant advantages. Obviously we'll also investigate Amaya, but previous experience in trying to get 4Suite to work on our Suse boxes (failed) has not been good (to be fair, we put virtually no effort into it, but the whole point is that these tools need to be trivial to install and use).
by Bryan Lawrence : 2005/01/19 : Categories python xml ndg (permalink)
Massachusetts Contributes to easier Document Curation
Groklaw reports that the (US) State of Massachusetts is going to require all agencies to store public documents in nonproprietary formats such as HTML or PDF.
Acceptable formats, according to the state's Information Technology Division, are now Rich Text Format v. 1.7 (.rtf); Plain Text Format (.txt); Hypertext Document Format (.htm); Portable Document Format (.pdf) - Reference version 1.5; Extensible Markup Language (XML) v. 1.0 (Third Edition) or v 1.1 "when necessary".
This article has some meeting notes which encapsulate the issue nicely:
It should be reasonably obvious for a lay person who looks at the concept of Public Documents that we've got to keep them independent and free forever because it is an overriding imperative of the American democratic system. That we cannot have our public documents locked up in some kind of proprietary format or locked up in a format that you need to get a proprietary system to use some time in the future. So, one of the things that we're incredibly focused on is insuring that the public records remain independent of underlying systems and applications insuring their accessibility over very long periods of time. In the IT business a long period of time is about 18 months, in government it's about 300 years, so we have slightly different perspective.
There is even some comment in the article that Microsoft may come to the party and alter the license terms associated with their (Word 2003) XML doc format. The issue here is that while the doc is an open format, the semantic interpretation of it is wrapped up in a very restrictive license. (There is considerable discussion of this licensing issue in the groklaw article). The bottom line is that if Microsoft chose to implement those license conditions it is feasible that an open source piece of software which read Word 2003 docs could be found to be illegal ...
... meanwhile, until the MS Office schema are in the public domain with an appropriate license, there will be no Microsoft documents held in our long term archive without pdf copies ... (or of course, MS goes to using a proper semantic markup of their documents).
In the mean time, well done Massachusetts!
by Bryan Lawrence : 2005/01/18 : Categories msxml (permalink)
Python Gui Kits
jkx@home has a brief review of his experience with python gui toolkits. I've been flirting briefly with all of these except glade based kits, so it's interesting he recommended it.
My experience thus far had been pyqt was the best, and I too had been worried about the license ... and never managed to get wxpython to do much good for me (too many of the demos simply didn't function correctly on my suse distributions). Tkinter on the other hand was very good, but limited in some key ways (primarily to do with drag-n-drop).
One day I'll investigate the pyGTK tutorial he describes.
by Bryan Lawrence : 2005/01/13 : Categories python (permalink)
A view of an iceberg/glacier collision from Space
NASA have released a news article showing images from space of a large iceberg (about the size of Long Island, NY) moving towards the Drygalski Ice Tongue. The article speculates the collision is due within a few days, but even if it doesn't happen, the mpg movie that shows the comparative sizes of Long Island and the iceberg and then the movement of the iceberg is well worth a look.
by Bryan Lawrence : 2005/01/13 : Categories environment (permalink)
Open Source Licenses
We're in the curation business. That means we're in the business of keeping our data for a very long time ... the reality of data is that it's always accompanied by software to use it. If we're serious about longevity of the data we're forced to be serious about longevity of the accompanying software. In practice, given the way companies can be bought and sold, and rights to software might change, we are required to be very careful about whether our software has legal longevity as well as technical longevity.
Up until yesterday, I happily thought that using software protected by open source licenses would protect the legal longevity of our software.
(A digression:) Yes, I know some large companies argue the opposite, but their warranty that the software they supply me is theirs to give is no more valid, and may be less valid than that which accompanies open source software. As has been argued many times, the reality is that closed source software is just as amenable to, for example, patent challenge, as open source. In fact, if software security is anything to go by, it's probably more at risk of having transgressed, even if (unlike security) it is (possibly) less at risk of being caught. Of course there is a difference in whose pockets may be used to defend a legal challenge on the software ... (but if I take the long view, it will always be my organisation's pockets). (Yes, I'm still worried about software patents even though I'm in the UK, because while they don't exist in Europe, yet ... it's not clear that they'll never exist).
(Back to the main story:) What happened yesterday? Members of my parent organisation (the CCLRC) met with a lawyer to discuss an upcoming workshop on software licensing. I learnt that the GPL is a US legal instrument. I guess I knew that, but I never understood the ramifications. I turns out that this means
The license does not state a legal jurisdiction or under which legal system it should be interpreted. (Compare it with the QPL).
Given we don't know whose law applies, it's a problem that not all the terminology has meaning outside the US, in particular phrases like "derivative-work" and "as-is" have no legal meaning in the UK (and probably Europe).
I'm not sure I understood the next point properly, but I'm repeating it here in the hope it will get clarified:
Arguably, the attempt to provide no warranty of fitness of purpose could be construed as an unfair contract in UK law.
(I would argue that given no money changes hands over the software, it's hard to see what is unfair about it, but perhaps difficulties could occur if one tried to provide services based on using such code?)
Anyway, it turns out that these difficulties could make the GPL difficult to work with here in the UK ... what would happen if one UK company and another UK company had an argument based on the GPL? What about European companies? What about cross-Atlantic arguments? Whose law? Whose interpretation?
Yesterday's discussion wasn't formal legal opinion but it was informed opinion and it has me nervous ...
However, the reality is that we produce software and services based on GPL'd code, and so our code has to be GPL as well ... so we need to understand these issues but not fret.
On the flipside are the risks associated with cheaper versions of commercial licenses: In a world where we obtain software under "academic" or "non-commercial" licenses we need to be careful about the conditions of use. Can we offer services based on academic licensed software to other academics (and charge them)? What about other research institutes? This has me nervous as well, because we (the BADC) exist because the CCLRC offer services to the Natural Environment Research Council (NERC) primarily via a Service Level Agreement. Is this a commercial relationship? Does it invalidate our licenses?
Me, I still feel safer with services based on GPL code.
Then there was the issue about the software we produce. We spent some time talking about other licenses obviously, and things to look for. It turns out that the words exclusive and non-exclusive are very important for software we produce. In particular, we only want to give folk non-exclusive rights to our software regardless of whether it's open source or not. That way we can do something else (like sell) other copies of the software. Or at least it would be that simple until we understand that people feed back corrections to open source software, in the understanding that it is open source. Can I then do something different with the "modified externally" software? I don't yet know.
Again, I feel safer with a "traditional" open source license, that way we all know the score, even if the legal ramifications are not yet clarified. In terms of risk analysis, should we ever get clobbered on a GPL violation based on EU (or UK) law, then I'd go public, and figure that it wouldn't just be us (BADC, CCLRC) that would have to defend it, it'd be the UK (plc) or the EU (plc).
by Bryan Lawrence : 2005/01/08 : Categories badc curation (permalink)
Searching Techniques
Serendipity is a strange thing. Two days ago we started a discussion on the internal NDG mailing list about what should or shouldn't be in our B metadata elements (the schema is nearly fixed, but would we allow XHTML markup within some elements?). The discussion broadened into what B metadata is for (B is browse metadata in our NDG lexicion). We were discussing what belonged in the browse metadata and what belonged in extra (E) metadata held alongside the data. This morning, I came across Derek's blog on Search is not Search, which prompted me to a) read the paper it refers to (of which more below), and b) organise my own thinking a bit more.
My own thinking is, of course, about searching, browsing, locating and utilising environmental data. In the NDG, we have distinguished between searching (utilising harvested D-discovery metadata), and browsing (utilising local to the data B-browse metadata). The B metadata I envisage will help users understand what the data is, and how it relates to other datasets. As such it includes both contextural data about the data (I say data about data to avoid the IR guys writing it off as simply indexing material) and linking data to other similar datasets. (Some more info about B can be found at in my blog and in various documents on the NDG site.)
I have to confess that I've not read much of the searching literature before, and have mainly relied on a number of exercises we've undertaken asking users about what they want and how they think as well as useful conversations with our peers (for example, the nasa global change master directory, GCMD, folks reported at a meeting I attended that roughly half their web-site users find datasets via keyword and half via free text search).
Derek refers to a paper by Teevan et. al., 2004, in which they discuss how people perform personally motivated searches. To some extent, that's what I mean by serendipity: here we are in NDG discussing searching issues, and a (not quite random) blog puts me onto something really relevant. Anyway, Teevan et al provides a number of interesting snippets:
... participants used keyword searches in only 39% of their searches, despite almost always knowing their information needs up front.
(As an aside, this is comforting confirmation of the rough statistics from the GCMD folks, and made me feel better about not having read much literature on this). Teevan et al go on to discuss how the other 61% seem to go, which is in a pattern they call orienteering. Orienteering is a process of moving by small steps towards something the user(searcher) knows exists.
The process involves using both prior and contextural information to narrow in on the actual information target ...
They contrast to something they call teleporting
When a person attempts to teleport, they try to jump directly to their information target. Teleporting represents the behaviour many search engines try to support ...
In the NDG lingo, I think teleporting corresponds to discovery and orienteering to browsing. One of the advantages of browsing that Teevan et al found was
... that it gave people a context for their results. We saw our participants use the context of the information they found to understand the results and to get a sense of how trustworthy those results were. Context was often essential in helping the participant understand that they had found what they were looking for ...
They went on to observe that
orienteering had an added advantage over simply presenting keyword searches with some surrounding context: it allowed participants to arrive at their result along a path they could understand. This process enabled them to understand exactly how the search was performed, and consequently accept negative results.
I think this is very important. When looking for environmental data, much of the difficulty is knowing what was in the thinking of the person who archived the data, often times you can find things by browsing in the vicinity of similar data sets ... which explains why some searches don't help - you've not thought of the right search term, but worse, your search term hasn't even picked up anything vaguely relevant. It is very useful to get into the vaguely relevant regime, and then move to where you want to be. (However, this is just like the information searching equivalent of least squares minimisation, with all the attendant risks of local minima/maxima etc).
In the implications section of their paper, they comment that
Orienteering's prevalence could be due to the fact that search engines do not permit effective teleporting. For example, keyword search engines often fail when confronted with overly specific queries ... while search engines are expanding to include meta-data ... a better way of incorporating metadata is to use metadata for browsing.
One of their suggestions was:
... next generation source tools could learn users habitually used or trusted sources, and make them accessible ...
So, it has been a useful exercise reading through this, and it has confirmed my impression of the importance of the browse phase in obtaining data. It has also emphasised to me the importance of being able to make incremental steps on the way to finding data. The state of the art at the moment for data is google/or GCMD followed by the Live Access Server or an OGC web service, so we have much yet to do ...
by Bryan Lawrence : 2005/01/07 : Categories ndg (permalink)
Latex Wiki
Back in November I wrote an entry on maths and blogging where I stated that I would inevitably want maths on this wiki. In the intervening time, it became clear that we needed an easy way of producing XHTML markup including mathematical expressions for a number of programmes (CF support and NDG at least) as well as for blogging.
The code in the compressed tar file (corewiki.tgz) does much of what is required. There are five files:
wikiBNL.py provides WikiFormatter, which parses a wiki text file and produces xhtml including references to images which are generated from latex in that text file by
embedhandler(.py), which utilises latex and dvi2bitmap to produce images
so2wiki2xhtml.py which is a command line utility to utilise WikiFormatter and embedhandler to produce the output files, and
MultiSplit(.py), a simple utility I found useful in the wiki parsing.
wikipage which has some simple examples of how it works.
The basic wiki formatting began with the original leonardo code, and by bouncing ideas off James Tauber I eventually got something we could put into leonardo, which is why a close examination of the code will show some "leonardo"ness. In fact, together James and I have got a prototpye of the next version of leonardo using this code already, and this site will go to that version at the earliest opportunity.
The standalone wrapper is intended for use outside of leonardo.
All the code is very alpha as yet, and is pretty inefficient in places, but it's a useful first step.
Update (Jan 11): I've realised that the output of this version is very poor xhtml (actually it's not xhtml at all). I've also found a few bugs and a need for the inline maths to be escaped. A new version has been linked above, but has yet to be fully tested.
Update (Jan 20): More bug fixes and support for code highlighting via the source-highlight code.
Update (Jan 28): Support for doi via name, another tiny bug fix.
2005/01/06 (permalink)
On Energy Supplies
Despite the content of most of my blog entries, I am fundamentally an atmospheric physicist, and only a computer geek by proxy. I've resolved as a new years resolution (yet again) to spend more time actually thinking about atmospheric physics. I'm resolved to live up to this by having at least one blog entry a week on atmospheric science (setting little targets does seem to work for me). However, in an act of pseudo-procrastination, I'm avoiding "atmospheric" physics for "environmental" physics first up.
I was reading two excellent articles in Physics Today: One on the hydrogen economy and one on basic choices and constraints on energy supplies (hereafter Weisz). The latter was full of really good quotes:
Energy demand by humanity continues to rise ... While total demand is, of course, influenced by personal demand, even unusually large (20%, say) conservation efforts would be nullified by population growth in less than 20 years.
This argument is often used by those arguing against the Kyoto agreement as a justification for doing nothing about greenhouse gases. Of course, it doesn't reflect the economic pressures that obeying the Kyoto agreement would bring. Once it becomes economically viable to look at alternatives, then some of that energy consumption could be achieved without releasing carbon dioxide pollution. So, in and of itself it's not an argument against Kyoto.
The Weisz article while not being about greenhouse issues per se (it's more about the impending energy crisis) addresses some of the realities associated with "alternatives":
many of the ideas researchers propose cannot significantly impact the real magnitude of the energy problem or may provide only short-term relief.
As he says, our basic choices of energy supply are limited:
using stored energy, and
harnessing incident solar energy.
The former category includes
Nuclear energy, and
Fossil Fuels
Geothermal
The latter includes:
hydroelectricity (yes it does, how did the water get up, to extract energy bringing it down?)
biomass
wind
solar cells etc
The article analyses most of these in real terms of what they are likely to do to address the energy requirements. It's a very accessible article, so I'll not repeat much of it here, but I did like the following
More than ever since the beginning of the energy revolution, knowledge of the basic nature and limits of energy is needed to realistically determine and carry out effective policy designed to guarantee reliable energies in the future. That could well help ensure the survival of civilization. As H. G. Wells once remarked, "Human history more and more becomes a race between education and catastrophe."
Another important point was his quotation of some Masters work from the 1970's where Nicholas Georgescu-Roegen observed that:
most economists believe that "the economic process can go on, even grow, without being continuously fed low entropy," which in a thermodynamics context means "without receiving new energy."
As a consequence
As we approach the limits of our easy access to energy, the defining economic currency will be dominated by availability of energy units rather than by an artificial currency, be that gold or dollars.
Weisz goes on to state two realities:
The economic value of an alternative energy technology depends on the net rate of energy QNE it will deliver after the rate of energy production QPR is debited by the energy consumed for its operation QOP and the energy invested in its creation E during its lifetime T:
QNE = QPR - (QOP + E/T).
Economic value is affected by policy. Things with negative QNE can become profitable with appropriate government subsidies (ie E and/or QOP become smaller).
This is my point about Kyoto.
Weisz concludes with
In particular, an urgent commitment to solar and nuclear energy technologies appears to be mandatory for the long term.
even though fission can only be a stop-gap solution.
Speaking personally, right now I see a negative QNE for putting solar heating in my home, but I know that'll change (even here in the UK). Similarly, I have heard that chimney-size rooftop wind-power systems will be released later this year for less than 1000 pounds ... at that cost, it's likely individuals can start to make a difference ...
by Bryan Lawrence : 2005/01/05 : Categories environment (permalink)
On Binary XML
NDG reviewers the first time around suggested we would need binary XML. I think they thought we would put all our data in XML, not just the metadata.
However, there is (was?) an initiative to build binary XML. ( Not to be confused with BinX which is an XML descriptive format for Binary data - or it's cousin DFDL )
What we're interested in is handling XML data itself as XML streams.
Now we could just compress it ... (quoting from Kendall Grant Clark) who said sensibly:
If you absolutely must have some kind of binary variant, gzip seems hard to beat since it allows you to pick any three from "decent compression factor", "decent (de)compression performance", and "already implemented everywhere".
This is a fair point for many applications. However, I think Clark didn't really cotton onto why we think we need it. In some of our applications, the size of XML documents is a problem. Ok, so use SAX instead of DOM I hear you say? Well maybe, but not always ...
The issue is what does one do with encoding environmental datastreams in XML? We currently wrap the big stuff in binary formats like NetCDF and HDF, and that isn't going to change soon. However, we markup the metadata in XML (using our CSML, or CDML, or NCML or ESML or ...) and sometimes our metadata needs to encode some of the data (axes for example). While we can encode formulae economically, we can't handle large wodges of binary data. The same problem exists in the energy industry.
With large datasets, we've already got something that gzip doesn't do well with, so wrapping asci representations of that data (at some resolution which may or may not be machine precision) in xml is going to be bloat. Bloating something, then gzipping it does better than you may think, but it's still unnecessary data handling.
To my mind, XML is about data transfer not about data storage. But this is a problem where we have significant problems with data volume. What we need to avoid is the situation where we have
Binary Data -> Convert to XML -> Process/Move it -> Convert back -> Binary Data
In this case the conversion overheads involved are rather huge, and the risk of precision loss is also significant.
Diversion on binary format conversion
Conversion overheads can also be a problem with binary formats themselves. Un-necessary format conversion of any sort should be discouraged. One of my two major beefs with opendap is un-necessary format conversion (the other is to do with access control).
I have no problem with necessary format conversion, for example if you end up with something like
Binary Data (format m) -> Process/Move -> Binary Data (format n)
then the last thing you want to do is produce M times N format converters. What you want to do is produce (at worst) M plus N format converters to an intermediate format. In part, one of the successes of Opendap is that it does address this issue (I think however, the main reason for it's success is that they produced a netcdf binding for it, thus allowing easy development of remote-access to netcdf data).
What would really help opendap is the ability to avoid the protocol conversion if the user required the same output format as the original data ...
Back to Binary XML
While we have engineered around storing massive amounts of binary data in CSML (for now), I can see significant advantages in being able to do so:
avoiding un-necessary format conversions
known binary precision
resulting in
less memory
shorter execution times
Unlike many XML applications the last point matters! For very big datasets, it makes such a difference that something untenably slow can suddenly become useable.
Probably the area where this could make the biggest difference in the near future for us is in the handling of observational meteorology data. More on this at a later date ...
by Bryan Lawrence : 2005/01/04 : Categories xml ndg (permalink)
































