Bryan Lawrence : Bryan's Blog 2007/05

Bryan Lawrence

... personal wiki, blog and notes

Bryan's Blog 2007/05

building python on feisty

So now I have to build myself a new python on feisty kubuntu since /usr/local isn't safe.

Things to note:

  • It can't be done without installing libc6-dev

    • apt-get install libc6-dev

  • If you want the python command line to be functional you need readline

    • apt-get install libncurses5-dev libreadline5-dev

  • Update 1st June At this point despite the fact it appeared to find the system zlib, an attempt to install gives:

    from setuptools.command.easy_install import main
    zipimport.ZipImportError: can't decompress data; zlib not available

I'm now considering a virtual python ...

by Bryan Lawrence : 2007/05/31 : Categories ubuntu : 0 trackbacks : 0 comments (permalink)

I still believe in Fortran

I've never believed in religious wars over programming languages, but the latest O'Reilly survey on the state of programming languages makes interesting reading, if only for the assumption that book purchase measures the health of a programming langauge.

Mostly I don't care, but I couldn't really believe fortran is so irrelevant (0 book sales in the first quarter of 1997!) I know it's commercially irrelevant ... so maybe all the relevant material is available online now? Well, a quick look at Google hits, compared with the O'Reilly classification, and the TIOBE index which is a more sophisticated ranking based on hits gives: :

  Language    Hits    OReilly Classification    TIOBE rank  
  java    306M    Major    1  
  perl    107M    Mid-Major    6  
  python    87M    Mid-Major    7  
  delphi    59 M    irrelevant    11  
  latex    50 M    Minor    n/a  
  tcl    29M    Minor    26  
  fortran    17 M    irrelevant    19  
  haskell    12 M    Minor    39  

which is more consistent with what I'd have guessed. Maybe the truth is that most Fortran programmers are (relatively) old folk like me (although I haven't written a line of fortran for five years), who don't need new books. Further, looking at the position of delphi (which I've never even looked at) it seems Fortran isn't the only exception: the case for irrelevance is far from proved.

Perhaps O'Reilly need a new category name. Irrelevant these languages are not!

by Bryan Lawrence : 2007/05/29 : Categories computing : 0 trackbacks : 0 comments (permalink)

Citation and Claddier

For some time now, we've been narrowing down how best to do scientific data citations. Last week we had a workshop where we concentrated on a number of issues associated with data publication.

I introduced the workshop with some philosophical bumf (0.75 MB ppt), and I'll probably say a lot more about it later, but meanwhile, here I want to concentrate on one aspect: We got some real feedback for my proposals for data citation (explanation and ISO19139 version).

The key criticisms were:

  • Too many things that looked like URLs.

  • Too much stuff

  • The order of material could be reconsidered.

  • (and the particular example I've been using could be easier to comprehend if we used examples that weren't quite so pathological).

Recall that we had something that looked like this (no longer a "real" dataset, but simpler as an example):

Lawrence, B.N. My Radar Data, [Internet], British Atmospheric Data Centre (BADC), 1990, urn, feature anotherID, [http://featuretype.registry/verticalProfile] [downloaded Sep 21 2006, available from]

where this was essentially

Author, title, [Internet], Publisher, Date, URN, feature ID, [Feature Type (from a controlled vocabulary)], [downloaded date, available from Distributor website].

(It's important to remember that we believe the feature ID (anotherID) is important because we accept that with data we do expect folk to cite into them on a regular basis e.g. a record in a database etc.)

After some mucking around, the breakout group working on this came to something like this:

Lawrence, B.N. (1990): My Radar Data, [http://featuretype.registry/verticalProfile anotherID]. British Atmospheric Data Centre [Available from]

The URN could be a DOI, and in some cases it could be simplified to:

Lawrence, B.N. (1990): My Radar Data, [http://featuretype.registry/verticalProfile anotherID]. British Atmospheric Data Centre DOI:doiaddress.

We have made the following assumption, and simplifications from the previous version:

  • We lost [Internet] because we thought it was redundant once the citation has a URL or DOI.

  • In this case we are dealing with "formally published" data, and so there is an expectation that the data wont change, so the download date is redundant.

    • We thought that a formally published data set should not allowed to grow, later "editions" could provide snapshots. We appreciate that this has implications for numbers of publications etc, but the importance of citing something as it was is preeminent.

    • This is not to preclude folk referencing material on the internet which is changing, but if it is going to be "published" data, then we think we can and should handle it differently.

  • We expect the target of the URL or DOI to be a metadata document, it should not be a binary target. There is human readable content there which provides more context and the URLs of the actual data. We have left the feature type in there though (as well as the feature ID), because it provides the human parser of a reference list a key hint about the target type1.

  • There could be a mismatch between a dataset which could have a DOI, and the URI of the feature. (We don't expect all features to have DOIs), and so having two forms of the citation does make sense: the first form above allows a URL which points directly to the feature to be shown (not that does in this example); even if the URL isn't persistent, the URN will be, and the data object should always be accessible via the publisher,URN,featureID combination. I suppose the DOI version is cleaner2 provided the feature can be easily obtained from the target of the DOI.

1: I'll probably come back to the "manifest" concept that is coming out of the work of Raj Bose and Guy McGarva in Edinburgh another time, but suffice to say "manifest" could itself be a member of the controlled vocabulary: a featurecollection is itself a feature! (ret).
2: despite the fact I hate the way most folk create the things: anyone done the stats for how many DOIs are opaque unmemorable strings that are used in the literature and result in mistranscribed versions which point nowhere or to the wrong place? (ret).

by Bryan Lawrence : 2007/05/21 : Categories claddier curation : 1 trackback : 3 comments (permalink)

debian python and easy_install aren't a perfect match

It turns out that on a debian system, if you

  1. create your own python in /usr/local, and

  2. use easy_install (python eggs)

you'll get into trouble. Phillip Eby has a solution, but it's not very tidy.

I can't think of any good reason why Debian has done this, and of course it affects ubuntu badly as well.

by Bryan Lawrence : 2007/05/18 : Categories ubuntu python : 0 trackbacks : 1 comment (permalink)

overheard on the email

Overheard on the email lately (I think both Andrew and Stefano will forgive me for publicising excerpts from recent emails especially since they're relevant to the whole what place does OGC and GML have in the world chat):

Andrew Woolf:

Too many people forget the case of a feature with (possibly multiple) coverage-valued properties. When scientific people complain that this ISO/OGC stuff "is just GIS" I robustly respond that actually the concepts are as revolutionary to traditional GIS as to us scientific users. Let's please leave behind these old notions of "raster" vs "vector", and realise that actually we can model the world in whatever complex way is necessary.

Stefano Nativi:

We should avoid "mental barriers" like Raster Vs Vector, as well as Coverage Vs Feature (the new version of the same contraposition).

In an interoperable Geospatial Information framework, Observation&Measurement, Feature and Coverage are different ways to see the same stuff (i.e. they are views). Different use cases may need to present users with a Feature view and access and process data using a Coverage view, or vice versa.

It helps to have the same toolkit to describe these views :-)

by Bryan Lawrence : 2007/05/09 : Categories ndg : 0 trackbacks : 0 comments (permalink)

It's not how big the tool is, it's what you do with it

I really ought not get involved in long discussions when I don't have time to finish what I start ... but anyway. Charlie didn't start the conversation by writing this, I did that by responding :-), but now it is a conversation :-). So this is an open letter to Charlie.

Taking things one point at time:

  1. In his (my) view, GML is not meant for data exchange but instead provides a common language that various communities can use to develop interoperable solutions .

    • Umm, not quite. I think GML is meant for data exchange, but using it requires some understanding by the client, GML alone is not a solution (Below we'll define data in this context).

  2. If I want to share data with my business partners, I can invent and implement a proprietary exchange format in less time then using GML. Or take an even simpler route - just exchange shape files and be done with it.

    • Each time you invent your proprietary exchange format, you probably can do it faster ... but remember both ends of that piece of wire have to be involved in the conversation. With GML (and a UML description as documentation), I've got a fighting chance of interpretting your data without you, and then writing my client or service.

    • As for shape files: care to explain how I can use a shape file to exchange a trajectory of dropsondes from an aircraft?

  3. Second, with this approach you end up with thousands of separate communities that cannot exchange data between them. Whether this is good or bad depends on your goals - if I want to exchange data with a few other like minded organizations then this is ok. But if your goals are loftier, to create a world-wide geoweb, then this is bad..

    • Umm. Nope, they can exchange data, but they have to do some work to do so, and absolutely I want a world-wide geoweb, but I don't want that to be proprietary or limited by the commercial imperatives of the GIS vendors. Actually, I think we both agree on this point, we're just trying to get there via different routes.

  4. Third, if the goal really was to create a common language to describe geographic information, as opposed to exchanging it, why not reuse UML (Universal Modeling Language) by creating a UML profile?

    • Agreed, so we start with our UML profile, which describes something, now what do we do? We have to find a way of implementing our description ... more of this below.

  5. (wrt GML) ... Is it meant to make it easier to implement thousands of one-off data integrations, or is it meant to enable geographic data exchange on a world-wide scale? ... If its the former, then I think there are simpler, faster approaches than GML. And if its the latter, then GML fails to provide a simple format that every system can use in the same way that Atom does.

    • Seriously: enabling thousands of one-off implementations a different way each time would be easier than GML?

    • Atom allows you and me to exchange a document that we can each interpret in terms of some simple concepts. Add GeoRSS to it, and you can tell me the location that this document applies to. If I want to do anything more complicated then we're into a GML extension anyway ...

John Caron, in the comments to my missive, also picked on OGC specs in general:

It could be that the process of creating OGC specs itself is flawed, perhaps because implementations come after the fact, perhaps because industry consortia are simply the wrong "governance" structure to produce clean technical specs with just the right level of abstraction.

I don't think all OGC specs are flawed, but do agree that GML is not at the right level of abstraction. It makes too much use of arcane XML technology (for example, in a slightly different context, who really cares about substitution groups?), and certainly could be much leaner. However, that's actually a problem with all standards that try to be all things to all people. There is no doubt that lean mean standards like Atom are easier to construct well (maybe the Atom team wont call what they did "easier", but all things are relative :-). The question then is could a profile of GML do an 80-20 job better than the whole thing? Could Atom (alone) do it all?

The clear answers are yes and no. In the first case which is what Application Schema of GML are all about. Step 2 in my diagram is about using GML to build an application schema, not GML in and of itself. That's the bit that communities should build to be lean and mean. Then only extend (or unify) a little as you add a new community.

To be fair, even building a profile is a pain, and that's because we all agree that XML schema is difficult to handle, and (again, in a different context) difficult to constrain.

Actually, perhaps part of the issue is what we mean by data. I think GML allows me to describe lots of attributes of my data in a way that's quite easy for someone else to consume: you can read my axes, understand the parameters (what dictionary did i use to describe them) etc. The final mile, "the real data", is going to be hard for anyone else to consume without knowing exactly what that data object is. But GML also allows me to define the coverages (albeit in a restrictive way, roll on a full implementation of ISO19123), and that's the take home point for my trajectory of dropsondes example above.

If I give you a coverage, conforming to a GML application schema of my trajectory of dropsondes, you've got a fighting chance, with a GML parser1, of writing some code to grab the trajectory and put it on a map, or create a contour map of the height versus trajectory ... and all that without knowing about NetCDF itself. Yes, you might need software that can read NetCDF, but you don't need to interpret the NetCDF yourself, you only need to interpret the GML description (so I've just saved you the netcdf manual, and maybe, in the CSML case, the HDF manual, the NASA Ames manual, and the PP manual, and yes, they add up to a lot more pages than GML alone). You can take my libraries though for reading the data and "just use them", but you can "just use them" in your application only if you parse the GML ... and do the thinking about what my data objects mean to you in your application (if anything :-).

Could Atom (alone) do that? No! Atom would allow my software to consume your document easily, but to make sense of the content in my application, the "standard" has to describe the semantics of use to me. Of course Atom is relatively easy to use, and that's because it has limited semantics. Great, let's use it for what it is, but not pretend I can give you an arbitrary geospatial object in Atom and you can consume it in any meaningful way without having a conversation with me about what it is, and then writing some code :-)

Both ways we need new code. Your point is that it would be easier to use Atom to give me the object, and have a conversation with me every time to write the code. My point is that you can give me the object anyway we like, but it helps if we have a toolkit that helps us describe the object without having to write (every time), yet another calendar handling tool, and dictionary handling tool, and ...

... and by way of conclusion: This isn't really an open letter to Charlie, it's really to my subconcious (so Charlie: thanks for the excuse: who knows, I too may change my mind; there's a lot more thinking and detail sorting out before this conversation is finished - whether or not anyone else bothers to join in :-)2

1: Oh yes, I know exactly what that means :-), no one has a complete GML parser, more's the pity. (ret).
2: Sorry this last paragraph didn't get submitted with the original, I somehow managed to submit an earlier version. (ret).

by Bryan Lawrence : 2007/05/08 : 2 trackbacks : 2 comments (permalink)

feisty kubuntu

On Friday I upgraded from dapper ubuntu to feisty kubuntu on my laptop. I needed to do it because:

  1. I got sick of Evolution hanging with some image email, and requiring a restart a couple of times a day.

  2. I wanted beagle to index my .doc files properly. (Actually, I needed this, the amount of time I spend trying to find files is unbelievable).

  3. I needed to deal with OpenOffice misbehaving with some spreadsheet inserts on a document (for once I was working with someone who wanted odt rather than doc ... role on the revolution). An upgrade was required, this was the clincher as to why I did it then and there.

  4. Ever since I got my laptop it would never reliably find a physical ethernet at boot time, I often had to do an ifup eth0 afterwards (I think this was a bit of misconfigured networking by Emperor Linux, but I never got to the bottom of it).

  5. konqueror was crashing on Eli's website. Always.

  6. I was hoping that akregator might behave better with atom feeds.

All but the last of these got fixed. I'm pretty happy, but there were some wrinkles:

  1. Evolution email didn't get indexed by beagle. I've fixed this by importing my mail back to kmail. By the way, that isn't as trivial as it ought to be. Evolution mail directories include .cmeta and etc files which the kmail importer hangs on. I had to follow advice on how to fix that. Basically:

    • You need to run this code. Then do the import. But beware, it imports them all as unread, so if you did have some genuinely unread stuff, then you wont be able to identify it afterwards from kmail (although it's still there in Evolution).

  2. I have my own python in /usr/local, and it interacts badly with the system python when /usr/local is mounted. See this bug report.

  3. I'm not really convinced beagle is getting everything, especially in the mail.

  4. The hotkeys on my Leonovo Thinkpad T60P used to work (thanks to Emperor Linux plus a wee piece of my own bespoke python). I'll have to get around to them.

I'm very impressed with

  • The new network manager

    • But, it really ought to come with the PPTP code by default, and it ought to work without getting both network-manager-pptp and network-manager-gnome, and when you do have the pptp stuff it ought to be integrated with the kde wallet not the gnome keyring.

  • The hibernate and suspend work much better with my laptop, and it looks like the power management does too, with around half an hour longer battery life I think ...

The upgrade did take an hour or two after the install to get thing nearly back the way I want them, so that was lost time, but I suspect I will make it up this week on file finding alone.

As usual I did it on a separate partition, so I can always go back ... I only wish I could work out how to backup from one partition to another, and upgrade that other partition, rather than install on it.

by Bryan Lawrence : 2007/05/08 : Categories ubuntu : 0 trackbacks : 0 comments (permalink)

Interoperability is just over the horizon ... always

I'm not a GIS person, yet I've invested quite a lot of my own time, and quite a lot of public money, into building tools based around the Geographic Markup Language (GML). GML is essentially a toolkit designed to improve interoperability, but it's getting a bit of bad press right now, both in blogs ( e.g.) and mailing lists (e.g this thread).

I probably wouldn't care, but Sean Gillies who seems to have quite a few clues (and who provided me with the pointer to the blog link above), seems to agree. So I want to engage in this discussion, but before I do so, I want to digress.

I spend a lot of time arguing about how difficult it is to data citation and publication. One of the key points that one needs to keep on restating is that we have a shared understanding of what books, articles, chapters and pages are, and what those terms mean. We have no such understanding for data. Data is about the real world.

Ok: back to the main point. Interoperability, in the GIS sense, is actually the same problem. Actually, it's the same point in every sense, not just the GIS sense, but we'll stay focussed here :-). One of the best diagrams I have seen to make this point is in ISO19109 (Geographic information - Rules for Application Schema), and it's encapsulated in one figure, which looks something like this:

Image: static/2007/05/02/ISO19109-Methodology.jpg

The key point is that everyone, doing any coding, does something like that. We start off by modelling (on paper, in head, in UML ... whatever) some of the things about the real world that we care about.

We then move from that abstract model to the world to building descriptions of the key features of the real world in some "descriptive language". We give those descriptions names, likes "schema" or "standards" (or even "RFCs") and we use a variety of technologies to do that.

Then we take data and we populate instances of those "schema".

Really good architects/programmers find really simple ways of doing the process, so the entire effort is streamlined, and the resulting objects and instances are easily understood by the "Community of Interest".

Now let time pass.

Two communities want to talk to each other, and exchange data. All that simplification is lost. They have to work their way back up the tree (probably to the real world level), and come back down until they share the same "descriptive language", and then data objects described using the descriptive language can be shared.

On the way we can write a new descriptive language every time (for every pair of new communities), or we can try and design an abstract descriptive language that allows one to avoid that step every time someone wants to interoperate. Doing the latter introduces considerable complexity, and that makes the job of solving problems for ones own little community harder ... every time. But, and here is the big BUT, unless you know you will never want to share your data, then you've just moved that complexity til later, you haven't undone it. If you are in a business, that's just fine, this years profit is all that matters, but if you have longer time horizons, then solving the tiny problem isn't necessarily optimal.

Which brings me back to my data citation example. We can't share anything, until we have a shared understanding of what it is (1), and a shared way of describing it (2). Right now, flaws and all, UML and GML are the best thing going for that. They're absolutely not the best thing going for solving any specific problem, and not even close to the best thing for most of the use cases the GeoRSS community want to address, but, if you want to do interoperability, it's not only today's problem you need to think about: what's over the horizon matters!

So, with that in mind, let's go back to Charlie Savage's argument. There's a lot of good stuff there, but I think he draws the wrong conclusions.

Thus the real problem GML tries to solve is how can your computer system and my computer system exchange data about the world in a meaningful way? In my opinion that's an unsolvable problem, because the way your database models the world is different than mine.

So I agree except for the unsolvable bit ... which we'll come back to.

He ends up with

In my view, the fundamental premise of GML is wrong. The ability to create custom data models is an anti-feature that makes integration between different computer systems impossible because it assumes that those systems can actually understand the data. Computer systems have no such intelligence - they only understand what someone has programmed them to understand.

Which is so nearly right, except for the impossible and anti bits.

I think the heart of the problem is in the expectation of what GML gives you. What it absolutely doesn't give you automatically is code to manipulate someone else's data objects. What it does give you is a descriptive language you can both use to describe them, and you absolutely have to spend real programming time exploiting the fact you have a common language (so now it's solvable and possible!) No, my WFS client may not understand your Feature Types ... yet ... but I could make it do so, and I can do that without inventing or learning a new paradigm. That's interoperability, but it's a "strong-typing/loose-coupling" sort of interoperability.

Of course it's not the only sort of interoperability that matters. Web 2.0 and REST and GeoRSS and all that stuff is good, it's really good, but it's not the whole story. I wish folk wouldn't keep on arguing that just because some technology doesn't solve their use case it's flawed!

Of course I can give you chapter and verse on why GML sucks, but that's another story, it sucks less than some other options :-)

by Bryan Lawrence : 2007/05/02 : Categories ndg : 2 trackbacks : 6 comments (permalink)

DISCLAIMER: This is a personal blog. Nothing written here reflects an official opinion of my employer or any funding agency.