... personal wiki, blog and notes
Bryan's Blog 2010/10
Tales from the EA linux coal front
Infrequently I get to do some real work (i.e. interesting work1), but it's so infrequent that every time I get to do work, my "environment" has moved on. In this case, I wanted to produce a simple model extending a couple of ISO classes to provide a different view of how we might describe a climate model. A few minute work I thought ... but it took me half a day, on and off, to get myself in a position to do that few minutes work. So this is by way of saving someone else the bother of working out how (and one of my colleagues at today's metafor meeting needs this info too).
If you use Enterprise Architect with subversion on a linux 10.04 or later system, then this might help:
Cross Over Office 9.1 (and subsequently 9.2)
(NB if your menus disappear during the upgrade, logout and login again, and they might reappear - mine did).
Linux Mint 9 (based on Ubuntu 10.04)
Enterprise Architect 8.0
Trying to use HollowWorld
(Remember to use both cmd.exe.so and still the svngate.exe.so; and note that when EA asks you for the path to your svnexe, you wont see it as a choice in your directory because the type is hardwired, just type in svngate.exe.so yourself! If you don't use the cmd.exe.so I think it still works, but with no useful error messages come back from svn.)
When you try and add the package version control on a model, you get the following error:
~/cxoffice/lib/libxml2.so.2: no version information available (required by /usr/lib/libneon-gnutls.so.27)
In ~/cxoffice/lib link your libxml.so.2 to the version in /usr/lib!
(It appears that the libneon stuff that handles secure access to https is newer in the latest ubuntus and depends on stuff in a version of libxml which is newer than that shipped with crossover.)
Instances and Properties
Someone else will probably find a better way to express this, and give me the proper language, but meanwhile, I wanted to expose something that I think is important about my meta-model of using domain modelling to produce descriptions of things, and indexes of those descriptions. It came up over lunch time at a metafor coding sprint, and someone asked me to write it down, so here it is.
These concepts should be, I think, independent of the actual modelling tool you use to encode these concepts (UML, OWL or whatever), although you would want to encode the concepts using those tools.
Most readers will recognize this as essentially repeating some of the General Feature Model of ISO19109 ... with, I think, a few additions, and possibly exposing the contradiction between that and ISO19110 (the feature catalog description ... and thanks to Andrew Woolf for bringing the latter to my attention.)
There are a few simple things in this world:
Features which are instances of Feature Types, and which have identity:
e.g. I am an instance of the Feature Type Humans. You can use my openid address to unambiguously point to me.
Feature Types, are classes of objects in the real world.
Features Types have properties, which characterize those Feature Types.
e.g. I have brown eyes.
Properties are attributes of Feature Types which are not interesting in their own right, although they are well defined, and are very useful.
e.g. if you do a search on humans with brown eyes, you are interested in returning instances of human, not instances of brown. (You might be interested in doing a search on colours of eyes in humans and get brown back, but in that case, you are interested in getting members of a vocabulary back, not instances of real world things.)
(NB: In ISO19109 such properties are composed into their parent feature descriptions, which is often not very helpful, since it implies that if i use colour as an attribute of human, it might not carry the same connotation as colour when used as an attribute of a dog. This "error" causes problems with serialization to RDF from an ISO19109 compliant domain model, since one can't necessarily assume that they are the same concept, without some extra baggage, which essentially associates the properties with other entities. This extra baggage is implicit in ISO19110 which allows one to store properties independent of feature types.)
Features have associations with other features, and may be composed or aggregated out of other features.
e.g. My family is a specific aggregation of humans.
We also have to deal with a specific class of features: those with well defined life cycles
e.g. I was born, and will die. Someone is responsible for maintaining the description of me which carries my openid. (That might be me, but for inanimate objects, it'll be a specific person.)
The implication of this is that there is an authoritative description of me, there are authoritative descriptions of my family, and both are bounded by a specific composition of features and set of properties. (In metafor, these compositions are called "documents", because they themselves are important entities, with life cycles.)
Why is this important? Well, in the context of building software tools which themselves describe software tools (e.g climate models), we need to distinguish between these concepts (e.g. the models, and the language they are written in).
This matters, because we want to serialize these descriptions in both rdf triples and XML documents for good reasons: the former yields excellent methods of indexing documents, the latter, excellent technology for document exchange (and differencing.)
(Yes, yes, I know one can compose RDF into things with the properties of documents via OWL, but XML is more useful for handing to a GUI etc, and we can talk about OWL another day.)
When we've got this in place, we want to use our RDF to index things, but in general we are interested (from a software perspective) in creating, managing, displaying, and differencing our feature instances, not their properties (humans consume those). So, our domain model - however serialized - needs to distinguish between all these things.
The choice is python
This longer piece summarises my thinking as to what language folks like ourselves should use to develop new data processing (including manipulation and visualisation) tools. The context is clearly that we have data producers and data consumers - who are not the same communities - and both of whom ideally would use the same toolset. As scientists they need to be able to unpick the internals and be sure they trust them, but they'll also be lazy; once trusted,tools need to be simultaneously easy and extensible. Ideally of course, one wants to develop toolsets that the community will start to own themselves, so that the ongoing maintenance and development doesn't become an unwelcome burden (even as we might invest ourselves in ongoing support, we want that support to be manageable, and even better, we might want collaborators to take some of that on too)! The bottom line is that I think there are two players: Python and Matlab with and R and IDL as also rans, and that for me, Python is the clear winner - especially since with the right kind of library structure, users can mix and match between R, Python and IDL.
For nearly a decade now, the BADC has been mainly a Python shop, even as much of, but not all, the NERC climate community has been exploiting IDL. The motivation for that has been my contention that
Python is easy to learn (particularly on one's own using a book supplemented by the web) - and that's important when we are mostly hiring scientists who we want to code, not software engineers to do science,
The Python syntax is conducive to writing "easier to maintain" code (although obviously it's possible to write obscure code in Python, the syntax, at least, promotes easier-to-read code).
Python can be deployed at all levels: from interaction with the system, building workflow, scientific processing and visualisation, and for web services (both backend services and front end GUIs via tools like Django and Pylons). In principle that means staff should be more flexible in what they can do (both in terms of their day jobs and in backing up others) without learning a plethora of languages.
Of course, one might make arguments like those about other languages, and folks do, but mostly I get arguments about two particular languages:
IDL- which is obviously familiar to many (but far from all) of both our data suppliers and consumers, and
Java - particularly given the Unidata toolsets, and because some of my software engineers complain about various (arcane) aspects of Python.
We'll get to the IDL arguments below, but w.r.t. Java: it's not really a contender, it's simply not suitable as a general purpose language in our environment. It's too verbose, it requires too much "expertise", and it's a nightmare to maintain. Some supporting arguments for that position are here (10 minute video) and here (interesting blog article).
In the remainder of this piece, I introduce some context: some results from a recent user survey at the BADC, a quick (and incomplete) survey of what is taught in a few UK university physics departments - with a few adhoc and non-attributable comments from someone involved with a much wider group of UK physics departments.
I'll then report on a few experiences in the BADC, before summarising with my conclusions - which of course are both overtly subjective and come with considerable input bias.
Context: User Surveys
(This section is based on material collected and analysed by my colleague: Graham Parton.)
We surveyed our users and asked them about their proficiency with a variety of programming languages/packages: the basic results are depicted in this bar chart:
The results are from around 280 responses (Red means: geek level; orange: happy to use it; yellow: use it on and off; green: aware of, but not used lately: and blues : complete mystery or no response).
If we look at this, we see that
The common scripting languages (Perl and Python) are not that commonly used by our community (but active Python usage is more prevalent than Perl and we can ignore TCL/Tk).
Of the high level programming languages (Fortran, java, C and friends), Fortran is the team leader (as you might expect for our community).
The big packages (Matlab, IDL, R) rank in that order (but note that R is more commonly used than python).
GrADS has usage comparable to R and python, but Ferret isn't much in use in our community.
Excel and MS friends are common (but so is the influenza, and neither can do big data processing tasks).
If we split all the responses into those from our "target" community (users who claimed to be atmospheric or climate related - roughly half of the total responses):
we find broadly similar results, except that IDL is marginally stronger than Matlab (at least as far as the usage goes - even if there is still more folk who are aware of Matlab). However, IDL still only hits half the audience!!!
Context: University Undergraduate teaching
Obviously most of the folks who use our data do so in postgraduate or other environments, and at least for NCAS, most of those will have IDL in the vicinity, if not on their desktop. However, what skills do they enter with?
As a proxy for entry level into our community, we (ok, Graham Parton again), did a quick survey as to what programming is taught in Russell group universities (why physics, why Russel group? Physics: graduates who are more likely to go under the hood ... we'll get back to that ... and Russell: a small number of identified universities which we might a priori assume to have high quality courses).
The results that we could get in an afternoon are here:
(Key: Red: integrated courses. Green: taught, Orange: accepted but not taught, P: project work, 1/2/3: year in which it is taught, if known). (We asked about some other languages too, but these are the main responses.)
What we find is that most of them offer programming courses to some level as an introduction to computational physics. There has been a move away from FORTRAN as the language of choice to other languages such as C++ and Python. Southampton, Cardiff and Nottingham have focused particularly on concentrating on one language that is integrated into wider course material (Matlab in Nottingham, and Python in Cardiff and Sheffield). These three universities have focused on one language to avoid confusion with others, focusing on aiming for fluency in programming that can be later translated to other languages as opposed to exposure to many languages. Oxford, on the other hand, is a notable exception where a wide number of languages are introduced in years 1 and 2. Imperial is reviewing programming provision and there is a strong lobby for Matlab within their department.
Most departments reported using a wide range of programming languages/packages (e.g FORTRAN, C++, IDL, Matlab) depending on what was the predominant processing package in the research group/field, e.g. IDL for astronomy, C++ for particle physics.
Overall, it appears that a ranking of programming language provision would be:
Off the cuff comments from a member of the Institute of Physics asked if they had any insight into the provision of programming languages in a wider group of physics departments suggest these results aren't unique to the Russell group departments (but also that Python, having been off the radar, is increasing rapidly). That person had not heard of IDL (which is mostly used in research departments, and then mainly in astrophysics/solar-terrestrial/astronomy and atmospheric physics).
(Common feedback on why Matlab was chosen indicated that one of the drivers was the relatively pain-free path from programming to getting decent graphics at the other end.)
At this point we need to focus down to some contenders. What should an organisation like ourselves, or even the Met Office for example, consider for their main "toolset" development language? Clearly on the table we have Matlab and Python (given the results above).
Given the importance of statistics to our field, and the fact that R is in relatively common usage and has an outlet for publishing code we should also keep it in the mix. However, if using R libraries is important, we can do that from Python ... and it's not a natural language for complex workflow development, so we'll park R in the "useful addendum to python" corner ... (that said, for a class of problems, we have used, and continue to use, R in production services at the BADC.)
What about IDL then? Well, clearly it's useful, and clearly folks will use it for a long time to come. However, most ordinary IDL users are likely to be able to read Python very easily - even if they have never seen Python before: For a time we used to give candidates for jobs at the BADC a bit of Python code and ask them to explain what it did, and we only did that to folk who hadn't seen python before. We had intended it as a discriminator of folks ability to interpret something they hadn't seen before, but in most cases they just "got it right". We obviously needed something a bit more complicated (in which case the more obscure Python syntax might have got in the way), but as it was, what we learned from that exercise was mostly that "Python is easy to read"!
What about writing IDL? Well, yes, it's relatively straightforward, but it's not a great language for maintaining code in, and it's commercial (and not cheap!). The IDL community of use is rather limited in comparison to Python - and, you can call Python from IDL anyway. So if you really want IDL, but wanted "my new toolset", (if we wrote it properly) you could call it from IDL anyway. (In this context, it's worth noting that calling C and Fortran from Python is apparently much easier than doing so from IDL.)
There is clearly a lot of momentum:
folk moving from IDL to Python, and some pretty coherent analyses of why one might use Python in comparison to IDL (e.g. here)
There are also lots of web pages which provide information for folk migrating to Python from IDL (example).
We've seen that I believe python is easy to learn, and that at least two UK departments have built their courses around it. But what about the wider community?
A number of computer science departments are now teaching Python as their first programming language as well (S. Easterbrook in private conversation).
Clear climate code are using Python of course!
Which leaves us with Matlab. In truth, I don't know that much about Matlab. My feeling is that the big advantage of Python over matlab is the integration with all the other bits and pieces one wants as soon as a workflow gets sufficiently interesting (GUIs, Databases, XML parsers, other people's libraries etc), and the easy extensibility. You can use R from Python. You can even use the NCAR graphics library from Python (via PyNGL even if some are curmudgeonly about the interface).
The other thing that I believe to be a killer reason for using Python: proper support for unit testing: if we could inculcate testing into the scientific development workflow, I, for one, believe a lot of time would be saved in scientific coding. I might even rest happier about many of the results in the literature.
The Bottom Line
So, I'm still convinced that that the community should migrate away from IDL to Python, and the way to do that is to build a library that can be called from IDL, but is in native Python.
I appreciate that there may be some resistance to this, particularly from those scientists who like to look under the hood and understand and extend library functions. Some of those scientists are very familiar with IDL - but my gut feeling is that those are also the very same ones, that, if they spent an afternoon familiarising themselves with Python, would find they can go faster and further with Python. (Many of those folks are going to have been physicists, which was why I started by looking at what Physics courses have been up to.) My suspicion is that those that don't look under the hood wont care, provided it's easy to use, and well documented. Python helps with the latter too: with documentation utilities vastly superior to anything available in the IDL (and I suspect, Matlab) space.
So, after all that: the choice is (still) Python!
NB: I will update this entry over time if folk give me useful feedback.
Integrating data handling libraries
Earlier this week I was talking with one of my colleagues about the difference between NetCDF3, NetCDF4, HDF etc ... and also about my hope that there is a future where projects stop inventing bespoke data formats AND exploit existing conventions as to how to use existing formats.
In the conversation, I realised I wanted some diagrams, and that I'd probably want them again. So here they are.
In this first schematic, I'm
Explaining the relationship between the CF conventions for NetCDF, the NetCDF3 classic interface, NetCDF4 and HDF (which is essentially about layers, exploiting and constraining how they use lower layers).
Pointing out that right now, CF is expanding (hopefully) into the aircraft, satellite, and station arenas (but admitting that we have a problem that is nobody's day job to work on this integration, to make proposals, and write examples etc).
Pointing out that from what I understand about EUFAR, they may be bypassing the opportunity to work through "the CF stack". While this might make sense for those who directly work with the instruments, it makes life difficult for other communities. (This argument applies to others as well ...)
In the second schematic, I'm trying to show that the pontential overlap between these (currently) competing aircraft formats and the existing CF initiatives could be quite considerable, and that I hope folk will work on convergence rather than divergence.
Which brings me to the last schematic: I'd really like CF in the future to consist of the core conventions that exploit NetCDF in a common way, with clearly defined toolboxes that data producers and consumers can rely upon to create and consume the key variants of data of interest to a bunch of overlapping communities. It may be that in that world CF can't be completely implementable in NetCDF3 ...