Bryan Lawrence : Bryan's Blog 2008/01

Bryan Lawrence

... personal wiki, blog and notes

Bryan's Blog 2008/01

Moving Modelling Forward ... in small steps

I'm in the midst of a series of "interesting" meetings about technology, modelling, computing, and collaboration ... Confucian times indeed.

Last week, we had a meeting to try and elaborate on the short and medium-term NERC strategy for informatics and data. For some reason, NERC uses the phrase "informatics" to mean "model development" (it ought to be more inclusive of other activities, and perhaps it is, but it's not obvious that all involved think that way). As it happens, we didn't spend much time discussing data, in part because from the point of view of the research programme in technology, the main issue at the moment is to improve the national capability in that area (i.e. through improvements and extensions to the NERC DataGrid and other similar programmes).

Anyway, in terms of "informatics" strategy we came up with three goals:

  • In terms of general informatics, to avoid loosing the impetus given to environmental informatics by the e-Science programme,

  • To try and increase the number of smart folk in our community who are capable of both leading and carrying out "numerically-rich" research programmes (i.e. more people who can carry our model development forward). We thought an initial approach of more graduate students in this area followed by a targetted programme might make a big difference.

  • To try and identify some criteria by which we could evaluate improvement in model codes (in particular, if we want adaptive meshes etc, which ones, and how should we decide?). (Michael you ought to like that one :-)

This was in the context of trying to ensure that NERC improves the flexibility and agility (and performance) of its modelling framework so it can start to answer interesting questions about regional climate change. Doing so will undoubtedly stretch our existing modelling paradigms, particularly as we try and take advantage of new computer hardware.

During the meeting we all had our list of issues contributing to the discussion. This was my list of things to concentrate on:

  • Improving our high resolution modelling (learning from and exploiting HIGEM).

  • Improving our (the UK research community outside the Met Office) ability to contribute to AR5 simulations.

  • Improving our ability to work with international projects like Earth System Grid (data handling) and PRISM (model coupling). (We - the UK - are involved with both, but not enough).

  • Data handling for irregular grids.

  • Model metadata (a la NumSim, PRISM, METAFOR).

  • Future Computing Issues in general, but in particular:

    • Massively parallelism on chip ... where we might expect memory issues: "Shared memory systems simply won't survive the exponential rise in core counts." (steve dekorte via Patrick Logan.)

    • Better dynamic cores

    • Better use of cluster grids and university supercomputing (not just the national services, will require much more portable code than we have now, and not a little validation of the models on each and every new architecture).

      • i.e. better coding standards ...

    • Better ensemble management and error reporting (Michael's bad experience is not dissimilar to folk here with the Unified Model).

    • Learning the lessons of the GENIE project(s).

    • Handling massive increases in data volumes.

      • With consequential issues for transport and archival

      • and the requirement to better exploit server-side data services

    • Much better model componentisation and coupler(s).

And like everyone else, I wanted to know where are the smart folk to do all this?

Then today, we had an initial discussion about procuring a new computing resource with the Met Office (which, by the way, doesn't preclude our involvement in other national computing services, far from it). There isn't much I can say about this discussion, as much of it was in confidence, but suffice to say, it was all about how we can exploit a shared system on which we would be running the met office models for joint programmes ... of course it's that very same model which most certainly needs a technology refresh :-)

On Friday, we'll be discussing the new NERC-Met Office joint climate research programme ... (which will be one of the programmes exploiting the new system).

by Bryan Lawrence : 2008/01/29 : Categories climate : 0 trackbacks : 0 comments (permalink)

Using more computer power, revisited.

In the comments to my post on why climate modelling is so hard, Michael Tobis made a few points that need a more elaborate response (in time and text) then was appropriate for the comments section, so this is my attempt to deal with them. But before, I do, let me reiterate that I don't disagree that there are substantial things that could and should be done to improve the way we do climate modelling. Where the contention lies maybe in our expectations of what improvements we might reasonably expect, and hence perhaps in our differing definitions of what might be impressive further progress.

Before I get into the details of my response, I'm going to ask you to read an old post of mine. Way back in January 2005, I tried to summarise the issues associated with where best to put the effort on improving models: into resolution, ensembles or physics?

Ok, now you've read that, three years on, it's worth asking whether I would update that blog entry or not? Well, I don't think so. I don't think changing the modelling paradigm (coding methods etc), would change the fundamentals of the time taken to do the integrations although it might well change our ability to assess changes and improve them, but I've already said I think that's a few percent advantage. So, in practise, we can change the paradigm, but then the questions still remain: ensembles, resolution or physics? Where to put the effort?

Ok, now to Michael's points:

Do you think existing codes are validated? In what sense and by what method?

In the models with which I am familiar I would expect every code module that can be tested physically against inputs and outputs has been done so for a reasonable range of inputs. That is to say, someone has used some test cases (not complete, in some cases, the complete set of inputs may be a large proportion of the entire domain of all possible model states, i.e. it can't be formally validated!), and tested the output for physical consistency and maybe even conservation of some relevant properties. There is no doubt in my mind that this procedure can be improved by better use of unit testing (Why is that if statement there? What do you expect it to do? Can we produce a unit test?), but in the final analysis, most code modules are physically validated, not computationally or mathematically validated. In most physical parameterisations, I suspect that's simply going to remain the case ...

Then, the parameterisation has been tested against real cases. Ideally in the same parameter space in which it should have been used. For an example of how I think this should be done, you can see Dean et al, 2007, where we have nudged a climate model to follow real events so we can test a new parameterisation. This example shows the good and bad: the right thing to do, and the limits of how well the parameterisation performed. It's obviously better, but not yet good enough ... there is much opportunity for Kaizen available in climate models, and this sort of procedure is where hard yards need to be won ... (but it clearly isn't a formal validation, and we will find cases where it's broken and needs fixing, but we'll only find those when the model explores that parameter space for us ... we'll come back to that).

(For the record, I think this sort of nudging is really important, which is why I recently had a doctoral student at Oxford working on this. With more time, I'd return to it).

It might be possible to write terser code (maybe by two orders of magnitude, i.e 10K lines of code instead of 1M lines of code).

While I think this is desirable, I think the parameterisation development and evaluation wouldn't have been much improved (although there is no doubt it would have helped Jonathan, the doctoral student, if the nudging code could have gone into a tidier model).

The value of generalisation and abstraction is unappreciated, and the potential value of systematic explorations of model space is somehow almost invisible, or occasionally pursued in a naive and unsophisticated way.

I don't think that the value is unappreciated. There are two classes of problem: exploring the (input and knob-type) parameters within a parameterisation, and exploring the interaction of the paramterisations (and those knobs). The former we do as well as is practicable and I certainly don't think the latter is invisible (e.g. Stainforth et al, 2004 from ClimatePrediction.net and Murphy et al, 2004 from the Met Office Hadley Centre QUMP project). You might argue that one or both of those are naive and unsophisticated. I would ask for a concrete example of how else we would do this. Leaving aside the issue of code per se, we are stuck with core plus parameterisations - plural - aren't we?

(if) there is no way forward that is substantially better than what we have ... I think the earth system modeling enterprise has reached a point of diminishing returns where further progress is likely to be unimpressive and expensive ...

I'm not convinced that what we have is so bad. We need to cast the question in terms of what goals are we going to miss, that another approach will allow us to hit?

Which brings us to your point

... If regional predictions cannot be improved, global projections will remain messy,

True.

... time to fold up the tent and move on to doing something else... the existing software base can be cleaned up and better documented, and then the climate modeling enterprise should then be shut down in favor of more productive pursuits.

I think we're a long way from having to do this! There is much that can and will be done from where we are now.

I have very serious doubts about the utility of ESMs built on the principles of CGCMs. We are looking at platforms five or six orders of magnitude more powerful than todays in the foreseeable future. If we simply throw a mess of code that wastes those orders of magnitude on unconstrained degrees of freedom, we will have nothing but a waste of electricity to show for our efforts.

I don't think anyone is planning on wasting the extra computational power, and I think my original blog entry shows at least one community was thinking, and I know (since I'm off to yet another procurement meeting next week) continues to think, very seriously about how to exploit improving computer power.

On what grounds do you think improving the models, and their coupling, will not result in utility?

by Bryan Lawrence : 2008/01/23 : Categories climate : 0 trackbacks : 6 comments (permalink)

Whither service descriptions

(Warning, this is long ...)

Last week I submitted an abstract to the EGU meeting in April in the The Service Oriented Architecture approach for Earth and Space Sciences (ESSI10) session. I'd been asked to submit something, but I fear I may be a bit of a cuckoo in the SOA nest ... (if by SOA, we take a traditional definition of SOA=SOAP+WS-*).

The abstract can be summarised even more briefly in two sentences:

  • there is a long tail of activities for which the abilities of web services to open up interoperabilty is being hindered by the difficulty in both service and data description, so

  • there is a requirement for both more sophisticated and more easily understandable data and service descriptions.

Some might argue that the latter is a heresy. Amusingly, I wrote that before I opened up my feed aggregator this week to find another raft of postings about service description languages. Probably the best, and easily most relevant, from my point of view, was Mark Nottingham, who wrote by far the most sage stuff. I'll quote some of it below. It looks like there was an ongoing discussion that hit the big time when Ryan Tomayko wrote an amusing summary which was picked up by Sam Ruby and Tim Bray.

It's hard to provide a summary that is more succinct than the material in those links and adds value (to me, remember these are my notes, I don't care about you1) so I won't! But I will write this, because I don't particularly want to wade through them all again.

On Resources

The first point to note is that most of the proponents of service description languages (particularly those from a RESTful heritage) are finally realising that it's not just about the verbs, the nouns matter too! It's fine to argue that you don't need a service description language because we should all use REST, but the resources themselves can be far more complicated beasts than standard mime-types, and so they need description too.

Mark Nottingham said it best in his summary:

Coming from the other direction, another RESTful constraint is to have a limited set of media types. This worked really well for the browser Web, and perhaps in time we?ll come up with a few document formats that describe the bulk of the data on the planet, as well as what you can do with it.

However, I don?t mean "XML" or even "RDF" by "format in that sentence; those are the easy parts, because they?re just meta-formats. The hard part is agreeing upon the semantics of their contents, and judging by the amount of effort its taken for things like UBL, I?d say we?re in for a long wait.

I've wittered on about the importance of this before, again and again. However, there are fundamental problems with using XML to describe resources. I've alluded to this issue too, but along with the summary by James Clark, I liked the way that it was put here:

One of the big problems with XML is that it is a horrid match with modern data structures. You see, it is not that it isn't trivial to figure a way to serialize your data to XML; it is just that left to their own devices, everyone would end up doing it slightly differently. There is no one-true-serialization. So, eventually, you end up having to write code to build your data structures from the XML directly. The problem there is that virtually all XML APIs are horrible for this kind of code. They are all designed from the perspective of the XML perspective, not from the data serialization perspective.

It gets worse. XML is one of those things that looks really easy, but is actually full of nasty surprises that don't show up until either the week before you ship (or worse.., a few weeks after). Things like character encoding issues, XML Namespaces, XSD Wildcards. It is really hard for your average developer (who makes no pretenses at XML guru-hood) to write good XML serialization/hydration code. Everything is stacked against him: XML APIs, XML -Lang itself, XSD.

At one time, I think I understood what it meant "Share schema not type", but now I don't ...

On the Service Description Language itself

Well, I've tried to review this sort of thing before. Since, then WADL has hit the big time. From a semantic point of view, I can't say I understand the big differences between WSDL and WADL, although I can appreciate that the WADL syntax is much simpler (and so it's a good thing).

Some folk, sadly including Joe Gregorio (whose work I mostly admire), have made a big deal out of the fact that there is no point generating code from WSDL (or WADL or any service description language), because if you do, when the service changes the WSDL (should) change, and so your code will need regeneration or otherwise your service will break. I think that's tosh (it's true, but still tosh, best put in Robert Sayre in a comment):

HTML forms are service declarations.

No one is arguing that we don't need HTML forms! The fact is, that clients will break when services change! Sure, some changes wont break the clients, but some will! The issue really comes down to How well coupled are your services and clients? and if they are strongly coupled Will you know when the service changes, and can you fix your client if it does? From experience of using WSDL and SOAP (yuck), I know I'd MUCH rather simply get the new WSDL, and regenerate the interface types .... than muck around at a deep level. (That said, I'm not arguing in favour of SOAP per se! Today's war story about SOAP and WSDL is one set of new discovery client developers complaining about our "inconsistent use of camelcase" in our WSDL ... it seems that they're hand crafting to the WSDL, and they want us to break all the other clients to fit their coding standards).

Of course, me wanting to use a service description language presupposes I've used my human ability to read the documentation (if it exists, or the WSDL if I really have to), to decide whether such a solution is the "right thing to do".

What does this mean to me?

At the moment we use WSDL and SOAP in our discovery service. I'd much rather we didn't (see above). It could be RESTful POX, which is how we've implemented our vocabulary service (but inconsistent camel case would still break things). It probably will change one day. More importantly, for the data handling services, we're currently using OGC services, where the "service description language" is the Get Capabilities document. This much I know (and where I violently disagree with Joe G) is that it would be much easier to use a generic service description language than the hodgepodge of get capabilities documents we deal with. I think OGC Get Capabilities is an existence proof that a generic service description language would be a Good Thing (TM)! In the final analysis, that's probably what I'll say in April (as well as "SOAP sucks" and "You need both GML and data modelling).

1: I do really :-) (ret).

by Bryan Lawrence : 2008/01/22 : Categories ndg computing xml : 1 trackback : 6 comments (permalink)

from every direction it's mitigation and acknowledgement

Most mornings now I get between half and an hour of time to myself: between feeding my baby boy who wakes up around 5.30 to 6 am and getting my daughter up around 7 to 7.30 am ... I mostly spend the time reading, coding (for pleasure ... I have made some significant progress on the new leonardo), and just cogitating.

This morning it was reading; and it seemed like from every direction we have climate change adaptation and mitigation issues:

  • Paul Ramsey on climate change, peak oil and the deep ocean (I read Paul for his commentary on GIS and Postgres ...)

  • New Scientist noting that we may be near peak coal (full article at author's website) (who needs to explain why they read the new scientist?)

  • The Observer reporting that the Severn Tidal power scheme takes another step towards actuality ... (when in the UK you do have to read a Sunday newspaper, they are the best in the world ...).

  • Joe Gregorio on batteries (Joe writes sage stuff on python, web services and much else).

Now the thing is, I found all of them in my thirty minutes this morning. It's an eclectic bunch of sources, but that's my point! (... and yes, it is unusual for me to read things from the NS, the Observer and my akregator all within thirty minutes ... and none of the readings from the latter were from my environmental folder).

by Bryan Lawrence : 2008/01/21 : Categories environment : 0 trackbacks : 0 comments (permalink)

Why is climate modelling stuck?

Why is climate modelling stuck? Well, I would argue it's not stuck, so a better question might be: "Why is climate modelling so hard?". Michael Tobis is arguing that a modern programming language and new tools will make a big difference. Me, I'm not so sure. I'm with Gavin. So here is my perspective on why it's hard. It is of necessity a bit of an abstract argument ...

  1. We need to start with the modelling process itself. We have a physical system with components within it. Each physical component needs to be developed independently, checked independently ... This is a scientific, then a computational, then a diagnostic problem.

  2. Each component needs to talk to other components, so there needs to be a communication infrastructure which couples components. Michael has criticised ESMF (and by implication PRISM and OASIS etc), but regardless of how you do it, you need a coupling framework. This is a computational problem. I think it's harder than Michael thinks it is. Those ESMF and PRISM folks are not stupid ...

  3. All those independently checked components may behave in different ways when coupled to other components (their interactions are nonlinear). Understanding those interactions takes time. This is a scientific and diagnostic problem.

  4. We need a dynamical core. It needs to be fast, efficient, mass preserving, and stable in a computational sense. Stability is a big problem, given that the various parameterisations will perturb it in ways that are quite instability inducing. This is both a mathematical and a computational problem.

  5. We need to worry about memory. We need to worry a lot about memory actually. If in our discussion we're going to get excited about scalability in multi-core environments, then yes, I can have 80 (pick a number) cores on my chip, but can I have enough memory and memory bandwidth to exploit them? How do we distribute our memory around our cores?

  6. What about I/O bandwidth? Without great care, the really big memory hungry climate models can often get slowed up and be waiting spinning empty CPU cycles waiting for I/O. This is a computational problem.

Every time we add a new process, we require more memory. The pinch points change and are very architecture dependent. Every time we change the resolution, nearly every component needs to be re-evaluated. This takes time.

At this point, we've not really talked about code per se. All that said, the concepts of software engineering do map onto much of what is (or should be) going on. Yes, scientists should build unit tests for their parameterisations. Yes, there should be system/model wide tests. Yes, task tracking and code control would help. But, every time we change some code there may be ramifications we don't understand, not only in terms of logical (accessible in computer science terms) consequences, but from a scientific point of view, there might be some non-linear (and inherently unpredictable) consequences. Distinguishing the two takes time, and I totally agree that better use of code maintenance tools would improve things, but sadly I think it would be a few percent improvement ... since most of the things I've listed above are not about code per se, they're about the science and the systems.

So, personally I don't think it's the time taken to write lines of code that makes modelling so hard. Good programmers are productive in anything. I suspect changing to python wouldn't make a huge difference to the model development cycle. That said, anyone who writes diagnostic code in Fortran, really ought to go on a time management course: yes learning a high level language (python) takes time, but it'll save you more ... but the reason for that is we write diagnostic code over and over. Core model code isn't written over and over ... even if it's agonised over and over :-)

Someone in one of the threads on this subject mentioned XML. Given that there (might be) a climate modeller or two read this: let me assure you, XML solves nothing in this space. XML provides a syntax for encoding something, the hard part of this problem is deciding what to encode. That is, the hard part of the problem is the semantic description of whatever it is you want to encode (and developing an XML language to encapsulate your model of the model: remeember XML is only a toolkit, it's not a solution). If you want to use XML in the coupler, what do you need to describe to couple two (arbitrary) components? If it's the code itself, and you plan to write a code generator, then what is it you want to describe? Is it really that much easier to write a parameterisation for gravity wave drag in a new code generation language? What would you get from having done so?

So what is the way forward? Kaizen: small continuous improvements. Taking small steps we can go a long way ... Better coupling strategies. Better diagnostic systems. Yes: Better coding standards. Yes: more use of code maintenance tools. Yes: Better understanding of software engineering, but even more importantly: better understanding of the science (more good people)! Yes: Couple code changes to task/bug trackers. Yes: formal unit tests. No: Let's not try the cathedral approach. The bazaar has got us a long way ...

(Disclosure:I was an excellent fortran programmer, and a climate modeller. I guess I'm a more than competent python programmer, and I'm sadly expert with XML too. I hope to be a modeller again one day).

by Bryan Lawrence : 2008/01/16 : Categories climate : 1 trackback : 10 comments (permalink)

Walking the Leonardo File System in Pylons

Now that we have access to the filesystem, the next step to porting is to get a pylons controller set up that can walk the filesystem ...

Start by installing pylons (in this case, 0.9.6):

easy_install Pylons

(watch that capitalisation: don't waste time with easy_install pylons ...)!

Now create the application... in my case a special sandbox directory called pylons ... and get a simpler controller up ...

cd ~/sandboxes/pylons
paster create --template=pylons pyleo template_engine=genshi
cd pyleo
paster controller main

At this point it's worth fixing something that will cause you grief later: we've decided on genshi, so we need an __init__.py in the pyleo/templates directory. Create one now. It can be empty. We won't need it til later ... but best get it done now!

At this point if we start the paster server up with

paster serve --reload development.ini

We can see a hello world on http://localhost:5000/main and a pylons default page on http://localhost:5000

The next step is to get rid of the pylons default page, and make our main controller handle pretty much everything (we wouldn't normally do this, but we're porting leonardo not starting something new). We do this by replacing the lines after CUSTOM ROUTES HERE in config/routing.py with:

    map.connect('',controller='main')
    map.connect('*url',controller='main')

and removing public/index.html.

Now http://localhost:5000/anything gives us 'Hello World'.

The next step is to get hold of the path and echo it instead of 'Hello World'. We do that by accessing the pylons request object in our main controller, which we have available since in main.py we inherit from the base controller.

So instead of

        return 'Hello World'pre] we have
	path=request.environ['PATH_INFO']
	return path

And the next step is to pass it to a simple genshi template to echo it. We do this by

  • making a simple template. Here is one (path.html which lives in the pyleo/templates directory):

    <html xmlns="http://www.w3.org/1999/xhtml"    i     
          xmlns:py="http://genshi.edgewall.org/"
          xmlns:xi="http://www.w3.org/2001/XInclude" lang="en">
      <body>
        <div class="mainPage">
          <h4> $c.path </h4>
        </div>
      </body>
    </html>
    

  • and callit from main.py after assigning the path value to something in the c object (itself visible to the template). Replace those two lines in main.py that we just replaced before, with:

    	c.path=request.environ['PATH_INFO']
    	return render('path')	
    

And now we're using Genshi to show us our path. The next step is to bring the leonardo file system into play, so we put filesystem.py into the model directory. (As an aside: Pylons is Model-View-Controller. If that doesn't ring any bells, see this).

Note that the MVC stuff is inside a directory embedded one more level than one might expect. That's why we have this wierd structure: sandboxes/pylons/pyleo is the source for a new (distributable) egg, and sandboxes/pylons/pyleo/pyleo is where our pylons application (with MVC) lives.

Realistically, next time I do this, i won't be repeating every identical step, so I'm bound to stuff up. I might need to get some decent debugging. Set this by editing the development.ini file so that [logger_root] has level set to DEBUG rather than INFO. Be warned; it results in verbiage on the console!

Right back to our thread ... In this first stab, we'll use filesystem.py as our model, and we'll simply put in place a view which walks the content and then displays the text for the moment (without a wiki formatter). Nice and straight forward.

We simply modify our existing template: path.html:

<html xmlns="http://www.w3.org/1999/xhtml" xmlns:py="http://genshi.edgewall.org/"
      xmlns:xi="http://www.w3.org/2001/XInclude" lang="en">
<!-- Simple genshi template for walking the leonardo file system -->

<!-- We'll be using the javascript helpers later, so let's make sure we have them -->
${Markup(h.javascript_include_tag(builtins=True))}

<body>
    <div class="mainPage">
    <h4> $c.path </h4>
    <ol py:if="c.files!=[]">
        <li py:for="f in c.files">Page:<a href="${f['relpath']}">${f['title']}</a></li>
    </ol>
    <ol py:if="c.dirs!=[]">
        <li py:for="d in c.dirs">Directory:<a href="${d['relpath']}">${d['title']}</a></li>
    </ol>
    </div>
</body>
</html>

and our main controller. It's all in the main controller for now. We'll change that later. Meanwhile, our main controller now looks like this:

import logging

from pyleo.lib.base import *
from pyleo.model.filesystem import *

log = logging.getLogger(__name__)

class MainController(BaseController):

    def index(self):
        ''' Essentially we're bypassing all the Routes goodness and using this
        main controller to handle most of the Leonardo functionality '''
        c.path=request.environ['PATH_INFO']
        
        #Later we'll move this elsewhere so it doesn't get called every time ...
        self.lfs=LeonardoFileSystem('/home/bnl/sandboxes/pyleo/data/lfs/')
        
        #ok, what have we got?
        dirs,files=self.lfs.get_children(c.path.strip('/'))
        
        if (dirs,files)==([],[]): return self.getPage()
            
        c.dirs=[]
        c.files=[]
        for d in dirs:
            x={}
            x['relpath']=os.path.join(c.path,d)
            x['title']=d
            c.dirs.append(x)
        for f in files:
            x={}
            x['relpath']=os.path.join(c.path,f)
            leof=self.lfs.get(f)
            #print leof.get_properties()
            x['title']=(leof.get_property('page_title') or f)
            c.files.append(x)
	return render('path')

    def getPage(self):
        ''' Return an actual leonardo page '''
        leof=self.lfs.get(c.path)
        c.content=leof.get_content()
        if c.content is None:
            response.status_code=404
            return
        else:
            #This is a leo file instance
            c.content_type=leof.get_content_type()
            
            #for now let's just count these ...
            comments=leof.enclosures('comment')+leof.enclosures('trackback')
            c.ncomments=len(comments)
            
            if c.content_type.startswith('wiki'):
                # for now, just return the text ... without formatting etc ...
                return ''.join([c.content,'\n%s comments'%c.ncomments]+[com.get_content() for com in comments])
            else:
                t={'png':'image/png','jpg':'image/jpg','jpeg':'image/jpeg',
                   'pdf':'application/pdf','xml':'application/xml',
                   'css':'text/css','txt':'text/plain',
                   'htm':'text/html','html':'text/html'}
                if c.content_type in t: 
                    response.headers['Content-Type']=t[c.content_type]
                else: response.headers['Content-Type']='text/plain'
                return c.content

***
highlight file error
***

and believe it or not, that's all we need to walk our filesystem, and return the contents ... it wont be a big step to add the formatter to get html, and to start thinking about how we want the layout to look in template terms. (We'll need to make sure we use a config file to locate our lfs as well).

The steps after that will be to add the login, upload, posting, trackback, feed generation etc ... none of which should be a big deal ... but I don't expect to do them quickly :-)

Update: I'm pleased to report that while this is a fork from the leonardo trunk, as is the django version, all three code stacks are now jointly hosted on google. The pylons code is in the pyleo_trunk and this is rev 464 (I haven't worked out the subversion revision syntax for the web interface).

by Bryan Lawrence : 2008/01/03 : Categories python pyleo : 0 trackbacks : 0 comments (permalink)

the leonardo file system

The most difficult thing about porting leonardo is interfacing with the leonardo file system (lfs). The lfs was designed to allow multiple backends through a relatively simple interface ... of course it's not properly documented anywhere, so remembering how it works was a bit difficult. The following piece of code shows the general principle:

from filesystem import LeonardoFileSystem
import sys,os.path
def WalkAndReport(leodir,inipath='/'):
    ''' Walks a leonardo filesystem and reports the contents in the same way
    as doing ls -R would do '''

    def walk(lfs,path):
        directories,files=lfs.get_children(path)
        for f in files:
            leof=lfs.get(os.path.join(path,f))
            #The following is the actual content at the path ... if it exists.
            #It's what you would feed to a presentation layer ...    
            content=leof.get_content()
            print '%s (%s)'%(f,leof.get_content_type())
            for p in leof.get_properties(): print '---',p,leof.get_property(p)
            #check for comments and trackbacks ... is there any other sort?
            comments=leof.enclosures('comment')+leof.enclosures('trackback')
            #comments and trackbacks are leo files ...
            for c in comments:
                for p in c.get_properties(): print '------',p,c.get_property(p)
        for d in directories:
            leod=os.path.join(path,d)
            print '*** %s ***  (%s)'%(d,leod)
            walk(lfs,leod)

    lfs=LeonardoFileSystem(leodir)
    walk(lfs,inipath)
if __name__=="__main__":
    lfsroot=sys.argv[1]
    if len(sys.argv)==3:
        inipath=sys.argv[2]
    else: inipath='/'
    WalkAndReport(lfsroot,inipath)

***
highlight file error
***

While I'm at it, I'd better document a small bug in the leonardo file system itself that manifested itself on this blog (python 2.4.3 on Suse 10) but nowhere else ... the comments came back in the wrong order. The following diff on filesystem.py fixed that:

    def enclosures(self, enctype):
+        #BNL: modified to reorder by creation date, since we can't
+        #rely on the name or operating system.
         enc_list = []
         for d in os.listdir(self.get_directory_()):
             match = re.match("__(\w+)__(\d+)", d)
             if match and enctype == match.group(1):
                 index = match.group(2)
-                enc_list.append(self.enclosure(enctype, index))
-        return enc_list
+                e=self.enclosure(enctype, index)
+                sort_key=e.get_property('creation_time')
+                enc_list.append((sort_key,e))
+        enc_list.sort()
+        return [i[1] for i in enc_list]

by Bryan Lawrence : 2008/01/02 : Categories pyleo python : 0 trackbacks : 0 comments (permalink)

Playing with pylons and leonardo

I've suddenly been granted a couple of hours I didn't expect, so I thought I'd take the first steps towards forking leonardo (sorry James), so that we have a pylons version. I know James has a Django version, but I want pylons for a number of reasons:

  • I've done a lot of work with pylons. I understand it.

  • I have a number of extensions already to James' codebase (including full trackback inbound and outbound). (James' subversion got broken: we never resolved it, so they never got committed back).

  • I want to build an egg, which allows external templating (i.e. you can complete control the look and feel via a genshi template or use the default within the egg).

  • I want to do all of this so I have a nice small job which exercises and documents all the skills I've built up building in the NDG portal (in my spare time) over the last few months.

  • I want to cache documents more efficiently (trivial in pylons).

  • I want to be able to be able to produce archiveable versions for previous years (not so trivial). Tim Bray reminded us all that this is important!

I expect it will take months to do what I expect to be a few hours coding :-( I wonder how well this will compare with previous announcements!

I guess we'll need a new name. pyleo will do, in this case for pylons leonardo.

by Bryan Lawrence : 2008/01/02 : Categories pyleo python : 0 trackbacks : 2 comments (permalink)

virtualenv

One more thing to remember. I'm going to be building pyleo using pylons 0.9.6.1, but the ndg stuff (also on my laptop) is using pylons 0.9.5. Library incompatibility is scary. Fortunately, we have virtualenv to the rescue.

Using virtualenv, I can build a python instance that is independent of the stuff build into my main python (which is a virtual-python for historical reasons). It's better than virtual-python because I get the benefits of things in my system site-packages that I've installed since I installed my virtual python.

What to remember? (I really will be forgetting things like this when there are weeks between activity ... in particular, this way I'll know which python to use!)

Well, I built my new virtualenv instance by typing

python virtualenv.py pyleo

and I can change into it any time I like with

source ~/pyleo/bin/activate

I expect I'll be able to ensure I use this python in my (test) webserver when I get to it. It looks like I need to adjust the path to the libraries inside the outer script with

import site
site.addsitedir('/home/bnl/pyleo/lib/python2.5/site-packages')

I used virtualenv 0.9.2.

It looks like I can make ipython respect this by copying /usr/bin/ipython into my ~/leo/bin and editing it to use ~/leo/bin/python ...

by Bryan Lawrence : 2008/01/02 : Categories pyleo python : 0 trackbacks : 0 comments (permalink)


DISCLAIMER: This is a personal blog. Nothing written here reflects an official opinion of my employer or any funding agency.