COAPEC Beowulf cluster ('Lewis')
Documentation for users

SEE ALSO: SYSTEM STATUS INFORMATION

Contents -- click to jump to given section:


Top

SUPPORT

Top

Obtaining an account

User registration will be performed by the e-Science centre.

To request an account, please contact the e-Science centre (contact information below), with the following information:

Please note that your account will be created with a home directory which is readable by others, although you can change this if you wish.

A list of names associated with the COAPEC programme has been supplied to the e-Science centre by Helen Snaith. Provided that you are already listed, an account will be set up for you. If not, then the e-Science centre will not accept your application directly, but please instead contact Helen Snaith who will apply on your behalf if appropriate.

Top

Support contacts

It is hoped in due course to provide a single email address for support issues. For now, please contact one of the following:

Top

Systems support

For system-related support, please contact the e-Science Centre directly. The primary contact is Pete Oliver. In pressing cases where Pete Oliver appears to be absent, please contact Nick Hill.

Top

Modelling support

For support with running the Unified Model or other scientific software on lewis, please contact Alan Iwi.

Top

SYSTEM DOCUMENTATION

Top

Machine specification

Lewis is a Beowulf Cluster consisting of 17 dual-processor 2.4GHz Pentium 4 Xeon machines, of which one is for interactive logins and control and 16 are for batch processing (hereafter, "master node" and "slave nodes"). The slave nodes (but not the master node) are connected with Myrinet 2000 networking, giving fast inter-process communication for parallel programs. The slave nodes each have 512MB memory; the master node has 1GB memory.

The following system software is provided: RedHat Linux with usual bundled software, Intel fortran compiler version 7, MPICH library for parallel code, OpenPBS queuing system with Maui job scheduler, Totalview debugger. (More specific modelling-related software is described under "Using the Unified Model" below).

Top

Logging in

For security reasons, all logins to lewis are via ssh.

The hostname is "lewis.esc.rl.ac.uk", (being hosted at the e-Science Centre at Rutherford Lab; see also why "Lewis"?).

To use SSH:

Note that ssh connections described above will connect to the master node. If you get "permission denied" messages trying to connect, then your host probably needs to be authorised; see under "obtaining an account" above.

You should not normally log into the slave nodes interactively because that may adversely affect performance of other people's jobs; if for exceptional reasons you need to log into them, then from the master node you can type "rsh node2" through to "rsh node17".

Top

File transfer

Please note that there is no FTP server running on lewis. The recommended way to copy files from or to lewis is via the secure copy ("scp") utility, which should be distributed with your SSH client.

Lewis does have FTP clients ftp and ncftp installed, although beware that using them will cause the passwords of your other accounts to be transmitted in cleartext.

Top

Filestore

Your files on lewis are stored in the /home area (with no distinction between "home" and "data" areas). The entire filestore is backed up regularly, and is subject to quotas.

Quotas are applied on a group basis, with each group corresponding to a COAPEC member institution. The filespace quotas for each group are as follows:
Birmingham60GB
Liverpool60GB
Oxford120GB
RAL60GB
Reading300GB
Sheffield60GB
SOC300GB
UCL60GB
UEA180GB

The remaining unallocated space (~500GB) may be used to provide shared-access data, or alternatively may be used to increase quotas in future.

To show your group's quota and usage, use the command "quota -g". Note that your group cannot create files in excess of this quota, even for a short time; this should ensure that provided your group remains within quota it cannot be inconvenienced by other groups filling the disk.

In the event that your group cannot manage usage within its quota informally, it is also possible to impose filestore quotas on certain users. To request this, your Principal Investigator should contact the e-Science Centre. Please note that user quotas would not replace the group quotas, but merely limit how much of the group allocation can be used by those users.

Jobs may also write to the temporary area /tmp. Please note that every node sees a different /tmp directory, which is local to the given node, and that these are regularly purged of old files.

Top

The queuing system

The following information may be useful, although skip this section for a quick introduction, because you can submit Unified Model jobs as described below without needing to know about the queuing system.

Top

Queuing configuration

Lewis is running the OpenPBS queuing software with the MAUI scheduler. These have been configured to allow a number of job execution queues, with the following characteristics:

Jobs are submitted to queues "normal", "dev" or "compile". "normal" and "dev" are for batch jobs, which run on the slave nodes of the cluster. "compile" jobs are single-processor, and run on the master node.

Jobs in the "normal" and "dev" queues are routed through to a number of execution queues depending on the number of processors. These execution queues are configured with differing time limits. For example the queue "normal8" allows up to 8 processors and has a time limit of 18 hours, whereas the queue "normal16" allows up to 16 processors and has a time limit of 9 hours.

Top

Development queue

Jobs submitted to the "dev" queue have higher priority than "normal" jobs, but a shorter execution time limit. It is intended for development work, so please use "normal" for normal work where turnaround time is not critical. In addition "dev" jobs a higher priority, there are four nodes (eight processors) of the cluster which are reserved exclusively for "dev" jobs from 9am to 6pm Monday to Friday, although they will run "normal" jobs overnight and at weekends.

Top

Fair share

Apart from the distinction between "dev" and "normal" queues, the highest priority job is the one which has been waiting longes. There is, however, one exception to this: a "fair share" factor. This is configured on a per-group basis; each group has a permitted CPU allocation. If the recent (few days') CPU usage has been above this level, then the group's jobs will be deprioritised, but will still run if the CPU is otherwise idle. The fair share usage level for each group will be set proportional to the group filesystem quotas shown above (although the scaling factor may be revised depending on levels of machine usage). As with filestore quotas, user-specific limits could also be defined if required.

Top

Specifying resource requirements

Resource requirements for UM jobs are set within the user interface. For other jobs, they can be specified as command-line options to the "qsub" command, or with control directives within the jobs script (see the OpenPBS documentation, including "qsub" manual page).

When specifying a time limit for your job, it is to your advantage not to overstate the requirements. The scheduler has an intelligent "backfill" algorithm, which means that if for example the highest priority job is a 16-processor job which it calculates can be started in 4 hours' time, but meanwhile there are 8 processors idle, it will run your 8 processor job if it requests no more than 4 hours' CPU. Likewise

However, beware that if you exceed the requested time limit your job will be terminated.

One approach to specifying time limits is empirically: submit a test job with a generous time limit, and base future job requirements on the actual run time, adding say a 10% safety margin. (See also the benchmark timings for the Unified Model below.)

Memory requirements are not such an issue; because the cluster is largely a distributed memory environment, the scheduler is not configured to penalise jobs which ask for generous memory allocation (although as there is 512MB physical memory per dual-processor node, it is not sensible to use more than about 200MB per processor).

Top

Queuing related commands

Here is a selection of useful commands to interact with the software. See also: OpenPBS info, MAUI user documentation.

PBS Command Description
qsub jobscript Submit a job script. (Not normally typed manually in UM submission.)
qdel jobid Stop a queued or running job
qstat Show the job queue
qstat -n Show the job queue, reporting the nodes on which jobs are running
qstat -Qf [queuename] Show the limits for all queues or a specified queue
qhold Hold a queued job to prevent execution
qrls Release a previously held job for potential execution
pbsnodes -l List any nodes which are down (null output means all nodes functioning)
Maui Command Description
showq Show the job queue, with job start time and duration information
showstart Show the estimated start time of a queued job
showres [-n] Show the reservations made by the scheduler (with optional node-specific detail)
diagnose -p Show the priorities of queued jobs
diagnose -f Show fair share usage information

Top

UNIFIED MODEL

The following information relates to use of the Met Office Unified Model specifically on lewis. For general information about the model, please refer to the UM Users' guide. If you do not have this already, a copy resides under file:///usr/local/um/umdoc_system/index.htm on lewis; web browsers mozilla, netscape, lynx are installed for the purpose of viewing locally stored HTML files.

Top

Running the UM

Top

Installations

There are two installations of the portable UM version 4.5 on lewis: one at 64-bit and one at 32-bit. The base directories ($UM_HOME) for these installations are ~um64 (/usr/local/um64) and ~um32 (/usr/local/um32) respectively.

In addition, there is an installation of the UM User Interface (UMUI), located at ~umui (/usr/local/umui).

You should make sure that you have signed the usage agreement with the Met Office for the Portable Unified Model (even though there are currently no technical measures on lewis to enforce this). One way to do this is via the British Atmosphere Data Centre; if you register for access to the PUM Software, they will process the usage agreement on behalf of the Met Office, even though you do not need to download the software from the BADC because it is already installed.

Top

Setup

Before you first run the UM, you need to do create files in your home directory:

Top

User interface

You can run the UM User Interface (UMUI) on either on lewis (on the master node), or on a machine in your home institution.

To run the UMUI on lewis, simply type umui. This has the advantage of being self-contained, and also ease of submitting jobs (you just press the "Submit" button). Also if many people run the UMUI on lewis, then they can all easily inspect one another's jobs. (The UMUI installation on lewis also incorporates these changes which allow automated running of "hand-edit" scripts.)

However, you prefer to on the UMUI on another machine. In that case:

Top

Example jobs

Assuming you have done the setup described above, you are ready to take one of the sample jobs found in experiment xaar in the UMUI on lewis, edit your username in the "General details" section (as a minimum; make any other changes required), then save, process and submit the job.

(Remember that where a job is described as 64 or 32 bit, it is still necessary to ensure that your setvars file points to the installation at the corresponding precision.)

Remember also that you can track the execution of the jobs with the "qstat" command.

Top

Output location

The output from the runs is initially stored in the following locations:

However, the sample HadCM3 jobs shown above have post-processing options turned on -- see documentation. (The main purpose of this is to permit restart files to be written frequently, as required temporarily for climate meaning, but without leaving a great number of these permanently stored.) The result is that some output files and dump files which are to be retained are moved from $DATAM to the directory $HOME/um_archive/run_id/ (subdirectories ppfiles and dumps).

At the end of the run, the job output ("leave") file is copied to $HOME/umui_out

Top

Email reports from UM jobs

If you wish to receive email when your UM job starts and/or finishes, check the relevant options in "Submodel-independent" -> "Output management".

As a convenience, you can create a file called $HOME/.um_mail which will override any email address set in the job. If the file exists and contains an email address, then that address will be used if mail is sent, regardless of the address configured in the UMUI. If the file exists and is blank, then no mail will be sent from UM jobs, regardless of any options set in the UMUI.

Top

The HadCM3 control integration

A 100-year integration of HadCM3 has already been performed on lewis at 64-bit precision. It is intended for use as a control integration for perturbation experiments performed on lewis, and also for other statistical studies.

Time-average model output fields have been stored on the BADC, with different averaging periods (monthly, seasonal, annual, decadal). These files contain the same fields as for the COAPEC 100 year HadCM3 integration on the Cray, but with the addition of MEAD diagnostics in the ocean fields, and are in NetCDF format. Monthly restart dumps (some in compressed format) are also stored on the BADC.

The data has been added as a subdirectory of the COAPEC data directory on the BADC, so for access to this dataset, follow that link and register for that dataset on the BADC if you have not already done so. The subdirectory is called "100yr_beowulf", and it contains a README file with further information about the files.

If doing a perturbation run, take job xaara in the UMUI on lewis as a starting point. (The actual run was performed as xaaqa; however some special values were used for some of the resource specifications, which have been removed in xaara. Also xaara specifies 8 processors instead of 16, and the run length is changed to 1 year so that a 100-year run is not submitted inadvertently.)

Although the integration was started directly from a initial conditions obtained from running HadCM3 on a Cray, the model does not appear to drift from the start, at least as regards the global mean surface temperature: see plot. Other fields will be added here as they are evaluated.

Global Mean Surface Temperature

For those who do not have access to lewis, but want to inspect the job, xaaqa is also available for download:

Top

Porting UM jobs to lewis

Converting a job which is setup to run on another machine onto the cluster requires a rather variable amount of effort: at best it can be a relatively easy task, although you may encounter complications.

Here are some pointers to get you started. Feel free to request assistance from Alan Iwi, who will help on a best-efforts basis, but please first make an attempt yourself with the following information.

Top

Speed Benchmarks: obtaining best use of the machine

The following graphs show the speed benchmarks for HadAM3 and HadCM3, with and without writing output diagnostics (STASH). (Preview bitmaps only - click on images for scalable Encapsulated Postscript versions.)

HadAM3 benchmarks HadCM3 benchmarks

In the above units "Model years per total CPU days", total CPU days is the wallclock time multiplied by the number of processors. So for example a 4-processor job given as 0.8 model years per total CPU days would take about 3.1 days for a 10-year run.

Top

Number of processors

Although the speed of the integration increases with the number of processors (scaling curves better than the "gradient for no further speedup" shown), it becomes decreasingly efficient (scaling curves worse than "gradient for perfect scaling").

This means that unless part of the cluster is otherwise idle, it is better to run jobs on fewer processors rather than more, but with more jobs running concurrently (e.g. different ensemble members or different people's jobs).

The one exception is the anomalous 1x2 configuration for HadCM3 at 64-bit. The cause is insufficient memory (see page on memory requirement, particularly last paragraph). This could be rectified by buying more memory, if there is a demand -- backed up with funds! -- but for the moment please avoid this configuration!

Top

Precision

Except in some heavily network-latency limited configurations, there is about a factor of 2 speedup by running at 32-bit instead of 64-bit. Initial model validation tests (see document in PDF or .ps.gz format) suggest that this may be appropriate for the atmosphere model but that caution should be exercised in using 32-bit with the ocean model.

Top

The role of STASH

The presence or absence of STASH makes a significant difference. In the case of the HadAM3 benchmarks, the set of diagnostics is the extremely large set in the standard HadAM3 job which is distributed with the portable model. In the case of the HadCM3 benchmarks, the set of diagnostics is the set of diagnostics in the COAPEC control integration.

Clearly you will want to write output from your model! However, there may be efficiency implications in writing large numbers of diagnostics when using many processors.

In separate tests, it has been found that the slowdown is largely due to network latency during sampling of diagnostics; therefore if you write time-averaged diagnostics, could the time average be calculated from diagnostics sampled every six hours instead of every timestep? This is configurable under "Edit Time Profile" in the STASH windows of the UMUI.

Top

Software Utilities

A number of software utilities are available on lewis. Please keep usage of the machine which is not directly associated with running models to sensible levels; if your postprocessing is resource-hungry then please consider transferring the output files to a machine in your home institution.

Any reasonable request for additional free software will be considered. If you are considering installing a package in your home directory, please first contact Alan Iwi, who may decide instead to install it in a shared location for wider benefit.

Top

Intel Fortran compiler

To run the Intel Fortran 90 compiler (version 7), first load the appropriate setup script, depending on your shell:

for csh / tcsh
source /usr/local/intel/compiler70/ia32/bin/ifcvars.csh
for bash / ksh
. /usr/local/intel/compiler70/ia32/bin/ifcvars.sh

(Note that when compiling/running UM jobs, this is already taken care of in setvars.)

Then the compiler is invoked with ifc.

Note that you also have to load the setup script before running programs compiled with this compiler (or at least set the environment variable LD_LIBRARY_PATH as done by the setup script).

For documentation on the Intel Fortran Compiler, see

On Fortran programming:
/usr/local/compiler70/docs/for_prg.pdf
On the portability library:
/usr/local/compiler70/docs/for_lib.pdf

The PDF readers acroread and xpdf are installed.

Top

Unified model file utilities

A number of Unified model file utilities are installed accompanying the UM distribution. You will find convsh and xconv in /usr/local/bin.

Additionally, there are ancillary file utilities which are specific to the 64- or 32-bit distributions (pumf, etc). These are in $UM_HOME/um/vn4.5/utils; use the correct value of $UM_HOME for the precision of your ancillary files.

See also these additional UM file utilities (written by Alan Iwi) -- some of these are installed on the system.

Top

NetCDF

NetCDF libraries (version 3.5.0) are installed, including programming interface for Fortran, C, C++ (see in /usr/local/lib), and with bundled utilities ncdump and ncgen.

In addition, the NetCDF Operators (NCO) package (version 2.7.2) is installed. This includes various utilites for operating on sets of NetCDF files, e.g. concatenating, averaging, differencing.

Top

Ferret

Ferret, (version 5.51) is installed. This will enable visualisation of NetCDF files (e.g. converted from Unified Model output).

Note that to run Ferret, it is first necessary to type source /usr/local/bin/ferret_paths. There is not an equivalent script for other shells, so if your login shell is bash or ksh then first invoke tcsh manually.

Some documentation of use of Ferret from the command line is shown on the Tools for COAPEC web page.

Additionally, Ferret can be invoked with the Graphical User Interface, by typing "ferret -gui". Typically datafiles is opened by doing: "File" -> "Open data set" -> "Search" -> .... A variable for plotting is then selected with the Data "Select" button, and a variety of plots (e.g. zonal mean plots) can then be performed, including e.g. zonal means.

When the GUI is used, a file called ferret.jnl is created in the current directory. This contains the commands which can be typed in command-line mode in order to perform the same operations which were performed using the GUI.

Top

CDAT

Climate Data Analysis Tools (CDAT) is installed -- see documentation

This will enable visualisation of NetCDF files (e.g. converted from Unified Model output). CDAT does not currently open UM output files directly; however, this functionality is being developed, and will be made available on lewis when it exists.

There is additionally a GUI, which works in a similar way to the Ferret GUI. Additionally it has some useful features, such as including a cos-latitude weighting when averaging fields in the meridional direction. To run the GUI, type vcdat.

Top

IDL

A number of floating licences for IDL are owned by the research section of the BADC, and it has been decided to make them available to users on lewis on an informal basis. You are encouraged to log out of IDL sessions promptly after use, because this arrangement may be reviewed if the number of simultaneous users of these licences on lewis is enough to inconvenience the BADC research group. If this arises, it may be possible to fund dedicated IDL licences for lewis, but no promises.

To run IDL, first load the appropriate setup script, depending on your shell:

for csh / tcsh
source /usr/local/rsi/idl/bin/idl_setup
for bash / ksh
. /usr/local/rsi/idl/bin/idl_setup.ksh

Then type idl, or for documentation type idlhelp.

Top

Totalview debugger

The Totalview debugger is installed on the system. However, this has not yet been tested and documented. (Once it is done, instructions will be added here. There is currently no timescale for this, but if you have a particular requirement for the debugger, then please contact Alan Iwi, who may be willing to bring forward this activity.)

Top

Sharing model-related files

The following disk areas have been set up for model-related files which were not supplied with the standard model installation. To facilitate collaboration, they are world-writable so you can place model-related files which will be of benefit to others.

The following areas are common to both 32- and 64-bit installations:

$UM_HOME/local/vn4.5/mods
Fortran mods
$UM_HOME/local/vn4.5/script_mods
Script mods
$UM_HOME/local/vn4.5/ctldata
Ascii datafiles (please do not put binary datafiles here)
$UM_HOME/local/vn4.5/user_stash
User stashmaster files (for use with UMUI)
$UM_HOME/local/vn4.5/comp_overrides
Compile-option override files
$UM_HOME/local/vn4.5/misc
Everything else

The following areas exist separately for each of the 32- and 64-bit installations:

$UM_HOME/localdata/dumps
Model start dumps
$UM_HOME/localdata/ancil
Other ancillary datafiles

The guidelines for putting files there are:

In the standard HadAM3 and HadCM3 jobs described above, environment variables have been set which point to many of these directories.

Top


Last edited: 8 August 2003
Alan Iwi <A.M.Iwi@rl.ac.uk>