How do we teach scientists to be programmers?

That is our challenge on the "Introduction to scientific computing" (ISC) course

At the STFC Centre for Environmental Data Analysis (CEDA) we collaborate with colleagues in the National Centre for Atmospheric Science (NCAS) to deliver training courses in the foundations of scientific computing (Linux and Python). The courses are primarily taught to scientists working in the environmental sciences and playing an essential role in enabling scientists to transition to a big-data world.

In the past it has been feasible (in some fields) to do scientific analysis, processing and statistics in Excel or using packages running on a local machine. As the data volumes have grown so has the scale of the analysis task. At CEDA, we house a cutting-edge big-data analysis facility in JASMIN on which multi-petabyte CEDA archives sit alongside vast collections of project-specific data. Scientists need to learn to become competent in automating their workflows - that is the key driver behind the ISC course.

We teach an introduction to Linux because we believe it is the platform that supports the richest toolkit for manipulating large data sets. Linux is also the basis for most High-Performance Computing (HPC) and Cluster systems that underpin modern data workflows. We then build on that with the basics of Python (at version 2.7). We introduce the main data types, control flow, functions, modules (and we touch on classes). Having laid the foundations we then start on the exciting stuff: numpy (array manipulation), matplotlib (visualisation) and netCDF4 (for data storage and I/O). Finally, we use a practical exercise that employs a simple temperature sensor to consolidate the material that has been learnt during the week.

Version control is a hard-sell and a hard topic for newcomers to grasp. We have decided to drop our students in at the deep end by introducing them straight into Git and GitHub on the first day. Throughout the course we remind them to push their coursework to their own GitHub repository and we hope that by the end of the week they have the basics, and they understand the benefits, of simple version control. We know it's a steep learning curve, but it is essential to good practice that the next generation of scientists have version control locked down.

If and when we get the head space to think about it we know there is a real desire for a follow up course that looks at higher-level tools for data analysis (e.g. pandas, cfpython, iris, CDO) and more hands-on examples of working with real data.

Further information

CEDA Training:

ISC Training Materials: