September 30, 2014

GridPP Storage

Data format descriptions

The highlight of the data area working groups meetings at the Open Grid Forum at Imperial recently was the Data Format Description Language . The idea is that if you have a formatted or structured input from a sensor, or a scientific event, and it's not already in one of the formatted, er, formats like (say) OpeNDAP or HDF5, you can use DFDL to describe it and then build a parser which, er, parses records of the format. For example, one use is to validate records before ingesting them into an archive or big data processing facility.

Led by Steve Hanson from IBM, we had an interactive tutorial building a DFDL description for a sensor: the interactive tool looks and feels a bit like Eclipse but is called Integration Toolkit:
And for those eager for more, the appearance of DFDL v1.0 is imminent.

by Jens Jensen ( at September 30, 2014 20:39

The SSI Blog

Free help to forge better software

By Steve Crouch, Consultancy Leader.

The Institute's Open Call provides developer expertise and effort - free of charge - to UK-based researchers. If your project develops research software and you'd like some expert help, you should submit an application to the Open Call.

We've just opened the latest round of the Open Call, which closes on 5 December 2014.

You can ask for our help to improve your research software, your development practices, or your community of users and contributors (or all three!). You may want to improve the sustainability or reproducibility of your software, and need an assessment to see what to do next, or perhaps you need guidance or development effort to help improve specific aspects or make better use of infrastructure. We want applications from any discipline in relation to software at any level of maturity.

Open Call, RSG, Research Software Group, consultancy, author:Steve Crouch

read more

by s.crouch at September 30, 2014 13:00

From benign dictatorship to democratic association: the RSE AGM

By Simon Hettrick, Deputy Director.

If you don’t write papers, how should a university recognise your work? This and related topics were the focus of discussions at the first ever Annual General Meeting of Research Software Engineers, which took place on 15-16 September. The AGM was an important milestone in our campaign for Research Software Engineers: it marked the first formal meeting of the RSE community. 

Over fifty RSEs met at ORTUS, based in King’s College London to meet, collaborate and discuss work. The day kicked off with an overview of the RSE campaign from staunch RSE supporter, James Hetherington. This was followed by a talk from Kumar Jacob and Richard Dobson (respectively from Maudsley Digital and the NIHR BRC for Mental Health) about software use in mental health research. Maudsley Digital were the gold sponsor for the event, and were joined by our other sponsors NIHR BRC for Mental Health, GitHub, Microsoft Research and FitBit UK.

Research Software Engineers, RSE, Events

read more

by s.hettrick at September 30, 2014 11:59

September 25, 2014

GridPP Storage

Erasure-coding: how it can help *you*.

While some of the mechanisms for data access and placement in the WLCG/EGI grids are increasingly modern, there are underlying assumptions that are rooted in somewhat older design decisions.

Particularly relevantly to this article: on 'The Grid', we tend to increase the resilience of our data against loss by making complete additional copies (either one on tape and one on disk, or additional copies on disk at different physical locations). Similarly, our concepts of data placement are all located at the 'file' level - if you want data to be available somewhere, you access a complete copy from one place or another (or potentially get multiple copies from different places, and the first one to arrive wins).
However, if we allow our concept of data to drop below the file level, we can develop some significant improvements.

Now, some of this is trivial: breaking a file into N chunks and distributing it across multiple devices to 'parallelise' access is called 'striping', and your average RAID controller has been doing it for decades (this is 'RAID0', the simplest RAID mode). Slightly more recently, the 'distributed' class of filesystems (Lustre, GPFS, HDFS et al) have allowed striping of files across multiple servers, to maximise performance across the network connections as well.

Striping, of course, increases the fragility of the data distributed. Rather than being dependent on the failure probability of a single disk (for single-machine striping) or a single server (for SANs), you are now dependent on the probability of any one of a set of entities in the stripe failing (a partial file is usually useless). This probability is likely to scale roughly multiplicatively with the number of devices in the stripe, assuming their failure modes are independent.

So, we need some way to make our stripes more robust to the failure of components. Luckily, the topic of how to encode data to make it resilient against partial losses (or 'erasures'), via 'erasure codes', is an extremely well developed field indeed.
Essentially, the concept is this: take your N chunks that you have split your data into. Design a function such that, when fed N values, will output an additional M values, such that each of those M values can be independently used to reconstruct a missing value from the original set of N. (The analogy used by the inventors of the Reed-Solomon code, the most widely used erasure-code family, is of overspecifying a polynomial by more samples than its order - you can always reconstruct an order N polynomial with any N of the M samples you have.)
In fact, most erasure-codes will actually do better than that - as well as allowing the reconstruction of data known to be missing, they can also detect and correct data that is bad. The efficiency for this is half that for data reconstruction - you need 2 resilient values for every 1 unknown bad value you need to detect and fix.

If we decide how many devices we would expect to fail, we can use an erasure code to 'preprocess' our stripes, writing out N+M chunk stripes.

(The M=1 and M=2 implementations of this approach are called 'RAID5' and 'RAID6' when applied to disk controllers, but the general formulation has almost no limits on M.)

So, how do we apply this approach to Grid storage?

Well, Grid data stores already have a large degree of abstraction and indirection. We use LFCs (or other file catalogues) already to allow a single catalogue entry to tie together multiple replicas of the underlying data in different locations. It is relatively trivial to write a tool that (rather than simply copying a file to a Grid endpoint + registering it in an LFC) splits & encodes data into appropriate chunks, and then stripes them across available endpoints, storing the locations and scheme in the LFC metadata for the record.
Once we've done that, retrieving the files is a simple process, and we are able to perform other optimisations, such as getting all the available chunks in parallel, or healing our stripes on the fly (detecting errors when we download data for use).
Importantly, we do all this while also reducing the lower bound for resiliency substantially from 1 full additional copy of the data to M chunks, chosen based on the failure rate of our underlying endpoints.

This past summer, one of our summer projects was based around developing just such a suite of wrappers for Grid data management (albeit using the DIRAC file catalogue, rather than the LFC).
We're very happy with Paulin's work on this, and a later post will demonstrate how it works and what we're planning on doing next.

by Sam Skipsey ( at September 25, 2014 16:03

The SSI Blog

2015 Fellowships - how it helped a digital humanist

By Stuart Dunn, lecturer at the Centre for e-Research, Kings's College London, and 2014 Institute Fellow.

One problem with being a digital humanities academic these days is the sheer volume of scholarly activity available – from seminars and workshops to conferences and symposia. In London alone, one could easily attend three or four such events every week, if not more.

My Fellowship has provided me with an excellent heuristic for selecting which events one goes to, and helped me to connect my participation in the community around how digital humanists approach and practice the sustainability of what they use and build.

Especially to one used to applying for research grants, the application process was extremely simple and lightweight. The focus was on your ideas and thinking, rather than just box-ticking. Even writing the application forced me to think succinctly about the challenges and questions facing the DH community in sustaining software. These include whether we are too reliant on proprietary software, what role crowdsourcing will play in the future and in what ways does the inherently collaborative nature of Digital Humanities impact in sustainability issues.

Fellows, Fellowship 2015, Digital Humanities, King's College, author:Stuart Dunn

read more

by s.sufi at September 25, 2014 13:00

September 19, 2014

The SSI Blog

2015 Fellowships - what's it like being a Fellow?

By Stephen Eglen, senior lecturer in Computational Biology at the University of Cambridge and 2014 Institute Fellow.

I first heard about the Software Sustainability Institute in 2013, when Laurent Gatto and I were planning an R programming bootcamp.

I have long been a believer in the open sharing of software, and so I was glad to read about many of the complementary issues that the Institute has promoted, both within the UK and worldwide. Another thing that convinced me to apply was that a respected colleague in the R community, Barry Rowlingson, was also a Fellow.

I found the application procedure refreshingly short and straightforward. The most useful thing in the process was to propose what I would do in the course of the Fellowship. I had been discussing with colleagues in the neuroscience community about ways in which we could encourage data and code sharing.

author:Stephen Eglen, Fellows, Fellowship 2015, R, Bootcamp

read more

by s.sufi at September 19, 2014 16:37

September 17, 2014

The SSI Blog

Reducing the Distance between theory and practice

Polar bears

By Mike Jackson, Software Architect.

Clever theory about how to estimate the density or abundance of wildlife is of limited value unless this theory can be readily exploited and applied by biologists and conservationists. Distance sampling is a widely-used methodology for estimating animal density or abundance and the Distance project provides software, Distance, for the design and analysis of distance sampling surveys of wildlife populations. Distance is used by biologists, students, and decision makers to better understand animal populations without the need for these users to have degrees in statistics or computer science. Distance places statistical theory into the hands of practitioners.

Consultancy, open call, wildlife, statistics, population modelling, environment, biology, conservation

read more

by m.jackson at September 17, 2014 10:42

September 16, 2014

The SSI Blog

“Is this a good time?” – how ImprompDo can tell when you’re busy

By Liam Turner, PhD student at Cardiff School of Computer Science & Informatics.

This article is part of our series: a day in the software life, in which we ask researchers from all disciplines to discuss the tools that make their research possible.

Growth in smartphone technology has led to the traditional trawl for information to be devolved down to an individual level. This presents a challenge as traditional methods of making information available depend on when it is ready available, rather than when it is most convenient for a busy user.

Currently users have to work out the best way to get information while still managing their other commitments at the same time, but it would be more useful if this could be managed proactively. This predictive estimation would analyse and arrange itself around its user’s behaviour before it sent them the new information. This forms the backbone of our project in using the technical capabilities of the smartphone to infer interruptibility and so make a decision as to whether to deliver or delay.

author:Liam Turner, Day in the software life, DISL, Android, Artificial Intelligence, Cardiff

read more

by a.hay at September 16, 2014 09:00

September 15, 2014

The SSI Blog

What makes good code good at EMCSR 2014

By Steve Crouch.

On August 8th 2014, I attended the first Summer School in Experimental Methodology in Computational Research at the University of St. Andrews in Scotland. Run as a pilot for primarily computer scientists, it explored the latest methods and tools for enabling reproducible and recomputable research, and the aim is to build on this successful event and hold a bigger one next year.

The Institute already works with the Summer School organisers in a related project, Led by Ian Gent, this project aims to allow the reproduction of scientific results generated using software by other researchers, by packaging up software and its dependencies into a virtual machine that others can easily download and run to reproduce those results.

author:Steve Crouch, Consultancy,, recomputation, reproducibility, EMCSR

read more

by s.crouch at September 15, 2014 13:00

September 11, 2014


Configuring CVMFS for smaller VOs

We have just configured cvmfs for t2k, hone, mice and ilc after sitting on the request for long time. The main reason for delay was the assumption that we need to change cvmfs puppet module to accommodate non lhc VOs.   It turns out to be quite straight forward with  little effort.
We are using cern cvmfs module and there was an update a month ago so it is better to keep it updated.

 Using hiera to pass parameters to module, our hiera bit for cvmfs
      cvmfs_server_url: ';'
      cvmfs_server_url: ';'
      cvmfs_server_url: ';'
      cvmfs_server_url: ';;'

One important bit is the name of cvmfs repository e.g instead of

Other slight hitch is public key distribution of various cvmfs repositories.  Installation of cvmfs also fetch cvmfs-keys-*.noarch rpm which put all the keys for cern based repository into /etc/cvmfs/keys/.

I have to copy publich key for and to /etc/cvmfs/keys. It can be fetched from  repository
wget -O
or copied from

we  distributed the keys through puppet but outside cvmfs module.
It would be great if some one can convince cern to include public keys of other repositories into cvmfs-keys-* rpm. I am sure that there is not going to be many cvmfs stratum 0s.

Last part of the configuration is to change SW_DIR in site-info.def or vo.d directory

WNs requires re-yaim  to configure SW_DIR in /etc/profile.d/  You can also edit file manually and distribute it through your favourite configuration management system.

by Kashif Mohammad ( at September 11, 2014 14:09

The SSI Blog

Online psychological therapy for Bipolar Disorder

By nicholas [dot] todd [at] nhs [dot] net (Nicholas Todd), Psychologist in Clinical Training at Leeds Teaching Hospitals NHS Trust.

This article is part of our series: a day in the software life, in which we ask researchers from all disciplines to discuss the tools that make their research possible.

People with Bipolar Disorder often have problems gaining access to psychological therapy. Online interventions are an innovative solution to this accessibility problem and are recommended in clinical guidelines for mild to moderate anxiety and depression. These interventions provide round-the-clock, evidence based, self-directed support for a large number of people at a reduced cost to the NHS. 

The Living with Bipolar project was funded by Mersey Care NHS Trust and led by myself under the supervision of Professor Fiona Lobban and Professor Steven Jones, from the Spectrum Centre for Mental Health Research, Lancaster University. It was the first randomised controlled trial of an online psychological intervention for Bipolar Disorder to find preliminary evidence that the web-based treatment approach is feasible and potentially effective.

Day in the software life, DISL, author:Nicholas Todd, Lancaster, Leeds, Psychology

read more

by a.hay at September 11, 2014 09:00

September 09, 2014

The SSI Blog

The Wild Man Game - bringing historic places to life

By Gavin Wood and Simon Bowen, Digital Interaction Group, Newcastle University.

This article is part of our series: a day in the software life, in which we ask researchers from all disciplines to discuss the tools that make their research possible.

Heritage organisations, such as museums, and managers of historic sites are increasingly interested in using mobile phones as a way of adding value to visits and directly connecting with the general public. App designers have responded by creating gamified digital experiences by borrowing game mechanics and game elements in an attempt to engage the user.

However, these experiences often fall short and we are given uninteresting treasure hunts that are often more about achieving goals and collecting rewards rather than thinking about and connecting with the heritage space itself. In response, we are exploring how digital play can bring our cherished cultural spaces to life, challenging the typical role for mobile phone apps in such contexts.

author:Gavin Wood, author:Simon Bowen, Day in the software life, DISL, Newcastle, Wildman, Belsay Hall

read more

by a.hay at September 09, 2014 09:00

September 04, 2014

The SSI Blog

A map of many views - what Google Earth and a 1500 AD chart of Venice have in common

By Juraj Kittler, Assistant Professor of Communication at St. Lawrence University, and Deryck Holdsworth, Professor of Geography at Penn State University.

This article is part of our series: a day in the software life, in which we ask researchers from all disciplines to discuss the tools that make their research possible.

Our recent study, published last month in New Media & Society, surveyed the technical approaches adopted by Renaissance artist Jacopo de’ Barbari when he drafted his iconic bird’s-eye view of Venice in the last decade of the fifteenth century. We pointed out some important parallels between this masterpiece of Renaissance mapmaking and the current computer-supported digital representations of urban spaces.

The historical sources that we analysed indicate that de’ Barbari’s map was a composite image stitched together from a multitude of partial views. These were produced by surveyors using a technical device, called the perspectival window, and in a fashion that may be seen as a proto-digital technology. When constructing his two-dimensional image, the artist was intentionally tricking the eye of the observer into seeing a three-dimensional panoply, evoking what has later became known as virtual reality.

author:Juraj Kittler, author:Deryck Holdsworth, Day in the software life, DISL, Virtual Reality, Maps, Venice

read more

by a.hay at September 04, 2014 09:00