October 17, 2014

The SSI Blog

Desert Island Hard Disks: Andrew Treloar

You find yourself stranded on a beautiful desert island. Fortunately, the island is equipped with the basics needed to sustain life: food, water, solar power, a computer and a network connection. Consummate professional that you are, you have brought the three software packages you need to continue your life and research. What software would you choose and - go on - what luxury item would you take to make life easier?

Today we hear from Andrew Treloar, Director of Technology at the Australian National Data Service and Co-chair of the Research Data Alliance Technical Advisory Board.

I still can't quite remember how I ended up on the desert island... It may have been a particularly bad performance review, or possibly my boss just took me a bit too seriously when I said I needed more time to sit and think? In any case, the only bit of the process that is still clear in my memory was having a short window in which to decide which device to take and what software packages I was allowed to load onto the internal flash storage. (Hard disks? Who uses hard disks anymore?)

Community
Desert Island Hard Disks, author:Andrew Treloar, Yosemite, Mac OS X, OmniGraffle, Framemaker

read more

by s.hettrick at October 17, 2014 13:00

Magnetic imaging software now FABBERlously easy to use

By Gillian Law, TechLiterate, talking with Michael Chappell, University of Oxford.

This article is part of our series: Breaking Software Barriers, in which Gillian Law investigates how our Research Software Group has helped projects improve their research software. If you would like help with your software, let us know.

Sometimes you just have to recognise that you can’t do everything, acknowledge that someone else has more experience and skills than you do, and accept their help.

That’s what Michael Chappell, Associate Professor in Engineering Science at the University of Oxford’s Institute of Biomedical Engineering did, when he turned to the Software Sustainability Institute for a steer in how to take his software forward.

Professor Chappell had developed an excellent piece of software that did exactly what he set out to make it do: the C++ tool, FABBER, processes functional magnetic resonance imaging (fMRI) to recognise blood flow patterns in the brain and measure brain activity. It works well for the research group that Chappell currently leads, QuBIc, and many other developers in the field are also keen to create their own analysis models to work with it, but that’s where things begin to become problematic for Chappell.

Consultancy
Breaking Software Barriers, fMRI, FABBER, Imaging, Open Call, author:Gillian Law

read more

by s.crouch at October 17, 2014 09:20

October 14, 2014

NorthGrid

Tired of full /var ?

This is how I prevent /var from getting full on any of our servers. I wrote these two scripts, spacemonc.py and spacemond.py. spacemonc.py is a client, and it is installed on each grid system and worker node as a cronjob:
# crontab -l | grep spacemonc.py
50 18 * * * /root/bin/spacemonc.py
Because it's going to be an (almost) single threaded server, I use puppet to make it run at a random time on each system (I say "almost" because it actually uses method level locking to hold each thread in a sleep state, so it's actually a queueing server, I think; it won't drop simultaneous incoming connections, but it's unwise to allow too many of them to occur at once.)
        cron { "spacemonc":
#ensure => absent,
command => "/root/bin/spacemonc.py",
user => root,
hour => fqdn_rand(24),
minute => fqdn_rand(60),
}
And it's pretty small:
/usr/bin/python

import xmlrpclib
import os
import subprocess
from socket import gethostname

proc = subprocess.Popen(["df | perl -p00e 's/\n\s//g' | grep -v ^cvmfs | grep -v hepraid[0-9][0-9]*_[0-9]"], stdout=subprocess.PIPE, shell=True)
(dfReport, err) = proc.communicate()

s = xmlrpclib.ServerProxy('http://SOMESERVEROROTHER.COM.ph.liv.ac.uk:8000')

status = s.post_report(gethostname(),dfReport)
if (status != 1):
print("Client failed");
The strange piece of perl in the middle is to stop a bad habit in df of breaking lines that have long fields (I hate that; ldapsearch and qstat also do it.) I don't want to know about cvmfs partitions, nor raid storage mounts.

spacemond.py is installed as a service; you'll have to pinch a /etc/init.d script to start and stop it properly (or do it from the command line to start with.) And the code for spacemond.py is pretty small, too:
#!/usr/local/bin/python2.4

import sys
from SimpleXMLRPCServer import SimpleXMLRPCServer
from SimpleXMLRPCServer import SimpleXMLRPCRequestHandler
import time
import smtplib
import logging

if (len(sys.argv) == 2):
limit = int(sys.argv[1])
else:
limit = 90

# Maybe put logging in some time
logging.basicConfig(level=logging.DEBUG,
format='%(asctime)s %(levelname)s %(message)s',
filename="/var/log/spacemon/log",
filemode='a')

# Email details
smtpserver = 'hep.ph.liv.ac.uk'
recipients = ['sjones@hep.ph.liv.ac.uk','sjones@hep.ph.liv.ac.uk']
sender = 'root@SOMESERVEROROTHER.COM.ph.liv.ac.uk'
msgheader = "From: root@SOMESERVEROROTHER.COM.ph.liv.ac.uk\r\nTo: YOURNAME@hep.ph.liv.ac.uk\r\nSubject: spacemon report\r\n\r\n"

# Test the server started
session = smtplib.SMTP(smtpserver)
smtpresult = session.sendmail(sender, recipients, msgheader + "spacemond server started\n")
session.quit()

# Restrict to a particular path.
class RequestHandler(SimpleXMLRPCRequestHandler):
rpc_paths = ('/RPC2',)

# Create server
server = SimpleXMLRPCServer(("SOMESERVEROROTHER.COM", 8000), requestHandler=RequestHandler)
server.logRequests = 0
server.register_introspection_functions()

# Class with a method to process incoming reports
class SpaceMon:
def post_report(address,hostname,report):
full_messages = []
full_messages[:] = [] # Always empty it

lines = report.split('\n')
for l in lines[1:]:
fields = l.split()
if (len(fields) >= 5):
fs = fields[0]
pc = fields[4][:-1]
ipc = int(pc)
if (ipc >= limit ):
full_messages.append("File system " + fs + " on " + hostname + " is getting full at " + pc + " percent.\n")
if (len(full_messages) > 0):
session = smtplib.SMTP(smtpserver)
smtpresult = session.sendmail(sender, recipients, msgheader + ("").join(full_messages))
session.quit()
logging.info(("").join(full_messages))
else:
logging.info("Happy state for " + hostname )
return 1

# Register and serve
server.register_instance(SpaceMon())
server.serve_forever()
And now I get an email if any of my OS partitions is getting too full. It's surpising how small server software can be when you use a framework like XMLRPC. In the old days, I would have needed 200 lines of parsing code and case statements. Goodbye to all that.

by Steve Jones (noreply@blogger.com) at October 14, 2014 16:23

SouthGrid

Nagios Monitoring for Non LHC VO’s



A brief description of monitoring framework before coming to the actual topic of Non LHC VO's monitoring.
Service Availability Monitoring (SAM) is a framework for monitoring grid sites remotely. It consists of many components to perform various functions. It can be broadly divided into
‘What to Monitor’ or Topology Aggregation:  Collection of service endpoints and metadata from different sources like GOCDB, BDII, VOMS etc. Custom topological source (VO Feeds) can also be used.
Profile Management:  Mapping of services to the test to be performed.  This service is provided by POEM ( Profile Management) database.  It provides a web based interface to group various metrics into profiles.
Monitoring: Nagios is used as monitoring engine. It is automatically configured based on the information provided by Topology Aggregator and POEM.
SAM software was developed under EGEE project at CERN and now maintained by EGI.
It is mandatory for grid sites to pass ops VO functional test to be part of WLCG. Every NGI maintains a Regional SAM Nagios and result from regional SAM Nagios also goes to central MyEGI which is used for Reliability/Availability calculation.   
UK Regional Nagios is maintained at Oxford
and a backup instance at Lancaster

VO-Nagios
There was no centralize monitoring of Non LHC VO’s for long time and it contributed to bad user experience as it was difficult to find whether a site is broken or problem at the user end.  It was decided to host a multi VO Nagios at Oxford as we had experience with WLCG Nagios.
It is currently monitoring five VO’s
gridpp
t2k
snoplus.snolab.ca
pheno
vo.soutgrid.ac.uk

Sites can look for tests associated with only their site
VO managers may be interested to see tests associated with a particular VO only

We are using VO-feed mechanism to aggregate site metadata and endpoint information. Every VO has a vo-feed available on a web server.  Currently we are maintaining this VO-feed 

VO feed provides list of services to be monitored. I am generating this VO-feed through a script

Jobs are submitted using a proxy generated from a Robot Certificate assigned to Kashif Mohammad. These jobs are like normal grid user jobs and test things like GCC version and CA version. Jobs are submitted every eight hour and this is a configurable option.  We are monitoring CREAMCE, ARC-CE and SE only. Services like BDII, WMS etc. are already monitored by Regional Nagios so there was no need for the duplication.  


For more information, these links can be consulted
https://tomtools.cern.ch/confluence/display/SAMDOC/SAM+Public+Site.html

by Kashif Mohammad (noreply@blogger.com) at October 14, 2014 13:08

October 10, 2014

The SSI Blog

Relearning Fortran through the medium of Star Wars

By Leanne Mary Wake, 2014 Fellow and Anniversary Research Fellow, Department of Geography, University of Northumbria.

The instructor heartened me when he kicked off the Introduction to F95 workshop, which took place at the Culham Centre for Fusion Energy on August 18th-19th, by saying "we want software engineers, not hackers." Science has reached a point where we produce and manipulate ever larger datasets, yet amongst the short-of-time and short-of-patience there is a temptation to produce code more with survival than sophistication in mind. This comes down to a clash between code that works versus code that works again, which is part of the Software Sustainability Institute’s mission and something which I will bring into my teaching from now on.

Immediately, I reflected on some of the less-then-useful code I had written in the past, along with my own learning experience as an undergraduate. Words such as ALLOCATION, SUBROUTINES, MODULES and FUNCTIONS, the latter, which has never been a characteristic of my writing, were mostly absent from my training as an undergraduate. Here are some others - PORTABILITY and GENERALISATION. These are not commands, but two virtues of a code produced by a software engineer as opposed to a hacker. Before I proceed, I make no apologies for using metaphors from Star Wars - it was on my mind at the time of writing - to highlight some of the learning outcomes of the workshop that I intend to use as pillars of my own teaching material and which, funnily enough, align themselves with the sustainability mantra.

Community
author:Leanne Mary Wake, FORTRAN, Coding, Culham, Training, Fellows

read more

by a.hay at October 10, 2014 16:00

October 08, 2014

London T2

XrootD and ARGUS authentication

A couple of months ago, I  set up a test machine running XrootD version 4  at QMUL. This was to test three things:
  1. IPv6 (see blog post),
  2. Central authorisation via ARGUS (the subject of this blog post).
  3. XrootD 4
We  run StoRM/Lustre on our grid storage, and have run an XrootD server for some time as part of the  ATLAS federated storage system, FAX. This  allows local (and non local) ATLAS users interactive access, via the xrootd protocol, to files on our grid storage.

For the new machine, I started by following ATLAS's Fax for Posix storage sites instructions. These instructions document how to use VOMS authentication, but not central banning via ARGUS. CMS do however have some instructions on using xrootd-lcmaps to do the authorisation - though with RPMs from different (and therefore potentially incompatible) repositories. It is, however, possible to get them to work.

The following packages are needed (or at least what I have installed):

  yum install xrootd4-server-atlas-n2n-plugin
  yum install argus-pep-api-c  yum install lcmaps-plugins-c-pep
  yum install lcmaps-plugins-verify-proxy
  yum install lcmaps-plugins-tracking-groupid
  yum install yum install xerces-c
  yum install lcmaps-plugins-basic

Now the packages are installed, xrootd needs to be configured to use them - the appropriate lines in /etc/xrootd/xrootd-clustered.cfg are:


 xrootd.seclib /usr/lib64/libXrdSec.so
 xrootd.fslib /usr/lib64/libXrdOfs.so
 sec.protocol /usr/lib64 gsi -certdir:/etc/grid-security/certificates -cert:/etc/grid-security/xrd/xrdcert.pem -key:/etc/grid-security/xrd/xrdkey.pem -crl:3 -authzfun:libXrdLcmaps.so -authzfunparms:--osg,--lcmapscfg,/etc/xrootd/lcmaps.cfg,--loglevel,5|useglobals -gmapopt:10 -gmapto:0
 #                                                                              
 acc.authdb /etc/xrootd/auth_file
 acc.authrefresh 60
 ofs.authorize 1

And in /etc/xrootd/lcmaps.cfg it is necessary to change path and argus server (my argus server is obscured in the example below). My config file looks looks like:

################################

# where to look for modules
#path = /usr/lib64/modules
path = /usr/lib64/lcmaps

good = "lcmaps_dummy_good.mod"
bad  = "lcmaps_dummy_bad.mod"
# Note put your own argus host instead of for argushost.mydomain
pepc        = "lcmaps_c_pep.mod"
             "--pep-daemon-endpoint-url https://argushost.mydomain:8154/authz"
             " --resourceid http://esc.qmul.ac.uk/xrootd"
             " --actionid http://glite.org/xacml/action/execute"
             " --capath /etc/grid-security/certificates/"
             " --no-check-certificates"
             " --certificate /etc/grid-security/xrd/xrdcert.pem"
             " --key /etc/grid-security/xrd/xrdkey.pem"

xrootd_policy:
pepc -> good | bad
################################################


Then after restarting xrootd, you just need to test that it works.

It seems to work, I was successfully able to ban myself. Unbanning didn't work instantly, and I resorted to restarting xrootd - though perhaps if I'd had patience, it would have worked eventually.

Overall, whilst it wasn't trivial to do, it's not actually that hard, and is one more step along the road to having central banning working on all our grid services.



by Christopher J. Walker (noreply@blogger.com) at October 08, 2014 11:20

October 07, 2014

The SSI Blog

Open Source - stop complaining about free software

By Devasena Inupakutika, Software Consultant at the Software Sustainability Institute.

The problem with open source software is not that it is free but that some people think this means they have got something for nothing. As an article by MongoDB vice president Matt Asay pointed out, developers really are spoilt these days​.

Yet there is no such thing as free software. When we call software "free", it means that it respects the user's essential freedoms: the freedom to run it, to study and change it, and to redistribute copies with or without changes. The source code can be read and modified as much as the author allows. Despite the success of open source software development, most of the general public feels that the software itself is inaccessible to them. This way they abuse the whole idea of open source by not paying back with their development to help projects.

Consultancy
author:Devasena Inupakutika, Open Source Software, Documentation, Development, MongoDB, Zongo
Consultancy

read more

by d.inupakutika at October 07, 2014 13:00

The creation of the Software Carpentry Foundation

By Neil Chue Hong, Director.

The Software Sustainability Institute is delighted to affirm its continued support for the Software Carpentry initiative as it enters the next stage of its development. The Institute's Neil Chue Hong and Carole Goble have been invited to join the interim board, contributing their time and experience to assist the creation of an independent Software Carpentry Foundation.

Community
Training
Software Carpentry Training author:Neil Chue Hong

read more

by n.chuehong at October 07, 2014 09:51

October 03, 2014

The SSI Blog

Smartphones for improved disease spread modelling

By Katayoun Farrahi, Lecturer at the Department of Computing, Goldsmiths, and Rémi Emonet, Associate Professor and Software Engineer at Jean Monnet University

This article is part of our series: a day in the software life, in which we ask researchers from all disciplines to discuss the tools that make their research possible.

In our globalised world, people can travel across several continents in a single day - carrying diseases with them. The importance of containing disease outbreaks to prevent global epidemics cannot be overstated: as evidenced by the recent Ebola outbreak in West Africa.

Diseases spread through physical proximity. Having a mechanism to know or predict interactions between people would allow us to track the movement of diseases, which would be an invaluable tool in preventing an epidemic. But how would such a feat be achieved? Most individuals carry a mobile phone. A phone can tell us a lot about its owner by continuously collecting a wide range of information, such as location and interaction.

Community
author:Katayoun Farrahi, author:Rémi Emonet, Smartphones, Pathology, Medicine, Software Carpentry, Bluetooth

read more

by a.hay at October 03, 2014 09:00

September 30, 2014

GridPP Storage

Data format descriptions

The highlight of the data area working groups meetings at the Open Grid Forum at Imperial recently was the Data Format Description Language . The idea is that if you have a formatted or structured input from a sensor, or a scientific event, and it's not already in one of the formatted, er, formats like (say) OpeNDAP or HDF5, you can use DFDL to describe it and then build a parser which, er, parses records of the format. For example, one use is to validate records before ingesting them into an archive or big data processing facility.

Led by Steve Hanson from IBM, we had an interactive tutorial building a DFDL description for a sensor: the interactive tool looks and feels a bit like Eclipse but is called Integration Toolkit:
And for those eager for more, the appearance of DFDL v1.0 is imminent.

by Jens Jensen (noreply@blogger.com) at September 30, 2014 20:39

The SSI Blog

Free help to forge better software

By Steve Crouch, Consultancy Leader.

The Institute's Open Call provides developer expertise and effort - free of charge - to UK-based researchers. If your project develops research software and you'd like some expert help, you should submit an application to the Open Call.

We've just opened the latest round of the Open Call, which closes on 5 December 2014.

You can ask for our help to improve your research software, your development practices, or your community of users and contributors (or all three!). You may want to improve the sustainability or reproducibility of your software, and need an assessment to see what to do next, or perhaps you need guidance or development effort to help improve specific aspects or make better use of infrastructure. We want applications from any discipline in relation to software at any level of maturity.

Consultancy
Open Call, RSG, Research Software Group, consultancy, author:Steve Crouch

read more

by s.crouch at September 30, 2014 13:00

From benign dictatorship to democratic association: the RSE AGM

By Simon Hettrick, Deputy Director.

If you don’t write papers, how should a university recognise your work? This and related topics were the focus of discussions at the first ever Annual General Meeting of Research Software Engineers, which took place on 15-16 September. The AGM was an important milestone in our campaign for Research Software Engineers: it marked the first formal meeting of the RSE community. 

Over fifty RSEs met at ORTUS, based in King’s College London to meet, collaborate and discuss work. The day kicked off with an overview of the RSE campaign from staunch RSE supporter, James Hetherington. This was followed by a talk from Kumar Jacob and Richard Dobson (respectively from Maudsley Digital and the NIHR BRC for Mental Health) about software use in mental health research. Maudsley Digital were the gold sponsor for the event, and were joined by our other sponsors NIHR BRC for Mental Health, GitHub, Microsoft Research and FitBit UK.

Policy
Community
Research Software Engineers, RSE, Events, author:Simon Hettrick

read more

by s.hettrick at September 30, 2014 11:59

September 25, 2014

GridPP Storage

Erasure-coding: how it can help *you*.

While some of the mechanisms for data access and placement in the WLCG/EGI grids are increasingly modern, there are underlying assumptions that are rooted in somewhat older design decisions.

Particularly relevantly to this article: on 'The Grid', we tend to increase the resilience of our data against loss by making complete additional copies (either one on tape and one on disk, or additional copies on disk at different physical locations). Similarly, our concepts of data placement are all located at the 'file' level - if you want data to be available somewhere, you access a complete copy from one place or another (or potentially get multiple copies from different places, and the first one to arrive wins).
However, if we allow our concept of data to drop below the file level, we can develop some significant improvements.

Now, some of this is trivial: breaking a file into N chunks and distributing it across multiple devices to 'parallelise' access is called 'striping', and your average RAID controller has been doing it for decades (this is 'RAID0', the simplest RAID mode). Slightly more recently, the 'distributed' class of filesystems (Lustre, GPFS, HDFS et al) have allowed striping of files across multiple servers, to maximise performance across the network connections as well.

Striping, of course, increases the fragility of the data distributed. Rather than being dependent on the failure probability of a single disk (for single-machine striping) or a single server (for SANs), you are now dependent on the probability of any one of a set of entities in the stripe failing (a partial file is usually useless). This probability is likely to scale roughly multiplicatively with the number of devices in the stripe, assuming their failure modes are independent.

So, we need some way to make our stripes more robust to the failure of components. Luckily, the topic of how to encode data to make it resilient against partial losses (or 'erasures'), via 'erasure codes', is an extremely well developed field indeed.
Essentially, the concept is this: take your N chunks that you have split your data into. Design a function such that, when fed N values, will output an additional M values, such that each of those M values can be independently used to reconstruct a missing value from the original set of N. (The analogy used by the inventors of the Reed-Solomon code, the most widely used erasure-code family, is of overspecifying a polynomial by more samples than its order - you can always reconstruct an order N polynomial with any N of the M samples you have.)
In fact, most erasure-codes will actually do better than that - as well as allowing the reconstruction of data known to be missing, they can also detect and correct data that is bad. The efficiency for this is half that for data reconstruction - you need 2 resilient values for every 1 unknown bad value you need to detect and fix.

If we decide how many devices we would expect to fail, we can use an erasure code to 'preprocess' our stripes, writing out N+M chunk stripes.

(The M=1 and M=2 implementations of this approach are called 'RAID5' and 'RAID6' when applied to disk controllers, but the general formulation has almost no limits on M.)

So, how do we apply this approach to Grid storage?

Well, Grid data stores already have a large degree of abstraction and indirection. We use LFCs (or other file catalogues) already to allow a single catalogue entry to tie together multiple replicas of the underlying data in different locations. It is relatively trivial to write a tool that (rather than simply copying a file to a Grid endpoint + registering it in an LFC) splits & encodes data into appropriate chunks, and then stripes them across available endpoints, storing the locations and scheme in the LFC metadata for the record.
Once we've done that, retrieving the files is a simple process, and we are able to perform other optimisations, such as getting all the available chunks in parallel, or healing our stripes on the fly (detecting errors when we download data for use).
Importantly, we do all this while also reducing the lower bound for resiliency substantially from 1 full additional copy of the data to M chunks, chosen based on the failure rate of our underlying endpoints.

This past summer, one of our summer projects was based around developing just such a suite of wrappers for Grid data management (albeit using the DIRAC file catalogue, rather than the LFC).
We're very happy with Paulin's work on this, and a later post will demonstrate how it works and what we're planning on doing next.

by Sam Skipsey (noreply@blogger.com) at September 25, 2014 16:03

The SSI Blog

2015 Fellowships - a Humanities scholar's view on being a Fellow.

By Stuart Dunn, lecturer at the Centre for e-Research, Kings's College London, and 2014 Institute Fellow.

One problem with being a digital humanities academic these days is the sheer volume of scholarly activity available – from seminars and workshops to conferences and symposia. In London alone, one could easily attend three or four such events every week, if not more.

My Fellowship has provided me with an excellent heuristic for selecting which events one goes to, and helped me to connect my participation in the community around how digital humanists approach and practice the sustainability of what they use and build.

Especially to one used to applying for research grants, the application process was extremely simple and lightweight. The focus was on your ideas and thinking, rather than just box-ticking. Even writing the application forced me to think succinctly about the challenges and questions facing the DH community in sustaining software. These include whether we are too reliant on proprietary software, what role crowdsourcing will play in the future and in what ways does the inherently collaborative nature of Digital Humanities impact in sustainability issues.

Community
Fellows, Fellowship 2015, Digital Humanities, King's College, author:Stuart Dunn

read more

by s.sufi at September 25, 2014 13:00