November 21, 2014

The SSI Blog

Building a firm foundation for solid mechanics software

By Gillian Law, TechLiterate, talking with Lee Margetts, University of Manchester.

This article is part of our series: Breaking Software Barriers, in which Gillian Law investigates how our Research Software Group has helped projects improve their research software. If you would like help with your software, let us know.

Software development for research into solid mechanics, particularly in High Performance Computing, has fallen behind other research areas such as fluid dynamics and chemistry, argues Lee Margetts, head of Synthetic Environments and Systems Simulation at the University of Manchester Aerospace Research Institute. It's time for some collaborative development to create some quality, shared code, he says.

Breaking Software Barriers, Open Call, Solid mechanics, ParaFEM, Lee Margetts, author:Gillian Law

read more

by s.crouch at November 21, 2014 14:00

November 20, 2014

The SSI Blog

Cartooning the First World War

By Rhianydd Biebrach, Project Officer for Cartooning the First World War at Cardiff University.

This article is part of our series: a day in the software life, in which we ask researchers from all disciplines to discuss the tools that make their research possible.

Cartoons are probably not the most obvious source material for the history of the First World War, but in the approximately 1,350 political cartoons drawn by Joseph Morewood Staniforth that appeared in The Western Mail and the News of the World throughout the conflict, we have a fascinating record of how those events were consumed daily at the nation’s breakfast tables.

This chronicle of the war is replete with clever caricatures of individuals such as Herbert Henry Asquith, David Lloyd George and the Kaiser, as well as Britannia, John Bull, and Staniforth’s own invention, the stout and matronly Dame Wales, in her traditional Welsh costume.

author:Rhianydd Biebrach, Day in the software life, Cardiff, Cartooning, World War One, Digital humanities

read more

by a.hay at November 20, 2014 14:00

November 18, 2014

GridPP Storage

Towards an open (data) science culture

Last week we celebrated the 50th anniversary of ATLAS computing at Chilton where RAL is located. (The anniversary was actually earlier, we just celebrated it now.)

While much of the event was about the computing and had lots of really interesting talks (which should appear on the Chilton site), let's highlight a data talk by Professor Jeremy Frey. If you remember the faster than light neutrinos, Jeremy praised CERN for making the data available early, even with caveats and doubts about the preliminary results.  The idea is to get your data out, so it people can have a look at it and comment. Even if the preliminary results are wrong and neutrinos are not faster than light, what matters is that the data comes out and people can look at it. And most importantly, that it will not negatively impact people's careers for publishing it.On the contrary, Jeremy is absolutely right to point out that it should be good for people's careers to make data available (with suitable caveats).

But what would an "open science" data model look like?  Suddenly you would get a lot more data flying around, instead of (or in addition to) preprints and random emails and word of mouth. Perhaps it will work a bit like open source, which is supposed to be "given enough eyes, all bugs are shallow."  With open source, you sometimes see code which isn't quite ready for production, but at least you can look at the code and figure out whether it will work, and maybe adapt it.

While we are on the subject of open stuff, the code that simulates science and analyses data is also important. Please consider signing the SSI petition.

by Jens Jensen ( at November 18, 2014 17:24

November 17, 2014


Condor Workernode Heath Script

This is a script that makes some checks on the worker node and "turns it off" if it fails any of them. To implement this, I made use a a Condor feature; startd_cron jobs. I put this in my /etc/condor_config.local file on my worker nodes.

PERSISTENT_CONFIG_DIR = /etc/condor/ral
STARTD_ATTRS = $(STARTD_ATTRS) StartJobs, RalNodeOnline
StartJobs = False
RalNodeOnline = False

I use the prefix "Ral" here because I inherited some of this material from Andrew Lahiffe at RAL! Basically, it's just to de-conflict names. I should have used "Liv" right from the start, but I'm not changing it now. Anyway, the first section says to keep a persistent record of configuration settings; it adds new configuration settings called "StartJobs" and “RalNodeOnline”; it's sets them initially to False; and it makes the START configuration setting dependant upon them both being set. Note: the START setting is very important because the node won't start jobs unless it is True. I also need this. It tells the system (startd) to run a cron script every three minutes.


# Make sure values get over
The script looks like this:


/usr/libexec/condor/scripts/ > /dev/null 2>&1

if [ $STATUS != 0 ]; then
MESSAGE=`grep ^[A-Z0-9_][A-Z0-9_]*=$STATUS\$ /usr/libexec/condor/scripts/ | head -n 1 | sed -e "s/=.*//"`
if [[ -z "$MESSAGE" ]]; then

if [[ $MESSAGE =~ ^OK$ ]] ; then
echo "RalNodeOnline = True"
echo "RalNodeOnline = False"
echo "RalNodeOnlineMessage = $MESSAGE"

echo `date`, message $MESSAGE >> /tmp/testnode.status
exit 0

This just wraps an existing script which I reuse from out TORQUE/MAUI cluster. The existing script just returns a non-zero code if any error happens. To add a bit of extra info, I also lookup the meaning of the code. The important thing to notice is that it echoes out a line to set the RalNodeOnline setting to false. This is then used in the setting of START. Note: on TORQUE/MAUI, the script ran as “root”; here it runs as “condor”. I had to use sudo for some of the sections which (e.g.) check disks etc. because condor could not get smartctl settings etc. Right, so I think that's it. When a node fails the test, START goes to False and the node won't run more jobs. Oh, there's another thing to say. I use two settings to control START. As well as RalNodeOnline, I have the StartJobs setting. I can control this independently, so I can turn a node offline whether or not it has an error. This is useful for stopping the node to (say) rebuild it. It's done on the server, like this.

condor_config_val -verbose -name r21-n01 -startd -set "StartJobs = false"
condor_reconfig r21-n01
condor_reconfig -daemon startd r21-n01

by Steve Jones ( at November 17, 2014 16:52

October 14, 2014


Tired of full /var ?

This is how I prevent /var from getting full on any of our servers. I wrote these two scripts, and is a client, and it is installed on each grid system and worker node as a cronjob:
# crontab -l | grep
50 18 * * * /root/bin/
Because it's going to be an (almost) single threaded server, I use puppet to make it run at a random time on each system (I say "almost" because it actually uses method level locking to hold each thread in a sleep state, so it's actually a queueing server, I think; it won't drop simultaneous incoming connections, but it's unwise to allow too many of them to occur at once.)
        cron { "spacemonc":
#ensure => absent,
command => "/root/bin/",
user => root,
hour => fqdn_rand(24),
minute => fqdn_rand(60),
And it's pretty small:

import xmlrpclib
import os
import subprocess
from socket import gethostname

proc = subprocess.Popen(["df | perl -p00e 's/\n\s//g' | grep -v ^cvmfs | grep -v hepraid[0-9][0-9]*_[0-9]"], stdout=subprocess.PIPE, shell=True)
(dfReport, err) = proc.communicate()

s = xmlrpclib.ServerProxy('')

status = s.post_report(gethostname(),dfReport)
if (status != 1):
print("Client failed");
The strange piece of perl in the middle is to stop a bad habit in df of breaking lines that have long fields (I hate that; ldapsearch and qstat also do it.) I don't want to know about cvmfs partitions, nor raid storage mounts. is installed as a service; you'll have to pinch a /etc/init.d script to start and stop it properly (or do it from the command line to start with.) And the code for is pretty small, too:

import sys
from SimpleXMLRPCServer import SimpleXMLRPCServer
from SimpleXMLRPCServer import SimpleXMLRPCRequestHandler
import time
import smtplib
import logging

if (len(sys.argv) == 2):
limit = int(sys.argv[1])
limit = 90

# Maybe put logging in some time
format='%(asctime)s %(levelname)s %(message)s',

# Email details
smtpserver = ''
recipients = ['','']
sender = ''
msgheader = "From:\r\nTo:\r\nSubject: spacemon report\r\n\r\n"

# Test the server started
session = smtplib.SMTP(smtpserver)
smtpresult = session.sendmail(sender, recipients, msgheader + "spacemond server started\n")

# Restrict to a particular path.
class RequestHandler(SimpleXMLRPCRequestHandler):
rpc_paths = ('/RPC2',)

# Create server
server = SimpleXMLRPCServer(("SOMESERVEROROTHER.COM", 8000), requestHandler=RequestHandler)
server.logRequests = 0

# Class with a method to process incoming reports
class SpaceMon:
def post_report(address,hostname,report):
full_messages = []
full_messages[:] = [] # Always empty it

lines = report.split('\n')
for l in lines[1:]:
fields = l.split()
if (len(fields) >= 5):
fs = fields[0]
pc = fields[4][:-1]
ipc = int(pc)
if (ipc >= limit ):
full_messages.append("File system " + fs + " on " + hostname + " is getting full at " + pc + " percent.\n")
if (len(full_messages) > 0):
session = smtplib.SMTP(smtpserver)
smtpresult = session.sendmail(sender, recipients, msgheader + ("").join(full_messages))
else:"Happy state for " + hostname )
return 1

# Register and serve
And now I get an email if any of my OS partitions is getting too full. It's surpising how small server software can be when you use a framework like XMLRPC. In the old days, I would have needed 200 lines of parsing code and case statements. Goodbye to all that.

by Steve Jones ( at October 14, 2014 15:23


Nagios Monitoring for Non LHC VO’s

A brief description of monitoring framework before coming to the actual topic of Non LHC VO's monitoring.
Service Availability Monitoring (SAM) is a framework for monitoring grid sites remotely. It consists of many components to perform various functions. It can be broadly divided into
‘What to Monitor’ or Topology Aggregation:  Collection of service endpoints and metadata from different sources like GOCDB, BDII, VOMS etc. Custom topological source (VO Feeds) can also be used.
Profile Management:  Mapping of services to the test to be performed.  This service is provided by POEM ( Profile Management) database.  It provides a web based interface to group various metrics into profiles.
Monitoring: Nagios is used as monitoring engine. It is automatically configured based on the information provided by Topology Aggregator and POEM.
SAM software was developed under EGEE project at CERN and now maintained by EGI.
It is mandatory for grid sites to pass ops VO functional test to be part of WLCG. Every NGI maintains a Regional SAM Nagios and result from regional SAM Nagios also goes to central MyEGI which is used for Reliability/Availability calculation.   
UK Regional Nagios is maintained at Oxford
and a backup instance at Lancaster

There was no centralize monitoring of Non LHC VO’s for long time and it contributed to bad user experience as it was difficult to find whether a site is broken or problem at the user end.  It was decided to host a multi VO Nagios at Oxford as we had experience with WLCG Nagios.
It is currently monitoring five VO’s

Sites can look for tests associated with only their site
VO managers may be interested to see tests associated with a particular VO only

We are using VO-feed mechanism to aggregate site metadata and endpoint information. Every VO has a vo-feed available on a web server.  Currently we are maintaining this VO-feed 

VO feed provides list of services to be monitored. I am generating this VO-feed through a script

Jobs are submitted using a proxy generated from a Robot Certificate assigned to Kashif Mohammad. These jobs are like normal grid user jobs and test things like GCC version and CA version. Jobs are submitted every eight hour and this is a configurable option.  We are monitoring CREAMCE, ARC-CE and SE only. Services like BDII, WMS etc. are already monitored by Regional Nagios so there was no need for the duplication.  

For more information, these links can be consulted

by Kashif Mohammad ( at October 14, 2014 11:08

October 08, 2014

London T2

XrootD and ARGUS authentication

A couple of months ago, I  set up a test machine running XrootD version 4  at QMUL. This was to test three things:
  1. IPv6 (see blog post),
  2. Central authorisation via ARGUS (the subject of this blog post).
  3. XrootD 4
We  run StoRM/Lustre on our grid storage, and have run an XrootD server for some time as part of the  ATLAS federated storage system, FAX. This  allows local (and non local) ATLAS users interactive access, via the xrootd protocol, to files on our grid storage.

For the new machine, I started by following ATLAS's Fax for Posix storage sites instructions. These instructions document how to use VOMS authentication, but not central banning via ARGUS. CMS do however have some instructions on using xrootd-lcmaps to do the authorisation - though with RPMs from different (and therefore potentially incompatible) repositories. It is, however, possible to get them to work.

The following packages are needed (or at least what I have installed):

  yum install xrootd4-server-atlas-n2n-plugin
  yum install argus-pep-api-c  yum install lcmaps-plugins-c-pep
  yum install lcmaps-plugins-verify-proxy
  yum install lcmaps-plugins-tracking-groupid
  yum install yum install xerces-c
  yum install lcmaps-plugins-basic

Now the packages are installed, xrootd needs to be configured to use them - the appropriate lines in /etc/xrootd/xrootd-clustered.cfg are:

 xrootd.seclib /usr/lib64/
 xrootd.fslib /usr/lib64/
 sec.protocol /usr/lib64 gsi -certdir:/etc/grid-security/certificates -cert:/etc/grid-security/xrd/xrdcert.pem -key:/etc/grid-security/xrd/xrdkey.pem -crl:3 -authzfunparms:--osg,--lcmapscfg,/etc/xrootd/lcmaps.cfg,--loglevel,5|useglobals -gmapopt:10 -gmapto:0
 acc.authdb /etc/xrootd/auth_file
 acc.authrefresh 60
 ofs.authorize 1

And in /etc/xrootd/lcmaps.cfg it is necessary to change path and argus server (my argus server is obscured in the example below). My config file looks looks like:


# where to look for modules
#path = /usr/lib64/modules
path = /usr/lib64/lcmaps

good = "lcmaps_dummy_good.mod"
bad  = "lcmaps_dummy_bad.mod"
# Note put your own argus host instead of for argushost.mydomain
pepc        = "lcmaps_c_pep.mod"
             "--pep-daemon-endpoint-url https://argushost.mydomain:8154/authz"
             " --resourceid"
             " --actionid"
             " --capath /etc/grid-security/certificates/"
             " --no-check-certificates"
             " --certificate /etc/grid-security/xrd/xrdcert.pem"
             " --key /etc/grid-security/xrd/xrdkey.pem"

pepc -> good | bad

Then after restarting xrootd, you just need to test that it works.

It seems to work, I was successfully able to ban myself. Unbanning didn't work instantly, and I resorted to restarting xrootd - though perhaps if I'd had patience, it would have worked eventually.

Overall, whilst it wasn't trivial to do, it's not actually that hard, and is one more step along the road to having central banning working on all our grid services.

by Christopher J. Walker ( at October 08, 2014 09:20

September 30, 2014

GridPP Storage

Data format descriptions

The highlight of the data area working groups meetings at the Open Grid Forum at Imperial recently was the Data Format Description Language . The idea is that if you have a formatted or structured input from a sensor, or a scientific event, and it's not already in one of the formatted, er, formats like (say) OpeNDAP or HDF5, you can use DFDL to describe it and then build a parser which, er, parses records of the format. For example, one use is to validate records before ingesting them into an archive or big data processing facility.

Led by Steve Hanson from IBM, we had an interactive tutorial building a DFDL description for a sensor: the interactive tool looks and feels a bit like Eclipse but is called Integration Toolkit:
And for those eager for more, the appearance of DFDL v1.0 is imminent.

by Jens Jensen ( at September 30, 2014 18:39

September 11, 2014


Configuring CVMFS for smaller VOs

We have just configured cvmfs for t2k, hone, mice and ilc after sitting on the request for long time. The main reason for delay was the assumption that we need to change cvmfs puppet module to accommodate non lhc VOs.   It turns out to be quite straight forward with  little effort.
We are using cern cvmfs module and there was an update a month ago so it is better to keep it updated.

 Using hiera to pass parameters to module, our hiera bit for cvmfs
      cvmfs_server_url: ';'
      cvmfs_server_url: ';'
      cvmfs_server_url: ';'
      cvmfs_server_url: ';;'

One important bit is the name of cvmfs repository e.g instead of

Other slight hitch is public key distribution of various cvmfs repositories.  Installation of cvmfs also fetch cvmfs-keys-*.noarch rpm which put all the keys for cern based repository into /etc/cvmfs/keys/.

I have to copy publich key for and to /etc/cvmfs/keys. It can be fetched from  repository
wget -O
or copied from

we  distributed the keys through puppet but outside cvmfs module.
It would be great if some one can convince cern to include public keys of other repositories into cvmfs-keys-* rpm. I am sure that there is not going to be many cvmfs stratum 0s.

Last part of the configuration is to change SW_DIR in site-info.def or vo.d directory

WNs requires re-yaim  to configure SW_DIR in /etc/profile.d/  You can also edit file manually and distribute it through your favourite configuration management system.

by Kashif Mohammad ( at September 11, 2014 12:09

April 03, 2014

Tier1 Blog

Deploying 2013 worker nodes at RAL

We have just had a very busy week deploying the 2013 tranches of worker nodes here at RAL. WE had hoped to deploy these sooner but many staff were unavailable due to  conferences or  annual leave. Consequently there was a rush to ensure that we continued to meet our pledged capacity on the 1st April.

The new machines are in two tranches, 64 OCF machines and 64 Viglen machines. They all have dual Xeon E5-2650 @ 2.60GHz processors. They are running hyperthreading and each machine is configured to have 32 job slots. (Total additional job slots is 4096.)

The new machines have been put into production with the latest kernels and errata. So the next week or so will see staff at RAL doing a rolling upgrade on the batch farm to ensure that it is homogeneous, with all worker nodes running the same kernel, errata and EMI version. We are also taking the opportunity to do a slight update of the condor version, from condor-8.0.4-189770.x86_64 to condor-8.0.6-225363.x86_64.

As the new machines come in, it is also a reminder that we will be retiring the old 2008 machines. At the moment there is no hurry and we will continue to exploit whatever resources we have available.

cap-graph.weekThe graph shows the increase in HSPEC06 capacity of the past week. The HSPEC idle has increased because many machines are now being drained for kernel and errata updates.


by johnkelly at April 03, 2014 13:39

March 24, 2014


The Three Co-ordinators

It is has been a while since we posted on the blog. Generally, this means that things have been busy and interesting. Things have been busy and interesting.

We are presently, going through redevelopment of the site, the evaluation of new techniques for service delivery such as using Docker for containers and updating multiple services throughout the sites.

The development of the programme presented at CHEP on automation and different approaches to delivering HEP related Grid services is underway. An evaluation of container based solutions for service deployment will be presented at the next GridPP collaboration meeting later this month. Other evaluation work on using Software Defined Networking hasn't progressed as quickly as we would have like but is still underway.

Graeme (left), Mark (center) and Gareth.

On other news, Gareth Roy is taking over as the Scotgrid Technical Co-ordinator this month. Mark is off for adventures with the Urban Studies Big Data Group within Glasgow University.And as Dr Who can do it, we can do. Co-ordinator Past, Present and Future all appear in the same place at the same time.

Will the fabric of Scotgrid be the same again?

Very much so.

by Mark Mitchell ( at March 24, 2014 23:13

January 06, 2014

Tier1 Blog

Tier 1 Hat Day – Christmas 2013

The Tier One Hat Day Organising Committee was in a triumphant mood during December, following its key role in the discovery of the Higgs Boson and the associated accolades from all concerned. The ‘Mexican Hat’ function that describes the Higgs potential (see picture) was clearly inspired by the tireless efforts of the T1HDOC, and we keenly await the discovery of other particles using headgear-based functions. Perhaps the ‘Egyptian Fez’ function or the ‘Russian Ushanka’ function are to be seen on the horizon in the very near future?


Fig 1: The ‘Mexican Hat’ function that describes the Higgs Potential. Note the innovative upward brim of the design.

In honour of its vital contribution, the T1HDOC chose to celebrate in the only way it knew – by organizing a Tier 1 Hat Day on the last Friday before Christmas. The turnout was excellent, we saw numerous Santas, a pirate captain, and a tinseled boater but the crowning glory was the innovative and predatory fez-crab, recently escaped from the shadowy ‘Cap Mesa’ research complex.


by Rob Appleyard at January 06, 2014 16:51

October 14, 2013


Welcome to CHEP 2013

Greetings from CHEP 2013 in a rather wet Amsterdam.

The conference season is upon us and Sam, Andy, Wahid and myself find ourselves in Amsterdam for CHEP 2013. CHEP started here in 1983 and it is hard to believe that it has been 18 months since New York.

As usual the agenda for the next 5 days is packed. Some of the highlights so far have included advanced facility monitoring, the future of C++ and Robert Lupton's excellent talk on software engineering for Science.

As with all of my visits to Amsterdam, the rain is worth mentioning. So much so that it made local news this morning. However, the venue is the rather splendid Beurs van Berlage in central Amsterdam.

CHEP 2013

There will be further updates during the week as the conference progresses.

by Mark Mitchell ( at October 14, 2013 12:53

June 04, 2013

London T2

Serial Consoles over ipmi

To get Serial Consoles over ipmi working properly with Scientific Linux 6.4 (aka RHEL 6.4 / centos 6.4) I had to modify several setting both in the BIOS and in the OS.

Hardware Configuration

For Dell C6100 I set these setting in the BIOS

Remote Access = Enabled
Serial Port Number = COM2
Serial Port Mode = 115200 8,n,1
Flow Control = None
Redirection After BIOS POST = Always
Terminal Type = VT100
VT-UTF8 Combo Key Support = Enabled

Note: "Redirection After Boot = Disabled" is required otherwise I get a 5 minute timeout before booting the kernel. Unfortunately with this set up you get a gap in output while the server attempts to pxeboot. However, you can interact with the BIOS and once Grub starts you will see and be able to interact with the grub and Linux boot processes.

For Dell R510/710 I set these setting in the BIOS

Serial Communication = On with Console Redirection via COM2
Serial Port Address = Serial Device1=COM1,Serial Device2=COM2
External Serial Connector = Serial Device1
Failsafe Baud Rate = 115200
Remote Terminal Type = VT100/VT220
Redirection After Boot = Disabled

Note: With these settings you will be unable to see the progress of the kickstart install on the non default console.

Grub configuration

In grub.conf you should have these two lines (they were there by default in my installs).

serial --unit=1 --speed=115200
terminal --timeout=5 serial console

This allows you access grub via the consoles. The "serial" (ipmi) terminal will be default unless you press a key when asked during the boot process. This is only for grub and not for the rest of the linux boot process

SL6 Configuration

The last console specified in the linux kernel boot options is taken to be the default console. However, if the same console is specified twice this can cause issues (e.g. when entering a password the characters are shown on the screen!)

For the initial kickstart pxe boot I append "console=tty1 console=ttyS1,115200" to the linux kernel arguments. Here the serial console over ipmi will be the default during the install process, while the other console should echo the output of the ipmi console.

After install the kernel argument "console=ttyS1,115200" was already added to the kernel boot arguments. I have additionally added "console=tty1" before this, this may be required to enable interaction with the server via a directly connected terminal if needed.

With the ipmi port set as default (last console specified in the kernel arguments) SL6 will automatically start a getty for ttyS1. If it was not the default console we would have to add a upstart config file in /etc/init/. Note SL6 uses upstart, previous SL5 console configurations in /etc/inittab are ignored!

e.g. ttyS1.conf

start on stopping rc runlevel [345]
stop on starting runlevel [S016]

exec /sbin/agetty /dev/ttyS1 115200 vt100

by Daniel Traynor ( at June 04, 2013 14:49

October 02, 2012

National Grid Service

SHA2 certificates

We have started to issue certificates with the "new" more secure algorithms, SHA2 (or to be precise SHA256) - basically, it means that the hashing algorithm which is a part of the signature is more secure against attacks than the current SHA1 algorithm (which in turn is more secure than the older MD5).

But only to a lucky few, not to everybody.  And even they get to keep their "traditional" SHA1 certificates alongside the SHA2 one if they wish.

Because the catch is that not everything supports SHA2.  The large middleware providers have started worrying about supporting SHA2, but we only really know by testing it.

So what's the problem?  A digital signature is basically a one-way hash of something, which is encrypted with your private key: S=E(H(message)).  To verify the signature, you would re-hash the message, H(message), and also decrypt the signature with the public key (found in the certificate in the signer): D(S)=D(E(H(message)))=H(message) - and also check the validity of the certificate.

If someone has tampered with the message, the H would fail (with extremely high probability) to yield the same result, hence invalidate the signature, as D(S) would no longer be the same as H(tamper_message).

However, if you could attack the hash function and find a tamper_message which has the property that H(tamper_message)=H(message), then the signature is useless - and this is precisely the kind of problem people are worrying about today, for H being SHA1 signatures (and history repeats itself, since we went through the same stuff for MD5 some years ago.)

So we're now checking if it works. So far, we have started with PKCS#10 requests of a few lucky individuals; I'll do some SPKACs tomorrow.  If you want one to play with, send us a mail via the usual channels (eg email or helpdesk.)

Eventually, we will start issuing renewals with SHA2, but only once we're sure that they work with all the middleware out there... we also take the opportunity to test a few modernisations of extensions in the certificates.

by Jens Jensen ( at October 02, 2012 16:52

June 14, 2012

National Grid Service

Kick off - it's time for the NGS summer seminar series

In the midst of this summer of sport another event is kicking off soon but this time it's the NGS Summer Seminar series.

The first seminar will take place next Wednesday (20th June) at 10.30am (BST) and will give an overview of how accounting is done on the grid, and what it is used for.  It will cover the NGS accounting system at a high level and then go into more detail about the implementation of APEL, the accounting system for EGI, including the particular challenges involved and the plans for development.

The speaker will be Will Rogers from STFC Rutherford Appleton Laboratory who I'm sure would appreciate a good audience ready to ask lots of questions!

Please help spread the word about this event to any colleagues or organisations you think might be interested.  A Facebook event page is available so please invite your colleagues and friends!

by Gillian ( at June 14, 2012 11:09

March 14, 2011

December 21, 2010

gLite/Grid Data Management

GFAL / LCG_Util 1.11.16 release

There has been no blog post for almost half a year. It does not mean that nothing has happened since than. We devoted enormous effort to some background works (automated test bed, nightly builds and test runs, change to EMI era from EGEE, etc.). We will test the tools and the procedures in the first months of 2011, analyze if they have added value and how they could be improved. As for the visible part, we released GFAL/LCG_Util 1.11.16 (finally) in November - see the release notes. Better later than never!

by zsolt rossz molnár ( at December 21, 2010 14:11

October 29, 2010

Steve Lloyd's ATLAS Grid Tests

Upgrade to AtlasSetup

I have changed the setup for my ATLAS jobs so it uses AtlasSetup (rather than AtlasLogin). The magic lines are:

source $VO_ATLAS_SW_DIR/software/$RELEASE/cmtsite/ AtlasOffline $RELEASE

VO_ATLAS_SW_DIR is set up automatically and you have to set RELEASE yourself. Since AtlasSetup is only available from Release 16 onwards, jobs going to sites without Release 16 will fail.

by Steve Lloyd ( at October 29, 2010 15:11

July 23, 2010

Steve Lloyd's ATLAS Grid Tests

Steve's pages update

I have done some much needed maintenance and the gstat information is available again (from gstat2). There is also a new page giving the history of the ATLAS Hammercloud tests status

by Steve Lloyd ( at July 23, 2010 09:04

November 16, 2009

MonAMI at large

watching the ink dry

Yeah, it's been far too long since the last bit of news so here's a entry just to announce that MonAMI now has a new plugin: inklevel.

This plugin is a simple wrapper around Markus Heinz's libinklevel library. This is a nice library that allows easy discovery of how much ink is left in those expensive ink cartridges.

The library allows one to check the ink levels of Canon, Epson and HP printers. It can check printers directly attached (via the parallel port or USB port) or, for Canon printers, over the network via BJNP (a proprietary protocol that has been reverse engineered).

libinklevel supports many different printers, but not all of them. There's a small collection of printers that the library doesn't work with. There are some printers that are neither listed as working or not working. If your printer isn't listed, please let Markus know whether libinklevel works or not.

Credit for the photo goes to Matthew (purplemattfish) for his picture CISS - Day 304 of Project 365.

by Paul Millar ( at November 16, 2009 20:14

Trouble at Mill

With some unfortunate timing, it looks like the "Axis of Openness" webpages (SourceForge, Slashdot, Freshmeat, ...) have gone for a burton. There seems to be some networking problems with these sites, with web traffic timing out. Assuming traceroute output is valid, the problem appears soon after traffic leaves the Santa Clara location of the Savvis network [dead router(s)?]

This is a pain because we've just done the v0.10 release of MonAMI and both the website and the file download locations are hosted by SourceForge. Whilst SourceForge is down, no one can download MonAMI!

If you're keen to try MonAMI, in the mean-time, you can download the RPMs from the (rough and ready) dev. site:

The above site is generously hosted by the ScotGrid project [their blog].

Thanks guys!

by Paul Millar ( at November 16, 2009 20:13

October 13, 2009

Monitoring the Grid


It should be obvious to any following this blog that 10 weeks of this student-ship project have long since ended, however until now there were a few outstanding issues. I can now finally say that the project is finished and ready for public use. It can be found at, although the link may change at some point in the future & the "add to iGoogle" buttons won't work for now.

The gadget is currently configured to utilize all available data from 2009. Specifically it holds data for all jobs submitted in 2009, up-to around mid September (the most up to date data available from the Grid Observatory).

In addition to this I have produced a small report giving an overview of the project which is available here.

by Laurence Hudson ( at October 13, 2009 16:45

September 17, 2009

Steve at CERN

Next Week EGEE 09

Next week is of course EGEE 09 in Barcelona. As a warm up the EGEE SA1 OAT sections a sneak preview.

by Steve Traylen ( at September 17, 2009 16:03

July 31, 2009

Monitoring the Grid

Another Update (Day 30)

As week six comes to a close I thought it was about time for another progress update. So same as last time, stealing the bullet points from 2 posts back, with new additions in italics.

  • GridLoad style stacked charts.

  • A "League Table" (totaled over the time period).

  • Pie charts (of the "League Table").

  • Filters and/or sub filters (just successful jobs, just jobs by one VO, just jobs for this CE etc).

  • A tabbed interface.

  • Regular Expression based filtering

  • Variable Y-axis parameter (jobs submitted, jobs started, jobs finished etc).

  • Transition to university web servers.

  • Move to a more dynamic chart legend style. Not done for pie chart yet.

  • Ensure w3 standards compliance/cross-browser compatibility & testing.

  • Automate back end data-source (currently using a small sample data set). Need automatic renewal of grid certificates.

  • Variable X-axis time-step granularity.

  • Data/image export option.

  • A list of minor UI improvements. (About 10 little jobs, not worth including in this list as they would be meaningless, without going into a lot more detail about how the gadget's code is implemented).

  • Optimise database queries and general efficiency testing.

  • Make the interface more friendly. (Tool-tips etc.)

  • Possible inclusion of more "real time" data.

  • Gadget documentation.

  • A Simple project webpage.

  • A JSON data-source API reference page.

  • 2nd gadget, to show all know infon for a given JobID.

  • 2nd gadget: Add view of all JobIDs for one user (DN string).

  • The items in this list are now approximately in the order I intend to approach them.

    On another note, I have finally managed to get some decent syntax highlighting for google gadgets, thanks to this blog post even if it means being stuck with VIM. To get this to work add the vim modeline to the very bottom of the xml gadget file, other wise it tends to break things, such as the gadget title, if added at the top. Whilst VIM is not my editor/IDE of choice it's pretty usable and can, with some configuration, match most of the key features (show whitespaces), I use in Geany. However Geany's folding feature would save a lot of time & scrolling.

    by Laurence Hudson ( at July 31, 2009 15:11

    June 18, 2009

    Grid Ireland

    DPM 1.7.0 upgrade

    I took advantage of a downtime to upgrade our DPM server. We need the upgrade as we want to move files around using dpm-drain and don't want to lose space token associations. As we don't use YAIM I had to run the upgrade script manually, but it wasn't too difficult. Something like this should work (after putting the password in a suitable file):

    ./dpm_db_310_to_320 --db-vendor MySQL --db $DPM_HOST --user dpmmgr --pwd-file /tmp/dpm-password --dpm-db dpm_db

    I discovered a few things to watch out for along the way though. Here's my checklist:

    1. Make sure you have enough space on your system disk: I got bitten by this on a test server. The upgrade script needs a good chunk of space (comparable to that already used by the MySQL DB?) to perform the upgrade
    2. There's a mysql setting you probably need to tweak first: add set-variable=innodb_buffer_pool_size=256M to the [mysqld] section in /etc/mysql.conf and restart mysql. Otherwise you get this cryptic error:

      Thu Jun 18 09:02:30 2009 : Starting to update the DPNS/DPM database.
      Please wait...
      failed to query and/or update the DPM database : DBD::mysql::db do failed: The total number of locks exceeds the lock table size at line 19.
      Issuing rollback() for database handle being DESTROY'd without explicit disconnect().

      Also worth noting is that if this happens to you, when you try to re-run the script (or YAIM) you will get this error:

      failed to query and/or update the DPM database : DBD::mysql::db do failed: Duplicate column name 'r_uid' at line 18.
      Issuing rollback() for database handle being DESTROY'd without explicit disconnect().

      This is because the script has already done this step. You need to edit /opt/lcg/share/DPM/dpm-db-310-to-320/ and comment out this line:

      $dbh_dpm->do ("ALTER TABLE dpm_get_filereq ADD r_uid INTEGER");

      You should then be able to run the script to completion.

    by Stephen Childs ( at June 18, 2009 09:19

    June 08, 2009

    Grid Ireland

    STEP '09 discoveries

    ATLAS have been giving our site a good thrashing over the past week, which has helped us shake out a number of issues with our setup. Here's some of what we've learned.

    Intel 10G cards don't work well with SL4 kernels

    We're currently upgrading our networking to 10G and had it mostly in place by the time STEP'09 started. However, we discovered that the stock SL4 kernel (2.6.9) doesn't support the ixgbe 10G driver very well. It was hard to detect because we could get reasonable transmit performance but receive was limited to 30Mbit/s! It's basically an issue with interrupts (MSI-X and multi-queue weren't enabled). I compiled up a 2.6.18 SL5 kernel for SL4 and that works like a charm (once you've installed it using --nodeps).

    It's worth tuning RFIO

    We had loads of atlas analysis jobs pulling data from the SE and they were managing to saturate the read performance of our disk array. See this NorthGrid post for solutions.

    Fair-shares don't work too well if someone stuffs your queues

    We'd set up shares for the various different atlas sub-groups but the generic analysis jobs submitted via ganga were getting to use much more time. On digging deeper with Maui's diagnose -p I could see that the length of time they'd been queued was overriding the priority due to fairshare. I was able to fix this by increasing the value of FSWEIGHT in Maui's config file.

    You need to spread VOs over disk servers

    We had a nice tidy setup where all the ATLAS filesystems were on one DPM disk server. Of course this then got hammered ... we're now trying to spread out the data across multiple servers.

    by Stephen Childs ( at June 08, 2009 12:53

    March 23, 2009

    Steve at CERN

    Installed Capacity at CHEP

    This week is CHEP 09 proceeded by WLCG workshop. I presented some updates on the roll out of the installed capacity document. It included examples of a few sites that would have zero capacity if considered under the new metrics.
    Sites should consider taking the following actions.

    • Check gridmap. In particular the view obtained by clicking on the more label and selecting the "size by SI00 and LogicalCPUs".
    • Adjust your published #LogicalCPUS in your SubCluster. It should correspond to the number of computing cores that you have.
    • Adjust your #Specint2000 settings in the SubCluster. The aim is to make your gridmap box the correct size to represent your total site power.
    The followup questions were the following. Now a chance for a a more reflected response.
    1. Will there be any opportunity to run a benchmark within the gcm framework?
      I answered that this was not possible since unless it could be executed in under 2 seconds then there was no room for it. Technically there would not be a problem with running something for longer, it could be ran rarely. We should check how the first deployment of GCM goes, longer tests are in no way planned though.
    2. What is GCM collecting and who can see its results?
      Currently no one can see on the wire since messages are encrypted. There should be a display at however currently it is down but once there it will be accessible to IGTF CA members. For now there are some test details available.
    3. When should sites start publishing the HEPSpecInt2006 benchmark?
      The management contributed "Now" which is of course correct, the procedure is well established. Sites should be in the process of measuring their clusters with the HEPSpec06 bench mark. With the next YAIM release they will be able to publish the value also.
    4. If sites are measuring these benchmarks can they the values be made available on the worker nodes to jobs?
      Recently the new glite-wn-info made it as far as the PPS service. This allows the job to find on the WN to which GlueSubCluster it belongs. In principal this should be enough, the Spec benchmarks can be retrieved from the GlueSubClusters. The reality of course is that until some future date when all the WNWG recommendations are deployed along with CREAM also then this is not possible. So for now I will extend glite-
      wn-info to also return a HepSpec2006 value as configured by the site administrators.
    5. Do you know how many sites are currently publishing incorrect data?
      I did not know the answer nor is an answer easy other than collecting the ones of zero size. Checking now of 498 (468 unique?) SubClusters some 170 of them have zero LogicalCPUs.
    On a more random note a member of CMS approached me afterwards to thank me for the support I gave him 3 or so years ago while working at RAL. At the time we both had an interest in making grid work. He got extra queues, reserved resources, process dumps and general job watching from me. It was the fist grid jobs we had approaching something similar to the analysis we now face. Quoting the gentleman from his grid experience and results using RAL he obtained his doctorate and CMS chose to use the grid.

    by Steve Traylen ( at March 23, 2009 10:11

    January 07, 2009

    GridPP Operations

    OpenSSL vulnerability

    There is a new vulnerability in OpenSSL in all versions prior to 0.9.8j, discovered by Google's security team. You will be happy to learn that the Grid PKI is not affected by the vulnerability since it uses RSA signatures throughout - only DSA signatures and ECDSA (DSA but with Elliptic Curves) are affected. (Of course you should still upgrade!)

    by Jens Jensen ( at January 07, 2009 21:06

    January 05, 2009

    GridPP Operations

    New MD5 vulnerability announced

    In 2006 two different MD5-signed certificates were created. A new stronger attack, announced last Wednesday (yes 30 Dec), allows the attacker to change more parts of the certificate, also the subject name. To use this "for fun and profit" one gets an MD5 end entity certificate from a CA (ideally one in the browser's keystore), and hacks it to create an intermediate CA which can then issue

    by Jens Jensen ( at January 05, 2009 11:12