March 28, 2015

GridPP Storage

EUDAT and GridPP

EUDAT2020 (the H2020 follow-up project to EUDAT) just finished its kick-off meeting at CSC. Might be useful to jot down a few thoughts on similarities and differences and such before it is too late.

Both EUDAT and GridPP are - as far as this blog is concerned - data e- (or cyber-) infrastructures. The infrastructure is distributed across sites, sites provide storage capacity or users, there is a common authentication and authorisation scheme, there are data discovery mechanisms, both use GOCDB for service availability.

  • EUDAT will be using CDMI as its storage interface - just like EGI does - and CDMI is in many ways fairly SRM-like. We have previously done work comparing the two.
  • EUDAT will also be doing HTTP "federations" (i.e. automatic failover when a replica is missing; this is confusingly referred to as "federation" by some people).
  • Interoperation with EGI is useful/possible/thought-about (delete as applicable). EUDAT's B2STAGE will be interfacing to EGI - there is already a mailing list for discussions.
  • GridPP's (or WLCG's) metadata management is probably a bit too confusing at the moment since there is no single file catalogue 
  • B2ACCESS is the authentication and authorisation infrastructure in EUDAT; it could interoperate with GridPP via SARoNGS (ask us at OGF44 where we will also look at AARC's relation to GridPP and EUDAT). Jos tells us that KIT also have a SARoNGS type service.
  • Referencing a file is done with a persistent identifier, rather like the LFN (Logical Filename) GridPP used to have.
  • "Easy" access via WebDAV is an option for both projects. GlobusOnline is an option (sometimes) for both projects. In fact, B2STAGE is currently using GO, but will also be using FTS.
Using FTS is particularly interesting because it should then be possible to transfer files between EUDAT and GridPP. The differences between the projects are mainly that
  • GridPP is more mature - has had 14-15 years now to build its infrastructure; EUDAT is of course a much younger project (but then again, EUDAT is not exactly starting from scratch)
  • EUDAT is doing more "dynamic data" where the data might change later. Also looking at more support for the lifecycle.
  • EUDAT and GridPP have distinct user communities, to a first approximation at least.
  • The middleware is different; GridPP does of course offer compute where EUDAT will offer simpler server-side workflows. GridPP services are more integrated, where in EUDAT the B2 services are more separated (but will be unified by the discovery/lookup service and by B2ACCESS)
  • Authorisation mechanisms will be very different (but might hopefully interface to each other; there are plans for this in B2ACCESS).
There is some overlap between data sites in WLCG and those in EUDAT. This could lead to some interesting collaborations and cross-pollinations. Come to OGF44 and the EGI conference and talk to us about it.

by Jens Jensen ( at March 28, 2015 20:30

March 23, 2015

The SSI Blog

Software Management Plan Service prototype live

Software management plan guide and service

By Mike Jackson, Software Architect.

Software management plans set down goals and processes that ensure software is accessible and reusable throughout a project and beyond. To complement our guide on Writing and using a software management plan we have now developed a prototype software management plan service, powered by the Digital Curation Centre's data management plan service, DMPonline.

software, software management plan, data management plan, service, digital curation centre, research software

read more

by m.jackson at March 23, 2015 11:00

March 20, 2015

GridPP Storage

ISGC 2015 Review and Musings..

The 2015 ISGC Conference is coming to a close; so I thought I would jot down some musings regarding some of the talks I have seen (and presented.) over the last week. Not surprisingly; since the G and C are grids and clouds, a lot of talks were regrading compute, however there were various talks on storage and data management (especially dCache). But most interesting talk was regarding new technology which sees a cpu and network interface incorporated into an individual HDD. this can be seen here:

There were also many site discussion from the various asian countries represented, of which network setup and storage was on particular interest (also including using infiniband between Singapore Seattle and Australia.) My perfSONAR talk seem to be well received.  It makes the distance our european dataflows have to travel seem trivial.

It was also interesting to listen to some of the Humanities and Arts themed talks. (First time I have ever heard post- modernism used at a conference!!) Their data volume may well be smaller than WLCG VOS;  but still complex and uses interesting visualisation methods.

by bgedavies ( at March 20, 2015 04:26

March 16, 2015

The SSI Blog

Releasing data service software as free open source software

Reflections of the same thing

By Mike Jackson, Software Architect.

Linked data is a way of representing and joining information from a variety of sources to allow it to be accessed, browsed, searched and used as easily as one would browse the web. One of the principles of linked data is that URIs are used to name things whether these be people, places, books, software, magazines, departments, machines and so on.

As anyone can develop their own linked data sets, and propose their own URIs, many URIs may be created for the same thing. is a service offered by Seme4 Limited that allows users to find out which URIs refer to the same thing. sameAs Lite is a refactored, open source, version of the software that powers We are providing consultancy to Seme4 on how to improve sameAs Lite for deployers and developers and to promote community engagement.

author:Mike Jackson, Software Development, Open Source, Linked Data, Service, Open Call, Research Software Group

read more

by m.jackson at March 16, 2015 14:00

February 24, 2015


Replacing the Condor Defrag Daemon

I've replaced the standard DEFRAG daemon released with Condor with a simpler version that contains a proportional integral (PI) controller. I hoped this would give us better control over multicore slots. Preliminary results with the proportional part of the controller show that it fails to keep accurate control over the provision of slots. It is subject to hunting due to the long time lags between the onset of drainin and the eventual change in the controlled variable (which is 'running mcore jobs'). The rate of provision was unexpectedly stable at first, considering the simplicity of the algorithm employed, but degraded over time as the controlled variable became more random.

The graph below shows the very preliminary picture, with a temporary period of stable control shown by the green line on the right of the plot. The setpoint is 250.

I have also now included an Integral component to the controller, and I'm in the process of tuning the reset rate on this. I hope to show the results of this test soon.

by Steve Jones ( at February 24, 2015 12:09

January 22, 2015

Tier1 Blog

Stress test of Ceph Cloud cluster

RAL has a Ceph storage cluster (refered to as the Cloud Cluster) that provides a Rados Block Device interface for our Cloud infastructure.    We recently ran a stress test of the Ceph instance.

We had 222 VMs running, of which 50 were randomly writing large volumes of data.  We realised we had maxed out when we noticed a slowdown in the responsiveness of our VMs. Increasing the number of VMs writing data did not increase the amount of data being written, so we believe we hit the limit on the cluster.

The write rate we hit into the cluster was 1044 MB/s (8.2 Gb/s), as reported by ‘Ceph status’. It is worth saying that this was the raw data in, as we store three copies, there was actually 24.6Gb/s being written (not including journaling). Investigation showed that the limiting factor was the storage node disks, which were all writing as fast as they could.

We have undertaken no optimisation with our mount commands in the Ceph configuration and this should probably be something we explore further in the future for performance gain.

The cluster currently consists of 15 storage nodes, each with 7 OSDS and 10Gb/s client and rebalancing networks.

The following graphs show the network, CPU and Memory utilisation on one of the storage nodes. They are typical of the rest of the cluster. The step change represents the point where we fired up the VMs doing random writes. You will notice the network in was about 220MB/s, fifteen times this is 3300MB/s ~ 26Gb/s which is approximately the same as the 24.6Gb/s figure I quote above, providing an independent check on the figure Ceph status quotes.





by Alastair Dewhurst at January 22, 2015 16:51

December 17, 2014

Tier1 Blog

RAL Tier1 – Plans for Christmas & New Year Holiday

RAL Tier1 – Plans for Christmas & New Year Holiday 2014/15

RAL closes at 3pm on Wednesday 24th December and will re-open on Monday 5th January. During this time we plan for services at the RAL Tier1 to remain up. The usual on-call cover will be in place (as per nights and weekends). This cover will be enhanced by daily checks of key systems.

Furthermore we do not have support around the 25/26 December & 1st January for some site services we rely on. The impact of any failures around these particular dates may therefore be more extended. Also, over the holiday we have relaxed our expectation that the on-call person will respond within two hours, particularly on the specific dates just mentioned.

During the holiday we will check for tickets in the usual manner. However, only service critical issues will be dealt with.

The status of the RAL Tier1 can be seen on the dashboard at:

Gareth Smith

by Gareth Smith at December 17, 2014 14:31

November 17, 2014


Condor Workernode Heath Script

This is a script that makes some checks on the worker node and "turns it off" if it fails any of them. To implement this, I made use a a Condor feature; startd_cron jobs. I put this in my /etc/condor_config.local file on my worker nodes.

PERSISTENT_CONFIG_DIR = /etc/condor/ral
STARTD_ATTRS = $(STARTD_ATTRS) StartJobs, RalNodeOnline
StartJobs = False
RalNodeOnline = False

I use the prefix "Ral" here because I inherited some of this material from Andrew Lahiffe at RAL! Basically, it's just to de-conflict names. I should have used "Liv" right from the start, but I'm not changing it now. Anyway, the first section says to keep a persistent record of configuration settings; it adds new configuration settings called "StartJobs" and “RalNodeOnline”; it's sets them initially to False; and it makes the START configuration setting dependant upon them both being set. Note: the START setting is very important because the node won't start jobs unless it is True. I also need this. It tells the system (startd) to run a cron script every three minutes.


# Make sure values get over
The script looks like this:


/usr/libexec/condor/scripts/ > /dev/null 2>&1

if [ $STATUS != 0 ]; then
MESSAGE=`grep ^[A-Z0-9_][A-Z0-9_]*=$STATUS\$ /usr/libexec/condor/scripts/ | head -n 1 | sed -e "s/=.*//"`
if [[ -z "$MESSAGE" ]]; then

if [[ $MESSAGE =~ ^OK$ ]] ; then
echo "RalNodeOnline = True"
echo "RalNodeOnline = False"
echo "RalNodeOnlineMessage = $MESSAGE"

echo `date`, message $MESSAGE >> /tmp/testnode.status
exit 0

This just wraps an existing script which I reuse from out TORQUE/MAUI cluster. The existing script just returns a non-zero code if any error happens. To add a bit of extra info, I also lookup the meaning of the code. The important thing to notice is that it echoes out a line to set the RalNodeOnline setting to false. This is then used in the setting of START. Note: on TORQUE/MAUI, the script ran as “root”; here it runs as “condor”. I had to use sudo for some of the sections which (e.g.) check disks etc. because condor could not get smartctl settings etc. Right, so I think that's it. When a node fails the test, START goes to False and the node won't run more jobs. Oh, there's another thing to say. I use two settings to control START. As well as RalNodeOnline, I have the StartJobs setting. I can control this independently, so I can turn a node offline whether or not it has an error. This is useful for stopping the node to (say) rebuild it. It's done on the server, like this.

condor_config_val -verbose -name r21-n01 -startd -set "StartJobs = false"
condor_reconfig r21-n01
condor_reconfig -daemon startd r21-n01

by Steve Jones ( at November 17, 2014 16:52

October 14, 2014


Nagios Monitoring for Non LHC VO’s

A brief description of monitoring framework before coming to the actual topic of Non LHC VO's monitoring.
Service Availability Monitoring (SAM) is a framework for monitoring grid sites remotely. It consists of many components to perform various functions. It can be broadly divided into
‘What to Monitor’ or Topology Aggregation:  Collection of service endpoints and metadata from different sources like GOCDB, BDII, VOMS etc. Custom topological source (VO Feeds) can also be used.
Profile Management:  Mapping of services to the test to be performed.  This service is provided by POEM ( Profile Management) database.  It provides a web based interface to group various metrics into profiles.
Monitoring: Nagios is used as monitoring engine. It is automatically configured based on the information provided by Topology Aggregator and POEM.
SAM software was developed under EGEE project at CERN and now maintained by EGI.
It is mandatory for grid sites to pass ops VO functional test to be part of WLCG. Every NGI maintains a Regional SAM Nagios and result from regional SAM Nagios also goes to central MyEGI which is used for Reliability/Availability calculation.   
UK Regional Nagios is maintained at Oxford
and a backup instance at Lancaster

There was no centralize monitoring of Non LHC VO’s for long time and it contributed to bad user experience as it was difficult to find whether a site is broken or problem at the user end.  It was decided to host a multi VO Nagios at Oxford as we had experience with WLCG Nagios.
It is currently monitoring five VO’s

Sites can look for tests associated with only their site
VO managers may be interested to see tests associated with a particular VO only

We are using VO-feed mechanism to aggregate site metadata and endpoint information. Every VO has a vo-feed available on a web server.  Currently we are maintaining this VO-feed 

VO feed provides list of services to be monitored. I am generating this VO-feed through a script

Jobs are submitted using a proxy generated from a Robot Certificate assigned to Kashif Mohammad. These jobs are like normal grid user jobs and test things like GCC version and CA version. Jobs are submitted every eight hour and this is a configurable option.  We are monitoring CREAMCE, ARC-CE and SE only. Services like BDII, WMS etc. are already monitored by Regional Nagios so there was no need for the duplication.  

For more information, these links can be consulted

by Kashif Mohammad ( at October 14, 2014 11:08

October 08, 2014

London T2

XrootD and ARGUS authentication

A couple of months ago, I  set up a test machine running XrootD version 4  at QMUL. This was to test three things:
  1. IPv6 (see blog post),
  2. Central authorisation via ARGUS (the subject of this blog post).
  3. XrootD 4
We  run StoRM/Lustre on our grid storage, and have run an XrootD server for some time as part of the  ATLAS federated storage system, FAX. This  allows local (and non local) ATLAS users interactive access, via the xrootd protocol, to files on our grid storage.

For the new machine, I started by following ATLAS's Fax for Posix storage sites instructions. These instructions document how to use VOMS authentication, but not central banning via ARGUS. CMS do however have some instructions on using xrootd-lcmaps to do the authorisation - though with RPMs from different (and therefore potentially incompatible) repositories. It is, however, possible to get them to work.

The following packages are needed (or at least what I have installed):

  yum install xrootd4-server-atlas-n2n-plugin
  yum install argus-pep-api-c  yum install lcmaps-plugins-c-pep
  yum install lcmaps-plugins-verify-proxy
  yum install lcmaps-plugins-tracking-groupid
  yum install yum install xerces-c
  yum install lcmaps-plugins-basic

Now the packages are installed, xrootd needs to be configured to use them - the appropriate lines in /etc/xrootd/xrootd-clustered.cfg are:

 xrootd.seclib /usr/lib64/
 xrootd.fslib /usr/lib64/
 sec.protocol /usr/lib64 gsi -certdir:/etc/grid-security/certificates -cert:/etc/grid-security/xrd/xrdcert.pem -key:/etc/grid-security/xrd/xrdkey.pem -crl:3 -authzfunparms:--osg,--lcmapscfg,/etc/xrootd/lcmaps.cfg,--loglevel,5|useglobals -gmapopt:10 -gmapto:0
 acc.authdb /etc/xrootd/auth_file
 acc.authrefresh 60
 ofs.authorize 1

And in /etc/xrootd/lcmaps.cfg it is necessary to change path and argus server (my argus server is obscured in the example below). My config file looks looks like:


# where to look for modules
#path = /usr/lib64/modules
path = /usr/lib64/lcmaps

good = "lcmaps_dummy_good.mod"
bad  = "lcmaps_dummy_bad.mod"
# Note put your own argus host instead of for argushost.mydomain
pepc        = "lcmaps_c_pep.mod"
             "--pep-daemon-endpoint-url https://argushost.mydomain:8154/authz"
             " --resourceid"
             " --actionid"
             " --capath /etc/grid-security/certificates/"
             " --no-check-certificates"
             " --certificate /etc/grid-security/xrd/xrdcert.pem"
             " --key /etc/grid-security/xrd/xrdkey.pem"

pepc -> good | bad

Then after restarting xrootd, you just need to test that it works.

It seems to work, I was successfully able to ban myself. Unbanning didn't work instantly, and I resorted to restarting xrootd - though perhaps if I'd had patience, it would have worked eventually.

Overall, whilst it wasn't trivial to do, it's not actually that hard, and is one more step along the road to having central banning working on all our grid services.

by Christopher J. Walker ( at October 08, 2014 09:20

September 11, 2014


Configuring CVMFS for smaller VOs

We have just configured cvmfs for t2k, hone, mice and ilc after sitting on the request for long time. The main reason for delay was the assumption that we need to change cvmfs puppet module to accommodate non lhc VOs.   It turns out to be quite straight forward with  little effort.
We are using cern cvmfs module and there was an update a month ago so it is better to keep it updated.

 Using hiera to pass parameters to module, our hiera bit for cvmfs
      cvmfs_server_url: ';'
      cvmfs_server_url: ';'
      cvmfs_server_url: ';'
      cvmfs_server_url: ';;'

One important bit is the name of cvmfs repository e.g instead of

Other slight hitch is public key distribution of various cvmfs repositories.  Installation of cvmfs also fetch cvmfs-keys-*.noarch rpm which put all the keys for cern based repository into /etc/cvmfs/keys/.

I have to copy publich key for and to /etc/cvmfs/keys. It can be fetched from  repository
wget -O
or copied from

we  distributed the keys through puppet but outside cvmfs module.
It would be great if some one can convince cern to include public keys of other repositories into cvmfs-keys-* rpm. I am sure that there is not going to be many cvmfs stratum 0s.

Last part of the configuration is to change SW_DIR in site-info.def or vo.d directory

WNs requires re-yaim  to configure SW_DIR in /etc/profile.d/  You can also edit file manually and distribute it through your favourite configuration management system.

by Kashif Mohammad ( at September 11, 2014 12:09

March 24, 2014


The Three Co-ordinators

It is has been a while since we posted on the blog. Generally, this means that things have been busy and interesting. Things have been busy and interesting.

We are presently, going through redevelopment of the site, the evaluation of new techniques for service delivery such as using Docker for containers and updating multiple services throughout the sites.

The development of the programme presented at CHEP on automation and different approaches to delivering HEP related Grid services is underway. An evaluation of container based solutions for service deployment will be presented at the next GridPP collaboration meeting later this month. Other evaluation work on using Software Defined Networking hasn't progressed as quickly as we would have like but is still underway.

Graeme (left), Mark (center) and Gareth.

On other news, Gareth Roy is taking over as the Scotgrid Technical Co-ordinator this month. Mark is off for adventures with the Urban Studies Big Data Group within Glasgow University.And as Dr Who can do it, we can do. Co-ordinator Past, Present and Future all appear in the same place at the same time.

Will the fabric of Scotgrid be the same again?

Very much so.

by Mark Mitchell ( at March 24, 2014 23:13

October 14, 2013


Welcome to CHEP 2013

Greetings from CHEP 2013 in a rather wet Amsterdam.

The conference season is upon us and Sam, Andy, Wahid and myself find ourselves in Amsterdam for CHEP 2013. CHEP started here in 1983 and it is hard to believe that it has been 18 months since New York.

As usual the agenda for the next 5 days is packed. Some of the highlights so far have included advanced facility monitoring, the future of C++ and Robert Lupton's excellent talk on software engineering for Science.

As with all of my visits to Amsterdam, the rain is worth mentioning. So much so that it made local news this morning. However, the venue is the rather splendid Beurs van Berlage in central Amsterdam.

CHEP 2013

There will be further updates during the week as the conference progresses.

by Mark Mitchell ( at October 14, 2013 12:53

June 04, 2013

London T2

Serial Consoles over ipmi

To get Serial Consoles over ipmi working properly with Scientific Linux 6.4 (aka RHEL 6.4 / centos 6.4) I had to modify several setting both in the BIOS and in the OS.

Hardware Configuration

For Dell C6100 I set these setting in the BIOS

Remote Access = Enabled
Serial Port Number = COM2
Serial Port Mode = 115200 8,n,1
Flow Control = None
Redirection After BIOS POST = Always
Terminal Type = VT100
VT-UTF8 Combo Key Support = Enabled

Note: "Redirection After Boot = Disabled" is required otherwise I get a 5 minute timeout before booting the kernel. Unfortunately with this set up you get a gap in output while the server attempts to pxeboot. However, you can interact with the BIOS and once Grub starts you will see and be able to interact with the grub and Linux boot processes.

For Dell R510/710 I set these setting in the BIOS

Serial Communication = On with Console Redirection via COM2
Serial Port Address = Serial Device1=COM1,Serial Device2=COM2
External Serial Connector = Serial Device1
Failsafe Baud Rate = 115200
Remote Terminal Type = VT100/VT220
Redirection After Boot = Disabled

Note: With these settings you will be unable to see the progress of the kickstart install on the non default console.

Grub configuration

In grub.conf you should have these two lines (they were there by default in my installs).

serial --unit=1 --speed=115200
terminal --timeout=5 serial console

This allows you access grub via the consoles. The "serial" (ipmi) terminal will be default unless you press a key when asked during the boot process. This is only for grub and not for the rest of the linux boot process

SL6 Configuration

The last console specified in the linux kernel boot options is taken to be the default console. However, if the same console is specified twice this can cause issues (e.g. when entering a password the characters are shown on the screen!)

For the initial kickstart pxe boot I append "console=tty1 console=ttyS1,115200" to the linux kernel arguments. Here the serial console over ipmi will be the default during the install process, while the other console should echo the output of the ipmi console.

After install the kernel argument "console=ttyS1,115200" was already added to the kernel boot arguments. I have additionally added "console=tty1" before this, this may be required to enable interaction with the server via a directly connected terminal if needed.

With the ipmi port set as default (last console specified in the kernel arguments) SL6 will automatically start a getty for ttyS1. If it was not the default console we would have to add a upstart config file in /etc/init/. Note SL6 uses upstart, previous SL5 console configurations in /etc/inittab are ignored!

e.g. ttyS1.conf

start on stopping rc runlevel [345]
stop on starting runlevel [S016]

exec /sbin/agetty /dev/ttyS1 115200 vt100

by Daniel Traynor ( at June 04, 2013 14:49

October 02, 2012

National Grid Service

SHA2 certificates

We have started to issue certificates with the "new" more secure algorithms, SHA2 (or to be precise SHA256) - basically, it means that the hashing algorithm which is a part of the signature is more secure against attacks than the current SHA1 algorithm (which in turn is more secure than the older MD5).

But only to a lucky few, not to everybody.  And even they get to keep their "traditional" SHA1 certificates alongside the SHA2 one if they wish.

Because the catch is that not everything supports SHA2.  The large middleware providers have started worrying about supporting SHA2, but we only really know by testing it.

So what's the problem?  A digital signature is basically a one-way hash of something, which is encrypted with your private key: S=E(H(message)).  To verify the signature, you would re-hash the message, H(message), and also decrypt the signature with the public key (found in the certificate in the signer): D(S)=D(E(H(message)))=H(message) - and also check the validity of the certificate.

If someone has tampered with the message, the H would fail (with extremely high probability) to yield the same result, hence invalidate the signature, as D(S) would no longer be the same as H(tamper_message).

However, if you could attack the hash function and find a tamper_message which has the property that H(tamper_message)=H(message), then the signature is useless - and this is precisely the kind of problem people are worrying about today, for H being SHA1 signatures (and history repeats itself, since we went through the same stuff for MD5 some years ago.)

So we're now checking if it works. So far, we have started with PKCS#10 requests of a few lucky individuals; I'll do some SPKACs tomorrow.  If you want one to play with, send us a mail via the usual channels (eg email or helpdesk.)

Eventually, we will start issuing renewals with SHA2, but only once we're sure that they work with all the middleware out there... we also take the opportunity to test a few modernisations of extensions in the certificates.

by Jens Jensen ( at October 02, 2012 16:52

June 14, 2012

National Grid Service

Kick off - it's time for the NGS summer seminar series

In the midst of this summer of sport another event is kicking off soon but this time it's the NGS Summer Seminar series.

The first seminar will take place next Wednesday (20th June) at 10.30am (BST) and will give an overview of how accounting is done on the grid, and what it is used for.  It will cover the NGS accounting system at a high level and then go into more detail about the implementation of APEL, the accounting system for EGI, including the particular challenges involved and the plans for development.

The speaker will be Will Rogers from STFC Rutherford Appleton Laboratory who I'm sure would appreciate a good audience ready to ask lots of questions!

Please help spread the word about this event to any colleagues or organisations you think might be interested.  A Facebook event page is available so please invite your colleagues and friends!

by Gillian ( at June 14, 2012 11:09

March 14, 2011

December 21, 2010

gLite/Grid Data Management

GFAL / LCG_Util 1.11.16 release

There has been no blog post for almost half a year. It does not mean that nothing has happened since than. We devoted enormous effort to some background works (automated test bed, nightly builds and test runs, change to EMI era from EGEE, etc.). We will test the tools and the procedures in the first months of 2011, analyze if they have added value and how they could be improved. As for the visible part, we released GFAL/LCG_Util 1.11.16 (finally) in November - see the release notes. Better later than never!

by zsolt rossz molnár ( at December 21, 2010 14:11

October 29, 2010

Steve Lloyd's ATLAS Grid Tests

Upgrade to AtlasSetup

I have changed the setup for my ATLAS jobs so it uses AtlasSetup (rather than AtlasLogin). The magic lines are:

source $VO_ATLAS_SW_DIR/software/$RELEASE/cmtsite/ AtlasOffline $RELEASE

VO_ATLAS_SW_DIR is set up automatically and you have to set RELEASE yourself. Since AtlasSetup is only available from Release 16 onwards, jobs going to sites without Release 16 will fail.

by Steve Lloyd ( at October 29, 2010 15:11

July 23, 2010

Steve Lloyd's ATLAS Grid Tests

Steve's pages update

I have done some much needed maintenance and the gstat information is available again (from gstat2). There is also a new page giving the history of the ATLAS Hammercloud tests status

by Steve Lloyd ( at July 23, 2010 09:04

November 16, 2009

MonAMI at large

watching the ink dry

Yeah, it's been far too long since the last bit of news so here's a entry just to announce that MonAMI now has a new plugin: inklevel.

This plugin is a simple wrapper around Markus Heinz's libinklevel library. This is a nice library that allows easy discovery of how much ink is left in those expensive ink cartridges.

The library allows one to check the ink levels of Canon, Epson and HP printers. It can check printers directly attached (via the parallel port or USB port) or, for Canon printers, over the network via BJNP (a proprietary protocol that has been reverse engineered).

libinklevel supports many different printers, but not all of them. There's a small collection of printers that the library doesn't work with. There are some printers that are neither listed as working or not working. If your printer isn't listed, please let Markus know whether libinklevel works or not.

Credit for the photo goes to Matthew (purplemattfish) for his picture CISS - Day 304 of Project 365.

by Paul Millar ( at November 16, 2009 20:14

Trouble at Mill

With some unfortunate timing, it looks like the "Axis of Openness" webpages (SourceForge, Slashdot, Freshmeat, ...) have gone for a burton. There seems to be some networking problems with these sites, with web traffic timing out. Assuming traceroute output is valid, the problem appears soon after traffic leaves the Santa Clara location of the Savvis network [dead router(s)?]

This is a pain because we've just done the v0.10 release of MonAMI and both the website and the file download locations are hosted by SourceForge. Whilst SourceForge is down, no one can download MonAMI!

If you're keen to try MonAMI, in the mean-time, you can download the RPMs from the (rough and ready) dev. site:

The above site is generously hosted by the ScotGrid project [their blog].

Thanks guys!

by Paul Millar ( at November 16, 2009 20:13

October 13, 2009

Monitoring the Grid


It should be obvious to any following this blog that 10 weeks of this student-ship project have long since ended, however until now there were a few outstanding issues. I can now finally say that the project is finished and ready for public use. It can be found at, although the link may change at some point in the future & the "add to iGoogle" buttons won't work for now.

The gadget is currently configured to utilize all available data from 2009. Specifically it holds data for all jobs submitted in 2009, up-to around mid September (the most up to date data available from the Grid Observatory).

In addition to this I have produced a small report giving an overview of the project which is available here.

by Laurence Hudson ( at October 13, 2009 16:45

September 17, 2009

Steve at CERN

Next Week EGEE 09

Next week is of course EGEE 09 in Barcelona. As a warm up the EGEE SA1 OAT sections a sneak preview.

by Steve Traylen ( at September 17, 2009 16:03

July 31, 2009

Monitoring the Grid

Another Update (Day 30)

As week six comes to a close I thought it was about time for another progress update. So same as last time, stealing the bullet points from 2 posts back, with new additions in italics.

  • GridLoad style stacked charts.

  • A "League Table" (totaled over the time period).

  • Pie charts (of the "League Table").

  • Filters and/or sub filters (just successful jobs, just jobs by one VO, just jobs for this CE etc).

  • A tabbed interface.

  • Regular Expression based filtering

  • Variable Y-axis parameter (jobs submitted, jobs started, jobs finished etc).

  • Transition to university web servers.

  • Move to a more dynamic chart legend style. Not done for pie chart yet.

  • Ensure w3 standards compliance/cross-browser compatibility & testing.

  • Automate back end data-source (currently using a small sample data set). Need automatic renewal of grid certificates.

  • Variable X-axis time-step granularity.

  • Data/image export option.

  • A list of minor UI improvements. (About 10 little jobs, not worth including in this list as they would be meaningless, without going into a lot more detail about how the gadget's code is implemented).

  • Optimise database queries and general efficiency testing.

  • Make the interface more friendly. (Tool-tips etc.)

  • Possible inclusion of more "real time" data.

  • Gadget documentation.

  • A Simple project webpage.

  • A JSON data-source API reference page.

  • 2nd gadget, to show all know infon for a given JobID.

  • 2nd gadget: Add view of all JobIDs for one user (DN string).

  • The items in this list are now approximately in the order I intend to approach them.

    On another note, I have finally managed to get some decent syntax highlighting for google gadgets, thanks to this blog post even if it means being stuck with VIM. To get this to work add the vim modeline to the very bottom of the xml gadget file, other wise it tends to break things, such as the gadget title, if added at the top. Whilst VIM is not my editor/IDE of choice it's pretty usable and can, with some configuration, match most of the key features (show whitespaces), I use in Geany. However Geany's folding feature would save a lot of time & scrolling.

    by Laurence Hudson ( at July 31, 2009 15:11

    June 18, 2009

    Grid Ireland

    DPM 1.7.0 upgrade

    I took advantage of a downtime to upgrade our DPM server. We need the upgrade as we want to move files around using dpm-drain and don't want to lose space token associations. As we don't use YAIM I had to run the upgrade script manually, but it wasn't too difficult. Something like this should work (after putting the password in a suitable file):

    ./dpm_db_310_to_320 --db-vendor MySQL --db $DPM_HOST --user dpmmgr --pwd-file /tmp/dpm-password --dpm-db dpm_db

    I discovered a few things to watch out for along the way though. Here's my checklist:

    1. Make sure you have enough space on your system disk: I got bitten by this on a test server. The upgrade script needs a good chunk of space (comparable to that already used by the MySQL DB?) to perform the upgrade
    2. There's a mysql setting you probably need to tweak first: add set-variable=innodb_buffer_pool_size=256M to the [mysqld] section in /etc/mysql.conf and restart mysql. Otherwise you get this cryptic error:

      Thu Jun 18 09:02:30 2009 : Starting to update the DPNS/DPM database.
      Please wait...
      failed to query and/or update the DPM database : DBD::mysql::db do failed: The total number of locks exceeds the lock table size at line 19.
      Issuing rollback() for database handle being DESTROY'd without explicit disconnect().

      Also worth noting is that if this happens to you, when you try to re-run the script (or YAIM) you will get this error:

      failed to query and/or update the DPM database : DBD::mysql::db do failed: Duplicate column name 'r_uid' at line 18.
      Issuing rollback() for database handle being DESTROY'd without explicit disconnect().

      This is because the script has already done this step. You need to edit /opt/lcg/share/DPM/dpm-db-310-to-320/ and comment out this line:

      $dbh_dpm->do ("ALTER TABLE dpm_get_filereq ADD r_uid INTEGER");

      You should then be able to run the script to completion.

    by Stephen Childs ( at June 18, 2009 09:19

    June 08, 2009

    Grid Ireland

    STEP '09 discoveries

    ATLAS have been giving our site a good thrashing over the past week, which has helped us shake out a number of issues with our setup. Here's some of what we've learned.

    Intel 10G cards don't work well with SL4 kernels

    We're currently upgrading our networking to 10G and had it mostly in place by the time STEP'09 started. However, we discovered that the stock SL4 kernel (2.6.9) doesn't support the ixgbe 10G driver very well. It was hard to detect because we could get reasonable transmit performance but receive was limited to 30Mbit/s! It's basically an issue with interrupts (MSI-X and multi-queue weren't enabled). I compiled up a 2.6.18 SL5 kernel for SL4 and that works like a charm (once you've installed it using --nodeps).

    It's worth tuning RFIO

    We had loads of atlas analysis jobs pulling data from the SE and they were managing to saturate the read performance of our disk array. See this NorthGrid post for solutions.

    Fair-shares don't work too well if someone stuffs your queues

    We'd set up shares for the various different atlas sub-groups but the generic analysis jobs submitted via ganga were getting to use much more time. On digging deeper with Maui's diagnose -p I could see that the length of time they'd been queued was overriding the priority due to fairshare. I was able to fix this by increasing the value of FSWEIGHT in Maui's config file.

    You need to spread VOs over disk servers

    We had a nice tidy setup where all the ATLAS filesystems were on one DPM disk server. Of course this then got hammered ... we're now trying to spread out the data across multiple servers.

    by Stephen Childs ( at June 08, 2009 12:53

    March 23, 2009

    Steve at CERN

    Installed Capacity at CHEP

    This week is CHEP 09 proceeded by WLCG workshop. I presented some updates on the roll out of the installed capacity document. It included examples of a few sites that would have zero capacity if considered under the new metrics.
    Sites should consider taking the following actions.

    • Check gridmap. In particular the view obtained by clicking on the more label and selecting the "size by SI00 and LogicalCPUs".
    • Adjust your published #LogicalCPUS in your SubCluster. It should correspond to the number of computing cores that you have.
    • Adjust your #Specint2000 settings in the SubCluster. The aim is to make your gridmap box the correct size to represent your total site power.
    The followup questions were the following. Now a chance for a a more reflected response.
    1. Will there be any opportunity to run a benchmark within the gcm framework?
      I answered that this was not possible since unless it could be executed in under 2 seconds then there was no room for it. Technically there would not be a problem with running something for longer, it could be ran rarely. We should check how the first deployment of GCM goes, longer tests are in no way planned though.
    2. What is GCM collecting and who can see its results?
      Currently no one can see on the wire since messages are encrypted. There should be a display at however currently it is down but once there it will be accessible to IGTF CA members. For now there are some test details available.
    3. When should sites start publishing the HEPSpecInt2006 benchmark?
      The management contributed "Now" which is of course correct, the procedure is well established. Sites should be in the process of measuring their clusters with the HEPSpec06 bench mark. With the next YAIM release they will be able to publish the value also.
    4. If sites are measuring these benchmarks can they the values be made available on the worker nodes to jobs?
      Recently the new glite-wn-info made it as far as the PPS service. This allows the job to find on the WN to which GlueSubCluster it belongs. In principal this should be enough, the Spec benchmarks can be retrieved from the GlueSubClusters. The reality of course is that until some future date when all the WNWG recommendations are deployed along with CREAM also then this is not possible. So for now I will extend glite-
      wn-info to also return a HepSpec2006 value as configured by the site administrators.
    5. Do you know how many sites are currently publishing incorrect data?
      I did not know the answer nor is an answer easy other than collecting the ones of zero size. Checking now of 498 (468 unique?) SubClusters some 170 of them have zero LogicalCPUs.
    On a more random note a member of CMS approached me afterwards to thank me for the support I gave him 3 or so years ago while working at RAL. At the time we both had an interest in making grid work. He got extra queues, reserved resources, process dumps and general job watching from me. It was the fist grid jobs we had approaching something similar to the analysis we now face. Quoting the gentleman from his grid experience and results using RAL he obtained his doctorate and CMS chose to use the grid.

    by Steve Traylen ( at March 23, 2009 10:11

    January 07, 2009

    GridPP Operations

    OpenSSL vulnerability

    There is a new vulnerability in OpenSSL in all versions prior to 0.9.8j, discovered by Google's security team. You will be happy to learn that the Grid PKI is not affected by the vulnerability since it uses RSA signatures throughout - only DSA signatures and ECDSA (DSA but with Elliptic Curves) are affected. (Of course you should still upgrade!)

    by Jens Jensen ( at January 07, 2009 21:06

    January 05, 2009

    GridPP Operations

    New MD5 vulnerability announced

    In 2006 two different MD5-signed certificates were created. A new stronger attack, announced last Wednesday (yes 30 Dec), allows the attacker to change more parts of the certificate, also the subject name. To use this "for fun and profit" one gets an MD5 end entity certificate from a CA (ideally one in the browser's keystore), and hacks it to create an intermediate CA which can then issue

    by Jens Jensen ( at January 05, 2009 11:12