February 09, 2016

The SSI Blog

Introductions, peer-review of plans, a giant squid and speed blogging - Fellows 2016 inaugural meeting

13 of this year's Fellows 2016 met in the Neil Chalmers Lecture theatre at the rather grand Natural History Museum in London. The Inaugural meeting for the Fellows allow them to get an introduction to the Institute, to receive feedback on their plans and discuss something topical in research software.

(best of the tweets)


read more

by s.sufi at February 09, 2016 14:27

January 28, 2016

The SSI Blog

Community update: Fellows 2016, Collaborations Workshop 2016, snakes and support

By Shoaib Sufi, Community Lead.

For some November and December is a time to wind down and slow down to meet the holiday seasons at a gentle pace … but not at the Institute! The community team with the help of the wider Institute, fellows and friends was busy putting together and finalising the building blocks that would enable a most productive 2016 for research software advocacy!

Read on to find out more more about Fellows 2016 and Collaborations Workshop 2016 (CW16) to see who we are supporting research software related activities in … you guessed it!... 2016! (and beyond).

In addition there are two reports from our Fellows around the rise of Python in HPC and software training.

Monthly updates author:Shoaib Sufi Oliver Laslett Farah Ahmed CW16 Fellows

read more

by n.chuehong at January 28, 2016 14:28

January 27, 2016

GridPP Storage

Xrootd for all

Xrootd provides, amongst other things, a convenient method to externally access files at a site anywhere in the world using your grid credentials.

Specialist storage systems such as DPM and dcache now include a xrootd server in their deployment.
If you use a stranded POSIX file system (e.g. Lustre, GPFS, NFS) it's possible to set up a standalone xrootd server to export all or part of the file system to external clients.

The LHC experiments have have gone a step further and have set up federated storage services combining storage from several separate sites in to one namespace allowing seamless client access to storage without having to worry where the data is stored. They have provided instructions to setup a service but only for their VO e.g.

Extending this to allow access to other VOs data via xrootd but without the federated storage service is simple.

The xrootd server runs as user xrootd. In order to access files it must have the correct permissions to the files. This can be done by making the xrootd user a member of the appropriate groups across the site (e.g. via NIS).

ypcat -k group 
dteam dteam:x:12345:user1,user2, …, xrootd
atlas atlas:x:13345:user1,user2, …, xrootd

For simplicity I'll make a symlink to the file system I want to export on the xrootd server, e.g.

ln -sf /mnt/lustre_2/storm_3/atlas/ /atlas
ln -sf /mnt/lustre_2/storm_3/dteam/ /dteam

The xrootd server configuration file is /etc/xrootd/xrootd-clustered.cfg. Within this file we need to define the file system to export and do so read only for security,

all.export /dteam r/o
all.export /atlas r/o

We also need to add the VOs to the X509 configuration.

sec.protparm gsi -vomsfun:/usr/lib64/libXrdSecgsiVOMS.so -vomsfunparms:certfmt=raw | vos=atlas,dteam | grps=/atlas,/dteam
acc.authdb /etc/xrootd/auth_file

The /etc/xrootd/auth_file specifies the group/user access rights. The following will give read and list right to members of the atlas group to file under /atlas and dteam group members for files under /dteam

g /atlas /atlas rl
g /dteam /dteam rl

The final configuration files look like

cat xrootd-clustered.cfg
frm.xfr.copycmd /bin/cp /dev/null $PFN

# atlas redirection
all.manager atlas-xrd-uk.cern.ch+:1098
xrootd.redirect atlas-xrd-uk.cern.ch:1094 ? /atlas
all.sitename SITENAME

all.export /dteam r/o
all.export /atlas r/o

all.role server
all.adminpath /var/run/xrootd
all.pidpath /var/run/xrootd
xrootd.async off

# atlas Monitoring
if exec xrootd
xrd.report atl-prod05.slac.stanford.edu:9931 every 60s all -buff -poll sync
# if your sites is in EU uncomment the next line
xrootd.monitor all flush 30s window 5s fstat 60 lfn ops xfr 5 dest redir files info user atlas-fax-eu-collector.cern.ch:9330

# N2N configuration. Please change for your site
oss.namelib /usr/lib64/XrdOucName2NameLFC.so

# X509 configuration, change nothing
xrootd.seclib /usr/lib64/libXrdSec.so
sec.protparm gsi -vomsfun:/usr/lib64/libXrdSecgsiVOMS.so -vomsfunparms:certfmt=raw|vos=atlas,dteam|grps=/atlas,/dteam
sec.protocol /usr/lib64 gsi -ca:1 -crl:3 -gridmap:/dev/null
acc.authdb /etc/xrootd/auth_file
acc.authrefresh 60

[root@xrootd02 xrootd]# cat auth_file
g /atlas /atlas rl
g /dteam /dteam rl

Additional file systems can be added in the same fashion for more VOs e.g. snoplus, t2k …..

It's possible to use argus server for the authentication http://londongrid.blogspot.co.uk/2014/10/xrootd-and-argus-authentication.html

The bandwidth to the file system will be limited by the performance of the xrootd server. For local file access it's still better to use native POSIX access, especially with parallel file systems like Lustre.

by Daniel Traynor (noreply@blogger.com) at January 27, 2016 22:02

January 05, 2016

GridPP Storage

Update on vo.dirac.ac.uk data movement and filesize distribution.

So....... I should have known that the information I posted in the blog post in November of last year would soon be out of date; but I didn't think it would be this soon! DiRAC have successfully developed their system to tar and split their data samples before transferring into the RAL Tier1. This system has dramatically increased the data transfer rates.
 What  has also changed is the number of files per tape  due to the change in average filesize per tape:
This has meant the number of files per tape varied from a starting value of 2-3 thousand per tape , swelling top 2-3 million before finally settling on 20-40 per tape. ( file size is ~ 250-300GB per file.)
To move large files requires good transfer rates; which we have been able to achieve; (can be seen in this log snippet):

Tue Dec 29 08:00:28 2015 INFO     bytes: 293193121792, avg KB/sec:286321, inst KB/sec:308224, elapsed:1001
Tue Dec 29 08:00:33 2015 INFO bytes: 294824181760, avg KB/sec:286481, inst KB/sec:318566, elapsed:1006
Tue Dec 29 08:00:38 2015 INFO bytes: 296458387456, avg KB/sec:286643, inst KB/sec:319180, elapsed:1011
Tue Dec 29 08:00:43 2015 INFO bytes: 298053795840, avg KB/sec:286766, inst KB/sec:311603, elapsed:1016
Tue Dec 29 08:00:45 2015 INFO bytes: 298822410240, avg KB/sec:286715, inst KB/sec:268071, elapsed:1018

Incidentally, the large filesize also helps reduce the overall rate loss due to individual overhead setup and completion per transfer. ( overhead of ~15 seconds for this file which then took 1018 seconds to transfer.This has allowed us to transfer ~ 125Tb of data over the new year period:

And a completion rate of ~90%

Although the low number of transfers does not allow the FTS optimizer to change settings so as to improve the throughput rate:

Let's hope we can continue this rate. My next step is to look at the rate at which we can create the tarballs on the source host in preparation for transfer  and whether this technique can be applied at other source sites within vo.dirac.ac.uk.

by bgedavies (noreply@blogger.com) at January 05, 2016 11:08

December 16, 2015

Tier1 Blog

RAL Tier1 – Plans for Christmas & New Year Holiday

RAL Tier1 – Plans for Christmas & New Year Holiday 2015/16

RAL closes at the end of the working day on Thursday 24th December and will re-open on Monday 4th January. During this time we plan for services at the RAL Tier1 to remain up. The usual on-call cover will be in place (as per nights and weekends). This cover will be enhanced by daily checks of key systems.

Furthermore we do not have support around the 25/26 December & 1st January for some site services we rely on. The impact of any failures around these particular dates may therefore be more extended. Also, over the holiday we have relaxed our expectation that the on-call person will respond within two hours, particularly on the specific dates just mentioned.

During the holiday we will check for tickets in the usual manner. However, only service critical issues will be dealt with.

The status of the RAL Tier1 can be seen on the dashboard at:


Gareth Smith

by Gareth Smith at December 16, 2015 09:52

July 27, 2015


Simple CVMFS puppet Module

Oxford was one of the first site to test CVMFS and also to use cern CVMFS module. Initially installation of CVMFS was not well documented  so cern cvmfs puppet module was very helpful in installing and configuring cvmfs.
Installation became easy and more clear with the newer version of cvmfs. One of my ops action was to install gridpp multi vo cvmfs repo with cern cvmfs puppet module. We realized that it is easy to write a trimmed down version of cern  cvmfs module rather than use cern cvmfs module directly. The result is cvmfs_simple module which is available on GitHub.

'include cvmfs_simple' will set up LHC repos and gridpp repo

Only mandatory parameter is

cvmfs_simple::config::cvmfs_http_proxy : 'squid-server'

It is also possible to add local cvmfs repository. Extra repos can be configured by passing values from hiera

cvmfs_simple::extra::repo: ['gridpp', 'oxford']

Oxford is using a local cvmfs repo to distribute software for local users. oxford.pp can be used as template for setting new local cvmfs repo.

cvmfs_simple doesn't support all use cases and it expects that everyone is using hiera ;) . Please feel free to change it for your use case. 

by Kashif Mohammad (noreply@blogger.com) at July 27, 2015 09:53

July 03, 2015

Tier1 Blog

Analysis of Callout Data

Analysis of Callout Data

As a work experience student at RAL, I have collected and analysed the data detailing the callouts made to the Tier 1 on-call team. The team provide 24×7 cover for the Tier 1 service.

Total Callouts Yearly 2

Total number of Callouts per year

Over the past few years, a trend has emerged highlighted by the above graph of Total Callouts Yearly. The graph shows a decrease from 467 callouts in 2011 to 91 half way through 2015. This significant decrease of 285 callouts (when estimating total callouts for 2015 being double 91) could reflect the weekly review of the callouts being done by Tier 1. Another explanation being improvements in technology to reduce the risk of faults and callouts. The only anomaly is 2014, showing a higher amount of callouts with no known specific cause as the team has not analysed all of the data. However, even with this anomaly, the overall data shows a trend portraying a lower amount of failures each year. Hopefully, we will hit zero soon!

Alarms by Service

Types of Alarms by Server

During 2014, there were a total of 294 callouts, the graph above divides this total among the different service and types of alarms. We can conclude from this data that Castor, Database, DISK Server and SRM cause the most callouts. This could be because we treat storage services as more critical and these are more often configured to callout. We do note that we have a large number of storage servers and this could lead to more callouts. We also note that the (Condor) batch system doesn’t produce many callouts, and there are relatively few for other grid services.

Problems Handled by

Types of Alarms by who handled them

The on-call team consists of a ‘Primary on-call’ (PoC) person who receives the message from the automated call-out system. The PoC makes an initial assessment of the problem and will attempt to resolve it. Should further assistance be needed the PoC passes the problem onto the on-call ‘expert’ from each of the support teams (Fabric, Castor, Database, Grid Services).

The graph above shows the difference between the problems handled by the PoC and the PoC + expert. We can see from this data that in 2014, 2/3 of the problems that arose were largely too complex or too big for the PoC alone and so referred to the assistance of an expert as the graph suggests.

by Dan O'Riordan at July 03, 2015 13:47

February 24, 2015


Replacing the Condor Defrag Daemon

I've replaced the standard DEFRAG daemon released with Condor with a simpler version that contains a proportional integral (PI) controller. I hoped this would give us better control over multicore slots. Preliminary results with the proportional part of the controller show that it fails to keep accurate control over the provision of slots. It is subject to hunting due to the long time lags between the onset of drainin and the eventual change in the controlled variable (which is 'running mcore jobs'). The rate of provision was unexpectedly stable at first, considering the simplicity of the algorithm employed, but degraded over time as the controlled variable became more random.

The graph below shows the very preliminary picture, with a temporary period of stable control shown by the green line on the right of the plot. The setpoint is 250.

I have also now included an Integral component to the controller, and I'm in the process of tuning the reset rate on this. I hope to show the results of this test soon.

by Steve Jones (noreply@blogger.com) at February 24, 2015 12:09

November 17, 2014


Condor Workernode Heath Script

This is a script that makes some checks on the worker node and "turns it off" if it fails any of them. To implement this, I made use a a Condor feature; startd_cron jobs. I put this in my /etc/condor_config.local file on my worker nodes.

PERSISTENT_CONFIG_DIR = /etc/condor/ral
STARTD_ATTRS = $(STARTD_ATTRS) StartJobs, RalNodeOnline
StartJobs = False
RalNodeOnline = False

I use the prefix "Ral" here because I inherited some of this material from Andrew Lahiffe at RAL! Basically, it's just to de-conflict names. I should have used "Liv" right from the start, but I'm not changing it now. Anyway, the first section says to keep a persistent record of configuration settings; it adds new configuration settings called "StartJobs" and “RalNodeOnline”; it's sets them initially to False; and it makes the START configuration setting dependant upon them both being set. Note: the START setting is very important because the node won't start jobs unless it is True. I also need this. It tells the system (startd) to run a cron script every three minutes.


# Make sure values get over
The testnodeWrapper.sh script looks like this:


/usr/libexec/condor/scripts/testnode.sh > /dev/null 2>&1

if [ $STATUS != 0 ]; then
MESSAGE=`grep ^[A-Z0-9_][A-Z0-9_]*=$STATUS\$ /usr/libexec/condor/scripts/testnode.sh | head -n 1 | sed -e "s/=.*//"`
if [[ -z "$MESSAGE" ]]; then

if [[ $MESSAGE =~ ^OK$ ]] ; then
echo "RalNodeOnline = True"
echo "RalNodeOnline = False"
echo "RalNodeOnlineMessage = $MESSAGE"

echo `date`, message $MESSAGE >> /tmp/testnode.status
exit 0

This just wraps an existing script which I reuse from out TORQUE/MAUI cluster. The existing script just returns a non-zero code if any error happens. To add a bit of extra info, I also lookup the meaning of the code. The important thing to notice is that it echoes out a line to set the RalNodeOnline setting to false. This is then used in the setting of START. Note: on TORQUE/MAUI, the script ran as “root”; here it runs as “condor”. I had to use sudo for some of the sections which (e.g.) check disks etc. because condor could not get smartctl settings etc. Right, so I think that's it. When a node fails the test, START goes to False and the node won't run more jobs. Oh, there's another thing to say. I use two settings to control START. As well as RalNodeOnline, I have the StartJobs setting. I can control this independently, so I can turn a node offline whether or not it has an error. This is useful for stopping the node to (say) rebuild it. It's done on the server, like this.

condor_config_val -verbose -name r21-n01 -startd -set "StartJobs = false"
condor_reconfig r21-n01
condor_reconfig -daemon startd r21-n01

by Steve Jones (noreply@blogger.com) at November 17, 2014 16:52

October 14, 2014


Nagios Monitoring for Non LHC VO’s

A brief description of monitoring framework before coming to the actual topic of Non LHC VO's monitoring.
Service Availability Monitoring (SAM) is a framework for monitoring grid sites remotely. It consists of many components to perform various functions. It can be broadly divided into
‘What to Monitor’ or Topology Aggregation:  Collection of service endpoints and metadata from different sources like GOCDB, BDII, VOMS etc. Custom topological source (VO Feeds) can also be used.
Profile Management:  Mapping of services to the test to be performed.  This service is provided by POEM ( Profile Management) database.  It provides a web based interface to group various metrics into profiles.
Monitoring: Nagios is used as monitoring engine. It is automatically configured based on the information provided by Topology Aggregator and POEM.
SAM software was developed under EGEE project at CERN and now maintained by EGI.
It is mandatory for grid sites to pass ops VO functional test to be part of WLCG. Every NGI maintains a Regional SAM Nagios and result from regional SAM Nagios also goes to central MyEGI which is used for Reliability/Availability calculation.   
UK Regional Nagios is maintained at Oxford
and a backup instance at Lancaster

There was no centralize monitoring of Non LHC VO’s for long time and it contributed to bad user experience as it was difficult to find whether a site is broken or problem at the user end.  It was decided to host a multi VO Nagios at Oxford as we had experience with WLCG Nagios.
It is currently monitoring five VO’s

Sites can look for tests associated with only their site
VO managers may be interested to see tests associated with a particular VO only

We are using VO-feed mechanism to aggregate site metadata and endpoint information. Every VO has a vo-feed available on a web server.  Currently we are maintaining this VO-feed 

VO feed provides list of services to be monitored. I am generating this VO-feed through a script

Jobs are submitted using a proxy generated from a Robot Certificate assigned to Kashif Mohammad. These jobs are like normal grid user jobs and test things like GCC version and CA version. Jobs are submitted every eight hour and this is a configurable option.  We are monitoring CREAMCE, ARC-CE and SE only. Services like BDII, WMS etc. are already monitored by Regional Nagios so there was no need for the duplication.  

For more information, these links can be consulted

by Kashif Mohammad (noreply@blogger.com) at October 14, 2014 11:08

October 08, 2014

London T2

XrootD and ARGUS authentication

A couple of months ago, I  set up a test machine running XrootD version 4  at QMUL. This was to test three things:
  1. IPv6 (see blog post),
  2. Central authorisation via ARGUS (the subject of this blog post).
  3. XrootD 4
We  run StoRM/Lustre on our grid storage, and have run an XrootD server for some time as part of the  ATLAS federated storage system, FAX. This  allows local (and non local) ATLAS users interactive access, via the xrootd protocol, to files on our grid storage.

For the new machine, I started by following ATLAS's Fax for Posix storage sites instructions. These instructions document how to use VOMS authentication, but not central banning via ARGUS. CMS do however have some instructions on using xrootd-lcmaps to do the authorisation - though with RPMs from different (and therefore potentially incompatible) repositories. It is, however, possible to get them to work.

The following packages are needed (or at least what I have installed):

  yum install xrootd4-server-atlas-n2n-plugin
  yum install argus-pep-api-c  yum install lcmaps-plugins-c-pep
  yum install lcmaps-plugins-verify-proxy
  yum install lcmaps-plugins-tracking-groupid
  yum install yum install xerces-c
  yum install lcmaps-plugins-basic

Now the packages are installed, xrootd needs to be configured to use them - the appropriate lines in /etc/xrootd/xrootd-clustered.cfg are:

 xrootd.seclib /usr/lib64/libXrdSec.so
 xrootd.fslib /usr/lib64/libXrdOfs.so
 sec.protocol /usr/lib64 gsi -certdir:/etc/grid-security/certificates -cert:/etc/grid-security/xrd/xrdcert.pem -key:/etc/grid-security/xrd/xrdkey.pem -crl:3 -authzfun:libXrdLcmaps.so -authzfunparms:--osg,--lcmapscfg,/etc/xrootd/lcmaps.cfg,--loglevel,5|useglobals -gmapopt:10 -gmapto:0
 acc.authdb /etc/xrootd/auth_file
 acc.authrefresh 60
 ofs.authorize 1

And in /etc/xrootd/lcmaps.cfg it is necessary to change path and argus server (my argus server is obscured in the example below). My config file looks looks like:


# where to look for modules
#path = /usr/lib64/modules
path = /usr/lib64/lcmaps

good = "lcmaps_dummy_good.mod"
bad  = "lcmaps_dummy_bad.mod"
# Note put your own argus host instead of for argushost.mydomain
pepc        = "lcmaps_c_pep.mod"
             "--pep-daemon-endpoint-url https://argushost.mydomain:8154/authz"
             " --resourceid http://esc.qmul.ac.uk/xrootd"
             " --actionid http://glite.org/xacml/action/execute"
             " --capath /etc/grid-security/certificates/"
             " --no-check-certificates"
             " --certificate /etc/grid-security/xrd/xrdcert.pem"
             " --key /etc/grid-security/xrd/xrdkey.pem"

pepc -> good | bad

Then after restarting xrootd, you just need to test that it works.

It seems to work, I was successfully able to ban myself. Unbanning didn't work instantly, and I resorted to restarting xrootd - though perhaps if I'd had patience, it would have worked eventually.

Overall, whilst it wasn't trivial to do, it's not actually that hard, and is one more step along the road to having central banning working on all our grid services.

by Christopher J. Walker (noreply@blogger.com) at October 08, 2014 09:20

March 24, 2014


The Three Co-ordinators

It is has been a while since we posted on the blog. Generally, this means that things have been busy and interesting. Things have been busy and interesting.

We are presently, going through redevelopment of the site, the evaluation of new techniques for service delivery such as using Docker for containers and updating multiple services throughout the sites.

The development of the programme presented at CHEP on automation and different approaches to delivering HEP related Grid services is underway. An evaluation of container based solutions for service deployment will be presented at the next GridPP collaboration meeting later this month. Other evaluation work on using Software Defined Networking hasn't progressed as quickly as we would have like but is still underway.

Graeme (left), Mark (center) and Gareth.

On other news, Gareth Roy is taking over as the Scotgrid Technical Co-ordinator this month. Mark is off for adventures with the Urban Studies Big Data Group within Glasgow University.And as Dr Who can do it, we can do. Co-ordinator Past, Present and Future all appear in the same place at the same time.

Will the fabric of Scotgrid be the same again?

Very much so.

by Mark Mitchell (noreply@blogger.com) at March 24, 2014 23:13

October 14, 2013


Welcome to CHEP 2013

Greetings from CHEP 2013 in a rather wet Amsterdam.

The conference season is upon us and Sam, Andy, Wahid and myself find ourselves in Amsterdam for CHEP 2013. CHEP started here in 1983 and it is hard to believe that it has been 18 months since New York.

As usual the agenda for the next 5 days is packed. Some of the highlights so far have included advanced facility monitoring, the future of C++ and Robert Lupton's excellent talk on software engineering for Science.

As with all of my visits to Amsterdam, the rain is worth mentioning. So much so that it made local news this morning. However, the venue is the rather splendid Beurs van Berlage in central Amsterdam.

CHEP 2013

There will be further updates during the week as the conference progresses.

by Mark Mitchell (noreply@blogger.com) at October 14, 2013 12:53

June 04, 2013

London T2

Serial Consoles over ipmi

To get Serial Consoles over ipmi working properly with Scientific Linux 6.4 (aka RHEL 6.4 / centos 6.4) I had to modify several setting both in the BIOS and in the OS.

Hardware Configuration

For Dell C6100 I set these setting in the BIOS

Remote Access = Enabled
Serial Port Number = COM2
Serial Port Mode = 115200 8,n,1
Flow Control = None
Redirection After BIOS POST = Always
Terminal Type = VT100
VT-UTF8 Combo Key Support = Enabled

Note: "Redirection After Boot = Disabled" is required otherwise I get a 5 minute timeout before booting the kernel. Unfortunately with this set up you get a gap in output while the server attempts to pxeboot. However, you can interact with the BIOS and once Grub starts you will see and be able to interact with the grub and Linux boot processes.

For Dell R510/710 I set these setting in the BIOS

Serial Communication = On with Console Redirection via COM2
Serial Port Address = Serial Device1=COM1,Serial Device2=COM2
External Serial Connector = Serial Device1
Failsafe Baud Rate = 115200
Remote Terminal Type = VT100/VT220
Redirection After Boot = Disabled

Note: With these settings you will be unable to see the progress of the kickstart install on the non default console.

Grub configuration

In grub.conf you should have these two lines (they were there by default in my installs).

serial --unit=1 --speed=115200
terminal --timeout=5 serial console

This allows you access grub via the consoles. The "serial" (ipmi) terminal will be default unless you press a key when asked during the boot process. This is only for grub and not for the rest of the linux boot process

SL6 Configuration

The last console specified in the linux kernel boot options is taken to be the default console. However, if the same console is specified twice this can cause issues (e.g. when entering a password the characters are shown on the screen!)

For the initial kickstart pxe boot I append "console=tty1 console=ttyS1,115200" to the linux kernel arguments. Here the serial console over ipmi will be the default during the install process, while the other console should echo the output of the ipmi console.

After install the kernel argument "console=ttyS1,115200" was already added to the kernel boot arguments. I have additionally added "console=tty1" before this, this may be required to enable interaction with the server via a directly connected terminal if needed.

With the ipmi port set as default (last console specified in the kernel arguments) SL6 will automatically start a getty for ttyS1. If it was not the default console we would have to add a upstart config file in /etc/init/. Note SL6 uses upstart, previous SL5 console configurations in /etc/inittab are ignored!

e.g. ttyS1.conf

start on stopping rc runlevel [345]
stop on starting runlevel [S016]

exec /sbin/agetty /dev/ttyS1 115200 vt100

by Daniel Traynor (noreply@blogger.com) at June 04, 2013 14:49

October 02, 2012

National Grid Service

SHA2 certificates

We have started to issue certificates with the "new" more secure algorithms, SHA2 (or to be precise SHA256) - basically, it means that the hashing algorithm which is a part of the signature is more secure against attacks than the current SHA1 algorithm (which in turn is more secure than the older MD5).

But only to a lucky few, not to everybody.  And even they get to keep their "traditional" SHA1 certificates alongside the SHA2 one if they wish.

Because the catch is that not everything supports SHA2.  The large middleware providers have started worrying about supporting SHA2, but we only really know by testing it.

So what's the problem?  A digital signature is basically a one-way hash of something, which is encrypted with your private key: S=E(H(message)).  To verify the signature, you would re-hash the message, H(message), and also decrypt the signature with the public key (found in the certificate in the signer): D(S)=D(E(H(message)))=H(message) - and also check the validity of the certificate.

If someone has tampered with the message, the H would fail (with extremely high probability) to yield the same result, hence invalidate the signature, as D(S) would no longer be the same as H(tamper_message).

However, if you could attack the hash function and find a tamper_message which has the property that H(tamper_message)=H(message), then the signature is useless - and this is precisely the kind of problem people are worrying about today, for H being SHA1 signatures (and history repeats itself, since we went through the same stuff for MD5 some years ago.)

So we're now checking if it works. So far, we have started with PKCS#10 requests of a few lucky individuals; I'll do some SPKACs tomorrow.  If you want one to play with, send us a mail via the usual channels (eg email or helpdesk.)

Eventually, we will start issuing renewals with SHA2, but only once we're sure that they work with all the middleware out there... we also take the opportunity to test a few modernisations of extensions in the certificates.

by Jens Jensen (noreply@blogger.com) at October 02, 2012 16:52

June 14, 2012

National Grid Service

Kick off - it's time for the NGS summer seminar series

In the midst of this summer of sport another event is kicking off soon but this time it's the NGS Summer Seminar series.

The first seminar will take place next Wednesday (20th June) at 10.30am (BST) and will give an overview of how accounting is done on the grid, and what it is used for.  It will cover the NGS accounting system at a high level and then go into more detail about the implementation of APEL, the accounting system for EGI, including the particular challenges involved and the plans for development.

The speaker will be Will Rogers from STFC Rutherford Appleton Laboratory who I'm sure would appreciate a good audience ready to ask lots of questions!

Please help spread the word about this event to any colleagues or organisations you think might be interested.  A Facebook event page is available so please invite your colleagues and friends!

by Gillian (noreply@blogger.com) at June 14, 2012 11:09

March 14, 2011

December 21, 2010

gLite/Grid Data Management

GFAL / LCG_Util 1.11.16 release

There has been no blog post for almost half a year. It does not mean that nothing has happened since than. We devoted enormous effort to some background works (automated test bed, nightly builds and test runs, change to EMI era from EGEE, etc.). We will test the tools and the procedures in the first months of 2011, analyze if they have added value and how they could be improved. As for the visible part, we released GFAL/LCG_Util 1.11.16 (finally) in November - see the release notes. Better later than never!

by zsolt rossz molnár (noreply@blogger.com) at December 21, 2010 14:11

October 29, 2010

Steve Lloyd's ATLAS Grid Tests

Upgrade to AtlasSetup

I have changed the setup for my ATLAS jobs so it uses AtlasSetup (rather than AtlasLogin). The magic lines are:

source $VO_ATLAS_SW_DIR/software/$RELEASE/cmtsite/asetup.sh AtlasOffline $RELEASE

VO_ATLAS_SW_DIR is set up automatically and you have to set RELEASE yourself. Since AtlasSetup is only available from Release 16 onwards, jobs going to sites without Release 16 will fail.

by Steve Lloyd (noreply@blogger.com) at October 29, 2010 15:11

July 23, 2010

Steve Lloyd's ATLAS Grid Tests

Steve's pages update

I have done some much needed maintenance and the gstat information is available again (from gstat2). There is also a new page giving the history of the ATLAS Hammercloud tests status http://pprc.qmul.ac.uk/~lloyd/gridpp/hammercloud.html.

by Steve Lloyd (noreply@blogger.com) at July 23, 2010 09:04

November 16, 2009

MonAMI at large

watching the ink dry

Yeah, it's been far too long since the last bit of news so here's a entry just to announce that MonAMI now has a new plugin: inklevel.

This plugin is a simple wrapper around Markus Heinz's libinklevel library. This is a nice library that allows easy discovery of how much ink is left in those expensive ink cartridges.

The library allows one to check the ink levels of Canon, Epson and HP printers. It can check printers directly attached (via the parallel port or USB port) or, for Canon printers, over the network via BJNP (a proprietary protocol that has been reverse engineered).

libinklevel supports many different printers, but not all of them. There's a small collection of printers that the library doesn't work with. There are some printers that are neither listed as working or not working. If your printer isn't listed, please let Markus know whether libinklevel works or not.

Credit for the photo goes to Matthew (purplemattfish) for his picture CISS - Day 304 of Project 365.

by Paul Millar (noreply@blogger.com) at November 16, 2009 20:14

Trouble at Mill

With some unfortunate timing, it looks like the "Axis of Openness" webpages (SourceForge, Slashdot, Freshmeat, ...) have gone for a burton. There seems to be some networking problems with these sites, with web traffic timing out. Assuming traceroute output is valid, the problem appears soon after traffic leaves the Santa Clara location of the Savvis network [dead router(s)?]

This is a pain because we've just done the v0.10 release of MonAMI and both the website and the file download locations are hosted by SourceForge. Whilst SourceForge is down, no one can download MonAMI!

If you're keen to try MonAMI, in the mean-time, you can download the RPMs from the (rough and ready) dev. site:

The above site is generously hosted by the ScotGrid project [their blog].

Thanks guys!

by Paul Millar (noreply@blogger.com) at November 16, 2009 20:13

October 13, 2009

Monitoring the Grid


It should be obvious to any following this blog that 10 weeks of this student-ship project have long since ended, however until now there were a few outstanding issues. I can now finally say that the project is finished and ready for public use. It can be found at http://epdt77.ph.bham.ac.uk:8080/webgrid/, although the link may change at some point in the future & the "add to iGoogle" buttons won't work for now.

The gadget is currently configured to utilize all available data from 2009. Specifically it holds data for all jobs submitted in 2009, up-to around mid September (the most up to date data available from the Grid Observatory).

In addition to this I have produced a small report giving an overview of the project which is available here.

by Laurence Hudson (noreply@blogger.com) at October 13, 2009 16:45

September 17, 2009

Steve at CERN

Next Week EGEE 09

Next week is of course EGEE 09 in Barcelona. As a warm up the EGEE SA1 OAT sections a sneak preview.


by Steve Traylen (noreply@blogger.com) at September 17, 2009 16:03

July 31, 2009

Monitoring the Grid

Another Update (Day 30)

As week six comes to a close I thought it was about time for another progress update. So same as last time, stealing the bullet points from 2 posts back, with new additions in italics.

  • GridLoad style stacked charts.

  • A "League Table" (totaled over the time period).

  • Pie charts (of the "League Table").

  • Filters and/or sub filters (just successful jobs, just jobs by one VO, just jobs for this CE etc).

  • A tabbed interface.

  • Regular Expression based filtering

  • Variable Y-axis parameter (jobs submitted, jobs started, jobs finished etc).

  • Transition to university web servers.

  • Move to a more dynamic chart legend style. Not done for pie chart yet.

  • Ensure w3 standards compliance/cross-browser compatibility & testing.

  • Automate back end data-source (currently using a small sample data set). Need automatic renewal of grid certificates.

  • Variable X-axis time-step granularity.

  • Data/image export option.

  • A list of minor UI improvements. (About 10 little jobs, not worth including in this list as they would be meaningless, without going into a lot more detail about how the gadget's code is implemented).

  • Optimise database queries and general efficiency testing.

  • Make the interface more friendly. (Tool-tips etc.)

  • Possible inclusion of more "real time" data.

  • Gadget documentation.

  • A Simple project webpage.

  • A JSON data-source API reference page.

  • 2nd gadget, to show all know infon for a given JobID.

  • 2nd gadget: Add view of all JobIDs for one user (DN string).

  • The items in this list are now approximately in the order I intend to approach them.

    On another note, I have finally managed to get some decent syntax highlighting for google gadgets, thanks to this blog post even if it means being stuck with VIM. To get this to work add the vim modeline to the very bottom of the xml gadget file, other wise it tends to break things, such as the gadget title, if added at the top. Whilst VIM is not my editor/IDE of choice it's pretty usable and can, with some configuration, match most of the key features (show whitespaces), I use in Geany. However Geany's folding feature would save a lot of time & scrolling.

    by Laurence Hudson (noreply@blogger.com) at July 31, 2009 15:11

    June 18, 2009

    Grid Ireland

    DPM 1.7.0 upgrade

    I took advantage of a downtime to upgrade our DPM server. We need the upgrade as we want to move files around using dpm-drain and don't want to lose space token associations. As we don't use YAIM I had to run the upgrade script manually, but it wasn't too difficult. Something like this should work (after putting the password in a suitable file):

    ./dpm_db_310_to_320 --db-vendor MySQL --db $DPM_HOST --user dpmmgr --pwd-file /tmp/dpm-password --dpm-db dpm_db

    I discovered a few things to watch out for along the way though. Here's my checklist:

    1. Make sure you have enough space on your system disk: I got bitten by this on a test server. The upgrade script needs a good chunk of space (comparable to that already used by the MySQL DB?) to perform the upgrade
    2. There's a mysql setting you probably need to tweak first: add set-variable=innodb_buffer_pool_size=256M to the [mysqld] section in /etc/mysql.conf and restart mysql. Otherwise you get this cryptic error:

      Thu Jun 18 09:02:30 2009 : Starting to update the DPNS/DPM database.
      Please wait...
      failed to query and/or update the DPM database : DBD::mysql::db do failed: The total number of locks exceeds the lock table size at UpdateDpmDatabase.pm line 19.
      Issuing rollback() for database handle being DESTROY'd without explicit disconnect().

      Also worth noting is that if this happens to you, when you try to re-run the script (or YAIM) you will get this error:

      failed to query and/or update the DPM database : DBD::mysql::db do failed: Duplicate column name 'r_uid' at UpdateDpmDatabase.pm line 18.
      Issuing rollback() for database handle being DESTROY'd without explicit disconnect().

      This is because the script has already done this step. You need to edit /opt/lcg/share/DPM/dpm-db-310-to-320/UpdateDpmDatabase.pm and comment out this line:

      $dbh_dpm->do ("ALTER TABLE dpm_get_filereq ADD r_uid INTEGER");

      You should then be able to run the script to completion.

    by Stephen Childs (noreply@blogger.com) at June 18, 2009 09:19

    June 08, 2009

    Grid Ireland

    STEP '09 discoveries

    ATLAS have been giving our site a good thrashing over the past week, which has helped us shake out a number of issues with our setup. Here's some of what we've learned.

    Intel 10G cards don't work well with SL4 kernels

    We're currently upgrading our networking to 10G and had it mostly in place by the time STEP'09 started. However, we discovered that the stock SL4 kernel (2.6.9) doesn't support the ixgbe 10G driver very well. It was hard to detect because we could get reasonable transmit performance but receive was limited to 30Mbit/s! It's basically an issue with interrupts (MSI-X and multi-queue weren't enabled). I compiled up a 2.6.18 SL5 kernel for SL4 and that works like a charm (once you've installed it using --nodeps).

    It's worth tuning RFIO

    We had loads of atlas analysis jobs pulling data from the SE and they were managing to saturate the read performance of our disk array. See this NorthGrid post for solutions.

    Fair-shares don't work too well if someone stuffs your queues

    We'd set up shares for the various different atlas sub-groups but the generic analysis jobs submitted via ganga were getting to use much more time. On digging deeper with Maui's diagnose -p I could see that the length of time they'd been queued was overriding the priority due to fairshare. I was able to fix this by increasing the value of FSWEIGHT in Maui's config file.

    You need to spread VOs over disk servers

    We had a nice tidy setup where all the ATLAS filesystems were on one DPM disk server. Of course this then got hammered ... we're now trying to spread out the data across multiple servers.

    by Stephen Childs (noreply@blogger.com) at June 08, 2009 12:53

    March 23, 2009

    Steve at CERN

    Installed Capacity at CHEP

    This week is CHEP 09 proceeded by WLCG workshop. I presented some updates on the roll out of the installed capacity document. It included examples of a few sites that would have zero capacity if considered under the new metrics.
    Sites should consider taking the following actions.

    • Check gridmap. In particular the view obtained by clicking on the more label and selecting the "size by SI00 and LogicalCPUs".
    • Adjust your published #LogicalCPUS in your SubCluster. It should correspond to the number of computing cores that you have.
    • Adjust your #Specint2000 settings in the SubCluster. The aim is to make your gridmap box the correct size to represent your total site power.
    The followup questions were the following. Now a chance for a a more reflected response.
    1. Will there be any opportunity to run a benchmark within the gcm framework?
      I answered that this was not possible since unless it could be executed in under 2 seconds then there was no room for it. Technically there would not be a problem with running something for longer, it could be ran rarely. We should check how the first deployment of GCM goes, longer tests are in no way planned though.
    2. What is GCM collecting and who can see its results?
      Currently no one can see on the wire since messages are encrypted. There should be a display at https://gridops.cern.ch/gcm however currently it is down but once there it will be accessible to IGTF CA members. For now there are some test details available.
    3. When should sites start publishing the HEPSpecInt2006 benchmark?
      The management contributed "Now" which is of course correct, the procedure is well established. Sites should be in the process of measuring their clusters with the HEPSpec06 bench mark. With the next YAIM release they will be able to publish the value also.
    4. If sites are measuring these benchmarks can they the values be made available on the worker nodes to jobs?
      Recently the new glite-wn-info made it as far as the PPS service. This allows the job to find on the WN to which GlueSubCluster it belongs. In principal this should be enough, the Spec benchmarks can be retrieved from the GlueSubClusters. The reality of course is that until some future date when all the WNWG recommendations are deployed along with CREAM also then this is not possible. So for now I will extend glite-
      wn-info to also return a HepSpec2006 value as configured by the site administrators.
    5. Do you know how many sites are currently publishing incorrect data?
      I did not know the answer nor is an answer easy other than collecting the ones of zero size. Checking now of 498 (468 unique?) SubClusters some 170 of them have zero LogicalCPUs.
    On a more random note a member of CMS approached me afterwards to thank me for the support I gave him 3 or so years ago while working at RAL. At the time we both had an interest in making grid work. He got extra queues, reserved resources, process dumps and general job watching from me. It was the fist grid jobs we had approaching something similar to the analysis we now face. Quoting the gentleman from his grid experience and results using RAL he obtained his doctorate and CMS chose to use the grid.

    by Steve Traylen (noreply@blogger.com) at March 23, 2009 10:11

    January 07, 2009

    GridPP Operations

    OpenSSL vulnerability

    There is a new vulnerability in OpenSSL in all versions prior to 0.9.8j, discovered by Google's security team. You will be happy to learn that the Grid PKI is not affected by the vulnerability since it uses RSA signatures throughout - only DSA signatures and ECDSA (DSA but with Elliptic Curves) are affected. (Of course you should still upgrade!)

    by Jens Jensen (noreply@blogger.com) at January 07, 2009 21:06

    January 05, 2009

    GridPP Operations

    New MD5 vulnerability announced

    In 2006 two different MD5-signed certificates were created. A new stronger attack, announced last Wednesday (yes 30 Dec), allows the attacker to change more parts of the certificate, also the subject name. To use this "for fun and profit" one gets an MD5 end entity certificate from a CA (ideally one in the browser's keystore), and hacks it to create an intermediate CA which can then issue

    by Jens Jensen (noreply@blogger.com) at January 05, 2009 11:12