May 03, 2016

The SSI Blog

The Importance of Grace and Why We Need More Women in Scientific Computing

By Toni Collis, Director - Women in HPC.

"What is a good (or bad) percentage of female users of a scientific computing facility?"

In this article, originally published to accompany a talk given at the recent public opening of Grace, UCL's new computing facility, Toni Collis uses this question to look at the diversity of the community in general, focussing on the High Performance Computing (HPC) community that represents the large-scale use of scientific computing. By looking at other studies, including analysing data collected by the Institute about research software use and development in the UK, Toni calls out the things we still need to be aware of if we are to continue to honour the work of pioneering computer scientist Grace Hopper, inventor of the first compiler for a computer programming language.  

Toni Collis is Director of Women in HPC, and works with the Institute's Policy team as a collaborator on diversity and engagement campaigns.

Policy
Community
Author:Toni Collis
Policy

read more

by n.chuehong at May 03, 2016 10:22

April 28, 2016

The SSI Blog

Collaborations Workshop 2016 (CW16) Report

By Shoaib Sufi, Community Lead, Software Sustainability Institute.

The Institute’s Collaborations Workshop 2016 (CW16) took place from 21-23 March 2016 at the Royal College of Surgeons of Edinburgh. The opening slide set the scene displaying a weighted representation of which software the people attending used in their daily work. Shoaib Sufi’s welcome to attendees was followed by an introduction from the Institute Director, Neil Chue Hong. Neil spoke about the work of the Institute and how to get the most out of a Collaborations Workshop (CW). The clue was very much in the name, the main idea was to meet people, people you may not have met before, thus widening your network of potential collaborators. With over 80 attending it made for a real opportunity to learn and share. We cover how the workshop unfolded below.

Community
cw16, report
Community

read more

by s.sufi at April 28, 2016 15:40

April 27, 2016

Tier1 Blog

Long delayed hat day

hatday_cropped_20160415It has been a while since there was a Tier1 Hat Day. It took numerous meetings, including an unprecedented full committee meeting to arrange the latest display of millinery finesse.

All in all it was a good turnout. In addition to the more ‘normal’ members attending, less normal members attending included:-

The Ceph member of staff being transformed to a man-octopus hybrid.
The pipe wielding mad scientist who denies all knowledge of man-octopus transformations.
The captain of the tier1 finally dons his official hat, hopefully to navigate R89 through the rocks of uncertainty.
Fidel sent his personal look-alike representative, or maybe even Fidel himself in disguise
The wizard wore the magical conical hat covered in magical invisible symbols.

The dark figure in the background who may be a fencer or maybe just a ‘dark figure’ mimic

by johnkelly at April 27, 2016 09:40

April 22, 2016

GridPP Storage

ZFS vs Hardware Raid System, Part II

This post will focus on other differences between a ZFS based software raid and a hardware raid system that could be important for the usage as GridPP storage backend. In a later post, the differences in the read/write rates will be tested more intensively.
For the current tests, the system configuration is the same as described previously

First test is, what happens if we just take a disk out of the current raid system...
In both cases, the raid gets rebuild using the hot spare that was provided in the initial configuration. However, the times needed to fully restore redundancy are very different:

ZFS based raid recovery time: 3min
Hardware based raid recovery time: 9h:2min

For both systems, only the test files from the previous read/write tests were on disk, and the hardware raid was initialized newly to remove the corrupted filesystem after the failure test and then the test files were recreated.  Both systems where not doing anything else during the recovery period.

The large difference is due to the fact that ZFS is raid system, volume manager, and file system all in one. ZFS knows about the structure on disk and the data that was on the broken disk. Therefore it only needs to restore what actually was really used, in the above test case just only 1.78GB.
The hardware raid system on the other hand knows nothing about the filesystem and real used space on a disk, therefore it needs to restore the whole disk even if it's like in our test case nearly empty - that's 2TB to restore while only about 2GB are actually useful data! This difference will  become even more important in the future when the capacity of a single disk gets larger.

ZFS is now also used on one of our real production systems behind DPM. The zpool on that machine, consist of 3 x raidz2 with 11 x 2TB disks for each raidz2 vdev, also had a real failed disk which needed to be replaced. There are about 5TB of data on that zpool, and the whole time needed to restore full redundancy took about 3h while the machine was still in production and used by jobs to request or store data.  This is again much faster than what a hardware raid based system would need even if the machine would be doing nothing else in the meantime than restoring full redundancy.



Another large difference between both system is the time needed to setup a raid configuration. 
For the hardware based raid, first one need to create the configuration in the controller, initialize the raid, then create partitions on the virtual disk, and lastly format the partitions to put a file system on it. When all is finished, the newly formatted partitions need to be put into /etc/fstab to be mounted when the system starts, mount points need to be created, and then the partitions can be mounted. That's a very long process which takes up a lot of time before the system can be used.
To give an idea about it, the formatting of a single 8TB partition with ext4 took
on H700: about 1h:11min
on H800: about       34min

In our configuration, 24x2TB on H800 and 12x2TB on H700, the formatting alone would take about 6h! (if done one partition after another)

For ZFS, we still need to create a raid0 for each disk separately in the H700 and H800 since both controllers don't support JBODs. However once this is done, it's very easy and fast to create a production ready raid system. There is one single command which does everything: 
zpool create NAME /dev/... /dev/... ..... spare /dev/....
After that single command, the raid system is created, formatted to be used, and mounted under /NAME which can also be changed using options when the zpool is created (or later). There is no need to edit /etc/fstab and the whole setup takes less than 10s
To have single 8TB "partitions", one can create other zfs in this pool and set a quota for it, like
zfs -o refquota=8TB create NAME/partition1
After that command, a new ZFS is created, a quota is placed on it which makes sure that tools like "df" only see 8TB available for usage, and it's mounted under /NAME/partition1 - again no need to edit /etc/fstab and the setup takes just a second or two. 




Another important consideration is what happens with the data if parts of the system fail. 
In our case, we have 12 internal disks on a H700 controller and 2 MD1200 devices with 12 disks each that are connected through a H800 controller, and 17 of the 2TB disks we used so far in the MD devices need to be replaced by 8TB disks.  There are different setups possible and it's interesting to see what in each case happens if a controller or one of the MD devices fails.
The below mentioned scenarios assume that the system is used as a DPM server in production.

Possibility 1: 1 raid6/raidz2 for all 8TB disks, 1 raid6/raidz2 for all 2TB disks on the H800, and 1 raid6/raidz2 for all 2TB disks on the H700

What happens if the H700 controller fails and can't be replaced (soon)?

Hardware raid system
The 2 raid systems on H800 would still be available. The associated filesystems could be drained in DPM, and the disks on the H700 controller could be swapped with the disks in the MD devices associated with the drained file systems. However, if the data on the disks will be usable depends a lot on the compatibility of the on disk format used by different raid controllers. Also, even if that would be possible to recreate the raid on different controller without any data lost (which is probably not possible), it would appear to the OS as a different disk and all the mount points would need to be changed to have the data available again.  

ZFS based raid system
Again, the 2 raid systems associated with the H800 would still be available and the pool of 8TB disks could be drained. Then the disks on the failed H700 could simply be swapped with the 8TB disks, and the zpool would be available again - mounted under the same directories before no matter in which bays the disks are or on which controller.

What happens if one of the MD devices fails?

Hardware raid system
Since one MD device has only 12 disks and we need to put in 17 x 8TB disks, one raid system consist of disks from both MD devices. If any MD device fails, the data on all 8TB disks will be lost. If the MD device with the remaining 2 TB disks fails, it depends if the raid on the 2TB disks could be recognized by the H700 controller. If so, then the data on the disks associated with the H700 could be drained and at least the data from the 2TB disks on the failed MD device could be restored, although with some manual configuration changes in the OS since DPM needs the data mounted in the previously used directories.

If anyone has experience on the possibility to use disks from a raid created on one controller  on another controller of a different kind, I would appreciate if we could read about it in the comments section.


ZFS based raid system
If any MD device fails, the pool with the internal disks on the H700 could be drained, and then the disks swapped with the disks on the failed MD device. All data would be available again and no data would be lost.

One raidz2 for all 8TB disks (pool1) and one raidz2 for all 2TB disks (pool2)

This setup is only possible when using a ZFS based raid since it's not supported by hardware raid controllers.

If the H700 fails, then the data on pool1 could be drained and the 8TB disks be replaced with the 2 TB disks which were connected to the H700. Then all data would be available again.
It's similar when one of the MD devices fails. If the MD device with only 8TB disks fails, then pool2 can be drained and the disks on the H700 be replaced with the 8TB disks from the failed MD device. 
If the MD device with 2TB and 8TB disks fails, it's a bit more complicated but all data could be restored in a 2 step process: 
1) replace the 2TB disks on the H700 with the 8TB disks from the failed MD device which makes pool1 available again which then could be drained.
2) Put the original 2TB disks that were connected to the H700 back in and replace 8TB disks on the working MD device with the remaining 2TB disks from the failed MD device, which makes pool2 available again.
In any case, no data would be lost.




by Marcus Ebert (noreply@blogger.com) at April 22, 2016 19:45

April 20, 2016

GridPP Storage

ZFS compression for LHC experiments data

An interesting feature of ZFS is that it supports transparent compression. Different to typical file compression, ZFS compression works on the record size/block size that it writes (which is variable in ZFS depending on the data and file size itself). Since it is important to have a fast compression/decompression algorithm to reduce the overhead compared to file access without compression, it can not be expected to get compression results similar to for example bzip in its highest compression level.  Also, the data files of the LHC experiments are ROOT files which already store data in a compressed format.

Therefore, I was not expecting any benefit of enabling compression on our servers, but since the newly implemented algorithm LZ4 has nearly no overhead even for non-compressible data, it shouldn't hurt to enable it.  Especially since our storage servers have Dual-CPUs with 12 cores each, running most of the time idle.

After enabling the default lz4 compression on 4 machines that were already migrated to ZFS and copying data on it, the first compression result looks like this:


NAME       SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
tank-2TB  32.5T  8.73T  23.8T         -    15%    26%  1.00x  ONLINE  -
tank-8TB   116T  24.0T  92.0T         -    10%    20%  1.00x  ONLINE  -

NAME                    PROPERTY       VALUE  SOURCE
tank-2TB                compressratio  1.03x  -
tank-2TB/gridstorage01  compressratio  1.03x  -
tank-2TB/gridstorage02  compressratio  1.03x  -
tank-2TB/gridstorage03  compressratio  1.03x  -
tank-2TB/gridstorage04  compressratio  1.03x  -
tank-8TB                compressratio  1.03x  -
tank-8TB/gridstorage01  compressratio  1.03x  -
tank-8TB/gridstorage02  compressratio  1.03x  -
tank-8TB/gridstorage03  compressratio  1.03x  -
tank-8TB/gridstorage04  compressratio  1.03x  -
tank-8TB/gridstorage05  compressratio  1.04x  -
tank-8TB/gridstorage06  compressratio  1.03x  -
tank-8TB/gridstorage07  compressratio  1.03x  -
tank-8TB/gridstorage08  compressratio  1.03x  -
tank-8TB/gridstorage09  compressratio  1.03x  -
tank-8TB/gridstorage10  compressratio  1.03x  -
tank-8TB/gridstorage11  compressratio  1.03x  -



NAME       SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
tank-2TB  32.5T  8.45T  24.0T         -    11%    26%  1.00x  ONLINE  -
tank-8TB   116T  24.1T  91.9T         -     7%    20%  1.00x  ONLINE  -

NAME                    PROPERTY       VALUE  SOURCE
tank-2TB                compressratio  1.03x  -
tank-2TB/gridstorage01  compressratio  1.03x  -
tank-2TB/gridstorage02  compressratio  1.03x  -
tank-2TB/gridstorage03  compressratio  1.03x  -
tank-2TB/gridstorage04  compressratio  1.04x  -
tank-8TB                compressratio  1.03x  -
tank-8TB/gridstorage01  compressratio  1.03x  -
tank-8TB/gridstorage02  compressratio  1.03x  -
tank-8TB/gridstorage03  compressratio  1.03x  -
tank-8TB/gridstorage04  compressratio  1.03x  -
tank-8TB/gridstorage05  compressratio  1.03x  -
tank-8TB/gridstorage06  compressratio  1.03x  -
tank-8TB/gridstorage07  compressratio  1.03x  -
tank-8TB/gridstorage08  compressratio  1.03x  -
tank-8TB/gridstorage09  compressratio  1.03x  -
tank-8TB/gridstorage10  compressratio  1.03x  -
tank-8TB/gridstorage11  compressratio  1.03x  -


NAME       SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
tank-4TB   127T  9.05T   118T         -     3%     7%  1.00x  ONLINE  -

NAME                    PROPERTY       VALUE  SOURCE
tank-4TB                compressratio  1.03x  -
tank-4TB/gridstorage01  compressratio  1.03x  -
tank-4TB/gridstorage02  compressratio  1.03x  -
tank-4TB/gridstorage03  compressratio  1.04x  -
tank-4TB/gridstorage04  compressratio  1.02x  -
tank-4TB/gridstorage05  compressratio  1.03x  -
tank-4TB/gridstorage06  compressratio  1.03x  -
tank-4TB/gridstorage07  compressratio  1.03x  -
tank-4TB/gridstorage08  compressratio  1.03x  -
tank-4TB/gridstorage09  compressratio  1.03x  -
tank-4TB/gridstorage10  compressratio  1.04x  -
tank-4TB/gridstorage11  compressratio  1.03x  -
tank-4TB/gridstorage12  compressratio  1.02x  -
tank-4TB/gridstorage13  compressratio  1.03x  -
tank-4TB/gridstorage14  compressratio  1.03x  -



NAME       SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
tank-2TB  63.5T  15.4T  48.1T         -    11%    24%  1.00x  ONLINE  -

NAME                    PROPERTY       VALUE  SOURCE
tank-2TB                compressratio  1.03x  -
tank-2TB/gridstorage01  compressratio  1.03x  -
tank-2TB/gridstorage02  compressratio  1.04x  -
tank-2TB/gridstorage03  compressratio  1.03x  -
tank-2TB/gridstorage04  compressratio  1.03x  -
tank-2TB/gridstorage05  compressratio  1.03x  -
tank-2TB/gridstorage06  compressratio  1.03x  -
tank-2TB/gridstorage07  compressratio  1.03x  -


Although there is not much data stored so far on each of the machines, this means we can still reduce the used disk space by some percent, 2-4% here depending on the file system and the data on it.
We have a bit more than 1PB disk storage in total on our site and the servers with 2TB disks provide about 50TB usable storage each. If we can get 4% compression for all the data, that would mean we could get nearly the space provided by one of the 2TB-disk servers additionally for free, without the cost of a new machine, power, extra disks,.... ! And that's just with the default compression while the compression level could also be tuned in ZFS...
This saving could be even bigger if we consider that in the future sites will also store more non-LHC data, like for LSST, which use a different and maybe uncompressed file format.
Another positive aspect of compression is that it reduces disk I/O since it needs to read less data blocks from disk.

It will be interesting to see how the compression rate will be after all our servers have been switch over  to ZFS.





by Marcus Ebert (noreply@blogger.com) at April 20, 2016 10:20

December 16, 2015

Tier1 Blog

RAL Tier1 – Plans for Christmas & New Year Holiday

RAL Tier1 – Plans for Christmas & New Year Holiday 2015/16

RAL closes at the end of the working day on Thursday 24th December and will re-open on Monday 4th January. During this time we plan for services at the RAL Tier1 to remain up. The usual on-call cover will be in place (as per nights and weekends). This cover will be enhanced by daily checks of key systems.

Furthermore we do not have support around the 25/26 December & 1st January for some site services we rely on. The impact of any failures around these particular dates may therefore be more extended. Also, over the holiday we have relaxed our expectation that the on-call person will respond within two hours, particularly on the specific dates just mentioned.

During the holiday we will check for tickets in the usual manner. However, only service critical issues will be dealt with.

The status of the RAL Tier1 can be seen on the dashboard at:

http://www.gridpp.rl.ac.uk/status/

Gareth Smith

by Gareth Smith at December 16, 2015 09:52

July 27, 2015

SouthGrid

Simple CVMFS puppet Module

Oxford was one of the first site to test CVMFS and also to use cern CVMFS module. Initially installation of CVMFS was not well documented  so cern cvmfs puppet module was very helpful in installing and configuring cvmfs.
Installation became easy and more clear with the newer version of cvmfs. One of my ops action was to install gridpp multi vo cvmfs repo with cern cvmfs puppet module. We realized that it is easy to write a trimmed down version of cern  cvmfs module rather than use cern cvmfs module directly. The result is cvmfs_simple module which is available on GitHub.

'include cvmfs_simple' will set up LHC repos and gridpp repo

Only mandatory parameter is

cvmfs_simple::config::cvmfs_http_proxy : 'squid-server'

It is also possible to add local cvmfs repository. Extra repos can be configured by passing values from hiera

cvmfs_simple::extra::repo: ['gridpp', 'oxford']

Oxford is using a local cvmfs repo to distribute software for local users. oxford.pp can be used as template for setting new local cvmfs repo.

cvmfs_simple doesn't support all use cases and it expects that everyone is using hiera ;) . Please feel free to change it for your use case. 

by Kashif Mohammad (noreply@blogger.com) at July 27, 2015 09:53

February 24, 2015

NorthGrid

Replacing the Condor Defrag Daemon

I've replaced the standard DEFRAG daemon released with Condor with a simpler version that contains a proportional integral (PI) controller. I hoped this would give us better control over multicore slots. Preliminary results with the proportional part of the controller show that it fails to keep accurate control over the provision of slots. It is subject to hunting due to the long time lags between the onset of drainin and the eventual change in the controlled variable (which is 'running mcore jobs'). The rate of provision was unexpectedly stable at first, considering the simplicity of the algorithm employed, but degraded over time as the controlled variable became more random.

The graph below shows the very preliminary picture, with a temporary period of stable control shown by the green line on the right of the plot. The setpoint is 250.




I have also now included an Integral component to the controller, and I'm in the process of tuning the reset rate on this. I hope to show the results of this test soon.

by Steve Jones (noreply@blogger.com) at February 24, 2015 12:09

November 17, 2014

NorthGrid

Condor Workernode Heath Script

This is a script that makes some checks on the worker node and "turns it off" if it fails any of them. To implement this, I made use a a Condor feature; startd_cron jobs. I put this in my /etc/condor_config.local file on my worker nodes.

ENABLE_PERSISTENT_CONFIG = TRUE
PERSISTENT_CONFIG_DIR = /etc/condor/ral
STARTD_ATTRS = $(STARTD_ATTRS) StartJobs, RalNodeOnline
STARTD.SETTABLE_ATTRS_ADMINISTRATOR = StartJobs
StartJobs = False
RalNodeOnline = False

I use the prefix "Ral" here because I inherited some of this material from Andrew Lahiffe at RAL! Basically, it's just to de-conflict names. I should have used "Liv" right from the start, but I'm not changing it now. Anyway, the first section says to keep a persistent record of configuration settings; it adds new configuration settings called "StartJobs" and “RalNodeOnline”; it's sets them initially to False; and it makes the START configuration setting dependant upon them both being set. Note: the START setting is very important because the node won't start jobs unless it is True. I also need this. It tells the system (startd) to run a cron script every three minutes.

STARTD_CRON_JOBLIST=TESTNODE
STARTD_CRON_TESTNODE_EXECUTABLE=/usr/libexec/condor/scripts/testnodeWrapper.sh
STARTD_CRON_TESTNODE_PERIOD=300s

# Make sure values get over
STARTD_CRON_AUTOPUBLISH = If_Changed
The testnodeWrapper.sh script looks like this:
#!/bin/bash

MESSAGE=OK

/usr/libexec/condor/scripts/testnode.sh > /dev/null 2>&1
STATUS=$?

if [ $STATUS != 0 ]; then
MESSAGE=`grep ^[A-Z0-9_][A-Z0-9_]*=$STATUS\$ /usr/libexec/condor/scripts/testnode.sh | head -n 1 | sed -e "s/=.*//"`
if [[ -z "$MESSAGE" ]]; then
MESSAGE=ERROR
fi
fi

if [[ $MESSAGE =~ ^OK$ ]] ; then
echo "RalNodeOnline = True"
else
echo "RalNodeOnline = False"
fi
echo "RalNodeOnlineMessage = $MESSAGE"

echo `date`, message $MESSAGE >> /tmp/testnode.status
exit 0

This just wraps an existing script which I reuse from out TORQUE/MAUI cluster. The existing script just returns a non-zero code if any error happens. To add a bit of extra info, I also lookup the meaning of the code. The important thing to notice is that it echoes out a line to set the RalNodeOnline setting to false. This is then used in the setting of START. Note: on TORQUE/MAUI, the script ran as “root”; here it runs as “condor”. I had to use sudo for some of the sections which (e.g.) check disks etc. because condor could not get smartctl settings etc. Right, so I think that's it. When a node fails the test, START goes to False and the node won't run more jobs. Oh, there's another thing to say. I use two settings to control START. As well as RalNodeOnline, I have the StartJobs setting. I can control this independently, so I can turn a node offline whether or not it has an error. This is useful for stopping the node to (say) rebuild it. It's done on the server, like this.

condor_config_val -verbose -name r21-n01 -startd -set "StartJobs = false"
condor_reconfig r21-n01
condor_reconfig -daemon startd r21-n01

by Steve Jones (noreply@blogger.com) at November 17, 2014 16:52

October 14, 2014

SouthGrid

Nagios Monitoring for Non LHC VO’s



A brief description of monitoring framework before coming to the actual topic of Non LHC VO's monitoring.
Service Availability Monitoring (SAM) is a framework for monitoring grid sites remotely. It consists of many components to perform various functions. It can be broadly divided into
‘What to Monitor’ or Topology Aggregation:  Collection of service endpoints and metadata from different sources like GOCDB, BDII, VOMS etc. Custom topological source (VO Feeds) can also be used.
Profile Management:  Mapping of services to the test to be performed.  This service is provided by POEM ( Profile Management) database.  It provides a web based interface to group various metrics into profiles.
Monitoring: Nagios is used as monitoring engine. It is automatically configured based on the information provided by Topology Aggregator and POEM.
SAM software was developed under EGEE project at CERN and now maintained by EGI.
It is mandatory for grid sites to pass ops VO functional test to be part of WLCG. Every NGI maintains a Regional SAM Nagios and result from regional SAM Nagios also goes to central MyEGI which is used for Reliability/Availability calculation.   
UK Regional Nagios is maintained at Oxford
and a backup instance at Lancaster

VO-Nagios
There was no centralize monitoring of Non LHC VO’s for long time and it contributed to bad user experience as it was difficult to find whether a site is broken or problem at the user end.  It was decided to host a multi VO Nagios at Oxford as we had experience with WLCG Nagios.
It is currently monitoring five VO’s
gridpp
t2k
snoplus.snolab.ca
pheno
vo.soutgrid.ac.uk

Sites can look for tests associated with only their site
VO managers may be interested to see tests associated with a particular VO only

We are using VO-feed mechanism to aggregate site metadata and endpoint information. Every VO has a vo-feed available on a web server.  Currently we are maintaining this VO-feed 

VO feed provides list of services to be monitored. I am generating this VO-feed through a script

Jobs are submitted using a proxy generated from a Robot Certificate assigned to Kashif Mohammad. These jobs are like normal grid user jobs and test things like GCC version and CA version. Jobs are submitted every eight hour and this is a configurable option.  We are monitoring CREAMCE, ARC-CE and SE only. Services like BDII, WMS etc. are already monitored by Regional Nagios so there was no need for the duplication.  


For more information, these links can be consulted
https://tomtools.cern.ch/confluence/display/SAMDOC/SAM+Public+Site.html

by Kashif Mohammad (noreply@blogger.com) at October 14, 2014 11:08

October 08, 2014

London T2

XrootD and ARGUS authentication

A couple of months ago, I  set up a test machine running XrootD version 4  at QMUL. This was to test three things:
  1. IPv6 (see blog post),
  2. Central authorisation via ARGUS (the subject of this blog post).
  3. XrootD 4
We  run StoRM/Lustre on our grid storage, and have run an XrootD server for some time as part of the  ATLAS federated storage system, FAX. This  allows local (and non local) ATLAS users interactive access, via the xrootd protocol, to files on our grid storage.

For the new machine, I started by following ATLAS's Fax for Posix storage sites instructions. These instructions document how to use VOMS authentication, but not central banning via ARGUS. CMS do however have some instructions on using xrootd-lcmaps to do the authorisation - though with RPMs from different (and therefore potentially incompatible) repositories. It is, however, possible to get them to work.

The following packages are needed (or at least what I have installed):

  yum install xrootd4-server-atlas-n2n-plugin
  yum install argus-pep-api-c  yum install lcmaps-plugins-c-pep
  yum install lcmaps-plugins-verify-proxy
  yum install lcmaps-plugins-tracking-groupid
  yum install yum install xerces-c
  yum install lcmaps-plugins-basic

Now the packages are installed, xrootd needs to be configured to use them - the appropriate lines in /etc/xrootd/xrootd-clustered.cfg are:


 xrootd.seclib /usr/lib64/libXrdSec.so
 xrootd.fslib /usr/lib64/libXrdOfs.so
 sec.protocol /usr/lib64 gsi -certdir:/etc/grid-security/certificates -cert:/etc/grid-security/xrd/xrdcert.pem -key:/etc/grid-security/xrd/xrdkey.pem -crl:3 -authzfun:libXrdLcmaps.so -authzfunparms:--osg,--lcmapscfg,/etc/xrootd/lcmaps.cfg,--loglevel,5|useglobals -gmapopt:10 -gmapto:0
 #                                                                              
 acc.authdb /etc/xrootd/auth_file
 acc.authrefresh 60
 ofs.authorize 1

And in /etc/xrootd/lcmaps.cfg it is necessary to change path and argus server (my argus server is obscured in the example below). My config file looks looks like:

################################

# where to look for modules
#path = /usr/lib64/modules
path = /usr/lib64/lcmaps

good = "lcmaps_dummy_good.mod"
bad  = "lcmaps_dummy_bad.mod"
# Note put your own argus host instead of for argushost.mydomain
pepc        = "lcmaps_c_pep.mod"
             "--pep-daemon-endpoint-url https://argushost.mydomain:8154/authz"
             " --resourceid http://esc.qmul.ac.uk/xrootd"
             " --actionid http://glite.org/xacml/action/execute"
             " --capath /etc/grid-security/certificates/"
             " --no-check-certificates"
             " --certificate /etc/grid-security/xrd/xrdcert.pem"
             " --key /etc/grid-security/xrd/xrdkey.pem"

xrootd_policy:
pepc -> good | bad
################################################


Then after restarting xrootd, you just need to test that it works.

It seems to work, I was successfully able to ban myself. Unbanning didn't work instantly, and I resorted to restarting xrootd - though perhaps if I'd had patience, it would have worked eventually.

Overall, whilst it wasn't trivial to do, it's not actually that hard, and is one more step along the road to having central banning working on all our grid services.



by Christopher J. Walker (noreply@blogger.com) at October 08, 2014 09:20

March 24, 2014

ScotGrid

The Three Co-ordinators

It is has been a while since we posted on the blog. Generally, this means that things have been busy and interesting. Things have been busy and interesting.

We are presently, going through redevelopment of the site, the evaluation of new techniques for service delivery such as using Docker for containers and updating multiple services throughout the sites.

The development of the programme presented at CHEP on automation and different approaches to delivering HEP related Grid services is underway. An evaluation of container based solutions for service deployment will be presented at the next GridPP collaboration meeting later this month. Other evaluation work on using Software Defined Networking hasn't progressed as quickly as we would have like but is still underway.

Graeme (left), Mark (center) and Gareth.

On other news, Gareth Roy is taking over as the Scotgrid Technical Co-ordinator this month. Mark is off for adventures with the Urban Studies Big Data Group within Glasgow University.And as Dr Who can do it, we can do. Co-ordinator Past, Present and Future all appear in the same place at the same time.

Will the fabric of Scotgrid be the same again?

Very much so.

by Mark Mitchell (noreply@blogger.com) at March 24, 2014 23:13

October 14, 2013

ScotGrid

Welcome to CHEP 2013

Greetings from CHEP 2013 in a rather wet Amsterdam.

The conference season is upon us and Sam, Andy, Wahid and myself find ourselves in Amsterdam for CHEP 2013. CHEP started here in 1983 and it is hard to believe that it has been 18 months since New York.

As usual the agenda for the next 5 days is packed. Some of the highlights so far have included advanced facility monitoring, the future of C++ and Robert Lupton's excellent talk on software engineering for Science.

As with all of my visits to Amsterdam, the rain is worth mentioning. So much so that it made local news this morning. However, the venue is the rather splendid Beurs van Berlage in central Amsterdam.

CHEP 2013


There will be further updates during the week as the conference progresses.




by Mark Mitchell (noreply@blogger.com) at October 14, 2013 12:53

June 04, 2013

London T2

Serial Consoles over ipmi

To get Serial Consoles over ipmi working properly with Scientific Linux 6.4 (aka RHEL 6.4 / centos 6.4) I had to modify several setting both in the BIOS and in the OS.

Hardware Configuration

For Dell C6100 I set these setting in the BIOS

Remote Access = Enabled
Serial Port Number = COM2
Serial Port Mode = 115200 8,n,1
Flow Control = None
Redirection After BIOS POST = Always
Terminal Type = VT100
VT-UTF8 Combo Key Support = Enabled


Note: "Redirection After Boot = Disabled" is required otherwise I get a 5 minute timeout before booting the kernel. Unfortunately with this set up you get a gap in output while the server attempts to pxeboot. However, you can interact with the BIOS and once Grub starts you will see and be able to interact with the grub and Linux boot processes.

For Dell R510/710 I set these setting in the BIOS

Serial Communication = On with Console Redirection via COM2
Serial Port Address = Serial Device1=COM1,Serial Device2=COM2
External Serial Connector = Serial Device1
Failsafe Baud Rate = 115200
Remote Terminal Type = VT100/VT220
Redirection After Boot = Disabled


Note: With these settings you will be unable to see the progress of the kickstart install on the non default console.

Grub configuration

In grub.conf you should have these two lines (they were there by default in my installs).

serial --unit=1 --speed=115200
terminal --timeout=5 serial console


This allows you access grub via the consoles. The "serial" (ipmi) terminal will be default unless you press a key when asked during the boot process. This is only for grub and not for the rest of the linux boot process

SL6 Configuration

The last console specified in the linux kernel boot options is taken to be the default console. However, if the same console is specified twice this can cause issues (e.g. when entering a password the characters are shown on the screen!)

For the initial kickstart pxe boot I append "console=tty1 console=ttyS1,115200" to the linux kernel arguments. Here the serial console over ipmi will be the default during the install process, while the other console should echo the output of the ipmi console.

After install the kernel argument "console=ttyS1,115200" was already added to the kernel boot arguments. I have additionally added "console=tty1" before this, this may be required to enable interaction with the server via a directly connected terminal if needed.

With the ipmi port set as default (last console specified in the kernel arguments) SL6 will automatically start a getty for ttyS1. If it was not the default console we would have to add a upstart config file in /etc/init/. Note SL6 uses upstart, previous SL5 console configurations in /etc/inittab are ignored!

e.g. ttyS1.conf

start on stopping rc runlevel [345]
stop on starting runlevel [S016]

respawn
exec /sbin/agetty /dev/ttyS1 115200 vt100



by Daniel Traynor (noreply@blogger.com) at June 04, 2013 14:49

October 02, 2012

National Grid Service

SHA2 certificates

We have started to issue certificates with the "new" more secure algorithms, SHA2 (or to be precise SHA256) - basically, it means that the hashing algorithm which is a part of the signature is more secure against attacks than the current SHA1 algorithm (which in turn is more secure than the older MD5).

But only to a lucky few, not to everybody.  And even they get to keep their "traditional" SHA1 certificates alongside the SHA2 one if they wish.

Because the catch is that not everything supports SHA2.  The large middleware providers have started worrying about supporting SHA2, but we only really know by testing it.

So what's the problem?  A digital signature is basically a one-way hash of something, which is encrypted with your private key: S=E(H(message)).  To verify the signature, you would re-hash the message, H(message), and also decrypt the signature with the public key (found in the certificate in the signer): D(S)=D(E(H(message)))=H(message) - and also check the validity of the certificate.

If someone has tampered with the message, the H would fail (with extremely high probability) to yield the same result, hence invalidate the signature, as D(S) would no longer be the same as H(tamper_message).

However, if you could attack the hash function and find a tamper_message which has the property that H(tamper_message)=H(message), then the signature is useless - and this is precisely the kind of problem people are worrying about today, for H being SHA1 signatures (and history repeats itself, since we went through the same stuff for MD5 some years ago.)

So we're now checking if it works. So far, we have started with PKCS#10 requests of a few lucky individuals; I'll do some SPKACs tomorrow.  If you want one to play with, send us a mail via the usual channels (eg email or helpdesk.)

Eventually, we will start issuing renewals with SHA2, but only once we're sure that they work with all the middleware out there... we also take the opportunity to test a few modernisations of extensions in the certificates.

by Jens Jensen (noreply@blogger.com) at October 02, 2012 16:52

June 14, 2012

National Grid Service

Kick off - it's time for the NGS summer seminar series

In the midst of this summer of sport another event is kicking off soon but this time it's the NGS Summer Seminar series.

The first seminar will take place next Wednesday (20th June) at 10.30am (BST) and will give an overview of how accounting is done on the grid, and what it is used for.  It will cover the NGS accounting system at a high level and then go into more detail about the implementation of APEL, the accounting system for EGI, including the particular challenges involved and the plans for development.

The speaker will be Will Rogers from STFC Rutherford Appleton Laboratory who I'm sure would appreciate a good audience ready to ask lots of questions!

Please help spread the word about this event to any colleagues or organisations you think might be interested.  A Facebook event page is available so please invite your colleagues and friends!

by Gillian (noreply@blogger.com) at June 14, 2012 11:09

March 14, 2011

December 21, 2010

gLite/Grid Data Management

GFAL / LCG_Util 1.11.16 release


There has been no blog post for almost half a year. It does not mean that nothing has happened since than. We devoted enormous effort to some background works (automated test bed, nightly builds and test runs, change to EMI era from EGEE, etc.). We will test the tools and the procedures in the first months of 2011, analyze if they have added value and how they could be improved. As for the visible part, we released GFAL/LCG_Util 1.11.16 (finally) in November - see the release notes. Better later than never!

by zsolt rossz molnár (noreply@blogger.com) at December 21, 2010 14:11

October 29, 2010

Steve Lloyd's ATLAS Grid Tests

Upgrade to AtlasSetup

I have changed the setup for my ATLAS jobs so it uses AtlasSetup (rather than AtlasLogin). The magic lines are:

RELEASE=xx.xx.xx
source $VO_ATLAS_SW_DIR/software/$RELEASE/cmtsite/asetup.sh AtlasOffline $RELEASE

VO_ATLAS_SW_DIR is set up automatically and you have to set RELEASE yourself. Since AtlasSetup is only available from Release 16 onwards, jobs going to sites without Release 16 will fail.

by Steve Lloyd (noreply@blogger.com) at October 29, 2010 15:11

July 23, 2010

Steve Lloyd's ATLAS Grid Tests

Steve's pages update

I have done some much needed maintenance and the gstat information is available again (from gstat2). There is also a new page giving the history of the ATLAS Hammercloud tests status http://pprc.qmul.ac.uk/~lloyd/gridpp/hammercloud.html.

by Steve Lloyd (noreply@blogger.com) at July 23, 2010 09:04

November 16, 2009

MonAMI at large

watching the ink dry

Yeah, it's been far too long since the last bit of news so here's a entry just to announce that MonAMI now has a new plugin: inklevel.

This plugin is a simple wrapper around Markus Heinz's libinklevel library. This is a nice library that allows easy discovery of how much ink is left in those expensive ink cartridges.

The library allows one to check the ink levels of Canon, Epson and HP printers. It can check printers directly attached (via the parallel port or USB port) or, for Canon printers, over the network via BJNP (a proprietary protocol that has been reverse engineered).

libinklevel supports many different printers, but not all of them. There's a small collection of printers that the library doesn't work with. There are some printers that are neither listed as working or not working. If your printer isn't listed, please let Markus know whether libinklevel works or not.

Credit for the photo goes to Matthew (purplemattfish) for his picture CISS - Day 304 of Project 365.

by Paul Millar (noreply@blogger.com) at November 16, 2009 20:14

Trouble at Mill

With some unfortunate timing, it looks like the "Axis of Openness" webpages (SourceForge, Slashdot, Freshmeat, ...) have gone for a burton. There seems to be some networking problems with these sites, with web traffic timing out. Assuming traceroute output is valid, the problem appears soon after traffic leaves the Santa Clara location of the Savvis network [dead router(s)?]

This is a pain because we've just done the v0.10 release of MonAMI and both the website and the file download locations are hosted by SourceForge. Whilst SourceForge is down, no one can download MonAMI!

If you're keen to try MonAMI, in the mean-time, you can download the RPMs from the (rough and ready) dev. site:
http://monami.scotgrid.ac.uk/

The above site is generously hosted by the ScotGrid project [their blog].

Thanks guys!

by Paul Millar (noreply@blogger.com) at November 16, 2009 20:13

October 13, 2009

Monitoring the Grid

Finished

It should be obvious to any following this blog that 10 weeks of this student-ship project have long since ended, however until now there were a few outstanding issues. I can now finally say that the project is finished and ready for public use. It can be found at http://epdt77.ph.bham.ac.uk:8080/webgrid/, although the link may change at some point in the future & the "add to iGoogle" buttons won't work for now.

The gadget is currently configured to utilize all available data from 2009. Specifically it holds data for all jobs submitted in 2009, up-to around mid September (the most up to date data available from the Grid Observatory).

In addition to this I have produced a small report giving an overview of the project which is available here.

by Laurence Hudson (noreply@blogger.com) at October 13, 2009 16:45

September 17, 2009

Steve at CERN

Next Week EGEE 09

Next week is of course EGEE 09 in Barcelona. As a warm up the EGEE SA1 OAT sections a sneak preview.



http://www.youtube.com/watch?v=PADq2x8q0kw

by Steve Traylen (noreply@blogger.com) at September 17, 2009 16:03

July 31, 2009

Monitoring the Grid

Another Update (Day 30)

As week six comes to a close I thought it was about time for another progress update. So same as last time, stealing the bullet points from 2 posts back, with new additions in italics.

  • GridLoad style stacked charts.

  • A "League Table" (totaled over the time period).

  • Pie charts (of the "League Table").

  • Filters and/or sub filters (just successful jobs, just jobs by one VO, just jobs for this CE etc).

  • A tabbed interface.

  • Regular Expression based filtering

  • Variable Y-axis parameter (jobs submitted, jobs started, jobs finished etc).

  • Transition to university web servers.

  • Move to a more dynamic chart legend style. Not done for pie chart yet.

  • Ensure w3 standards compliance/cross-browser compatibility & testing.

  • Automate back end data-source (currently using a small sample data set). Need automatic renewal of grid certificates.

  • Variable X-axis time-step granularity.

  • Data/image export option.

  • A list of minor UI improvements. (About 10 little jobs, not worth including in this list as they would be meaningless, without going into a lot more detail about how the gadget's code is implemented).

  • Optimise database queries and general efficiency testing.

  • Make the interface more friendly. (Tool-tips etc.)

  • Possible inclusion of more "real time" data.

  • Gadget documentation.

  • A Simple project webpage.

  • A JSON data-source API reference page.

  • 2nd gadget, to show all know infon for a given JobID.

  • 2nd gadget: Add view of all JobIDs for one user (DN string).


  • The items in this list are now approximately in the order I intend to approach them.

    On another note, I have finally managed to get some decent syntax highlighting for google gadgets, thanks to this blog post even if it means being stuck with VIM. To get this to work add the vim modeline to the very bottom of the xml gadget file, other wise it tends to break things, such as the gadget title, if added at the top. Whilst VIM is not my editor/IDE of choice it's pretty usable and can, with some configuration, match most of the key features (show whitespaces), I use in Geany. However Geany's folding feature would save a lot of time & scrolling.

    by Laurence Hudson (noreply@blogger.com) at July 31, 2009 15:11

    June 18, 2009

    Grid Ireland

    DPM 1.7.0 upgrade

    I took advantage of a downtime to upgrade our DPM server. We need the upgrade as we want to move files around using dpm-drain and don't want to lose space token associations. As we don't use YAIM I had to run the upgrade script manually, but it wasn't too difficult. Something like this should work (after putting the password in a suitable file):


    ./dpm_db_310_to_320 --db-vendor MySQL --db $DPM_HOST --user dpmmgr --pwd-file /tmp/dpm-password --dpm-db dpm_db


    I discovered a few things to watch out for along the way though. Here's my checklist:

    1. Make sure you have enough space on your system disk: I got bitten by this on a test server. The upgrade script needs a good chunk of space (comparable to that already used by the MySQL DB?) to perform the upgrade
    2. There's a mysql setting you probably need to tweak first: add set-variable=innodb_buffer_pool_size=256M to the [mysqld] section in /etc/mysql.conf and restart mysql. Otherwise you get this cryptic error:

      Thu Jun 18 09:02:30 2009 : Starting to update the DPNS/DPM database.
      Please wait...
      failed to query and/or update the DPM database : DBD::mysql::db do failed: The total number of locks exceeds the lock table size at UpdateDpmDatabase.pm line 19.
      Issuing rollback() for database handle being DESTROY'd without explicit disconnect().


      Also worth noting is that if this happens to you, when you try to re-run the script (or YAIM) you will get this error:

      failed to query and/or update the DPM database : DBD::mysql::db do failed: Duplicate column name 'r_uid' at UpdateDpmDatabase.pm line 18.
      Issuing rollback() for database handle being DESTROY'd without explicit disconnect().


      This is because the script has already done this step. You need to edit /opt/lcg/share/DPM/dpm-db-310-to-320/UpdateDpmDatabase.pm and comment out this line:

      $dbh_dpm->do ("ALTER TABLE dpm_get_filereq ADD r_uid INTEGER");


      You should then be able to run the script to completion.

    by Stephen Childs (noreply@blogger.com) at June 18, 2009 09:19

    June 08, 2009

    Grid Ireland

    STEP '09 discoveries

    ATLAS have been giving our site a good thrashing over the past week, which has helped us shake out a number of issues with our setup. Here's some of what we've learned.

    Intel 10G cards don't work well with SL4 kernels

    We're currently upgrading our networking to 10G and had it mostly in place by the time STEP'09 started. However, we discovered that the stock SL4 kernel (2.6.9) doesn't support the ixgbe 10G driver very well. It was hard to detect because we could get reasonable transmit performance but receive was limited to 30Mbit/s! It's basically an issue with interrupts (MSI-X and multi-queue weren't enabled). I compiled up a 2.6.18 SL5 kernel for SL4 and that works like a charm (once you've installed it using --nodeps).

    It's worth tuning RFIO

    We had loads of atlas analysis jobs pulling data from the SE and they were managing to saturate the read performance of our disk array. See this NorthGrid post for solutions.

    Fair-shares don't work too well if someone stuffs your queues

    We'd set up shares for the various different atlas sub-groups but the generic analysis jobs submitted via ganga were getting to use much more time. On digging deeper with Maui's diagnose -p I could see that the length of time they'd been queued was overriding the priority due to fairshare. I was able to fix this by increasing the value of FSWEIGHT in Maui's config file.

    You need to spread VOs over disk servers

    We had a nice tidy setup where all the ATLAS filesystems were on one DPM disk server. Of course this then got hammered ... we're now trying to spread out the data across multiple servers.

    by Stephen Childs (noreply@blogger.com) at June 08, 2009 12:53

    March 23, 2009

    Steve at CERN

    Installed Capacity at CHEP

    This week is CHEP 09 proceeded by WLCG workshop. I presented some updates on the roll out of the installed capacity document. It included examples of a few sites that would have zero capacity if considered under the new metrics.
    Sites should consider taking the following actions.

    • Check gridmap. In particular the view obtained by clicking on the more label and selecting the "size by SI00 and LogicalCPUs".
    • Adjust your published #LogicalCPUS in your SubCluster. It should correspond to the number of computing cores that you have.
    • Adjust your #Specint2000 settings in the SubCluster. The aim is to make your gridmap box the correct size to represent your total site power.
    The followup questions were the following. Now a chance for a a more reflected response.
    1. Will there be any opportunity to run a benchmark within the gcm framework?
      I answered that this was not possible since unless it could be executed in under 2 seconds then there was no room for it. Technically there would not be a problem with running something for longer, it could be ran rarely. We should check how the first deployment of GCM goes, longer tests are in no way planned though.
    2. What is GCM collecting and who can see its results?
      Currently no one can see on the wire since messages are encrypted. There should be a display at https://gridops.cern.ch/gcm however currently it is down but once there it will be accessible to IGTF CA members. For now there are some test details available.
    3. When should sites start publishing the HEPSpecInt2006 benchmark?
      The management contributed "Now" which is of course correct, the procedure is well established. Sites should be in the process of measuring their clusters with the HEPSpec06 bench mark. With the next YAIM release they will be able to publish the value also.
    4. If sites are measuring these benchmarks can they the values be made available on the worker nodes to jobs?
      Recently the new glite-wn-info made it as far as the PPS service. This allows the job to find on the WN to which GlueSubCluster it belongs. In principal this should be enough, the Spec benchmarks can be retrieved from the GlueSubClusters. The reality of course is that until some future date when all the WNWG recommendations are deployed along with CREAM also then this is not possible. So for now I will extend glite-
      wn-info to also return a HepSpec2006 value as configured by the site administrators.
    5. Do you know how many sites are currently publishing incorrect data?
      I did not know the answer nor is an answer easy other than collecting the ones of zero size. Checking now of 498 (468 unique?) SubClusters some 170 of them have zero LogicalCPUs.
    On a more random note a member of CMS approached me afterwards to thank me for the support I gave him 3 or so years ago while working at RAL. At the time we both had an interest in making grid work. He got extra queues, reserved resources, process dumps and general job watching from me. It was the fist grid jobs we had approaching something similar to the analysis we now face. Quoting the gentleman from his grid experience and results using RAL he obtained his doctorate and CMS chose to use the grid.

    by Steve Traylen (noreply@blogger.com) at March 23, 2009 10:11

    January 07, 2009

    GridPP Operations

    OpenSSL vulnerability

    There is a new vulnerability in OpenSSL in all versions prior to 0.9.8j, discovered by Google's security team. You will be happy to learn that the Grid PKI is not affected by the vulnerability since it uses RSA signatures throughout - only DSA signatures and ECDSA (DSA but with Elliptic Curves) are affected. (Of course you should still upgrade!)

    by Jens Jensen (noreply@blogger.com) at January 07, 2009 21:06

    January 05, 2009

    GridPP Operations

    New MD5 vulnerability announced

    In 2006 two different MD5-signed certificates were created. A new stronger attack, announced last Wednesday (yes 30 Dec), allows the attacker to change more parts of the certificate, also the subject name. To use this "for fun and profit" one gets an MD5 end entity certificate from a CA (ideally one in the browser's keystore), and hacks it to create an intermediate CA which can then issue

    by Jens Jensen (noreply@blogger.com) at January 05, 2009 11:12