March 16, 2010

National Grid Service

Have you tried our UI-WMS yet?

It sounds painful but over 50 users since November have proved that it doesn't have to be!

The NGS UI-WMS service has accumulated over 50 users since it was released in November last year, a substantial fraction of the active NGS user base. Users are using the service for all kinds of work from computational chemistry through to bioinformatics.

Their work is automatically distributed by the WMS over the entire NGS, rather than just using one or two sites. If you would like to find out more about how this capability can improve your work flow check the case study in the last NGS Newsletter (page 5-11) or the resources and tutorials linked from the UI-WMS page. Most of the popular pre-installed applications already have examples of how to run them via the UI-WMS.

If you have any questions about the UI-WMS and how it can help you and your research, email our helpdesk - support(at)grid-support(dot)ac(dot)uk.

by Gillian (noreply@blogger.com) at March 16, 2010 15:59

gLite/Grid Data Management

Vnode works

How to work with virtual nodes, FTS, GFAL at CERN? A guide for developers can be found here.

by Zsolt Molnar (noreply@blogger.com) at March 16, 2010 13:00

March 12, 2010

National Grid Service

Let's get together

And so back from the OMII-UK collaborative workshop which took place in Edinburgh Wednesday and Thursday this week.

Like last years event, it was a really interesting and productive 2 days with many actions resulting from the in-depth discussion groups. The great thing about the workshop is that it's not death by PowerPoint. The sessions are discussions sessions where delegates chose from a range of topics, disappear off into groups in various parts of NeSC and then after an hour everyone reconvenes and reports back their findings. Importantly it's not just findings but short-term and long-term actions that are reported back!

Unsurprisingly as the NGS outreach person, I attended several sessions on collaboration and was the "report back" person for the session on "assistance with publicity, outreach and dissemination"which had a healthy audience (obviously a pertinent topic!).

There were many findings over the past two days and these can be found ordered by session on the event website. However the one that really stuck in my mind was something that was raised by several of the researchers who attended the meeting.

They have things they need done which they can't do themself eg code written, advice on optimising or tweaking existing software. They need computer scientists for this but they don't know how to "hook up" with them for want of a better description! The things they need done quite often aren't large projects and may only take a few weeks or less but they don't know where to go for help, who to approach and who is best suited for the job.

Several possible ways of tackling this problem were suggested at the event and if you are interested in the ideas that came up, the slides from this particular topic can be found on the website under - "Long-term researcher-driven collaborations" and "Collaboration 2". This was such a popular topic that we ended up with 2 discussion sessions.

So the question is for those research scientists out there who work with computer scientists - how did you hook up? How did you get together and form a collaboration?

by Gillian (noreply@blogger.com) at March 12, 2010 15:03

ScotGrid

SSDs - the testing begins!

This Monday (finally!) we received (half) of the SSDs we ordered for our storage testing plans.
These are the Intel G2 X-25s which are intended to represent the mid-range of the SSDs available currently (the low end ones are still due to arrive, and our high end card is being tested differently).

Just as a sneak preview, we had a chance to run iozone against one of the X-25s, in the same configuration as I've previously run against our newer disk servers (in RAID6 mode). As you can see from the graphs below, the SSDs behave exactly as we'd expect - the throughput is almost identical on random or ordered reads, whilst the RAID array suffers significantly from having to seek. Indeed, although the 22 drives in the array give it much better read performance when not seeking, the single X25 seems to equal the RAID array's performance when seeking is needed...





Next thing on the list is testing them in Worker nodes against Analysis and Production workloads.

by Sam Skipsey (noreply@blogger.com) at March 12, 2010 11:26

LHCb Production Failures

Over the last week we have been investigating why we have around 50% failure rate with LHCb jobs. All seem to be failing with the same issue which is sometimes not being able to copy their results back to the Tier 0 or subsequent fail-over Tier 1 site. This is not strictly just a Glasgow issue and it has affected Sheffield and Brunel, although the issue appears to have gone away from Brunel.

We have tried pretty much everything, as simple lcg-ls and lcg-cp actually work from the worker nodes so its not a certificate issue. The failures are not particular to a CE. Nothing changed at our site prior to the failure and LHCb say nothing changed at their end. In fact they have sites in the UK such as Manchester working fine.

None of the failures correspond to a particular set of worker nodes which might indicate NAT issues for us as we split our odd and even nodes through separate NAT's. However, it does look like network contention at some point in the process as we see either broken pipes or timeouts in the logs direct from Globus.


2010-03-04 04:04:56 UTC dirac-jobexec.py ERROR: SRM2Storage.__putFile: Failed to put file to storage. file:/tmp/8230840/CREAM603030715/7472318/00005987_00009161_3.dst: globus_xio: System error in writev: Broken pipe
2010-03-04 04:04:56 UTC dirac-jobexec.py ERROR: globus_xio: A system call failed: Broken pipe


The only constant so far is that their appears to be a 50% failure rate from failed uploads which happens consistently from submissions from DIRAC.

Its certainly a puzzler and we are fast running out of ideas!

by dug mcnab (noreply@blogger.com) at March 12, 2010 09:51

NATs Maxing Out

During our investigation of our LHCb failures we noticed that our number of conntrack entries on our two NAT hosts were in fact being totally used up i.e. all 43200! By looking at /proc/net/ip_conntrack we noticed that most of the connections were in fact udp DNS lookups by Camont jobs. We also noticed that we had not changed the default timeouts, 32768 for tcp and 3600 for udp. This was probably the reason they were being used up. So we have tweaked the timeouts and increased the maximum.
So our new NAT settings look like this:

/etc/sysctl.conf
original values of 43200, 32768, 3600 respectively.
net.ipv4.netfilter.ip_conntrack_tcp_timeout_established = 21600
net.ipv4.netfilter.ip_conntrack_max = 65536
net.ipv4.netfilter.ip_conntrack_udp_timeout = 30

Now out NAT's look much healthier. Only problem - it didn't help with LHCb productions jobs not being able to upload their results back to CERN. Back to the drawing board.

by dug mcnab (noreply@blogger.com) at March 12, 2010 09:39

March 08, 2010

National Grid Service

OMII go visual

Our colleagues at OMII-UK have recently launched the OMII-UK YouTube channel which lets you view the latest software demonstrations and videos from their staff and collaborators.

The channel will be grown over the next few months but currently it contains software demonstrations of Campus Grid Toolkit, Middleware interoperation, OSCAR, Rapid and JSDL Applications Repository. However it's not all demonstrations as they also have videos showing the sustainability lecture recently presented at NeSc in Edinburgh by Neil Chue Hong (OMII-UK's Director) and interviews with OMII-UK's PIs and some of their partners.

The OMII-UK collaboration workshop will take place in Edinburgh this week so look out for blog posts from the various NGS staff attending including myself.

by Gillian (noreply@blogger.com) at March 08, 2010 13:25

March 04, 2010

Tier1 Blog

ATLAS memory limit changes

A small number of ATLAS jobs are starting to require memory in excess of 3GB.  If these jobs are killed by the batch system it makes it much harder for any problems to be debugged.  As a result ATLAS requested a change to the memory limit of its batch jobs from 3GB to 4GB.

The majority of worker nodes have 8 cores with 16 GB of RAM.  The RAM is overcommitted by 50% which would allow 8 3GB jobs to run on the same worker node.  By applying ATLAS’ request it is only possible for a worker node to run a maximum of 6 4GB jobs on one worker node, thus blocking 2 jobs slots.  If the entire cluster was just running ATLAS jobs there could be a reduction in capacity of 25%.  However as well as ATLAS jobs the worker nodes also run jobs requiring 2, 1 and 0.5 GB of RAM.  It is thus not immediately obvious what effect this change would have on the batch system.

As ATLAS are the only VO that runs 4 GB jobs this change is most likely to have the largest effect on the batch farm when ATLAS is trying to run a lot of jobs.  Over the weekend of the 20th – 21st February ATLAS have been doing a lot of re-processing.  This involves the merging of lots of small files in MCDisk.  There were also some Monte Carlo jobs running.

The plot show the performance of the batch farm over the weekend.  For the first half of the weekend the number of blocked job slots remains fairly constant at around 30 or ~1% of running jobs.  On Sunday this started to rise and is currently at around 500 blocked jobs.  This rise can be attributed to the fact that ATLAS jobs (since Sunday) have been having problems.  This has taken worker nodes offline.  However, the number of ATLAS jobs has remained roughly constant while the total number of job slots available has dropped, leading to more machines having to run ATLAS jobs.  The current number of blocked jobs is roughly ~16% of the running jobs.

by Alastair Dewhurst at March 04, 2010 16:14

GridPP Storage

Checksumming and Integrity: The Challenge

One key focus of the Storage group as whole at the moment is the thorny issue of data integrity and consistency across the Grid. This turns out to be a somewhat complicated, multifaceted problem (the full breakdown is on the wiki here), and one which already has fractions of it solved by some of the VOs.
ATLAS, for example, has some scripts managed by Cedric Serfon which do the checking of data catalogue consistency correctly, between ATLAS's DDM system, the LFC and the local site SE. They don't, however, do file checksum checks, and therefore there is potential for files to be correctly placed, but corrupt (although this would be detected by ATLAS jobs when they run against the file, since they do perform checksums on transferred files before using them).
The Storage group has an integrity checker which does checksum and catalogue consistency checks between LFC and the local SE (in fact, it can be run remotely against any DPM), but it's much slower than the ATLAS code (mainly because of the checksums).

Currently, the plan is to split effort between improving VO specific scripts (adding checksums), and enhancing our own script - one issue of key importance is that the big VOs will always be able to write specific scripts for their own data management infrastructures than we will, but the small VOs deserve help too (perhaps more so than the big ones), and all these tools need to be interoperable. One aspect of this that we'll be talking about a little more in a future blog post is standardisation of input and output formats - we're planning on standardising on SynCat, or a slightly-derived version of SynCat, as a dump and input specifier format.

This post exists primarily as an informational post, to let people know what's going on. More detail will follow in later blog entries. If anyone wants to volunteer their SE to be checked, however, we're always interested...

by Sam Skipsey (noreply@blogger.com) at March 04, 2010 04:26

March 03, 2010

National Grid Service

GlobusWorld 2010

Globus is one of the most prevalent middlewares in use for Grid computing and is used extensively in the NGS. GlobusWorld is the main shindig for globus developers and users to share their experiences of the software. This year's GlobusWorld is being held at Argonne National Laboratory, on the outskirts of Chicago.

The NGS presentation on "Authentication in the NGS" went down well with interest in MEG and the Certificate Wizard. It was also interesting to find out that GSI-SSHTerm is now embedded into the US Teragrid portal.

Certificate ease of use issues were echoed by other institutions who all seemed to be doing their own integration of identity provider and grid software so users can use their institutional credentials to get short-lived certificates.

There have also been many interesting talks about the latest advances in globus-related Middleware, in particular version 5 of the Globus Toolkit (gt5) and their cloud-like service globus.org.

Tomorrow is the final day of the conference with tutorial sessions on the Globus Toolkit and globus.org.

Further information on the above NGS tools/services can be found under Use the NGS / User Tools on the NGS Web site.

by JK (john.kewley@stfc.ac.uk) at March 03, 2010 23:12

Tier1 Blog

ATLAS data transfers problems

There are two primary ways which ATLAS moves data around the grid. DDM transfers which simply move data around the grid and Jobs which may read or write data from one site before processing it and then registering it to another site.

Over the weekend of the 13th – 14th February 2010, RAL experienced problems with data transfers to Tier 2 sites.  Many jobs were stuck in a transfering state despite there being no obvious problems.  The problem was initially thought to be realted to a large transfer rate from CMS jobs however this problem remained even when the CMS rate dropped significantly.  The problem was eventually traced to a problem with the ATLAS LSF (which schedules the data transfers) which was solved by minor configuration changes and a reboot.

The following plots are taken from the panda graph generator which can be found at:

http://gridinfo.triumf.ca/panglia/graph-generator/

It shows the panda (ie production) jobs for the entire UK for the past week.  The majority of these jobs will have transferred their data to RAL once finished.  The binning along the bottom is slightly confusing.  The word (Fri, Sat etc) is at the centre of the bin so at midday.  The re-processing started on Friday around 8pm.  Graeme Stewart noticed that our problems started at around midnight and you can see that the red line indicating the number of transferring jobs rises considerably.  The total number of running jobs also seems to drop slightly, this drop was not uniform, some sites such as Manchester found they had no running jobs at one point.

Chris Kruk identified the problem over the weekend as was able to implement his fix at around 9-10pm on Sunday night. You can see that the number of transferring jobs dropped steadily after that.   No new jobs were coming in because the re-processing was finished.  On Tuesday morning the next stage of the ATLAS production was started as you can see from the jump in running jobs.

The results from the UK can be compared to both the German (top) or French (bottom) Clouds for the same time period:

The French cloud was slightly special in that it had to have a whole bunch of jobs aborted which are shown as the big failed spike.  Also it was given more data than it expected to re-process which is why it didn’t finish by Monday like the UK and Germany.

The following 3 tables show the transfer rate, the number of completed file transfers (succesful) and the total number of transfer errors for the MC Disk for the UK, German and French Clouds for each day over the last week.  There is unfortunately no better resolution for the transfer rate.

As you can see the total number of file transfers was higher for the UK but not by that much.  By the final day the merging step of the re-processing had started which is why the files being transferred are so much larger.

The error rates are slightly mis-leading.  The majority of the transfer errors for the UK cloud were caused by Tier 2 sites having problems.  For example on the 16/2 ~10000 errors were caused by QMUL.  The error rate from the failing jobs can be better estimated from the light green line on the plots.

UK CLOUD Throughput (MB/s) Completed File Transfers Total Number Transfer Errors
12/2/10 7 55231 6771
13/2/10 7 50746 4470
14/2/10 10 45838 3384
15/2/10 6 45625 6502
16/2/10 5 29157 12156
17/2/10 46 2506 71
DE CLOUD Throughput (MB/s) Completed File Transfers Total Number Transfer Errors
12/2/10 10 47283 187
13/2/10 7 37613 75
14/2/10 10 40698 102
15/2/10 2 8531 2
16/2/10 6 24154 24
17/2/10 41 4638 1
FR CLOUD Throughput (MB/s) Completed File Transfers Total Number Transfer Errors
12/2/10 15 52125 1579
13/2/10 11 29609 207
14/2/10 11 35318 11
15/2/10 7 38575 1227
16/2/10 6 26014 9742
17/2/10 93 7541 3444

The final table shows the estimated average file size of each file transferred.  This was calculated from Througput * 86400 / Completed File Transfers.  (86400 = number of seconds in the day, for the Thursday when the day hadn’t been completed I took 16 hours)

It is clear that the average file size to the UK was significantly smaller than both DE and French clouds when we experienced the problem.  However as said before the number of files transferred wasn’t that much higher and certainly not that unusual in terms of what we can expect from re-processing.

Estimated average file size (MB) UK DE FR
12/2/10 11.0 18.3 24.9
13/2/10 11.9 16.1 32.1
14/2/10 18.8 21.2 26.9
15/2/10 11.4 20.3 15.7
16/2/10 14.8 21.5 19.9
17/2/10 1060 509 710

by Alastair Dewhurst at March 03, 2010 16:35

March 02, 2010

Tier1 Blog

Low efficiency LHCb jobs

Jobs on the batch farm are automatically killed if they exceed a certain amount of wall or CPU time.  This however, doesn’t prevent in-efficient jobs from running.  Job efficiency is defined as CPU time / Wall time.  One of the most common causes of job in-efficiency is where a job is waiting on files.  It was noticed that LHCb seemed to have a large number of in-efficient jobs compared to the other LHC VOs.  One possible reason for this is because LHCb allows users to run their analysis at the Tier 1 and there code may not be well designed.  However it is also possible for jobs that are using the same computing resources to effect each other. There could be a hardware bottleneck which meant that all LHCb jobs would slow down once a certain threshold was passed.

One way to test if this is happening is by producing a plot of the throughput against the number of running jobs.  The throughput is defined as the Job Efficiency x number of jobs running.  If the number of jobs is effecting the efficiency then plot should reach a plateau.  The plot shows just this for each of the 4 LHC experiments. As there is no plateau then it is to be assumed that the in-efficiency is due to poor coding.

by Alastair Dewhurst at March 02, 2010 17:55

National Grid Service

Blowing your trumpet!

So two NGS roadshows last week down south and both went very well! It's always nice to see our users present their work as it's the best way to really find out what you are all up to out there! Quite often the research presented is literally hot off the machine so the roadshow audiences are getting a truly up to the minute view of research being performed on the NGS.

As the liaison officer it's my job to let the world know what the NGS and it's users are up to so we can advertise your research and our resources through a wide variety of publications, events etc. We have recently had some of our users research featured in SCW (the role of geographic isolation and dispersal limitation in generating high endemic plant species diversity) and in iSGTW in an article on "Supporting the arts and humanities with e-science".

We also feature our users research on our website in our case studies section, posters, research papers etc. If you have anything you would like added into these sections (particularly the research paper section) then please contact me at support@grid-support(dot)ac(dot)uk.

by Gillian (noreply@blogger.com) at March 02, 2010 15:48

March 01, 2010

ScotGrid

local users before pool users

Further to the original post by Graeme 'to voms or not to voms'. The Nikhef documentation has been thoroughly overhauled and I have now been able to switch lcmaps in CREAM and SCAS over to use local unix group mappings before pool accounts, if they exist.

The main changes are changing localaccount to pull in the glasgow centric grid-mapfile.

localaccount = "lcmaps_localaccount.mod"
" -gridmapfile /usr/local/etc/grid-mapfile-local"
# " -gridmapfile /etc/grid-security/grid-mapfile"

Some small tweaks are required to move localaccount from the last check to the first check. If this is successful it uses that account, otherwise it moves to check voms and pool accounts.

glexec_get_account:
proxycheck -> localaccount
localaccount -> good | vomslocalgroup
#proxycheck -> vomslocalgroup
vomslocalgroup -> vomspoolaccount | poolaccount
vomspoolaccount -> good | vomslocalaccount
vomslocalaccount -> good | poolaccount
poolaccount -> good #| localaccount

glexec_verify_account:
proxycheck -> localaccount
localaccount -> good | vomslocalgroup
#proxycheck -> vomslocalgroup
vomslocalgroup -> vomspoolaccount | poolaccount
vomspoolaccount -> good | vomslocalaccount
vomslocalaccount -> good | poolaccount
poolaccount -> good #| localaccount

SCAS is works in the same way and all that is required is to change the localaccount setting to pull in our Glasgow local grid-mapfile a'la

localaccount = "lcmaps_localaccount.mod"
" -gridmapfile /usr/local/etc/grid-mapfile-local"
# " -gridmapfile /etc/grid-security/grid-mapfile"


Job done. I can now flit between gla or pool accounts depending on my existence in /usr/local/etc/grid-mapfile-local

Job id Name User Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
2013.svr008 cream_441636610 ssp001 0 R q2d
2014.svr008 cream_963867097 gla057 0 Q q2d

by dug mcnab (noreply@blogger.com) at March 01, 2010 15:28

VMware Web admin vs SL5.4: fight!

Recently, we've acquired some hefty servers for the purposes of running virtual machines (initially for test purposes and cheap dev boxes, but potentially for service hosting depending on how well it goes). We're using VMWare Server, which, although it comes with some command line tools, very much wants you to use the fancy web interface that it runs on non-standard ports.

This was fine, except that it seemed extremely flaky on all our test servers - randomly crashing, sometimes taking out a running VM with it.

It turns out that this is all the fault of our running an up-to-date version of SL. SL5.4 (actually, anything based on RHEL5.4, one assumes) has a version of glibc that VMWare really doesn't get on with well.
Once we copied the 5.3 release of libc.so.6 from a 5.3 server into a suitable place, and pointed VMware's LD_LIBRARY_PATH at it, it seems much more stable.

(The relevant bug report, including fix suggestions is:
http://bugs.centos.org/view.php?id=3884 )

by Sam Skipsey (noreply@blogger.com) at March 01, 2010 13:57

GridPP Storage

A Phew Good Files

The storage support guys finished integrity checking of 5K ATLAS files held at Lancaster and found no bad files.


This, of course, is a Good Thing™.


The next step is to check more files, and to figure out how implementations cache checksums. Er, the next two steps are to check more files and document handling checksums, and do it for more experiments. Errr, the next three steps are to check more files, document checksum handling, add more experiments, and integrate toolkits more with experiments and data management tools.


There have been some reports of corrupted files but corruptions can happen for more than one reason, and the problem is not always at the site. The Storage Inquisition investigation is ongoing.

by Jens Jensen (noreply@blogger.com) at March 01, 2010 10:26

National Grid Service

The M-Word

There is one term that you really can't avoid in Grid computing - as much as you might want to.

It is the M-word - Middleware - a term that apparently dates back to the late 60s and has been gathering new definitions ever since.

As far as the current work of the NGS is concerned, Middleware is the software that you install on an existing computer system to hook it into the grid.

For sites associated with GridPP project, this software is gLite.

If you are outside GridPP and do not wish to use gLite, we currently recommend packages from the Virtual Data Toolkit (VDT). VDT is a project from the US Open Science Grid to collect together and package as much grid software as practical. There are more than enough packaged applications within VDT to get an NGS affiliate site on the grid.

Full NGS partners - who might need something gLite-like and capable of supporting many virtual organisations - can extend VDT with another set of seemingly random letters LCAS/LCMAPS (Local Centre Authorisation Service/ Local Credential Mapping Service).

To simplify the job of selecting the relevant packages from VDT, and building LCAS/LCMAPS - NGS staff have developed the 'NGS VDT Installer scripts'. These are maintained via the National eScience Centre's NeSCForge service (http://forge.ngs.ac.uk/projects/ngs)

The scripts started as a way of collating and documenting the knowledge of NGS operations staff.

Three years ago, the original NGS 'core' sites at RAL (STFC), Oxford, Manchester and Leeds had all deployed VDT but had set the service up independently.

So we got together, agreed on what packages were needed, how they should be configured and what local tweaks needed to make things work. Rather than simply documenting this information, we turned it into a set of executable scripts - the 'NGS VDT Installer' scripts - which could produce a consistent VDT-based service on a host.

The scripts can be thought of as runnable documentation. Someone who needed to reproduce a standard NGS installation could either run the script or read it as a guide to what to do.

More recently, following work done at Manchester, we scripted the process to building LCAS/LCMAPS. There is now a set of scripts that can take a site to full NGS partner status.

Maintaining and developing the installer scripts is one of the jobs of the NGS research and development group and we released the latest version last Friday.

Bug reports permitting, this will be the last of this generation of install scripts.

So where do we go from here?

That depends on what happens to the M-word over the next few months.

We do know VDT is still being developed. We also know there are plans to produce a pan-european Unified Middleware Distribution incorporating gLite.

Whatever happens, we'll be watching.

by Jason Lander (noreply@blogger.com) at March 01, 2010 09:49

February 23, 2010

National Grid Service

South way south

Tomorrow sees three NGS staff head way south to Canterbury for a NGS roadshow event which will be held at Christ Church Canterbury University or CCCU for short!

As well as the usual NGS staff in attendance we will also have two NGS users who will be speaking about their research using the NGS. Paul Townend from the University of Leeds will be speaking about the use of NGS in the social sciences while Sulman Sarwar will be speaking about the use of NGS in the humanities. It should be a very interesting event and something slightly different!

Meanwhile if you would like to see the presentations from our last roadshow event which was held at the University of Hull these are now available on the event website.

And coincidentally some social science research using NGS resources was mentioned in this recent iSGTW article entitled "Supporting the arts and humanities with e-science".

by Gillian (noreply@blogger.com) at February 23, 2010 16:02

February 19, 2010

London T2

RHUL 'Newton' cluster comes home

After two years hosted by Imperial College, our 'Newton' Grid computing cluster has finally been relocated to Royal Holloway's new state-of-the-art computer centre. The move was carried out by Clustervision and everything went smoothly. Before the cluster goes back into production, analysing LHC data, a software upgrade to SL5 is planned.

A small part of Newton remains at IC: the racks were donated to become part of the particle physics cluster.

by Simon George (noreply@blogger.com) at February 19, 2010 17:40

National Grid Service

Fancy a change of scenery?

If so then have a look at the new job opportunities advertised from the recently created European Grid Initiative (EGI). The EGI organisation is being developed to coordinate the European Grid Infrastructure, based on the federation of individual National Grid Infrastructures, to support a multi-disciplinary user community.

The jobs appear to be based in Amsterdam and include technology and operation officers, dissemination roles and much more. Details of all the roles can be found in the jobs section of the EGI website.

by Gillian (noreply@blogger.com) at February 19, 2010 16:13

February 18, 2010

National Grid Service

An Introduction to NGS Research and Development

The goal of the NGS is to deliver a production quality national e-infrastructure in support of academic research. This is stated on the website and mentioned at the roadshows.

Technology moves on and what constitutes a production quality national e-infrastructure changes.

The task of keeping up with the Globuses is the responsibility of the NGS Research and Development group. This is the first of a series of postings from Research and Development describing what we do.

Let us introduce ourselves: we are a group of people from STFC, Oxford, Leeds, Edinburgh and Manchester. We typically meet by acccess grid once a week and it is our job to evaluate and develop new services. These services may be suggested by people within the group or suggested by our colleagues in the Operations or Partnership.

The services we are working on can be broadly classified as...

Advanced Reservation

To provide ways of booking time on one or more sites

Middleware deployment

Simplifying and documenting the installation of the software needed to join a grid.

User facing services

Services intended to make life easier for end users

Data access

Ways of getting the data you want where you want it to be.

Cloud computing

Computing on demand.
If you want to know more, keep watching the blog for future R+D postings.

by Jason Lander (noreply@blogger.com) at February 18, 2010 10:03

February 17, 2010

ScotGrid

more openmpi tweaking

Whilst testing MPI on our cluster and get it into a usable state I uncovered a rather nasty bug with openmpi-1.3.4. This manifested itself with never being able to run on the node with cores > 4. It was a weird one as openmpi communication over two nodes worked fine with 8 cores on each node but when a job requested cores > 4 on the same node. The job just hung. An strace of the mpiexec process suggested some sort of TIMEOUT/WAIT issue.

On the release note for openmpi-1.4.1 it appears they discovered this bug and provided a fix:
- Fix a shared memory "hang" problem that occurred on x86/x86_64
platforms when used with the GNU >=4.4.x compiler series.

This sounded plausible and in fact an upgrade has fixed the issue.

So now with all 8 cores running on the same node the next issue to arise was one related to Maui. Some time when you requested nodes=8, Maui scheduled the job on 3 cores, a qdel and a resubmission later Maui rescheduled the job onto 5 cores. On one test I even qrun'd the job and it appeared it start on the correct number of nodes but there appeared to be no reason for Maui not getting this correct. So it was time to get out Maui docs.

from the docs;
Maui is by default very liberal in its interpretation of :PPN=. In its standard configuration, Maui interprets this as 'give the job * tasks with AT LEAST tasks per node'. Set the JOBNODEMATCHPOLICY parameter to EXACTNODE to have Maui support PBS's default allocation behavior of nodes with exactly tasks per node.

This seemed to suggest that Maui's default behaviour is to pack a job into as few nodes as possible. So I tried out setting the JOBNODEMATCHPOLICY to EXACTNODE and this seems to have done the trick.

nodes=24 means 24 nodes, not 8, not 6 but 24

This does have a drawback in that it will be 24 separate nodes. This setting relies upon being able to set :ppn (processes per node) to allow nodes=3:ppn=8 giving 24 cores which is really what you want to say. As you probably have a fast machine with loads of memory and cores. Therefore, you could target all the cores rather than 24 nodes. However, it is a start.

Wouldn't it be nice if you could specify :ppn in JDL. The only way round this I can see for now is to manually change the job manager or use the local batch attributes of CREAM to allow a custom cerequirement to be specified. Possible but not nice.

by dug mcnab (noreply@blogger.com) at February 17, 2010 16:02

National Grid Service

It's that time of year again

Last years UK All Hands Meeting is still fresh in the minds of many people but we're now beginning to look ahead to this years.

The AHM has been moved back to its traditional slot of September and will be held in Cardiff from the 13th - 15th - a week when every other conference also seems to be on judging by my inbox!

The call is however out now for workshops for this event and all the relevant information is on the new AHM website.

by Gillian (noreply@blogger.com) at February 17, 2010 15:26

ScotGrid

cream sours

Well we now know how much it takes to kill our CREAM instance. Yesterday it stoppped working completely and it appeared to be caught in a tailspin with the Lease and Proxy Renew processes within CREAM. Grepping the logs indicated that most of the Renewals and Lease Manager entries were all related to condor submission from ATLAS.

From speaking to Massimo at INFN it was described how Proxy and Lease renewals are operations which are executed with higher priorities wrt other commands. One hypothesis might be that the CREAM CE was so overloaded doing these commands that it was unable to deal with basic job submission since all the test jobs I submitted never made it out out the REGISTERED state.

It looked bad on Ganglia:


The first course of action was to disable job submission using the command line tool: glite-ce-disable-submission and try to deal with the renewals. This worked for a time but they reoccurred later on that evening.

The timestamps on these ATLAS cream jobs seems to be very old and hinted at stale jobs so the next course of action was to manually purge the database using the tool provided by the CREAM developers: here. The easiest way I could see to do this was to connect to the creamdb, select out the id's and create a script that called the purger for each id. Note: you need jdk 1.6 in order to run the purger!

This ended up removing around 3000 CREAM entries.

Ganglia looked much happier:


So I think you have to be careful when getting submissions from Condor at the moment as it looks to be quite easy to denial of service your CREAM CE.

Roll on CREAM 1.6

- That proxy renewal is not very efficient in the release now in production (already addressed in the coming CREAM CE: see here)
- When there are too many pending commands, new job submissions will be disabled by the limiter: see here

by dug mcnab (noreply@blogger.com) at February 17, 2010 15:17