April 15, 2014

NorthGrid

Kernel Problems at Liverpool

Introduction

Liverpool recently updated its cluster to SL6. In doing so, a problem occurred whereby the kernel would experience lockups during normal operations. The signs are unresponsiveness, drop-outs in Ganglia and (later) many "task...blocked  for 120 seconds" msgs in /var/log/m.. and dmesg.


Description

Kernels in the range 2.6.32-431* exhibited a type of deadlock when run on certain hardware with BIOS dated after 8th March 2010.

This problem occured on Supermicto hardware, main boards:
  • X8DTT-H
  • X9DRT

Notes:

1) No hardware with BIOS dated 8th March 2010 or before showed this defect, even on the same board type.

2) The oldest kernel of the 2.6.32-358 range is solid. This is corroborated by operational experience with the 358 range.

3) All current kernels in the 2.6.32-431 range exhibited the problem on our newest hardware, and a few nodes of the older hardware that had had unusual BIOS updates.

Testing

The lock-ups are hard to reproduce, but after a great deal of trail and error,  a ~ 90% effective predictor was found.

The procedure is to:

  • Build the system completely new in the usual way and 
  • When yaim gets to "config_user", use a script (stress.sh) to run 36 threads of gzip and one of iozone. 

On a susceptible node, this is reasonably certain to make it lock up after a minute. The signs are unresponsiveness and (later) "task...blocked  for 120 seconds" msgs in /var/log/m.. and dmesg.

I  observed that if the procedure is not followed "exactly", it is unreliable as a predictor. In particular, if you stop Yaim and try again, the predictor is useless.

To test that, I isolated the config_users script from Yaim, and ran it separately along with the stress.sh script. Result: useless - no lock-ups were seen.

Note: This result was rather unexpected because the isolated config_users.sh script works in the same way as the original.





Unsuccessful Theories

A great many theories were tested and rejected or not pursued further (APIC problems, disk problems, BIOS differences,various kernels, examination of kernel logs, much googling etc. etc.) Eventually, a seemingly successful theory was stumbled upon which I describe below.

The Successful Theory

All our nodes had unusual vm settings:

# grep dirty /etc/sysctl.conf
vm.dirty_background_ratio = 100
vm.dirty_expire_centisecs = 1800000
vm.dirty_ratio = 100


These custom settings facilitate the storage of atlas "short files" in RAM. Basically, they force files to remain off disk for a long time, allowing very fast access.

The modification had been tested almost exhaustively for several years on earlier kernels - but perhaps some change (or latent bug?) in the kernel had invalidated them somehow.

We came up with the idea that the issue originates in the memory operations that occur prior to Yaim/config_users. This would explain why anything but the exact activity created by the procedure might well not trigger the defect. We thought this could  tally with the idea of the ATLAS "short file" modifications in sysctl.conf. The theory is that these mods set up the problem during the memory/read/write operations (i.e. the asynchronous OS loading and flushing of the page cache).

 To test this, I used the predictor on susceptible nodes , but without applying the ATLAS "short file" patch.  Default vm settings were adopted instead.

Result

Very satisfying at last - absolutely no sign on the defect. As the ATLAS "short file" patch is not very beneficial given the current data traffic, we have decided to go back to default "vm.dirty" settings and monitor the situation carefully.



by Steve Jones (noreply@blogger.com) at April 15, 2014 10:40

April 03, 2014

The SSI Blog

Software and reproducible research - best of the tweets

This year's Collaboration Workshop took place between March 26th and 28th at the Oxford e-Research Centre, was a great success. Its theme was software and reproducible research, and ended with a special Hackday where competing teams coded against the clock to create the best software.

Our sponsors this year were Microsoft, GitHub and the Oxford e-Research Centre itself, and we would in particular like to thank Kenji Takeda, Arfon Smith and the OeRC staff for all their help, not to mention all our attendees!

Naturally, lots of tweeting relating to the Workshop (and its hashtag, #CW14) took place before, during and after the event. So here are some of the best and most revealing tweets that resulted, including manatees, rabbits, pizza, writing on the walls and "hardcore Python deployment."

Community
CW14, Twitter

read more

by a.hay at April 03, 2014 14:07

Tier1 Blog

Deploying 2013 worker nodes at RAL

We have just had a very busy week deploying the 2013 tranches of worker nodes here at RAL. WE had hoped to deploy these sooner but many staff were unavailable due to  conferences or  annual leave. Consequently there was a rush to ensure that we continued to meet our pledged capacity on the 1st April.

The new machines are in two tranches, 64 OCF machines and 64 Viglen machines. They all have dual Xeon E5-2650 @ 2.60GHz processors. They are running hyperthreading and each machine is configured to have 32 job slots. (Total additional job slots is 4096.)

The new machines have been put into production with the latest kernels and errata. So the next week or so will see staff at RAL doing a rolling upgrade on the batch farm to ensure that it is homogeneous, with all worker nodes running the same kernel, errata and EMI version. We are also taking the opportunity to do a slight update of the condor version, from condor-8.0.4-189770.x86_64 to condor-8.0.6-225363.x86_64.

As the new machines come in, it is also a reminder that we will be retiring the old 2008 machines. At the moment there is no hurry and we will continue to exploit whatever resources we have available.

cap-graph.weekThe graph shows the increase in HSPEC06 capacity of the past week. The HSPEC idle has increased because many machines are now being drained for kernel and errata updates.

 

by johnkelly at April 03, 2014 13:39

April 02, 2014

The SSI Blog

Top tips for writing a press release

Make sure your press release is a right 'ribbiting' read...By Simon Hettrick and Alexander Hay.

Whether you're researching a cure for cancer or the eating habits of the common toad, every now and then you'll want to tell the outside world about your research. It's time for a press release! Here are our five top tips on preparing one.

1. Do you need professional help?

Press releases need to be written in a journalistic style that will appeal to publishers. Most organisations will have press officers whose job it is to write press releases for researchers. This is typically a free service, because it's in your employer's interest to showcase your successes. To find a press officer, ask the faculty member responsible for marketing or contact the marketing department of your university or employer.

Community
Training
author:Simon Hettrick, author:Alexander Hay

read more

by s.hettrick at April 02, 2014 11:00

April 01, 2014

GridPP Storage

Dell OpenManage for disk servers

As we've been telling everyone who'll listen, we at Oxford are big fans of the Dell 12-bay disk servers for grid storage (previously R510 units, now R720xd ones). A few people have now bought them and asked about monitoring them.

Dell's tools all go by the general 'OpenManage' branding, which covers a great range of things, including various general purpose GUI tools. However, for the disk servers, we generally go for a minimal command-line install.

Dell have the necessary bits available in a YUM-able repository as described on the Dell Linux wiki. Our setup simple involves:
  • Installing the repository file,
  • yum install srvadmin-storageservices srvadmin-omcommon,
  • service dataeng start
  • and finally logging out and back in again, or otherwise picking up the PATH variable change from the newly installed srvadmin-path.sh script in /etc/profile.d
At that point, you should be able to query the state of your array with the 'omreport' tool, for example:
    # omreport storage vdisk controller=0
    List of Virtual Disks on Controller PERC H710P Mini (Embedded)

    Controller PERC H710P Mini (Embedded)
    ID : 0
    Status : Ok
    Name : VDos
    State : Ready
    Hot Spare Policy violated : Not Assigned
    Encrypted : No
    Layout : RAID-6
    Size : 100.00 GB (107374182400 bytes)
    Associated Fluid Cache State : Not Applicable
    Device Name : /dev/sda
    Bus Protocol : SATA
    Media : HDD
    Read Policy : Adaptive Read Ahead
    Write Policy : Write Back
    Cache Policy : Not Applicable
    Stripe Element Size : 64 KB
    Disk Cache Policy : Enabled
    We also have a rough and ready Nagios plugin which simply checks that each physical disk reports as 'OK' and 'Online' and complains if anything else is reported.

    by Ewan (noreply@blogger.com) at April 01, 2014 14:57

    The SSI Blog

    Exploring the integration of Subversion and Git with CVS

    By Mike Jackson, Software Architect.

    Michael Chappell leads the Quantitative Biomedical Inference (QuBIc) research group within the Institute of Biomedical Engineering at the University of Oxford. Michael has developed a method of processing functional magnetic resonance image (MRI) data that can be used to recognise blood flow patterns in the brain. I have been helping Michael through one of our consultancy projects, which he applied for through our open call. Part of our collaboration looked at issues around integrating Subversion or Git repositories with CVS.

    QuBIc's method is implemented as part of a C++ code, FABBER, which can be used on its own or via BASIL (Bayesian Inference for Arterial Spin Labelling MRI), a shell-script that provides a richer command-line interface. Both FABBER and BASIL are distributed as part of FSL, the FMRIB Software Library, which is produced by The Oxford Centre for Functional MRI of the Brain, Nuffield Department of Clinical Neurosciences, University of Oxford and the John Radcliffe Hospital.

    Consultancy
    revision control, version control, Subversion, SVN, Git, CVS, collaboration, author:Mike Jackson
    Consultancy

    read more

    by m.jackson at April 01, 2014 13:00

    March 31, 2014

    GridPP Storage

    Highlights of ISGC 2014

    ISGC 2014 is over. Lots of interesting discussions - on the infrastructure end, ASGC developing fanless machine room, interest in (and results on) CEPH and GLUSTER, dCache tutorial, and an hour of code with the DIRAC tutorial.

    All countries and regions presented overviews of their work in e-/cyber-Infrastructure.

    Interestingly, although this wasn't a HEP conference, practically everyone is doing >0 on LHC, so the LHC really is binding countries and researchers (well, at least physicist and infrastructureists) and e-Infrastructures together (and NRENs). When one day, someone sits down to tally up the benefit and impact of the LHC, this ought to be one of the top ones. The ability to work together and to (mostly) be able to move data to each other, and to trust each other's CAs.

    Regarding the DIRAC tutorial, I was there and went through as much as I could ("I am not doing that to my private key")  Something to play with a bit more when I have time - an hour (of code) is not much time; there are always compromises between getting stuff done realistically and cheating in tutorials, but as long as there's something you can take away and play with later. As regards the key shenanigans, DIRAC say they will be working with EGI on SSO, so that's promising. Got the T-shirt, too. "Interware," though?

    On the security side, OSG have been interfacing to DigiCert, following the planned termination of the ESNET CA. Once again grids have demands that are not seen in the commercial world, such as the need for bulk certificates (particularly cost effective ones - something a traditional Classic IGTF can do fairly well.) Other security questions (techie acronym alert, until end of paragraph) include how Argus and XACML compare for implementing security policies, and the EMI STS - CERN looking at linking with ADFS. And Malaysia are trialling an online CA based on a FIPS level three token with a Raspberry π.

    EGI federated cloud got mentioned quite a few times - KISTI interested in offering IaaS, also Australia interested in joining. Philippines providing resources. EGI have a strategy for engagement. Interesting the extent to which they are driving the of CDMI.

    I should mention Shaun gave a talk on "federated" access to data, comparing the protocols - which I missed - the talk, I mean - being in another session, but I understand it was well received and there was a lot of interest.

    Software development - interesting experiences from the dCache team and building user communities with (for) DIRAC. How are people taught to develop code? The closing session was by Adam Lyon from Fermilab who talked about the lessons learned - the HEP vision of big data being different from the industry one. And yet HEP needs a culture shift to move away from the not-invented-here.

    ISGC really had a great mix of Asian and European countries, as well as the US and Australia. This post was just a quick look through my notes; there'll be much more to pick up and ponder over the coming months. And I haven't even mentioned the actual science stuff ...

    by Jens Jensen (noreply@blogger.com) at March 31, 2014 18:45

    The SSI Blog

    PyData London 2014

    By Mark Basham, Senior Software Scientist, Diamond Light Source and 2014 Institute fellow.

    As a scientist, the chance to glimpse inside the world of data analytics in the financial sector was something I was really keen on, and if nothing else, the setting for PyData London did not disappoint. Level 39 is the 39th floor of 1 Canada Square, at the heart of Canary Wharf. Its breath-taking views and modern layout and design made for a really good conference location, and set the mood for the conference well.

    PyData is all about using Python to analyse data, and as such the delegates were a mix of academic and commercial programmers, which made for an interesting diversity of presentations and conversation. In addition to this, there was a two track program, the first generally targeted at novice Python users, and the other with more advanced talks.

    Community
    author:Mark Basham, Python, PyData

    read more

    by a.hay at March 31, 2014 13:00

    GridPP Storage

    Storage thoughts from GRIDPP32

    Last week saw me successfully talk about the planned CEPH installation at the RAL Tier1. Here is a list of other thoughts which came up form GRIDPP32:

    ATLAS and CMS plans for Run2 of the LHC seems to have an increase in churn rate of data at their Tier2s which will lead to a higher deletion rate being needed. Also will need to look at making sure dark data is discovered and deleted in a more timely manner.

    A method for discovering and deleting empty directories which are no longer needed needs to be created. As an example at the Tier1, there are currently 1071 ATLAS users , each of whom can create  up to 131072 sub-directories which can end up being dark directories under ATLAS's new RUCIO namespace convention.

    To help with deletion, some of the bulk tools the site admins can use are impressive (but also possible hazardous.) One small typo when deleting may lead to huge unintentional data loss!!!

    Data rates shown  by Imperial college  of over 30Gbps WAN traffic are impressive (and makes me want to make a comparison between all UK sites to see  what rates have been recorded via the WLCG monitoring pages.

    Wahid Bhimji's  storage talk also got me thinking again that with the rise of the WLCG VO's  FAX/AAA systems and their relative increase in usage; perhaps it is time to re-investigate WAN tuning not only of WN's at sites but also of XROOT proxy servers used by the VOs. In addition, I am still worried about monitoring and controlling the number of xrootd connections per disk server in each of the type's of SE  which we have deployed on the WLCG.

    I was also interested to see his work using DAV and its possible usefulness for smaller VOs.
     

    by bgedavies (noreply@blogger.com) at March 31, 2014 12:14

    March 27, 2014

    GridPP Storage

    dCache workshop at (with) ISGC 2014

    Shaun and I took part in the dCache workshop. Starting with a VM with a dCache RPM, the challenge was to set it up with two pools, NFS4, and WebDAV. A second VM got to access the data, mainly via NFS or HTTP(S) - security ranged from IP address to X.509 certificates. The overall impression was that it was pretty easy to get set up and configure the interfaces and get it to do something useful: dCache is not "an SRM" or "an NFS server" but rather storage middleware which provides a wide range of interfaces to storage. One of the things the dCache team is looking into is the cloud interface, via CDMI. This particular interface is not ready (as of March 2014) for production, but it's something we may want to look into and test with the EGI FC's version, Stoxy.

    by Jens Jensen (noreply@blogger.com) at March 27, 2014 04:24

    March 26, 2014

    The SSI Blog

    Supporting and showcasing women in technology

    By Catherine Breslin, Cam Women in Tech@CamTechWomen.

    This article is part of our series Women in Software, in which we hear perspectives on a range of issues related to women who study and work with computers and software.

    Just 17% of the UK’s technical workforce is female, and in many tech companies it’s still worthy of comment when there’s more than one woman in the room. That’s why we set up Cam Women in Tech, to showcase and support women who work in Cambridge’s tech industry.

    The lack of diversity in the tech workforce has received increasing amounts of media attention in recent years. The general consensus is that there’s no one cause for the lack of diversity, but that a combination of factors work to steer girls away from the industry long before they leave school. Great initiatives like Lady Geek and Stemettes are working to address the gender imbalance by breaking down stereotypes, and are increasing future numbers by encouraging more girls to consider science and engineering careers.

    Policy
    Women in Software, author:Catherine Breslin, Cam Women in Tech

    read more

    by s.hettrick at March 26, 2014 10:00

    March 25, 2014

    The SSI Blog

    How a photo from Playboy became part of scientific culture

    By Hannah Dee, computer science lecturer, Aberystwyth University.

    When I was approached to write a guest post on women in software, my first thought was to try and pull together another post about the leaky pipeline, school science, or girls toys. But that’s not the field in which I do most of my software development. It’s what I tend to pontificate on, but not what I research. I’m a vision researcher. So, could I come up with a computer vision topic that was somehow gendered? Easy!

    When doing research in computer vision or image processing, it's useful to have a test image or two. Writing programs that reduce noise, alter brightness, or enhance edges is all very well and good, but without test images, we can't know if they work. Early on in vision science, the acquisition of images was hard, and there were a handful of images everyone used. This was partly due to expediency (not everyone had access to a scanner) and partly due to comparability (we want to be able to see the results of each algorithm on the same image or set of images). Today, nearly everyone has a digital camera as part of the device in their pocket, in the 70s and 80s such devices simply didn't exist.

    Policy
    Women in software, Image processing

    read more

    by s.hettrick at March 25, 2014 10:00

    March 24, 2014

    ScotGrid

    The Three Co-ordinators

    It is has been a while since we posted on the blog. Generally, this means that things have been busy and interesting. Things have been busy and interesting.

    We are presently, going through redevelopment of the site, the evaluation of new techniques for service delivery such as using Docker for containers and updating multiple services throughout the sites.

    The development of the programme presented at CHEP on automation and different approaches to delivering HEP related Grid services is underway. An evaluation of container based solutions for service deployment will be presented at the next GridPP collaboration meeting later this month. Other evaluation work on using Software Defined Networking hasn't progressed as quickly as we would have like but is still underway.

    Graeme (left), Mark (center) and Gareth.

    On other news, Gareth Roy is taking over as the Scotgrid Technical Co-ordinator this month. Mark is off for adventures with the Urban Studies Big Data Group within Glasgow University.And as Dr Who can do it, we can do. Co-ordinator Past, Present and Future all appear in the same place at the same time.

    Will the fabric of Scotgrid be the same again?

    Very much so.

    by Mark Mitchell (noreply@blogger.com) at March 24, 2014 23:13

    The SSI Blog

    Top tips for writing a case for a funding a software developer

    By Mike Jackson, Software Architect.

    You have been developing software that is becoming more popular. But now you are struggling to balance the need to develop and support your software, against the need to do your research. How do you convince funders to give you money to recruit a software developer to keep your users happy?

    Here are our top tips in the form of four sets of questions that, by answering, will help you to convince funders.

    Training
    Funding, top tips, author:Mike Jackson
    Training

    read more

    by m.jackson at March 24, 2014 10:00

    March 21, 2014

    The SSI Blog

    Reproducible research – an impossible dream?

    By Kenji Takeda, Microsoft Research.

    Research results in peer-reviewed publications are reproducible, right? If only it was so clear cut. With high profile paper retractions and pushes for better data sharing by funders, publishers and the community, the spotlight is now focussing on the whole way research is conducted around the world.

    While research software provides the potential for better reproducibility, most people agree that we are some way from achieving this. It’s not just a matter of throwing your source code online. Even though tools such as Github provide excellent sharing and versioning, it is down to the researcher and developer to make sure the code can not only be re-run, but also understood by others. There are still technical issues to overcome, but the social ones are even harder to tackle. The development of scientific software, choices by researchers, its use and reuse, are all intertwined. We at Microsoft Research are concerned with this: see Troubling Trends in Scientific Software.

    Policy
    Community
    CW14, Collaborations Workshop, Reproducibility, Research Software Engineers, recomputation.org, author:Kenji Takeda

    read more

    by s.hettrick at March 21, 2014 16:00

    "I am used to being in a male dominated classroom" - experiences of an A-level computing student

    By Phoebe Chapman, A-level student, Barton Peveril College.

    My journey into the unknown field of Computer Science started at an open day held at my college, Barton Peveril. I had not come across computing before and I (naively) thought it was pretty much the same as ICT (Information and Communications Technology), which I had studied at school and wasn’t very keen on. Computing has a lot more application than ICT. In ICT, all we did was take screen shots, hear about how to use Microsoft word and PowerPoint and access files and folders on a computer (all of which everyone knew how to do anyway). In my opinion, students should start learning programming skills at an earlier age, because it is too late to wait until A-level, when most of the important decisions on subjects have already been made.

    I have always had an interest in maths and a logical approach to problems, and I was told that these skills would be very useful when studying computing. The idea of putting logic into practice is what encouraged me to try the subject. As I sat in my first-year classroom I was, at first, intimidated by the large number of guys who had been programming for years, and seemed to know a great amount about programming. This turned out not to be a problem, because the lessons were about getting everyone up to a certain standard, and I did not feel left behind. The lessons were interesting: we created programs to carry out all sorts of different functions. It quite quickly became my most enjoyable subject, despite my initial doubts. You really can make a program to do just about anything you want with knowledge of computer science.

    Policy
    Women in Software, author:Phoebe Chapman

    read more

    by s.hettrick at March 21, 2014 10:00

    March 19, 2014

    The SSI Blog

    Geeks who love the NHS: NHS Hack Days

    By Helen Jackson and Carl Reynolds, academic clinical fellow and CEO, Open Healthcare UK.

    The NHS Hack Day (NHSHD) series was the brainchild of Dr Carl Reynolds, an academic clinical fellow in respiratory medicine and founder and CEO of Open Healthcare UK. NHS Hack Day, London Edition 2014, will take place in May with the final date to be confirmed very soon. This will be the seventh event in a very successful series of hack events with a healthcare theme. 

    On how he had the idea, Carl says "I was whinging about broken NHS IT, and Tom [Taylor] told me about the recent hack day he had participated in at the Cabinet Office and said why not have an NHS Hack Day? It seemed like a good idea, so when we got home we decided to make it happen...".

    Community
    Hackdays, NHS, author:Helen Jackson, author:Carl Reynolds

    read more

    by s.hettrick at March 19, 2014 15:00