Analysis of Callout Data
As a work experience student at RAL, I have collected and analysed the data detailing the callouts made to the Tier 1 on-call team. The team provide 24×7 cover for the Tier 1 service.
Over the past few years, a trend has emerged highlighted by the above graph of Total Callouts Yearly. The graph shows a decrease from 467 callouts in 2011 to 91 half way through 2015. This significant decrease of 285 callouts (when estimating total callouts for 2015 being double 91) could reflect the weekly review of the callouts being done by Tier 1. Another explanation being improvements in technology to reduce the risk of faults and callouts. The only anomaly is 2014, showing a higher amount of callouts with no known specific cause as the team has not analysed all of the data. However, even with this anomaly, the overall data shows a trend portraying a lower amount of failures each year. Hopefully, we will hit zero soon!
During 2014, there were a total of 294 callouts, the graph above divides this total among the different service and types of alarms. We can conclude from this data that Castor, Database, DISK Server and SRM cause the most callouts. This could be because we treat storage services as more critical and these are more often configured to callout. We do note that we have a large number of storage servers and this could lead to more callouts. We also note that the (Condor) batch system doesn’t produce many callouts, and there are relatively few for other grid services.
The on-call team consists of a ‘Primary on-call’ (PoC) person who receives the message from the automated call-out system. The PoC makes an initial assessment of the problem and will attempt to resolve it. Should further assistance be needed the PoC passes the problem onto the on-call ‘expert’ from each of the support teams (Fabric, Castor, Database, Grid Services).
The graph above shows the difference between the problems handled by the PoC and the PoC + expert. We can see from this data that in 2014, 2/3 of the problems that arose were largely too complex or too big for the PoC alone and so referred to the assistance of an expert as the graph suggests.