RAL has a Ceph storage cluster (refered to as the Cloud Cluster) that provides a Rados Block Device interface for our Cloud infastructure. We recently ran a stress test of the Ceph instance.
We had 222 VMs running, of which 50 were randomly writing large volumes of data. We realised we had maxed out when we noticed a slowdown in the responsiveness of our VMs. Increasing the number of VMs writing data did not increase the amount of data being written, so we believe we hit the limit on the cluster.
The write rate we hit into the cluster was 1044 MB/s (8.2 Gb/s), as reported by ‘Ceph status’. It is worth saying that this was the raw data in, as we store three copies, there was actually 24.6Gb/s being written (not including journaling). Investigation showed that the limiting factor was the storage node disks, which were all writing as fast as they could.
We have undertaken no optimisation with our mount commands in the Ceph configuration and this should probably be something we explore further in the future for performance gain.
The cluster currently consists of 15 storage nodes, each with 7 OSDS and 10Gb/s client and rebalancing networks.
The following graphs show the network, CPU and Memory utilisation on one of the storage nodes. They are typical of the rest of the cluster. The step change represents the point where we fired up the VMs doing random writes. You will notice the network in was about 220MB/s, fifteen times this is 3300MB/s ~ 26Gb/s which is approximately the same as the 24.6Gb/s figure I quote above, providing an independent check on the figure Ceph status quotes.