The other day I started to get paged around 12:10AM because one of our Xenserver hosts decided it was time to reboot. We have high availability (HA) turned on, so all the VMs running on this host were rebooted on other hosts per our HA config. This was good, that means paging will stop once all the servers are running again.
But what was the cause of this reboot? Of course I go straight to the logs and this is what I found in /var/log/kern.log
Nov 15 00:05:59 xenserverhostname kernel: [2461330.653319] nfs: server 10.0.0.11 not responding, timed out
Looks like issues with NFS timing out to my storage backend, which is a NetApp FAS22xx. We’ve never see performance issue with our NetApp ever, but it looks like we have something going on now. I noticed that we had a lot of volumes scheduled to run deduplication jobs starting off at midnight. I spread those out a bit so they weren’t all trying to run at the same time. I also noticed that our HA Xenserver Heartbeat was getting dedup’d as well. I turned that off because the heartbeat only takes up a few MB’s.
I also noticed that this timeout has been logged happen before, but not enough to cause a host to reboot. I believe HA/Xen will reboot the host once it goes over a timeout threshold, and that is why the server rebooted. I think we are dealing with a couple issues and I hate to use the term “perfect storm”, but it seems fitting. I think because there were a lot of NetApp jobs kicking off at midnight, jobs with lots of I/O getting kicked off on VMs at midnight, and issues with XenServer handling timeouts were at play. I think spreading out jobs on the NetApp, on the VMs, and applying patches will help, but only time will tell if it does.