After a power outage we ran into some errors after bring everything back up. When trying to reconnect a CIFS share on our NetApp filer, we got the following errors.
Jan 3, 2014 10:40:10 AM Error: Repairing SR CIFS ISO library On Netapp - Unable to mount the directory specified in device configuration request
Jan 3, 2014 10:40:54 AM Error: Detaching SR 'CIFS ISO library On Netapp' from 'SOME POOL' - General backend error
At first I thought the issue was Xenserver not bring joined to the Active Directory domain. But I was wrong, the issue was the NetApp wasn’t joined to the AD domain. So make sure everyone is on the domain and you may get rid of these errors. For us, the issue was the time on the NetApp was off by 6 minutes and was causing errors when trying to join AD (time needs to be with in sync within 5 minutes).
Today I rebooted a xen host, everything came back up as I would expect, and after trying to live migrate VMs back onto the host I got this error message.
Internal error: File "xapi_xenops.ml", line 1788, characters 3-9: Assertion failed
I thought it was something wrong with xencenter, but then tried via command line and got this error:
Error code: SR_BACKEND_FAILURE_46
Error parameters: , The VDI is not available [opterr=VDI SOME-UUID already attached RW]
The message “The VDI is not available” I’ve seen many times before and I knew what to do.
The other day I started to get paged around 12:10AM because one of our Xenserver hosts decided it was time to reboot. We have high availability (HA) turned on, so all the VMs running on this host were rebooted on other hosts per our HA config. This was good, that means paging will stop once all the servers are running again.
But what was the cause of this reboot? Of course I go straight to the logs and this is what I found in /var/log/kern.log
Nov 15 00:05:59 xenserverhostname kernel: [2461330.653319] nfs: server 10.0.0.11 not responding, timed out
Looks like issues with NFS timing out to my storage backend, which is a NetApp FAS22xx. We’ve never see performance issue with our NetApp ever, but it looks like we have something going on now. I noticed that we had a lot of volumes scheduled to run deduplication jobs starting off at midnight. I spread those out a bit so they weren’t all trying to run at the same time. I also noticed that our HA Xenserver Heartbeat was getting dedup’d as well. I turned that off because the heartbeat only takes up a few MB’s.
I also noticed that this timeout has been logged happen before, but not enough to cause a host to reboot. I believe HA/Xen will reboot the host once it goes over a timeout threshold, and that is why the server rebooted. I think we are dealing with a couple issues and I hate to use the term “perfect storm”, but it seems fitting. I think because there were a lot of NetApp jobs kicking off at midnight, jobs with lots of I/O getting kicked off on VMs at midnight, and issues with XenServer handling timeouts were at play. I think spreading out jobs on the NetApp, on the VMs, and applying patches will help, but only time will tell if it does.
I found this white paper from the early days of XenServer. I think it’s worth a once over as most of the information and logic remain valid. The flow chart of recovery steps makes it look pretty simple and can maybe help someone out of a jam. This is definitely something you want in your disaster recovery plan!
XenServer System Recovery Guide
Orignal source: http://support.citrix.com/servlet/KbServlet/download/17140-102-671536/XenServer%20System%20Recovery%20Guide.pdf