Random Server hang issues result in a required hard reset
Windows 2008 R2 64bit systems hang at random and require a hard reboot of the system to recover. You can remote to the system via KVM (RDP is not accessible) and even do a CTRL+ALT+DEL, but after the lock screen goes away and tries to give you a login screen… YOU GET NOTHING. Only silence…
We ended up with a three headed root cause on this set of issues.
1.) Our blades had a bad BIOS version that caused the system to get into an inconsistent state and required a power cycle to get them clear.
2.) The hardware vendor had Data Execution Protection (DEP) turned on at the hardware layer by default.
3.) By default Microsoft has its own version of DEP turned on for all services unless you add in exceptions.
How did we diagnose this beast? Many team members (Dan, Don, Christian, Jim, and Cornè) all weighed in and found part of this along with support from our hardware vendor and Microsoft.
The issues plagued us for several weeks because it was not a predictable failure and there was NOTHING in the logs to correlate the issues together other than a single model of blade server.
Call with Microsoft and the hardware vendor suggested that the Microsoft DEP might be part of the issue as well. Luckily our support level was good enough to get both vendors on the same line and have them work together. Support calls like this are not cheap if you don’t have the agreements in place already.
1.) Flash the BIOS with an updated and vendor verified version.
2.) Turning off of the hardware DEP
3.) Setting the Windows DEP to on “for essential Windows programs and services only”
Since making these changes we have not seen reoccurrence of the random system hang issues. I will update this post if things change… but so far, so good!