Data Center Migrations & Pretesting– How to Avoid an Executive Caused Disaster
Relocation causes a stain on systems that will result in higher than normal failure rates. The eliminate those failure and many resulting from changes, we encourage “pretesting”. Pretesting is just another tool to help you get to know peace of mind.
When we at DCMWORKS are coaching clients in regards to relocation we request that they pretest systems that are going to be moved. Often we get push back. So let’s define what pretesting is and why it needs to occur.
DCMWORKS always asks our clients to shut off and restart all equipment that is going to be moved at least a week before the live move. During this process all drivers should be updated & tested and and any IP address changes should be tested. This is done at least a week prior to give adequate time for discovery of problems. Additionally in the case of critical applications, such as accounting applications, we suggest the shutdown should be before a month end or quarter end preceding the move.
What to include in Pretesting
Any machine that has not been shut down recently should be cycled. This will prove that power supplies and disk drives can take the added strain of a cold start. For that reason we recommend the machines be down for more than a few minutes during the test.
Any machine that needs drivers changed to function in the new data center should have those new drivers installed during the pretest. These drivers do not necessarily have to be the newest but should be at a level known to be compatible with the new site.
If the IP addresses of the equipment will be changed at the new site; they should be changed during the pretest. For example, if 10.192.80.121 must be changed to 192.168.64.12 at the new site, then during the pretest you may want to change it to 10.192.80.140 (or any site specific, usuable IP). The DHCP service that distributes the address range must also be updated and have time to prepare to distribute the new address. Some users may have to be reminded (or forced) to reboot to get the new addresses.
So a machine that has recently been rebooted, that will use the same IP address, and has the latest drivers would appear to not need to be tested. Don’t let the appearance deceive you.
Examples of Discoveries with Pretesting
Often we discover that disk drives or power supplies fail during or after pretesting. If they are still under warranty or maintenance at that point the repairs should be covered however many may charge for a failure on the move day that would have been covered a week before hand at the old data center.
Often with even minor IP address changes we find systems that nobody thought were related failing because of a hard coded IP address for a “trivial” connection (who gets to tell an executive that his or her scheduling tool is “trivial”?).
Making these discoveries ahead of time means an opportunity to change the IP address in the trivial system or even change a scheduled date on a system to avoid problems of reliance of one system on another.
Many clients find that when drivers are changed there are unexpected consequences in those or other dependent systems. For example, changing a storage driver to be compatible with a device at the new data center may cause an incompatibility with an older piece of equipment that was not planned to be upgraded. Therefore driver changes sometimes lead to multiple upgrades to maintain compatibility. This requires additional time and resources that one probably does not have on the move weekend.
Examples of Discoveries without Pretesting
There are many horror stories in this category. I have chosen three to illustrate the point.
1) Client went to shutdown System-X. They could see a System-X label and so pressed the button. System-Y went down. Months prior System-X had failed and the server intended for (the then new) System-Y was reconfigured to run System-X. When damaged System-X server was fixed System-Y was brought up on it. Unfortunately, nobody remembered to exchange the labels on the machines. The entirely wrong machine was turned off by the very Systems Administrator that had remotely loaded both System-X and System-Y.
Which application would you not like to unexpectedly fail? A pretest would have turned this up in a maintenance window and saved embarrassment. As this was supposed to be a machine that had been off about 3 months before the move, had no IP changes, and no driver changes; the customer specifically excluded it from tests. D’oh!
2) A company moving out of a $250,000/month building discovered on move day that they had no one on staff who could remember how to stop and start an old DOS machine that had been running an essential program for years. After the next month’s rent (and several calls to keep circuits and utilities running) they called a specialist with experience to handle the situation. A pretest would have flagged the problem in time to save that month’s rent.
3) A car manufacturer was certain a server was only accessed through it’s FQDN so it did not get an IP address change as part of the pretest. After the move it was discovered that several small parts suppliers had hard-coded the IP address (thus not using the FQDN at all). When the move occurred the suppliers could no longer get to the application. The result was a failure to receive specific parts (in this case Screws) for manufacture.
There will be failures undoubtedly with the numbers of failures that kick up during pretesting look like a spike. I am taking the liberty of stating this as two uses of the 80:20 rule.
80% of system failures are from people; 20% from machines.
80% of system failures are from changes made that affect the system; 20% are the standing system unable to continue as normal.
When you apply these both, you get this breakdown of most system failures:
64% are from changes made by people (change a parameter, upgrade a feature, etc.),
16% are from changes made to equipment (memory or disk upgrades, the move, etc.),
16% are from input changes (epidemic overload, new diseases, market spikes, etc.), and
4% are from hardware spontaneously failing in place.
Fortunately this is almost completely avoidable through well run pretesting.
In summation, one of my life’s mantras is “No thinking on the weekends!”. I rarely accomplish that completely; but in the finer discussion it is really defined as saving your 2:00 AM brain cells for real problems by avoiding issues for which you can economically test and build contingencies. That is in keeping with Six Sigma principles to put in effort to eliminate failures until the marginal cost outweighs the marginal saving.
Expect to see about four times the normal level of errors during the pretest cycle that you normally see. If you do, you will still probably get a 1% failure to start. That is about one-quarter of the normal run for engine failures.
You do not want to have even a normal level of failures spoil the look and feel of a data center migration.
Pretesting is just another tool to help you get to know peace of mind.