Cut Your Computer Systems Downtime in Half!

Submitted by Anonymous on Sat, 06/13/2009 - 09:10

Why does computer maintenance take so much time? Why are computers systems down for so long when they fail?

Computer systems inevitably fail, but you can minimize the costs, and reduce workstation down time to hours instead of days if you fully apply these strategic principles in your computer support operation.

There are four main causes of a systems failure: "domino" effects, moving parts, (hard drives), data corruption,  and configuration management issues. Electronics failures occasionally happen, but this is typically the quick failure of defective components, and heat or power issues.

The “Domino” Effect

The “Domino” effect can be seen even in the smallest of businesses, as the trend over the last twenty years has been to load up the main office computer with word and number crunching programs, email, and web browsers, followed by an endless list of other small and operation critical applications. When the system failed, -all- of these systems go out at once, and the repair must often rebuild and test all of these dependent components.

When a personal computer was three or four thousand dollars, this “old school” approach was a necessary evil. However, today, computer systems can be had for around three hundred dollars! Today, it is possible to compartmentalize critical applications on a small and inexpensive dedicated computer, and contain the “domino” effect.

Rule #1: Don't load up a system with all of your critical applications. Compartmentalize. Isolate systems so that a hardware failure does not take down everything at once.

Moving Parts (Hard Drives)

Google has extensive hard drive reliability data from their huge server farms. Their data suggests that after a couple of years of use, the risk of hard drive increases dramatically. In my own experience, I would attribute about 40% of system failures to the hard drive.

The good news is: these new class of small and inexpensive computers typically have no moving parts! The hard drive has been replaced by flash technology. Computer systems like security gate controllers, industrial process controllers, cash registers, and the computers inside CNC machines originally had no hard drives. Plus, they were “compartmentalized” systems – isolated from any domino effect. These last generation embedded systems would often run for years without failure. Only the occasional electronics failure would shut them down.

Rule #2: Selectively replace hard drives with storage components that have no moving parts. Per thousand gigabytes, hard drive storage is still more affordable, but on small compartmentalized systems fifty to one hundred dollars of non-disk storage is often all that is needed – and the smallest capacity hard drives are more expensive than that.

A hard drive failure can corrupt the data it contains, but about half the data corruption incidents I have seen have other causes.

Data Corruption

Data corruption is typically a side effect of web browsing and email, Software updates, frequent hard reboots of the computer system, and well meaning “geek squad” boo boos.

Once again, the compartmentalization strategy helps. Systems dedicated to tasks like: time clock data collection or logging building access of a key card system, or computer controlled video security systems should NEVER, EVER be used for email and web browsing.

These systems should have software updates TURNED OFF and should only be updated when configuration management demands it.

Reboots of these systems should be infrequent, and I would prescribe turning off any preventative reboot programs that restart the system once a day or week – unless the nature of the system demands it.

RULE #3: No automated software updates, reboots or Internet access from compartmentalized systems that run other business-critical applications.

Configuration Management

Most “geek squads” can replace a hard drive or a printer – but they have no idea about configuration management because they don't get to see the big picture. Configuration Management means having a plan or template as to how systems are configured. After a failure – systems are put back -exactly- in accordance with the plan.

Configuration Management is such a huge issue, it deserves it own discussion, and I have other articles on this web site that touch on this subject.

RULE #4: Practice good Configuration Management.

Summary

These are all business class support issues. If your home laptop fails, chances are good the neighbor's kid can reload your word processor and Quickbooks. You can even afford to leave it with him for a couple of days while he sorts it out. Keeping business systems downtime contained to hours instead of days is an entirely different matter.

If you are responsible for desktop maintenance at your organization, these four principles can slash your costs and minimize your system downtime.