Guides

Straightforward guides and papers on a range of relevant topics

You're here: Skip Navigation LinksHome > Guides > Achieving Reliable Systems

Achieving Reliable Systems

Every business needs its computer systems to be reliable. System crashes can waste huge amounts of time and also carry the risk of data loss.

This guide highlights some of the key factors in making systems reliable. It is intended for smaller systems where stand-by equipment, cluster servers or other more elaborate precautions are unrealistic.

It is perfectly possible these days for Windows-based computer systems to run with very high levels of reliability and minimum unplanned downtime. Following the guidelines below will help you achieve the reliability that is essential for your business to work efficiently.

The factors influencing system reliability are discussed under six headings:

  • Hardware
  • Operating Systems
  • Networks
  • Application Software
  • Virus attack or system intrusion
  • Maintenance and monitoring

Hardware

Specification

Get the right equipment for the job. In the case of computers, that means enough memory (RAM) for the software you want to run, and sufficient disk space to store your programs and data – allowing for future growth. Systems that are under-specified will be more highly stressed and, apart from anything else, are likely to test your patience. Some programs are very resource-hungry and you really do need to be careful what equipment you buy for them. Video editing is an example of an application requiring a fast processor, lots of memory (RAM) and huge amounts of disk space. In the case of printers, make sure you buy something that is intended to handle the volumes you need to print.

RAID disk systems

A RAID disk array (of which a mirror disk is one option) allows a computer to suffer failure of a hard disk with no ill effect. Disks and RAID controllers are relatively inexpensive, and RAID systems should always be specified for servers.

Component quality

Cheap memory or poor video cards can cause real issues, such as the dreaded blue screen crash. Try to buy good quality components in the first place, and get upgrades or add-ons from reputable firms. Any differences in cost are usually minor.

Peripheral equipment or accessories 

These can be a bit of an issue – for example printers that don’t communicate properly (usually through USB connection) can mean a computer has to be rebooted to clear the problem. Buy reputable brands and make sure you get the latest drivers for your operating system from the manufacturer. Wireless equipment (eg mice, printers and routers), although now generally good, is almost invariably going to be less reliable than that connected by a piece of wire.

Mechanical fitting

Make sure all plugs and network wires are securely connected. In time, internal components of a PC may need re-seating to ensure a good connection – for example memory, hard drive connectors or video cards. Temperature variation may cause loosening of components.

Operating environment

Excessive heat or cold may cause difficulties over time, as components expand and contract more the greater the temperature variation. Liquid spillage should be avoided at all times. Very cold environments can result in condensation that could destroy circuit boards. Damp may cause corrosion. Dust and dirt may eventually clog up ventilation slots, and reduce the effectiveness of cooling fans.

Power

Use an Uninterruptible Power Supply. This not only allows the system to keep running through short power interruptions, but can protect against damage or corruption caused by unexpected shutdowns or mains surges.

Firmware levels

The expression firmware refers to low-level software that is stored in hardware devices such as disk drives, printers, routers, and the BIOS software in computer motherboards. Manufacturers often update firmware, particularly in the early period of a product’s lifecycle, and these updates are usually to fix faults or improve performance. They are almost invariably worth downloading and applying.

Wear, tear and age

Older equipment will tend to become less reliable. A typical average life for a component might be expressed as 40000 hours MTBF (Mean Time Between Faults) – that’s about 4 and a half years of continuous use. Generally speaking, turning equipment on and off every day will reduce that life, so although you may not leave your computer on all the time that does not mean it will last proportionately longer. Life expectancy may also be reduced by poor operating environment (see above).

Operating Systems

In the past, some desktop operating systems were inherently unreliable. If you got through a working week without having to reboot a Windows 98 PC, you had done well. A program crash would usually mean a reboot. Nowadays, Windows XP, Vista, 7 and Server 2003/2008, and Unix/Linux should all be solid if installed and configured correctly on good hardware. Generally speaking these operating systems will all handle errors arising from the programs you run pretty well, and the system itself should not crash or need rebooting when program errors happen.

  • Use up to date drivers for all components and peripherals – eg video cards, printers. If a product or operating system is relatively new, there may be frequent driver updates as problems are found and fixed. Check the manufacturer’s website. Faulty drivers are more likely to crash a modern operating system than software applications.
  • Install operating system service packs. (Many would say, with some justification that you should not use a new Windows operating system until Service Pack 1 has been released.)
  • Install operating system updates (eg Windows Updates). On balance, these do more good than harm and security fixes can be important. But you may wish to control which ones are installed and when, rather than let the system do the updates automatically.
  • Restrict Administrator rights to those users who really need them. If a user does not have Administrative rights, it restricts the damage they can do to the system (unintentionally or otherwise).
  • Resolve all configuration or other issues which result in errors or warnings in Event Logs.
  • Do not unnecessarily mess with server configuration or installed software if it is working OK.

Networks

  • Use good quality network components such switches and routers.
  • Use wired connections wherever possible.
  • Protect wireless systems with the highest level of security available.
  • Ensure that network devices (connections, routers) and software (such as DHCP) are correctly configured, and that IP address conflicts are avoided.

Application Software

  • Do not install software or add-ons unless you are confident they are from a reputable source.
  • Windows Vista and 7 include features to control the installation of software (User Account Control). You are recommended not to disable this feature, even if it can sometimes be irritating. It can prevent a malicious program installing itself without the user noticing.
  • Do take care on the Internet. Ensure that you use browser pop-up blockers provided. Don’t click on things that tell you have won prizes, that you may have a virus and need a free check, or similar. Keep away from dubious websites. Do not install unnecessary browser add-ons.

Virus attack and intrusion

  • Install reputable anti-virus software. Use a firewall – this may be built into a router, a separate physical device, or part of an anti-virus software suite. There is also the firewall built into more recent versions of Windows.
  • Ensure that anti-virus software is set to update automatically, and that subscriptions are renewed where required.
  • Ensure that the firewall is turned on in internet routers.
  • Ensure that wireless routers are secured.

Monitoring and maintenance

One of the most important aspects of ensuring system reliability is to monitor a system. Key issues to watch out for are:

  1. Lack of free disk space
     
  2. Disk fragmentation causing loss of performance
     
  3. Failure of automatic backups
     
  4. Failure of automatic updates to anti-virus software
     
  5. Failure to keep up to date with operating system and browser security updates
     
  6. The appearance of system errors and warnings in event logs
     
  7. Dirt and dust clogging air inlets, outlets and heatsinks, or failed cooling fans causing overheating
     
  8. Unduly hot, cold or damp operating conditions
     

If you have any questions or require advice on this topic, please Contact Us