Nobody gets the root password.
That's right - nobody; not the DBA, not the developers, not even your aunt Martha.
With the root password, anybody can do anything on your server. Traceability is, at
best, minimal. Of course, when something goes wrong, nobody is responsible - except
you!
Know your System Administrator. Do a
background check on them. Include a credit check and a criminal background check.
Whomever you hire as a system administrator has to be trustworthy. Would you
authorize a known criminal or someone with a bad credit rating to withdraw funds from your
bank accounts at their discretion? The reality of the situation is that few managers
have any idea what their system administrators are doing and why. The system
administrator has a great deal of discretion, has access to valuable data and can assume
the identity of anyone and do almost anything.
Document your systems. If your
documentation is NOT in text format then your systems are NOT documented. If the
system administrator can't get to it in an emergency then it is no good. The
documentation that you put together last week (in terms of system information) is now out
of date. Insist that your system administrator maintains 'how-to' notes on the
system in text format. The 'how-to' notes should include the names and telephone
numbers of contacts, how software was installed and why, etc.
Insist on written plans. Written
plans should include step-by-step commands. The preparation of the plan forces the
system administrator to think about what is going to happen, the risks involved, etc.
Deviations from the plan should be expected. Sometimes errors are discovered
during a review. By thinking ahead, the risk of something BAD happening is
minimized. Another benefit of the plan is that it provides documentation for future
maintenance activity.
Don't rely on tools. If you believe
the salesmen, the chimp from "Bedtime for Bonzo" could run your entire data
center with just a click of a mouse. There is no substitute for an experienced
system administrator checking your systems. Many of the tools are awkward, intrusive
and expensive to maintain. True, some tools are very helpful, but they are not a
substitute for good systems administration.
Implement a maintenance plan. The
plan needs to include scheduled reboots of the system, disk reorganization, collection,
review and installation of system patches. The maintenance plan needs the full
support of upper management. A small amount of scheduled downtime is a lot better
than a system failure in the middle of a critical event such as closing accounting records
for the end of the year.
Develop a disaster recovery plan.
This means being able to rebuild your system at a remote site, not just merely being able
to restore data from tapes. A UNIX operating system is dynamic and is changing all
the time. See the section on "What you should know...Business
Continuance".
Back the system up. Some companies
think that some systems don't have to be backed up because they are a 'development system'
or a 'test system', etc. Sometimes it seems unimportant - until a critical file of
source code is erased - then is becomes important. Backups are cheap insurance and
are easy to implement. Backup EVERYTHING and put it on a retention cycle.
Do scheduled audits of your backups. Review the logs, double-check the
configuration, etc. Its easy to forget a detail when you are busy
doing 5 things and the same time!
Assign a lead system administrator to EVERY server.
When everyone is responsible for a system then NO ONE is responsible for the
system. The lead administrator should check the system daily, should know what is
running on the system, etc. The lead system administrator coordinates all work done
on that server.
Calculate the cost of each server being down for 1
day. Put a number on it! This exercise puts into perspective how
important good system administration is. It gives a non-emotional or
non-anecdotal
basis for making decisions for extra tape drives, disks, training, etc. For
example, what is the cost associated with having a team of 10 developers (working on a
high visibility project) sitting idle for a day or two? Will the project leader or
development manager sign off on a written document stating that a 2 or 3 days downtime is
ok? How much uptime do they want and are they willing to pay for it?
Have the business line or manager sign off on how long a system can be
down. If it is important for them to have it up and running, then they
will back up their concern with money.