Documentation is very important. I started a new SysAdmin gig a couple of months ago and the people here did a good job of documentation. A lot is documented about the systems themselves and what sort of maintenance contracts we have and that sort of thing. All this is good stuff.
But: What is not documented is the relationships and dependencies between the various sites at this company (at least on the Unix side of the house). They are spread out all over the place: Canada, India, Texas, Louisiana, D.C.
Then, the time came to upgrade DNS. Management got wind of this problem and decided that this was a problem of some urgency. Nevermind that their main DNS and mailserver was running an un-patched copy of Solaris with the RPC portmapper open to the world — this problem needed to be fixed now.
The first time through, I discovered that they were depending on internal MX records in DNS to do mail routing. Uh… wrong! So, I prepared to take out the internal MX records. However, this meant that I had to change the sendmail configuration. Since they were running an old, unpatched copy of that, I decided to upgrade sendmail as well. I set up a mailertable and tried to get all the internal MX records into it. In the process, I discovered some relatively unknown machines running SMTP. You’d think they’d want to get rid of them if no one knew about them, eh? But no, the political climate (and some special people) guaranteed that they would stay.
I was able to clean up DNS a bit as a result of this upgrade. I had to; the new bind was far more sensitive about configuration problems than the older bind.
After extensive testing, I put the changes in place. It took longer than expected — things always do — but it got done.
Oops! There was no checklist of things to make sure that everything was done right (and this was a rush project, so there was no time to create one), so 6000 users lost their mail for about 12 hours.
Of course, a bigger deal was made of it than was necessary. It was a big deal, but really, no one believed the specter of lost sales of a nuclear power plant because email was down.
Finally, though, all the problems were fixed. What were the lessons I learned?
- Document everything. For your sake and the sake of the person who comes after you. Especially document dependencies. People shouldn’t be able to claim grief if you had no way of knowing about it. If it isn’t documented, it doesn’t exist.
- Make sure you have management’s support. You’ll need these guys saying “I gave him the go ahead” if something goes wrong.
- Try to get as much information about the changes as you can. Test the information you have. Test it again.
- Get someone else to review what you are doing if you can. You might miss something.