5 best practices for successful system administration
For successful system administration, you need more than just the required technical skills. Below is a list of five slightly non-technical abilities that should be developed in order to become the best system admin ever.
1. Monitor, measure, and record.
Yes, you know what the swap usage is today because there’s a problem with the disks thrashing and it’s causing the server to go slow. But your users are complaining to management that it’s an ongoing issue and now management is asking you for data. What, you haven’t been documenting this, so it’s now your word against Sales and Marketing? Guess who wins that argument by default? You’re responsible for the system, so they will make this your problem. So get/build/buy a system to monitor, measure, and record that data so you can build pretty power-point slides for finance next time you need to ask for hardware upgrades, or to prove that the issues are caused by bad software rather than your perfectly functioning servers. Even if you are just running a single server for an employer, a client, or even yourself, it’s good data to have for some unforeseen reason someday.
A shortlist of things to start monitoring/recording/charting/graphing:
- Load average
- Memory usage
- Disk I/O (transactions per second)
- Network throughput (in Mbits/sec)
- Network throughput per virtual host/site
- Transfer (in GB/month)
- Transfer per virtual host
- Disk storage (monthly in GB) and also daily rolling average if files are uploaded and deleted regularly)
- Average response time of test URI under your control (in milliseconds)
- Average response time of a PHP (or Ruby/Python/etc.) page under your control that does not change. Testing real web pages gives you a consistent baseline that you can use to narrow the problem to the server, the OS, or the web code itself.
- SSH logins per day/month by user and IP address
- Anything you feel is necessary, or will get questions on later
Once you have consistent information, you’ll start seeing patterns and can look for things out of the ordinary. It’s also good for correlating data to behaviors when you’re troubleshooting issues and aren’t sure where to start.
2. Develop project management habits.
Even for small, one-person projects. Write up a small scope of work, write requirements, get sign-off from stakeholders on their expectations, plan a schedule, and record your activities. Write up a postmortem document at the end. Even if it’s just for yourself. It doesn’t have to be fancy, and it certainly doesn’t have to be formal PMBoK activities. It may seem bureaucratic managing all that paper and it may seem like you’re spending more time on paperwork than sysadminning, but it helps keep you organized when your boss hands you random high-priority assignment that strays you from your task. It’s also handy when you build a new system and users complain that it doesn’t do what they wanted it to do. See? You got their sign-off on the requirements document right there…
Even if it’s just for yourself, one day you’ll ask yourself, “now why on earth did I install Acme::Phlegethothon this server? Oh yeah, it was for that weird commune who needs it for their application code…”
3. Develop a system for day-to-day work.
Again, this may seem bureaucratic, but if you spend your days just “doing stuff” without a To-Do list, you may find it difficult to explain to your boss next week exactly what you’ve been doing with your time. I’ve become a fan of Kanban boards lately because it’s a visual device that your boss (or anyone who assigns you work) can interact with. Let’s say I’ve got three items I plan to work on today that should fill up my 8 hours. “Oh, you need me work on this other item instead? Yes sir! Here is what I planned to work on today. Which one should I deprioritize in favor of this one? Oh, so it’s more important than this one, but not as important as these two? That’s fine, I can requeue that lower priority one and get to it later.” This helps set expectations. I know of one graphic designer who used it to coordinate her work between three competing project managers. If one asked her to prioritize something, she’d show him her board and send him to the other project managers to negotiate the conflict and coordinate their deadlines. Even if no one else looks at your board but you, it helps to keep you organized.
4. Develop communications skills (sales, presentation, etc).
It took me a while to really understand why this is important. Yes, today you just want to sit in a server room, keep things running, and look at Lolcats. But tomorrow, you may have other people assisting (or working for) you. You need to be able to communicate expectations. You need to propose and advocate your ideas (great ideas never stand on their own merit unless and until they are properly communicated), to your peers or to management. Maybe you need to convince someone that they need to upgrade the web server. Maybe you need to explain your new server proposal that will fix all their problems. Maybe you need to convince the developer that his code is really causing those memory leaks, but you need to present it in a non-accusatory manner. I’m personally a big fan of Toastmasters for this, as it’s the cheapest and most effective way to improve your ability to communicate.
5. Start preparing for “what if” scenarios.
Your servers will crash. Your servers will be hax0r3d. Your backups will be corrupted. So start figuring out how to react when that happens. One of the unhappiest days of my life was when my personal server was r00t3d. I did all the right things, but the attackers were more dedicated to getting in than I was in keeping them out. How do you remove a rootkit after it’s discovered? I didn’t know then, because I never asked the question (remember? I thought I did all the right things to prevent it in the first place). You can bet I certainly know now! What happens when the server drops off the network because of a power outage, and now it’s saying “kernel not found”? What happens when your client or internal user asks for you to restore a backup, and the backup is corrupted? You may not get all the answers to these until you actually experience them first-hand, but it’s better to start asking the questions now and not when you have angry people yelling at you. Also, once you start asking the questions, you can start setting up “self-training” scenarios to test it. Set up a test box and remove the kernel. See if you can get it back to operational. Try and get someone to install a rootkit on it, or at least do a bunch of stuff that you have to troubleshoot and fix. By asking these questions now, you’ll be in a much better position to deal with them later.