Thursday, December 29, 2005

Building Reliable Infrastructure

Here are some quick tips I've used in the past when building out infrastructure and applications. Most of these are common sense and are well known. First I'll give you my quote that I also provide the customer when going over availability options. "There are two types of systems, those that have failed, those that are waiting to fail". Simply put, almost all systems no matter how well architected will crash and burn and some point. How fast you recover from that is what high availability(HA) is all about.

The quote is very important when communicating with a customer. It is important to set realistic expectations for systems. Along as humans are writing the code and building the infrastructure, then mistakes will happen. Minimizing and recovering from those mistakes are what make you HA plan. Here are the tips:

-Document, Document, Document. Every aspect of the system needs to be well documented. Concentrate on "what to do if this piece fails scenarios". Make it simple and concise. Make sure everyone has it and it has been exercised. This is the single most important step when building HA systems. The best architected system in the world means nothing if the folks doing the support don't understand it.

-Buy at least the second node. Clustering your hardware even in a passive mode can save you lots of down time. Convince your client its in their best interest. If you can't, make sure there is similar hardware on site available for parts. Trying to acquire parts while your server is down is not very fun.

-If you can, buy Veritas cluster software. It's the best. If you get it, practice, practice, practice. Go over as many what if situations as possible. Do this before the application goes live.

-Make sure you have a good change management process. Changes in the environment are responsible for a lot of application failures. Developers tend to want to throw code over the wall without a lot of testing. They are under a lot of pressure to meet deadlines and will take shortcuts. Make sure they don't if you want your systems to be stable. Always have build and test systems and make sure the code is exercised there before it goes live. Your test systems should be just like production.

-Application Architecture also has to be designed with HA in mind. The construction of your applications should take into account network failures, hardware failures, database failures etc. The way you handle errors with in the application can also affect availability. More on this in another post.

-Network Infrastructure is a key component to most applications. You can have the best architecture in the world but if your network is down then your application is down as far as the clients are concerned. Make sure you work with the network guys to understand the infrastructure and where the failure points are.

These are just a few tips I've learned over the years. Once you have a good handle on your what if scenarios and how you recover from them, then you will be able to give your clients a good idea on availability and potential down time. 5 nines is very difficult to achieve and almost no one does it. Make sure you are honest with the client and set realistic expectations. And remember, just because your application has not failed does not mean it won't. It just means you are lucky. Make sure you plan for when the luck runs out.

No comments: