Software design and testing is directly related to the availability of an application. We all know that already. No amount of money spent on hardware can overcome a flaw in the application. We call these undocumented features sometimes. They can be very challenging to troubleshoot in a production system.
With Integration, these challenges become more intense. Multiple systems wired together and yes even loosely coupled systems can impact each other despite what all of the analyst and software vendors tell you. The question is how to overcome this complexity and turn out reliable systems that have little to no impact on each other. Well I would love to say I have the answer but the truth is there is no one answer.
Good sunny day and rainy day testing will take you far as well as good design and good documentation. Well trained operations and application support staffs will also contribute. But in the end the combination of variables in a large integrated environment both known and unknown are just to great to overcome. There are going to be issues.
So what to do? A few of pointers from hard learned experience:
1-Educate your management and your clients; setting expectations for realistic availability is key. ie deploying your app that you designed while coding it to a single server is not going to give you 5 nines.
2-Educate your staff. It's great that you know the app but you are not available 24x7 unless of course food, sleep, vacation are not something your require.
3-Practice recovering your app. Don't know how? When it is hosed in production is not the time to learn.
4-Know what else your app depends on. This is really key in integration work. Change management is a big deal.
5-Did I mention educate your management and your clients?
Just a few thoughts for the day as the fires have finally died out.:)