Diagnosis
It happens to every piece of software. No matter how far-sighted the developers, no matter how heavily over-built the code, every application gets sick once in awhile. Functionality that end users were enjoying suddenly stops working. Someone added a new module which turned out to be not as fully tested as everyone thought; now it's not working, or worse, causing strange errors in a supposedly unrelated portion of the application. If the development team which built the software did not follow the kinds of best practices we've been discussing in regards to the FSDLC, that kind of thing happens far more than it should and eats up huge amounts of time fixing things that were supposed to be working automatically. There are even places which prioritize hiding the symptoms from customers instead of diagnosing the problem and stopping the flood of problems. Only once you stop the application from generating more work than you can handle, can you begin to have time in your day to clean up the mess.
The first step in the process is to sit down with the testing site and see if you can duplicate the error. If not, there may be a screw loose in the production version and the thing to do is wipe the production site and reload it. That may be all that's necessary to correct the problem. In one of the worst cases, you can't reproduce the error in testing, but you've reloaded the site a time or two and the error is consistent. That means you have an error due to differences in the testing and production sites. On one level, if your architecture is well designed, this only leaves a few places to look, however, you have two strikes against you. One, you're going to be testing on the production site (always a horrible idea), and two, these errors have a tendency to be really nasty.
More typically, you can reproduce the error on the testing server and you immediately sit down to see if you can simply roll back to a version that works. In almost all cases, this is possible and once you identity your fallback position, you can fix the production site by rolling it back to a working version. Test the rolled back site to make sure everything is OK, but you have now at least bought yourself some time to fix the problem.
Obviously, halting development to embrace an old version can only be a very temporary solution. Diagnosis is the first step in the long term solution of your problem. You don't want to panic, but anyone who can be reasonably helpful in this process should be yanked from whatever they are working on until the problem is solved. You don't wind up with a strong working code base by accident – you build a strong code base by planning it well, implementing it professionally, and guarding it... gee, as if your life depended on it. Therefore, you need to diagnose the problem. There are two key things to do when making a diagnosis.
First, use your error handling to spot errors and figure out what's going on. Sometimes errors are misleading, but certainly the first step is to read your errors. See what they tell you and experiment with resolving the problem assuming that the error is correct, and if that fails, take a stab or two in that general vicinity. You will note that the success rate of this technique is directly proportional to the value of the error handling which has been built into the application. What seemed like a huge waste of time in construction is now the only thing that can save your life.
Second, testing is not a random, or a gut feeling process, even if that's how you pursue construction. If you don't understand the results that are coming back from a test, you have wasted all the time it took and learned nothing. As I said, if my error message seems like it's off base, I will take a stab or two to see if the error message may at least lead me to the answer indirectly. However, I'm now on a really short span of patience with the error message. These deep tests, a white box test that check specifically implemented functionality, are usually the kind that finally break the problem open. However, because they're so specific, you may have to run a handful just to check one function, and that means it takes a lot of time to locate a difficult problem. If I feel like I'm starting to wander around in the dark, it's time to take a new tack.
Assuming neither of the first two approaches worked, it's time to test systematically which implies a pattern or method. I now switch from deep tests to shallow, black box tests. The disadvantage of this is that I can't solve the problem. The advantage is that I can test virtually any sized portion of the application that I want, which means I can check large sections of the application relatively quickly. You start by taking your best guess which application path is creating that problem (if you're wrong, you have to make your next best guess and start over). Now, start taking a look at the steps in that application path. Again, shallow, black box tests don't test the inside of the section being examined, but all I want to do is locate where the problem is. So step by step I hardcode input to the steps and check the output; usually I start with the final step to see if that produces the output I expect. If it does then I back up a step, etc.; this allows you to use the application as your testing harness. This lets me identify a section of the program where the problem is, and I can now dive a little bit deeper and do slightly deeper tests on the steps of that portion of the program. This process allows me to create a window around the problem code and narrow it level by level until I can nail it down and then solve the problem.
As you can see, depending on what the problem is, diagnosis can be pretty involved. However, using these techniques, you have a reasonable chance of diagnosing what's going on.