Why Problems Are Good

By Bob Aiello - September 18, 2015

Many large enterprise technology systems have suffered incidents that had significant impact to the customers as well as the firm itself. Serious outages can be devastating for the firm, and few technology managers relish the thought of looking forward to the next large-scale glitch. The fact is that experienced IT professionals know that learning from our mistakes is good, and so, too, is harnessing the lessons learned from a serious incident or problem.

However, all too often, IT operations professionals get caught by surprise when problems occur, and developers may not be available when their knowledge and experience could be of the most value. QA, testing, and the business are also frequently not included in managing problems, which typically distinguish themselves from incidents by the need for root-cause analysis.

The ITILv3 framework used by many IT operations organizations provides valuable guidance on the processes, procedures, and industry best practices that can best be utilized to respond to problems when they do occur. Organizations that embrace ITIL have a robust approach to capturing events that occur from challenges ranging from disk space running low to runtime processes terminating unexpectedly. These incidents also need to be analyzed, as they may indicate an underlying problem that could become more serious and have widespread impact.

Many companies have a Critical Incident Response Team (CIRT) that bears responsibility for organizing the response to incidents, whether they occur in a cluster or are isolated as one-time-only occurrences. But sometimes serious incidents require expert analysis to determine the root cause. This is where the problem management function usually helps to handle the situation.

Recently, a colleague of mine noted that while ITIL v3 prescribes analyzing events as aggregates and many companies manage to identify related incidents, few organizations do a good job of analyzing trends among problems that, in fact, may be related. I quickly broke out my well-loved texts on ITIL v3 to find some processes to handle this requirement, but sadly, I had to admit that I could not find a well-defined process for evaluating problems as related entities.

Problem management is usually focused on getting the right stakeholders to participate in a call or online meeting to evaluate what has occurred, with the goal of understanding the underlying root cause of the problem itself. With this knowledge, software bugs can be identified or operational procedures adjusted to avoid customer impact should a problem reoccur. But truthfully, many organizations have anemic problem response functions that do little more than provide a short-term response.

What is really needed is a strong DevOps approach to ensuring that all the experts are involved and trends among problems are understood, so that strategies are identified to address and prevent problems from reoccurring. Involvement of the business end-user is essential in problem analysis to understand customer impact and the risk of taking action (or failing to take an action).

We are still searching for a good solution to analyzing clusters of problems. What are your ideas and suggestions?

Why Problems Are Good

Status message