What is Problem Management

Problem management is the set of processes and activities responsible for managing the lifecycle of all problems that could happen in an IT service. It’s main goal is to prevent problems and their resulting incidents from happening. For those incidents that have already occurred, problem management seeks to prevent them from happening again or if they are unavoidable, minimize the impact to the business. To understand problem management, it is first helpful to define what a problem is. ITIL defines a problem as the cause of one or more incidents. Another way to look at it is – a problem is an underlying condition which could have negative impacts on the service and therefore needs to be addressed. Problems have a lifecycle that starts when the problem is created (often by a change in the environment), includes identification and the stages of diagnosis and remediation, and ends when the problem is resolved either through some action being taken or the underlying situation going away.

Problem management is both a transactional process of managing the lifecycle of an individual problem as well as a portfolio management process of making decisions about what problems should be addressed, the resources applied to them and the risks that problems present to the organization. Problem management includes activities required to diagnose the root cause of incidents and determine the appropriate resolution steps that should be taken. It is also responsible for ensuring that any resolutions are implemented safely and effectively in accordance with change management and release management policies and procedures.

The portfolio part of problem management is responsible for maintaining information about problems that exist in the environment, any workarounds that have been developed and the resolution options that have been identified. This information enables leaders to make decisions that will reduce the number and impact of incidents.

 

The Role of a Problem Manager

Problem managers are responsible for managing the lifecycle of problems to ensure that they are clearly understood, and appropriate actions are taken. His/her goal is to prevent incidents from happening and to minimize the impacts of incidents that cannot be prevented. Problem managers will often interface with incident management staff and technical resources to ensure diagnostic data is captured about associated incidents and environmental conditions related to the problem. Problem managers are responsible for performing root cause analysis (RCA) to help the organization identify not only why an incident occurred, but also when and how the underlying problem was introduced into the environment. Root cause analysis often results in a number of alternative resolutions being identified and the problem manager plays a key role in helping to qualify the alternatives relative to cost, benefit and risk to provide a recommendation to management decision makers. It is common for resolution actions to take some time to be channeled through the appropriate change and release management processes. During this time, problem managers are responsible for ensuring that knowledge management resources and known-error databases are kept up-to date to enable incident management staff to effectively address any recurring incidents and/or service requests. Although incident management and problem management are separate processes, problem managers will typically use the same tools, similar categorization, impact and priority coding systems as their incident manager counterparts as a way of fostering effective collaboration between process areas.

Proactive Problem Management

Problem management consists of two major processes:

Proactive problem management aims to identify future incidents and prevent them from re-occurring by identifying and eliminating the root cause before they can cause service impacting incidents. Proactive problem analysis is heavily influenced by data generated through automated monitoring capabilities, analysis of change records and the use of trend analysis. Proactive problem management differs from its reactive counterpart by addressing three key areas


Workarounds

Problem management can be a time-consuming process and while it is going on, the IT organization still needs to provide services to users. To do this, one of the most common tools/techniques that service management staff employ are workarounds. Workarounds are temporary solutions aimed at reducing or eliminating the impact of known issues and problems for which a full resolution is not yet available. This may be because underlying causes cannot be readily identified, resolution steps have not been developed or the organization has not yet implemented permanent resolutions.

Workarounds do not correct the root-cause of a problem, the simply address the symptoms and impacts. Common examples of workarounds include rebooting servers, clearing application caches or using an alternative process or system to complete the business activity. Workarounds may be executed by incident management staff or by end users and they could be in use for any timeframe (from seconds to years).

Most organizations document workarounds as a part of their knowledge management system, linking to records in the known error database. These records may be presented to users in the form of FAQs or they may be visible only to service management staff in the form of diagnostic instructions. It is important to keep in mind that workarounds follow the same lifecycle as the underlying problem and as such, when the problem is resolved, the workaround should be retired to avoid creating confusion.

ITIL Problem Management Processes

ITIL defines problem management as a part of service operations with strong relationships to incident management, change management and continuous service improvement. ITIL v3 breaks problem management down into the following sub-processes:

Incident vs. Problem Management

The difference between incident and problem management is one of the biggest causes for confusion in ITIL and service management processes. While these processes are very closely related, they are intentionally different and separate. Incident management is tasked with responding to an event that has occurred, minimizing impact to the business and restoring service as quickly as possible. Problem management is tasked with understanding the root cause of why the event occurred and how to prevent it from happening in the future.

We’ve already discussed the processes involved in problem management, in contrast, the main activities in incident management are:

It might take multiple incidents before problem management has enough data to analyze what is going wrong and figure out what steps can be taken to correct the situation. As a result, communication and coordination between incident managers and problem managers is essential. Problem Management is an essential part of your IT Service Management function. It takes the knowledge gained through monitoring, incident management and other parts of service operations and feeds them into the continuous service improvement processes that will help make the services you provide to users more robust and dependable.

Other IT Service Desk Resources