Problem management is the set of processes and activities responsible for managing the lifecycle of all problems that could happen in an IT service. It’s main goal is to prevent problems and their resulting incidents from happening. For those incidents that have already occurred, problem management seeks to prevent them from happening again or if they are unavoidable, minimize the impact to the business. To understand problem management, it is first helpful to define what a problem is. ITIL defines a problem as the cause of one or more incidents. Another way to look at it is – a problem is an underlying condition which could have negative impacts on the service and therefore needs to be addressed. Problems have a lifecycle that starts when the problem is created (often by a change in the environment), includes identification and the stages of diagnosis and remediation, and ends when the problem is resolved either through some action being taken or the underlying situation going away.
Problem management is both a transactional process of managing the lifecycle of an individual problem as well as a portfolio management process of making decisions about what problems should be addressed, the resources applied to them and the risks that problems present to the organization. Problem management includes activities required to diagnose the root cause of incidents and determine the appropriate resolution steps that should be taken. It is also responsible for ensuring that any resolutions are implemented safely and effectively in accordance with change management and release management policies and procedures.
The portfolio part of problem management is responsible for maintaining information about problems that exist in the environment, any workarounds that have been developed and the resolution options that have been identified. This information enables leaders to make decisions that will reduce the number and impact of incidents.
Problem managers are responsible for managing the lifecycle of problems to ensure that they are clearly understood, and appropriate actions are taken. His/her goal is to prevent incidents from happening and to minimize the impacts of incidents that cannot be prevented. Problem managers will often interface with incident management staff and technical resources to ensure diagnostic data is captured about associated incidents and environmental conditions related to the problem. Problem managers are responsible for performing root cause analysis (RCA) to help the organization identify not only why an incident occurred, but also when and how the underlying problem was introduced into the environment. Root cause analysis often results in a number of alternative resolutions being identified and the problem manager plays a key role in helping to qualify the alternatives relative to cost, benefit and risk to provide a recommendation to management decision makers. It is common for resolution actions to take some time to be channeled through the appropriate change and release management processes. During this time, problem managers are responsible for ensuring that knowledge management resources and known-error databases are kept up-to date to enable incident management staff to effectively address any recurring incidents and/or service requests. Although incident management and problem management are separate processes, problem managers will typically use the same tools, similar categorization, impact and priority coding systems as their incident manager counterparts as a way of fostering effective collaboration between process areas.
Problem management consists of two major processes:
Reactive problem management - which is executed as a part of service operation focuses on follow-up to incidents that have already occurred
Proactive problem management - is initiated in service operation but generally considered part of continual service improvement focuses on identifying problems from environmental signals and preventing incidents from occurring at all.
Proactive problem management aims to identify future incidents and prevent them from re-occurring by identifying and eliminating the root cause before they can cause service impacting incidents. Proactive problem analysis is heavily influenced by data generated through automated monitoring capabilities, analysis of change records and the use of trend analysis. Proactive problem management differs from its reactive counterpart by addressing three key areas
Proactive detection - which is executed as a part of service operation focuses on follow-up to incidents that have already occurred
Problem prevention is initiated in service operation but generally considered part of continual service improvement focuses on identifying problems from environmental signals and preventing incidents from occurring at all.
Preemptive action - which is executed as a part of service operation focuses on follow-up to incidents that have already occurred
Fault diagnosis -is initiated in service operation but generally considered part of continual service improvement focuses on identifying problems from environmental signals and preventing incidents from occurring at all.
Problem management can be a time-consuming process and while it is going on, the IT organization still needs to provide services to users. To do this, one of the most common tools/techniques that service management staff employ are workarounds. Workarounds are temporary solutions aimed at reducing or eliminating the impact of known issues and problems for which a full resolution is not yet available. This may be because underlying causes cannot be readily identified, resolution steps have not been developed or the organization has not yet implemented permanent resolutions.
Workarounds do not correct the root-cause of a problem, the simply address the symptoms and impacts. Common examples of workarounds include rebooting servers, clearing application caches or using an alternative process or system to complete the business activity. Workarounds may be executed by incident management staff or by end users and they could be in use for any timeframe (from seconds to years).
Most organizations document workarounds as a part of their knowledge management system, linking to records in the known error database. These records may be presented to users in the form of FAQs or they may be visible only to service management staff in the form of diagnostic instructions. It is important to keep in mind that workarounds follow the same lifecycle as the underlying problem and as such, when the problem is resolved, the workaround should be retired to avoid creating confusion.
ITIL defines problem management as a part of service operations with strong relationships to incident management, change management and continuous service improvement. ITIL v3 breaks problem management down into the following sub-processes:
Proactive problem identification - improve overall availability of services by proactively identifying problems so they can be solved, or workarounds identified before future incidents occur.
Problem diagnosis and resolution - identify the underlying root cause of a problem and initiate the most appropriate solution.
Problem and error control - improve overall availability of services by proactively identifying problems so they can be solved, or workarounds identified before future incidents occur.
Problem closure and evaluation - ensure that after a problem is solved, the problem record contains a full historical description and that known-error and knowledge records are updated.
Major problem review - review the resolution of a major problem to ensure problem situations have been fully eliminated, capture lessons learned and identify preventative actions (such as process changes) that should be undertaken to avoid problem recurrence.
Problem management reporting - informing other service management processes and IT management are informed of outstanding problems, their status and existing workarounds.
The difference between incident and problem management is one of the biggest causes for confusion in ITIL and service management processes. While these processes are very closely related, they are intentionally different and separate. Incident management is tasked with responding to an event that has occurred, minimizing impact to the business and restoring service as quickly as possible. Problem management is tasked with understanding the root cause of why the event occurred and how to prevent it from happening in the future.
We’ve already discussed the processes involved in problem management, in contrast, the main activities in incident management are:
Escalation, as necessary
Communication with the user community throughout the life of the incident
It might take multiple incidents before problem management has enough data to analyze what is going wrong and figure out what steps can be taken to correct the situation. As a result, communication and coordination between incident managers and problem managers is essential. Problem Management is an essential part of your IT Service Management function. It takes the knowledge gained through monitoring, incident management and other parts of service operations and feeds them into the continuous service improvement processes that will help make the services you provide to users more robust and dependable.