Introduction

Businesses spend huge amount in firefighting activities and it is crucial to resolve these issues as fast as possible because it directly impacts productivity. ITIL is a framework that includes a set of best practices for service support and delivery. Problem Management is one such ITIL process to prevent incidents from occurring. Businesses often confuse this with Incident management process due to their similarities and many organizations do not have Problem Management process. Incident management deals with resolving issues as soon as possible and restoring services back to normalcy whereas the primary goal of Problem Management is to provide permanent resolution and prevent these incidents from occurring with the help of Change Management process.

It is fundamental to understand the differences between these two before implementing any of these processes. Problem Management helps businesses in cost reduction by identifying and preventing critical incidents. This means that there is no service interruption and therefore no productivity loss. While striving for service excellence, it is inevitable that businesses must deliver seamless support and offer extraordinary service to their users. Problem Management is a part of ITIL service operations lifecycle.Problem Management is closely aligned with other ITIL modules such as Change Management, Release Management in order to plan and deploy a permanent fix to the recurring incident. Most organizations do not understand the importance of Problem Management when they implement ITIL. But it is significant to understand the business value and benefits of this process. 

In this Problem Management guide, let us look at a detailed study of objective, scope, process flow, techniques, benefits, feature checklist and KPIs associated with Problem Management process along with suitable examples.

What is ITIL Problem management

Problem Management is an IT Service Management (ITSM) process to prevent problems and incidents from occurring and resolve known problems with a permanent solution. Recurring incidents give rise to a Problem. The objective of Problem Management is to diagnose the root cause of repeated incidents. Root Cause Analysis (RCA) is an important step during Problem Management process. Incident Management aims at restoring the services as fast as possible and if the same incident occurs frequently that have higher impact, then it is moved to Problem Management team to analyse the root cause and find a solution. Problem Management either provides a workaround for the problem or provides a permanent solution.

Problem Management uses a common database to track problems. It starts with problem diagnosis and try to provide a workaround or a permanent solution. A known error database (KEDB) is maintained for open problems. KEDB is used to track known issues and it involves changes to Configuration Items (CIs). Problem Management and Configuration Management talk to each other in sharing CI related details. Whenever there is a problem reported, it is vital to check CI involved and update the CI if needed in order to resolve the issue permanently. Information consistency across these modules is important to faster resolution of Incidents, problems and also to enable timely deployment. To remain competitive, businesses must have speed to market and agility.

Objectives

Uninterrupted service is a dream come true for any service desk. In reality, issues do arise and it is the responsibility of service desk to mitigate the impact and respond as fast as possible. However, end users expectations have increased and they demand easily accessible service desk touchpoints. The primary objective of Problem Management is to identify and troubleshoot repeating incidents by finding the root cause. Its aim is to proactively eliminate problems from occurring and also find out a workaround or a permanent solution. Problem Management reduces the number of incidents by being proactive. It also reduces the long term cost associated with firefighting activities and service downtime. End user satisfaction improves eventually and realize real business and customer value.

Definitions
Problem Management in ITIL Service lifecycle

Problem Management belongs to the ITIL service operation. It interacts with number of other processes in ITIL service lifecycle. Within ITIL service operation, it closely interacts Incident Management to address repeated incidents and prevent major incidents from occurring. When it comes to service design, problem history is crucial to design Availability Management. Knowledge Management that belongs to service transition is helpful to record known errors and their workarounds as knowledge base articles. While performing RCA, Problem Management interfaces with Knowledge Management process to look out for potential solution that is already available. Finally, Proactive Problem Management does Continual Service Improvement to improve the service quality.

Problem Management is crucial at every stage of ITIL service lifecycle. Therefore, it is a costly mistake to ignore this process while setting up ITIL process at your organization. While choosing a service desk solution, ensure that the solution supports all features needed to perform Problem Management process.

Problem Management Process flow

ITIL Problem Management follows a sequence of steps to identify, diagnose and resolve problems. There is a predefined framework to execute Problem Management. This process flow helps organizations to do Problem Management in the right way without confusing with Incident Management. The scope of the process flow are as follows

Problem detection

The first step is to detect the problem and this can be done in a variety of ways. Tier I team escalates incidents that are unable to resolve. A problem can also be recorded by reviewing the Incident report. When one or more incidents occur with an unknown cause, then a problem record is created. In certain cases, a reported incident is clearly associated to a known problem. If the problem record does not exist, then create a new problem record and link related incidents. Problem detection saves a lot of resources by identifying the problem at the right time so that diagnosis gets easier. The symptoms of a problem include

Problem logging

Every detected problem has to be logged in the problem record for tracking purpose. It is vital to capture problem details such as problem type, description, associated incidents, affected CIs from CMDB, category, user information, status, resolution, closure. This information is vital to tag known errors and manage them in a database. Every problem record has two attributes i.e. impact and urgency. Impact refers to the number of users and CIs affected due to this problem. Urgency refers to how quickly the resolution is needed. Depending on these two factors, Service Level Agreement (SLA) is set which decides the due by date for problem resolution. This information is crucial for Problem Management team to perform root cause analysis. Service desk ticketing system enables Problem logging by capturing all relevant details using a form template. Generating problem reports using this data becomes easier when there is a complete database.

Investigation & Diagnosis

Prioritization and categorization of problem records help in picking the problem record for investigation. During investigation, stakeholders discuss about possible root cause. Problem diagnosis is done once RCA is completed. RCA is carried out using various Problem Management techniques that are available. Investigation involves cross team collaboration and diagnosis is performed by Problem Research team. While investigating a problem record, it is recommended to search in KEDB initially to find out whether it’s a known problem.

KEDB

Post the diagnosis, problem record could be added to the Known error database (KEDB) or a permanent solution is delivered to close the record. Investigation and diagnosis may result in a workaround to solve the issue temporarily until a permanent resolution is found. Until then, services are restored with the help of a workaround. As soon as a workaround is found out, it is added to the KEDB. It is important to maintain the KEDB upto date. Whenever any incident or problem arises in future, service desk agent refers this database first to check for possible workaround.

Resolution

Problem resolution involves other ITIL modules such as Change Management and Release Management. In order to fix the problem permanently, a new change has to be raised. Change Management handles evaluation, planning and execution of changes. Problem Management team raises the request and submits Request for Change (RFC). Change team evaluates the impact and planning is carried out. A suitable Change Management process is used such as standard, normal or emergency type. Release Management is responsible for actual deployment of approved changes. This involves packaging the change and testing in sandbox environment before it is rolled out to the production environment. It is necessary to document the resolution provided to the user and the Problem record is associated to the respective Change and Release records. Closure can be handled through automation

Problem management Techniques

There are different Problem Management techniques available. Let us discuss some of the popular techniques that can be implemented easily.

Brainstorming

Discussing the problem statement and possible causes with key stakeholders. This involves group discussion and encourages full house partIcipation.

Kepner Tregoe Problem analysis

A logical approach to problem-solving that includes with problem definition and elaboration. Possible causes are vetted, then tested and finally the true cause is identified.This is a systematic four phase Root Cause Analysis (RCA) for complex problem analysis. Kepner Tregoe (KT) is applicable for both proactive and reactive problem management. It involves problem analysis as well as potential problem analysis.

Possible Causes   Evidence   Result  
Memory issue
 
Memory leakage
 
Cause
 
Server speed issue
 
Log files
 
Cause
 
Data retrieval Issue
 
Configuration issue
 
Not a cause
Cause and effect analysis

Cause Effect analysis describes relationships between a problem and its possible causes. This method is also known as Ishikawa or fishbone diagram that analyses primary and secondary causes of a problem. Causes have various categories such as people, product, process and partners. For example: Network outage might have causes such as router malfunction, configuration error, natural disaster etc. This method is used for reactive problem management. Therefore, it is important to define the problem statement precisely.

5 Whys

5 why strategy is a simple technique to find out the root cause by asking subsequent “why” questions. It is one of the six sigma techniques to identify the actual root cause of a problem and to take appropriate countermeasures to prevent from occuring in the future. It understands the relationships between various root causes. However, it is significant to frame the questions properly to derive at the actual the root cause. Asking why question five times is just a rule of thumb and it varies depending on the problem complexity.

Proactive vs Reactive Problem Management

Reactive Problem Management

Reactive Problem Management reacts to recurring incidents by analysing the root cause and providing a long term fix. It is crucial to identify these repeating incidents as problems. Incident Management aims at restoring the services as fast as possible and therefore, often miss out on the underlying cause of incidents. Incident Management team transfers such incidents to Problem Management team for a detailed research and analysis. This handover is crucial and timing is more important in order to maintain service integrity.

Incident Management team should pass on information such as incident category, affected CIs, criticality and impact. Reactive Problem Management process consumes these information and does a detailed RCA, submits RFC and updates the problem record in KEDB. Reactive Problem Management starts with checking incident patterns and it includes reviewing past incidents in the service desk.

Proactive Problem Management

Proactive Problem Management acts as a gatekeeper in continuously identifying potential issues and avoiding them. It does not wait for incidents to occur and aims to prevent incidents/problems from occurring in the future. This process is a preventive technique that involves big data and trend analysis. Patterns are identified from historical incident and problem data and potential issues are avoided. This requires past incident data analysis, major events, asset health check and situational appraisal. Kepner Tregoe analysis is an example of proactive Problem Management technique that deals with data analysis. Examples include maintenance activities, periodic audit.

Inter relationships with other ITIL modules

Incident Management

Problem management starts once Incident management is completed. A problem record can be created either from one or more incidents or on its own. It deals with analysis of recurring incidents and finding their root cause. Incident management shares information such as incident description, user impacted, asset impacted, criticality. Problem Management uses these information to identify whether it is a known error or not. Therefore, Incident Management acts as a prerequisite to Problem Management in most cases.

Change Management

If Problem Management is unable to find a permanent solution, then it is followed by Change Management to execute new changes. Problem Management RCA is crucial for Change Management to understand the associated risk and urgency. Change Management process finds a permanent fix by rolling out new changes. Problem Management simplifies change evaluation phase by providing a detailed RCA. Change Management process decides the change schedule depending on problem impact and criticality. Change advisory board (CAB)

involves relevant stakeholders from Problem research team to assess the planned change. Known errors or Known problems result in a Request for Change (RFC). Relevant problems are associated to the change record for better execution.

Configuration Management

Recurring incidents demand asset health check in order to find out the cause. While Problem Management owns root cause analysis, it is essential to work closely with Configuration Management team to understand asset details, asset owner and its interdependencies with other assets, impact and vendor related information. Problem research team with the help of these details suggests the next steps i.e. to execute a new change in the configuration item, CI or provide a suitable workaround. These two modules are closely connected to each other and Problem analysis phase revolves around Configuration Items (CIs) in order to minimize the impact.

Knowledge Management

Problem Management leverages Knowledge Management by accessing the central repository and solution database. Knowledge base articles are fundamental to trend analysis. For both proactive and reactive Problem Management, knowledge base articles help in speedy resolution. Relevant knowledge articles are associated to problem record. Known error database along with workarounds are stored in knowledge base as well. KEDB is a subset of broader Knowledge Management system. After a permanent solution is found out, it is stored in Knowledge Management for future reference.

Problem management best practices

DOs
DON’Ts
Problem Management Key Performance Indicators (KPIs)

Problem Management leverages Knowledge Management by accessing the central repository and solution database. Knowledge base articles are fundamental to trend analysis. For both proactive and reactive Problem Management, knowledge base articles help in speedy resolution. Relevant knowledge articles are associated to problem record. Known error database along with workarounds are stored in knowledge base as well. KEDB is a subset of broader Knowledge Management system. After a permanent solution is found out, it is stored in Knowledge Management for future reference.

Problem Manager roles and responsibilities

Problem Manager role does not exist in many organizations but it is fundamental for companies to realize the importance of this ITIL methodology. A Problem Manager role acts as a middleman between Incident and Change Management.

Feature checklist

  • Create, modify and delete problem records
  • Search problem records
  • Filter problems based on created date, assigned agent, requester, status, priority and category
  • Ability to mark a problem as a known error
  • Create multiple dashboards to store relevant problem records
  • Ability to add a detailed root cause and attach relevant files
  • Placeholder to add impact and symptoms
  • Ability to add a solution - permanent or workaround
  • Integrated knowledge management module within Problem Management solution
  • Ability to add or remove solution articles from Knowledge base within the problem record
  • Ability to add or remove the right Configuration Items (CI) to problem record
  • Ability to assign tasks to other people in the same team or other team
  • Automated email notifications based on events
  • Associate the related incidents for better reference
  • SLA information and due by date visibility
  • Ability to maintain and search in KEDB
  • Ability to associate a change record to this problem record
  • Unique Problem identification number for future reference
  • Ability to export problem records
  • Reporting and analytics based on problem data
Benefits

Having discussed the various aspects of Problem Management, it is necessary to highlight the business benefits of Problem Management.

Other ITSM Resources

BLOG

4 Techniques for Effective Problem Management

BLOG

Change & Release Management – It’s Complicated!

BLOG

Moving from reactive ITSM to proactive ITSM

BLOG

The Paradigm Shift In ITSM