Effective incident management process is an indispensable part of all enterprise businesses. As technology and workflows become more and more complex and unified, systems become increasingly susceptible to unplanned downtime, resulting in a potential impact to business operations both internally and externally.

What is an Incident?

An incident is an unexpected interruption to service. When the functioning of any activity becomes a failure and causes the system to act in an unplanned fashion, it is termed as an incident. A problem can result in more than one incident which is to be resolved, as soon as possible.

An incident disturbs the normal operation thus affecting the productivity of the end user. An Incident may be due to network failure or an asset that is not operating correctly. Examples of Incidents can be anything, including issues with wifi connectivity, printers, server crash, misconfiguration of systems, application issues, email service issues, laptop crash, user authentication errors, file sharing issues, etc.

With such an impending impact, enterprises are rapidly evolving incident response practices to ensure that they can be coped with as quickly and successfully as possible. This requires taking a holistic approach to an incident, understanding how it progresses, and how to incessantly improve the flexibility of systems. From an academic standpoint, there are more than a few opinions on how many stages are related to a characteristic incident response workflow.

What is ITIL Incident Management?

The IT Service Desk is a distinct point of contact between IT teams and end users. Organizations implement ITIL to deliver efficient services and enhance productivity. ITIL service operation includes Incident management practices whose most important objective is to warrant smooth business operations with negligible or no downtime. Competent Incident management process reduces the communication gap between IT teams and end users. ITIL Incident management process consists of a set of best practices to actively handle and resolve incidents. These best practices help identify the difference between classifying incidents, problems, and service requests.

Service Request

An official request or appeal from a user for something to be provided or a request for information or advice is termed as a Service Request. These requests are often pre-approved standard changes requested by end users. For example, a UX designer requesting for Photoshop tools or increase in RAM space can be termed as a service request.

Problem

A problem can be termed as a series of incidents with the unidentified root cause, while incident arises due to breakdowns or from something that ceases to work, disrupting normal service. Incident handling is generally a reactive process while problem management is more practical. An Incident management system or Incident management process aims at reinstating services quickly whereas problem management aims at bringing about a perpetual fix.

The stages in the Incident Management Process

Incident Management process encompasses the following sequence of actions:

Incident logging

Essentially, reporting an identified incident, or Incident logging is the first step in the Incident management process. This can be done by end-users themselves using any ticket source, or the end-users can request agents to raise tickets on their behalf. The Incident form template that records details about the issue speeds up the process of recovery by automating based on values. Relevant channels are like email, mobile apps, self-service platforms, etc., are configured to allow users to raise a ticket.

Incident Classification

Classification of incidents enables proper cataloging and assignment of tickets to the suitable agent. Category/sub-category fields in the Incident template help choose the associated Incident category. Categorization also streamlines prioritization. For example, if an incident is regarded as a system outage, this might spontaneously escalate the incident to a greater priority. This categorization also makes it helps problem management teams track and identify patterns between incidents, improving incident deterrence.

Incident Prioritization

Service Level Agreement (SLA) depends on ticket priority to describe response and resolution rate. It is essential to assign the right priority to the tickets as this helps to address critical issues on time. Hence, it is important to configure a realistic SLA definition for better customer satisfaction. Ranking incidents based on their urgency and their impact on end-users save time during the Incident management process.

Investigation & Diagnosis

Diagnosis, also referred to as the response stage in the Incident management process, often takes a longer time than the other steps. After a help desk employee receives a ticket, the first task is to identify and arrive at a preliminary hypothesis to determine the likely cause of the issue.

A troubleshooting runbook or flowchart can streamline the investigation process and make it less time-consuming, enabling help desk teams to identify or eliminate possible causes.

If the ticket is not resolved at this stage of the Incident management process based on their hypothesis and accessible resources, the issue is escalated to Tier II and Tier III teams. Tier I teams perform the initial analysis and investigation. If the ticket is not resolved, Tier II and III teams conduct a more detailed investigation using their additional expertise or resources. The incident is linked with the relevant CI (Configuration Item) for a quicker conclusion.

Incident Resolution & Closure

Incident resolution is decisive to meet the Service Level Agreement, and therefore it is imperative that an incident is resolved in a timely fashion. Effective communication about the resolution arrived at is equally important for users to resume normal operation. Closure of tickets can be handled through self-service portals or by the system automatically.

Incident Management Process

Incident management process aims to rapidly restore services, in adherence to the service level agreements. Unlike Problem Management, where finding the root cause of problems is key, Incident Management is fundamentally about getting things back up quickly, even if this means implementing workarounds and quick fixes.

Technologies play a crucial role in optimizing this process, by automating the concrete process activities themselves (like incident recording and classification), and by gaining access to the outputs from other associated processes. Integration with other processes (particularly Problem, Change, Configuration, and Service Level Management) is very important to make sure that incidents are kept to the least and that the highest levels of service availability are maintained.

Incident Management process is responsible for running the lifecycle of all Incidents regardless of their origination. The key goals for the Incident Management process are:

Incident Management includes IT service providers, internal and external resources, reporting, recording and working on an Incident. All Incident Management process activities should be implemented completely, operated as applied, measured and amended as necessary.

A successful Incident Management process highlights other areas that need attention. There are numerous qualitative and measurable benefits that can be achieved, for both IT service providers and end-users, by implementing an operative and resourceful Incident Management process. Here are some of the key benefits that an Incident Management process brings to the organization:

Key benefits for IT Service Providers Key benefits for end-users
Improved capability to identify prospective improvements to IT services Better service availability due to lesser service downtime
Prioritization of efforts Reduction in unplanned work and associated costs
Better use of resources, reduction in unplanned work and associated costs IT activity in-line with real-time business priorities
Better control over IT services Identification of prospective improvements to services
Better coordination between departments Recognizing additional service or training requirements for the business or IT
Empowered IT staff  
More control over vendor management through Incident Management metrics  

Incident manager roles & responsibilities

An Incident manager is someone who creates and manages the enterprise Incident management process for the organization and implements the best practices of ITIL within the process. The incident manager is responsible to reinstate normal service operation as rapidly as possible to curtail any adverse impact on business operations. The key roles and responsibilities are:

Incident management process - Key Performance Indicators

The Incident Manager is responsible for defining the right KPIs. This ensures business alignment and KPI reports are reviewed with the management periodically. KPIs are correlated to Critical Success Factors (CSF) and CSFs, in turn, are associated with the primary business objectives. Service desk solution helps assess these KPIs with advanced analytics and reports. These reports are automated and used to develop the existing processes and the holistic vigor of the business. The KPI reports include ticket trends, agent performance, CSAT, SLA reports, etc.

Characteristic Incident Management process metrics include:

Optimizing the Incident management process

Since the Incident Management process aims to enable users to resume work as quickly as possible, process activities should include technologies that support the tasks of identifying, classifying, monitoring and resolution. Tools that help augment the Incident Management process should basically provide:

Some issues to look out for to avoid problems in the Incident Management process:

• Incident Management Bypass
IT cannot gauge service levels and errors when users try to resolve incidents themselves. Centralizing the Service Desk function, with the help of technology, can essentially act as the clearinghouse for all incidents. Incident Management bypass can also happen by offhandedly requesting the SME groups for help. From a process perspective, the SME group can take on the work until after the incident has been logged in.

• Holding on to Incidents
Fusing Information Management and Problem Management into a hybrid Incident Management process can be detrimental from the metrics perspective. The processes have to be clearly distinguished, and incidents should be closed once the user confirms that the error has been rectified.

• Traffic Overload
Traffic overload occurs when there is an unforeseen number of incidents. This may lead to the incorrect recording of incidents, resulting in longer resolution times and degradation of the overall service. Automating procedures to arrange spare capacity and resources can help overcome traffic overload.

• Too Many Choices

Classification of incidents in finite detail and navigating through many sub-levels may lead to increased time and incorrect classification, as the analyst can tend to give up searching for the most correct match.

• Lack of a Service Catalog
A Service Catalog can help to clearly define IT services, the configuration components that upkeep the service, along with the agreed service levels.