Incident management is the most common of all IT support processes. It is the process that IT uses to respond when customers report something broken, unavailable or not functioning the way that it is expected to. The purpose of incident management is to provide a consistent and predictable way of engaging with users, identifying when something is broken and fixing it as quickly as possible, so business can return to normal. Often, incident management is performed either directly by a helpdesk function or coordinated by a helpdesk function through a network of subject matter experts.
An incident in the context of IT is an event that takes place that isn’t part of normal operations that disrupts a user’s activities or a business process. An incident begins when normal operations are disrupted and continues until normal operations are restored. Incidents are technology events that disrupt business activities and typically require some sort of intervention to resolve. An incident may involve the failure of a technology component or feature; it may be an issue with the integrations between components; or it may be the failure of workflows that are configured to run on a system. Some examples of incidents include:
Not all incidents are equal in importance, impact and urgency. Some incidents represent full work stoppages while others are only an inconvenience that can be easily navigated around. An incident may impact a single person (or no people at all), a team, a location or an entire organization. Depending on the time of day and degree of impact and the criticality of the business operations being impacted, incidents will require different levels of urgency and attention. Most companies utilize formalized Service Level Agreements (SLAs) to bring objective structure to the assessment of impact and urgency and to assign incident priority and criticality scores that are used to ensure resources are focused on addressing the most pressing incidents first.
Incident management is an IT Service Management (ITSM) process area that is focused on restoring normal service operation as quickly as possible and minimizing the impact of disruption on business operations. It is typically executed as a sequential process.
Identifying that an incident has occurred and capturing it in the incident management system. This may be done through automated monitoring systems or by a user contacting a helpdesk for support.
Reporting and communication begin at the time an incident is identified to acknowledge to the user or community that a disruption is taking place and investigation is in progress.
Impact and urgency assessments guide the assignment of priority and severity classifications that are used to determine the level of support to be provided.
Issue diagnosis and troubleshooting help to isolate symptoms from underlying causes and identify the relationship between the incident and changes in the environment or known issues.
Resolving incidents typically involves some sort of support action (making a change, rebooting a resource, etc). Resolving an incident also includes documenting analysis findings and steps taken to restore service.
Closing an incident involves communication with the user/community, updating support documentation and initiating problem management processes (if necessary)
Many simple incidents are resolved by a single support resource, however more complex incidents may involve teams of resources that need to be coordinated and/or different stakeholder audiences that require ongoing communication. An incident manager is responsible for orchestration of the incident management process, coordination of resources to resolve the incident and facilitating stakeholder communications.
An incident command system is a piece of software that is used to assist in coordinating resources working together on an incident. It will often extend beyond the basic capabilities of an incident management system and include things like conferencing and collaboration tools, resource management capabilities, a centralized monitoring console and the ability to publish incident updates through a variety of channels.
Helpdesks and IT service desks often address more user issues than just incidents. One of the more common issues they handle are user requests. While an incident is a disruption of normal processing (something is broken), a request is simply an activity that requires assistance to complete (something that needs to be done). Examples of requests might be setting up systems for a new employee, granting access to data, or performing an update to system software. While incidents and requests often follow similar processes and may leverage the same tools for tracking, requests do not represent a failure and disruption – they are a part of normal business processing.
There is a special classification of incidents for crisis situations called Major Incidents that represent widespread disruption to business operations, a critical security risk or inability for the company to deliver on expectations to customers. Major incidents often include increased management involvement to assess impacts and coordinate communication, enhanced incident manager involvement as well as more formalized decision-making structures. Major incidents are highly time-sensitive and could include engaging resources outside of normal business hours to assist in diagnosis and resolution of the issue.
The terms incident and problem are frequently used interchangeably, however there is an important distinction to be made between them. Incident management is concerned with addressing the symptoms and impact of an issue while problem management is concerned with addressing the cause and potential for recurrence of the issue.
The timeline of an incident starts when normal operations are disrupted and ends when service is restored, and normal operations resume. The timeline of a problem starts when the underlying issue was introduced into the environment (often a change in configuration, release, or change in usage) and ends when the underlying issue is removed (which may not be until a future release). Problem management includes activities like root cause analysis, risk assessment and the prioritization and selection of a long-term fix solution.
There are two main industry standards that companies use to guide their incident management processes. Because incident management is so common and performed by almost every IT organization with very little need for customization, standards play a helpful role in instructing organizations on what needs to be done and how to manage the incident management process.
The IT Infrastructure Library (ITIL) is the most commonly cited incident management reference used to design operational processes used within companies. As part of the overall IT Service Operations process area, ITIL devotes considerable attention to the processes surrounding incident management including interfaces with other processes such as change management, problem management and request management.
ISO 20000-1 defines the international standard definitions for incident management. The ISO standard is used less frequently for operational design and more often for establishing contractual relationships and defining standards of performance for service provider organizations.