Key activities an incident manager performs
Support tickets and service requests received from end-users of IT systems or issues identified through system and service-monitoring capabilities frequently drive the daily tasks of incident managers. Most companies have ITSM Ticketing Software that aggregate support tickets into queues and facilitate the assignment of tasks to individual incident managers. An incident manager will typically receive assignments for a few or as many as 20 support tickets (in various statuses) throughout the day or shift. Even with what seems like a large ticket backlog, most incident managers will only be actively focusing on 1–3 simultaneous incidents.
The goal of the incident management process is to minimize the impact of IT incidents to system/service users and reduce the impact on business operations. The incident manager achieves this goal by performing a series of activities – some focused on understanding the issue, some on resolving it and others to minimize future impacts.
Triage
This is the incident manager’s first step when a new incident is encountered. He or she will seek to understand the reported symptoms and the extent of the disruption and determine the level of urgency to apply to the resolution of the issue. This initial triage will determine ticket priority/criticality, establish SLA expectations for response/resolution time and determine what processes and resources will be leveraged to resolve the issue.
Assessing impact
Incidents vary significantly in their impact on users and business operations. Most incidents will be relatively low-impact, disrupting the activities of one or a few users, with some workarounds available to enable business activities to continue (even if productivity is impacted). Other incidents have a much greater impact on the company – a critical system outage, a security breach or the failure of automated workflows. These more critical incidents can impact entire business functions and locations and the company’s capability to deliver to customers or jeopardize the company’s reputation. The incident manager is responsible for performing an initial impact assessment and then re-assessing impact periodically as the incident evolves.
Diagnostics and data collection
For the incident manager to resolve an issue, he or she must first develop an understanding of what is occurring, both technically and within the operating environment of the system with an issue. When an incident is created (ticket is opened), it is common for the initial description to be both incomplete and descriptive of symptoms of the issue. Rarely does a ticket clearly state the problem, initially. The incident manager will perform a variety of diagnostic tests, talk with users and collect data about the incident to develop a clearer and more complete understanding of what is occurring. This data will be compared against known issues, knowledge articles and the incident manager’s personal experience to drive the resolution of the issue.
Troubleshooting and remediation
This part of the incident manager’s job is the most recognizable. Once the incident manager has identified the technical issue and collected some data, he or she will explore a series of troubleshooting activities to identify what is causing the issue and how to fix it. It is important to keep in mind the incident manager’s primary objective is to minimize impact and restore service quickly. As a result, he or she may restart services, re-boot hardware or suggest re-installing software as a means of remediation. These are common remediation steps, designed to resolve the impact of the incident even if there isn’t a clear and complete understanding of what caused the initial problem.
Interacting with requestors
Incident managers aren’t just responsible for working on technical systems – they are also responsible for interacting with the users who open IT-support tickets. This interaction occurs throughout the incident lifecycle, from initial impact assessment and data collection through troubleshooting and remediation (updating user on status) to follow up after the incident is closed to ensure the issue was resolved completely. Incident managers require strong communication skills as well as the ability to show empathy for users and elicit information that can aid in diagnosis.
Collecting data to enable problem management
While incident managers’ primary responsibility is alleviating the immediate business impact and disruption, they are also critical to collecting data to aid in understanding the root cause of problems, so permanent fixes can be developed. Incident managers are able to view a wide range of system and environmental data while an incident is actively being addressed that may not be available to problem managers once the incident has been resolved. For this reason, incident managers will often spend some time collecting information before performing activities, such as restarts and re-installations, that problem managers will need to perform a more detailed root-cause analysis.
Creating knowledge resources
Knowledge management is an important part of the incident management process. Incident managers use knowledge articles and known-issue databases as parts of their diagnostic and troubleshooting activities to compare current incidents with past situations. Incident managers are also important contributors to the IT organization’s collective knowledge-management database, as they update knowledge resources for previous issues and create new knowledge articles for undocumented situations.