Every incident is a disruption for a company. Some incidents impact a single user, others disrupt productivity for entire business functions. An incident manager’s job is to respond to incidents when they occur and take any necessary steps to restore service and return the business to normal operations as quickly as possible. Incident managers are the IT staff members with which employees, suppliers, and customers interact when they are stuck and need help. To serve their needs, incident managers must possess technical skills, access to tools and information and a customer-service mindset for interacting with users.

Where do incident managers fit within an IT organization?

Incident management is a generic job category that includes many positions, from generalist call-center agents to deeply technical engineering staff. Many companies seek to centralize incident management into an IT service management (ITSM) function, but it isn’t uncommon to find employees performing incident manager duties throughout the IT organization. The most common incident manager roles are found at IT service desks, call-centers, operations centers, specialized support teams, and field-support functions. Each of these functions is responsible for providing support for IT systems, which includes responding to incidents when they occur. Other less-obvious places for incident managers in an IT organization include information security, data management, governance and compliance, and solution-development teams. In these functions, incident managers focus on specialized types of incidents and, often, have unique protocols they use for incident response.

Incident management is where many IT careers start

For many IT professionals recently entering the industry or graduating from college, incident manager positions are an ideal start to their technical career. The structured processes of incident management combined with knowledge resources and training provide an environment where new IT professionals can develop skills and experience while making a positive contribution to their companies. Many professionals who start their careers in incident management move to positions in solution development, operations and specialties, such as security and risk management.

Incident managers interact with a great variety of users and systems, giving them the opportunity to learn about the company, how it operates and the specific tasks and responsibilities within IT and business groups. Incident managers who remain in the position for many years often gain a very broad perspective on business operations and have a better overall understanding of the business than top-level executives. The subject-matter expertise gained in an incident manager position creates some of a company’s most valuable IT employees.

New IT professionals often start their working careers with a general understanding of different types of technology (perhaps, a programming language, some hardware knowledge or some technical-support skills), but they may lack the experience needed to understand thoroughly technology’s importance in business and how to apply their technical knowledge to actual IT systems. Being an Incident manager provides opportunities for IT professionals to refine their technical skills, applying them to solving real-world IT issues and developing the confidence to tackle more complex technical projects. Resolving IT incidents requires diagnostics and troubleshooting skills and an understanding of how IT systems work and how they interact with each other and the users who rely on them. The experiences gained as an incident manager help an employee develop the foundation for a productive career in the IT industry.

Key activities an incident manager performs

Support tickets and service requests received from end-users of IT systems or issues identified through system and service-monitoring capabilities frequently drive the daily tasks of incident managers. Most companies have ITSM Ticketing Software that aggregate support tickets into queues and facilitate the assignment of tasks to individual incident managers. An incident manager will typically receive assignments for a few or as many as 20 support tickets (in various statuses) throughout the day or shift. Even with what seems like a large ticket backlog, most incident managers will only be actively focusing on 1–3 simultaneous incidents.

The goal of the incident management process is to minimize the impact of IT incidents to system/service users and reduce the impact on business operations. The incident manager achieves this goal by performing a series of activities – some focused on understanding the issue, some on resolving it and others to minimize future impacts.

Triage

This is the incident manager’s first step when a new incident is encountered. He or she will seek to understand the reported symptoms and the extent of the disruption and determine the level of urgency to apply to the resolution of the issue. This initial triage will determine ticket priority/criticality, establish SLA expectations for response/resolution time and determine what processes and resources will be leveraged to resolve the issue.

Assessing impact

Incidents vary significantly in their impact on users and business operations. Most incidents will be relatively low-impact, disrupting the activities of one or a few users, with some workarounds available to enable business activities to continue (even if productivity is impacted). Other incidents have a much greater impact on the company – a critical system outage, a security breach or the failure of automated workflows. These more critical incidents can impact entire business functions and locations and the company’s capability to deliver to customers or jeopardize the company’s reputation. The incident manager is responsible for performing an initial impact assessment and then re-assessing impact periodically as the incident evolves.

Diagnostics and data collection

For the incident manager to resolve an issue, he or she must first develop an understanding of what is occurring, both technically and within the operating environment of the system with an issue. When an incident is created (ticket is opened), it is common for the initial description to be both incomplete and descriptive of symptoms of the issue. Rarely does a ticket clearly state the problem, initially. The incident manager will perform a variety of diagnostic tests, talk with users and collect data about the incident to develop a clearer and more complete understanding of what is occurring. This data will be compared against known issues, knowledge articles and the incident manager’s personal experience to drive the resolution of the issue.

Troubleshooting and remediation

This part of the incident manager’s job is the most recognizable. Once the incident manager has identified the technical issue and collected some data, he or she will explore a series of troubleshooting activities to identify what is causing the issue and how to fix it. It is important to keep in mind the incident manager’s primary objective is to minimize impact and restore service quickly. As a result, he or she may restart services, re-boot hardware or suggest re-installing software as a means of remediation. These are common remediation steps, designed to resolve the impact of the incident even if there isn’t a clear and complete understanding of what caused the initial problem.

Interacting with requestors

Incident managers aren’t just responsible for working on technical systems – they are also responsible for interacting with the users who open IT-support tickets. This interaction occurs throughout the incident lifecycle, from initial impact assessment and data collection through troubleshooting and remediation (updating user on status) to follow up after the incident is closed to ensure the issue was resolved completely. Incident managers require strong communication skills as well as the ability to show empathy for users and elicit information that can aid in diagnosis.

Collecting data to enable problem management

While incident managers’ primary responsibility is alleviating the immediate business impact and disruption, they are also critical to collecting data to aid in understanding the root cause of problems, so permanent fixes can be developed. Incident managers are able to view a wide range of system and environmental data while an incident is actively being addressed that may not be available to problem managers once the incident has been resolved. For this reason, incident managers will often spend some time collecting information before performing activities, such as restarts and re-installations, that problem managers will need to perform a more detailed root-cause analysis.

Creating knowledge resources

Knowledge management is an important part of the incident management process. Incident managers use knowledge articles and known-issue databases as parts of their diagnostic and troubleshooting activities to compare current incidents with past situations. Incident managers are also important contributors to the IT organization’s collective knowledge-management database, as they update knowledge resources for previous issues and create new knowledge articles for undocumented situations.

Tools incident managers use

Incident management in most IT organizations is a remote activity, with incident managers sitting in offices that are isolated from the users and systems on which they are working. Granted, there are some incident managers performing field service, but most incidents can now be resolved remotely. This is possible because of the wide range of IT tools available to incident managers to aid in the incident management process. Some of the most common tools that incident managers use include:

In most companies, a consolidated ITSM platform provides many of these capabilities, which enable incident managers (and others) to access all the information and tools they need to do their job from a single interface. The “single-pane-of-glass” concept has led to significant productivity improvements in ITSM functions (including incident management) during the past few years.

Look for these skills when hiring incident managers

The skills and experience level of your incident management staff may vary greatly – from new college graduates with little industry experience to subject-matter experts with decades of incident management and technical experience. When hiring for incident manager roles, it is important to understand the strengths and weaknesses of your current staff and seek candidates who will complete your team profile. Companies often look for these 5 traits when evaluating their incident manager needs and the value of an individual candidate:

In addition to these specific traits, incident managers must thrive working in high-stress environments with multiple priorities and a strong sense of urgency. They should have an innate curiosity for understanding how systems work and a learning mindset. Incident managers can come from many different backgrounds and the position is ideal for introducing motivated candidates to the IT industry.

Measuring incident manager’s performance

Incident managers must maintain a balance between customer satisfaction, timely resolution of incidents and productivity/cost. It is easy for the incident manager to excel at one or two of these purposes, but balancing all three can be quite challenging. The metrics used to evaluate incident manager performance are important in helping incident managers understand the company’s expectations of their role and responsibilities and guide their activities to the performance level the company views as most important. Here are some of the most common measurements used to evaluate incident manager performance:

SLA compliance

The primary measure companies use to evaluate incident-management performance is response-and-resolution time SLAs. SLAs seek to measure whether the IT-support function (and incident manager) has achieved the committed expectations set with the user. Keep in mind IT sets SLAs and may not represent what the user actually expects. SLA-compliance rates are good indicators of whether the services provided to users are within acceptable levels (as IT management defines them).

First-call resolution

The goal of incident management is to resolve incidents as quickly as possible to minimize business disruption. Repeated calls/tickets on the same incident or the need to engage the user multiple times are indicators of delays in incident resolution. First-call resolution rates are strong, primary indicators of the effectiveness of the solutions the incident manager provide and his or her success minimizing business impacts.

Knowledge-article contribution

Incident-management processes that leverage shared knowledge are much more efficient and quicker than those that rely on incident managers to diagnose each new incident from scratch. Knowledge-article contribution (authoring, updates, and reviews) is a good measurement of how well the incident manager is contributing to the overall success of the incident-management function.

Issues resolved per shift

This is a basic productivity measure of how many tickets are resolved during a given period. This measurement method does not consider issue complexity or incident managers’ skill level. It is common for senior incident managers to resolve fewer and more complex incidents than a junior incident manager addressing simple incidents. Issue-resolution rates should be benchmarked against peers with similar skill levels and workloads.

Escalation rate

Incident managers aren’t expected to resolve every ticket themselves. Sometimes, escalation is required. Monitoring escalation rates in addition to the amount of time the incident manager spends working on an issue before escalating it is a good method to evaluate whether he or she is spending too much or not enough time on each issue. It is also a good indicator of whether the assigned workload is appropriate for the incident manager’s skill level.

User satisfaction score

This is the most common measure of the incident manager’s customer service and soft skills. The purpose of IT support is to help users throughout the company be more productive (to support them). While satisfaction scores are subjective, they provide a better indicator than SLA compliance of how well user expectations are being fulfilled. They also often provide valuable clues to improvement and training opportunities for incident managers.

Incident managers in IT organizations of different sizes

In small IT organizations, incident managers are likely to have a more generalized service desk role, addressing a wide variety of technical issues and user requests (service requests). There will typically only be a few people assigned to this activity, with the express goal of avoiding the need to transfer work to others in the IT organization (incident managers are there so others in IT don’t have to be disrupted as often). In smaller organizations, processes, systems, and metrics may be less formal and incident managers will have broad discretion to “do what needs to be done” to resolve the incident.

As IT organizations expand, incident-management functions become more structured and formal. Standard processes, such as ITIL, are adopted, ITSM platforms are implemented and SLAs/metrics become more formalized. In these organizations, incident managers find their roles more constrained and focused, but also typically have greater access to support and knowledge resources from throughout the organization to aid in the incident-management effort.

Many IT organizations now leverage external third-party support vendors as part of their incident-management process. This can include the suppliers of technology components, outsourced helpdesk functions, and managed service providers. In IT organizations that include a supplier ecosystem, incident managers must often engage with third-party resources to resolve incidents. How well the incident manager does this will impact both the quality of incident resolution and the ability to achieve stated resolution SLAs.

Larger IT organizations will frequently segment their incident management teams into specialties (networking, data center, desktop support, etc.) and implement business rules and support workflows within their ITSM systems to help in directing incidents to the proper teams for resolution. In these environments, it is important incident managers know the scope of their responsibilities and how to engage with resources on other teams either to hand off an incident or collaborate on resolution.

Many global companies have implemented “follow-the-sun” incident-management processes, with multiple incident management teams working in shifts, often in different geographic locations. This enables the IT department to provide continuous, 24-hour incident-management coverage. For incident managers, this creates both an added complexity to their work and an opportunity. In follow-the-sun operations, at the end of a shift, open incidents are either retained and put on hold until the next day or transferred to the incoming shift of incident managers. Incidents that are transferred require a greater level of documentation rigor as well as a structured hand-off process to ensure all important information and activities are transferred effectively.

Incident management safety and risk considerations

Incident managers are often given broad administrative access to IT systems. This is necessary to enable diagnostics and troubleshooting, but can also present some risks to the organization. It is important incident managers understand that in addition to access to information and powerful support tools comes a responsibility to use those tools safely and with an understanding of the impacts of the actions they may take.

Restarting services and systems

Incident managers will often restart services to resolve an incident without a complete awareness that other users and business activities are attempting to use the service. It is not uncommon for restarts during peak business hours to have more of a disruptive impact than the original issue the incident manager is seeking to resolve.

Destructive fixes

IT systems are complex and have many dependencies that are not well understood and documented. Changing a configuration or applying a fix to resolve one issue creates the potential for another. Incident managers must be aware of the potential for destructive fixes and ensure proper testing and roll-back plans have been prepared when making changes to production systems.

Access to sensitive data

IT incident managers often have access to production business data, which may contain company secrets, employees’ personal data or sensitive customer data. Inadvertent disclosure of sensitive data is one of incident managers’ most common safety mistakes – specifically including sensitive information in user communications and/or ticket notes that do not have the access controls of the source systems.

Destroying clues and symptoms

Effective problem management requires understanding what was happening in the IT environment when the incident was occurring. While troubleshooting and resolving incidents, incident managers frequently must take actions that destroy clues and eliminate symptoms that can help in root-cause analysis. It is important for incident managers and problem managers to collaborate closely to collect any needed data during the incident management process.

Reverting changes without understanding dependencies

Planned changes and releases to IT systems are the causes of many incidents. A common resolution of post-release incidents is to roll-back or revert the changes to the previous version. Unfortunately, releases are typically tested in bundles, thus masking the internal dependencies that may be affected if a single component is reverted. Before rolling-back changes deployed as part of a release, the incident manager should consult with the release manager and project team, as additional testing may be required.

Bypassing change-control mechanisms

Most organizations have robust change-control mechanisms that include change review and approval. These are intended to safeguard the IT infrastructure from adverse impacts and ensure due-diligence and risk mitigation. Incident managers often have the access and authority to act independently and apply changes to production systems that should be reviewed as part of the normal change-control process. Incident managers must be trained and understand when they are empowered to act and when they should be seeking approval before applying changes.

The value of incident managers within your IT organization

Incident managers are essential to any IT organization as the front-line of interaction between business users and IT staff. They assess the impact of incidents and evaluate the urgency and importance of the issue to the business to ensure the highest impact activities receive the most attention. Incident managers perform much of the diagnostics, data collection and troubleshooting necessary to understand what is occurring when an IT system isn’t working properly, and they take steps to remediate the issue, so business disruption can be minimized.

Incident managers are tasked with resolving operational impacts quickly to minimize disruption to user productivity and business processes. Every incident is a disruption that costs the company time, resources and capacity and may harm the company’s reputation in the marketplace. Incident managers are tasked with responding to unplanned events and issues to avoid distracting other IT resources from project work and daily tasks.

In addition to resolving near-term impacts, incident managers are responsible for capturing data to support problem-management processes, so the root cause of incidents can be identified and long-term fixes developed to prevent future incidents. Incident managers also capture valuable organizational knowledge in the form of knowledge articles and known issues that help improve support for future issues and enable solution-development teams to build better quality systems and services to fulfill business needs.

Other Related Resources