Incident management safety and risk considerations
Incident managers are often given broad administrative access to IT systems. This is necessary to enable diagnostics and troubleshooting, but can also present some risks to the organization. It is important incident managers understand that in addition to access to information and powerful support tools comes a responsibility to use those tools safely and with an understanding of the impacts of the actions they may take.
Restarting services and systems
Incident managers will often restart services to resolve an incident without a complete awareness that other users and business activities are attempting to use the service. It is not uncommon for restarts during peak business hours to have more of a disruptive impact than the original issue the incident manager is seeking to resolve.
Destructive fixes
IT systems are complex and have many dependencies that are not well understood and documented. Changing a configuration or applying a fix to resolve one issue creates the potential for another. Incident managers must be aware of the potential for destructive fixes and ensure proper testing and roll-back plans have been prepared when making changes to production systems.
Access to sensitive data
IT incident managers often have access to production business data, which may contain company secrets, employees’ personal data or sensitive customer data. Inadvertent disclosure of sensitive data is one of incident managers’ most common safety mistakes – specifically including sensitive information in user communications and/or ticket notes that do not have the access controls of the source systems.
Destroying clues and symptoms
Effective problem management requires understanding what was happening in the IT environment when the incident was occurring. While troubleshooting and resolving incidents, incident managers frequently must take actions that destroy clues and eliminate symptoms that can help in root-cause analysis. It is important for incident managers and problem managers to collaborate closely to collect any needed data during the incident management process.
Reverting changes without understanding dependencies
Planned changes and releases to IT systems are the causes of many incidents. A common resolution of post-release incidents is to roll-back or revert the changes to the previous version. Unfortunately, releases are typically tested in bundles, thus masking the internal dependencies that may be affected if a single component is reverted. Before rolling-back changes deployed as part of a release, the incident manager should consult with the release manager and project team, as additional testing may be required.
Bypassing change-control mechanisms
Most organizations have robust change-control mechanisms that include change review and approval. These are intended to safeguard the IT infrastructure from adverse impacts and ensure due-diligence and risk mitigation. Incident managers often have the access and authority to act independently and apply changes to production systems that should be reviewed as part of the normal change-control process. Incident managers must be trained and understand when they are empowered to act and when they should be seeking approval before applying changes.
The value of incident managers within your IT organization
Incident managers are essential to any IT organization as the front-line of interaction between business users and IT staff. They assess the impact of incidents and evaluate the urgency and importance of the issue to the business to ensure the highest impact activities receive the most attention. Incident managers perform much of the diagnostics, data collection and troubleshooting necessary to understand what is occurring when an IT system isn’t working properly, and they take steps to remediate the issue, so business disruption can be minimized.
Incident managers are tasked with resolving operational impacts quickly to minimize disruption to user productivity and business processes. Every incident is a disruption that costs the company time, resources and capacity and may harm the company’s reputation in the marketplace. Incident managers are tasked with responding to unplanned events and issues to avoid distracting other IT resources from project work and daily tasks.
In addition to resolving near-term impacts, incident managers are responsible for capturing data to support problem-management processes, so the root cause of incidents can be identified and long-term fixes developed to prevent future incidents. Incident managers also capture valuable organizational knowledge in the form of knowledge articles and known issues that help improve support for future issues and enable solution-development teams to build better quality systems and services to fulfill business needs.