What is incident management?

Incident management is the process used by development and IT Operations teams to respond to an unplanned event or service interruption and restore the service to its operational state.

At Atlassian, we define an incident as an event that causes disruption to or a reduction in the quality of a service which requires an emergency response. Teams who follow ITIL or ITSM practices may use the term major incident for this instead.

Incident Management Handbook

Get our Incident Management Handbook

Download the PDF to learn incident management principles and practices, and how to apply these lessons using Jira Service Management.

Incidents are events of any kind that disrupt or reduce the quality of service (or threaten to do so). A business application going down is an incident. A crawling-but-not-yet-dead web server can be an incident, too. It’s running slowly and interfering with productivity. Worse yet, it poses the even-greater risk of complete failure. Incidents can vary widely in severity, ranging from an entire global web service crashing to a small number of users having intermittent errors.

An incident is resolved when the affected service resumes functioning in its intended state. This includes only those tasks required to mitigate impact and restore functionality.

The importance of incident management

Incident management values

Atlassian’s incident management values

Incident management is one of the most critical processes an organization needs to get right. Service outages can be costly to the business and teams need an efficient way to respond to and resolve these issues quickly. Teams need a reliable method to prioritize incidents, get to resolution faster, and offer better service for users.

When teams are facing an incident they need a plan that helps them:

Want to see how Atlassian handles major incidents? We’ve published our internal incident management handbook. Anyone is welcome to learn from it, adapt it, and use it however they see fit.

Types of incident management processes

Different types of companies tend to gravitate toward different types of incident management processes. No single process is best for all companies, so you’re likely to see various approaches across different companies.

Many teams rely on a more traditional IT-style incident management process, such as those outlined in ITIL certifications. Other teams lean toward a more Site Reliability Engineer- (SRE) or DevOps-style incident management process.

IT incident management process

An incident management process helps IT teams investigate, record, and resolve service interruptions or outages. The ITIL incident management workflow aims to reduce downtime and minimize impact on employee productivity from incidents. Using templates designed to manage incidents, you can create a repeatable incident management workflow, which ensures teams log, diagnose, and resolve incidents—and have a record of their activities.

The ITIL framework is chiefly used by IT teams running services inside businesses. Typically teams take what they need from ITIL—which covers almost every type of incident and issue and process IT teams might face—and leave the rest. ITIL is great when teams need to focus on cultivating a culture of active troubleshooting. The prescribed processes help teams track incidents and actions in a consistent manner, which improves reporting and analysis, and can lead to a healthier service and a more successful team.

Steps in the IT incident management process

Identify an incident and log it

An incident can come from anywhere: an employee, a customer, a vendor, monitoring systems. No matter the source, the first two steps are simple: someone identifies an incident, then someone logs it. These incident logs (i.e., tickets) typically include:

Categorize

Assign a logical, intuitive category (and subcategory, as needed) to every incident. This helps you analyze your data for trends and patterns, which is a critical part of effective problem management and preventing future incidents.

Prioritize

Every incident must be prioritized. Start by assessing its impact on the business, the number of people who will be impacted, any applicable SLAs, as well as the potential financial, security, and compliance implications of the incident. Compare this incident to all other open incidents to determine its relative priority. As a best practice, define your severity and priority levels before an incident happens, making it simpler for incident managers to gauge priority quickly.

Respond

DevOps and SRE incident management process

With a DevOps or SRE approach to incident management, the team that builds the service also runs it—and fixes it if it breaks. This approach has exploded in popularity alongside the growth of always-on cloud services, globally-accessed web applications, microservices, and software as a service.

Increasingly the software you rely on for life and work is not being hosted on a server in the same physical location as you. It’s likely a web-accessed application deployed in a data center for thousands or millions of users around the globe. For teams tasked with running these services, agility and speed are paramount. Any downtime has the potential to affect thousands of organizations, not just one.

An advantage of the “you build it, you run it” approach is that it offers the flexibility agile teams need, but it can also obscure who is responsible for what and when. DevOps teams can be comfortable—and successful—with less structured development processes. But it’s best to standardize on a core set of processes for incident management so there is no question how to respond in the heat of an incident, and so you can track issues and report how they’re resolved.

Three beliefs of DevOps incident management teams

This approach assures fast response times and faster feedback to the teams who need to know how to build a reliable service.

We outline a very DevOps-friendly approach to incident management in our Atlassian Incident Handbook.

Incident management tools

Incident management isn’t done just with a tool, but the right blend of tools, practices, and people. Here are several of the most common tool categories for effective incident management:

Incident management topics

The Atlassian Incident Management Handbook

This handbook features the real incident management processes we've created as a global company with thousands of employees and over 200,000 customers.

Incident communication best practices

Incident communication is the process of alerting users that a service is experiencing some type of outage or degraded performance.

Incident response

Discover strategies and best practices for swift and effective incident response and how to safeguard your business against cyber threats.

On call

On call teams are rapidly evolving. Explore the pros and cons of different approaches to on call management.

Tools

Explore the key features of incident management software. Learn how to choose the right tools for effective incident response and seamless operations.

Postmortem

An incident postmortem, also known as a post-incident review, is the best way to work through what happened during an incident and capture lessons learned.

DevOps

For teams practicing DevOps, the Incident Management (IM) process focuses on transparency and continuous improvements to the incident lifecycle.

Featured tutorials

Incident communication

In this tutorial, we’ll show you how to use incident templates to communicate effectively during outages. Adaptable to many types of service interruption.

On call schedule

In this tutorial, you’ll learn how to set up an on-call schedule, apply override rules, configure on-call notifications, and more, all within Opsgenie.

Want to learn about incident management in Jira Service Management?