SOC Incident Management

Incident management is the process responsible for managing the lifecycle of all incidents to ensure that normal service operation is restored as quickly as possible, and that business impact is minimized.

Incident Management (ITIL definition)

An unplanned interruption to an IT service or a reduction in the quality of an IT service. Failure of a configuration item that has not yet impacted one or more services is also an incident. (For example, the failure of one disk from a mirror set.)

Sometime, it is very challenging to assess and classify if incident fired is about security or this is infrastructure issue. In this post I would like to help SOC with correct processing and providing appropriate analytic approach with classifications.

The simple an incident processing flow is shown below:

Incident Management
NOTE: for DR phase please refers to our Disaster Recovery Planning post

Detect Incident

Identification is the part of the process where we figure out that something’s wrong or isn’t performing as it should be.

As we are aware, there are several different ways an incident can be detected. The incident could be identified by the customer and then reported to support or through monitoring, such as critical or warning alerts.

In the case of an end user or customer identifying an incident, the priority is for the support team to get a ticket raised and prioritized, with the incident corrected as quickly as possible. In some instances, the client support team will require assistance from the Operations team, and a ITSM ticket will need to be logged.

When an incident is identified by a SOC team member, the onus is on that person to ensure that an incident ticket is raised in ITSM, and to work with other SOC team members, or other support teams, to manage it through to resolution.

Log Incident

This is the part of the process where the incident is captured into a ITSM ticket. Our goal is to have automation in place so that all incidents are automatically logged in ITSM, whether it is a client support ticket, an infrastructure alert etc. However, there may be instances when members of the SOC team manually log a ticket.

Establish scope of responsibility

At this stage, the SOC team must identify if the incident is something that falls into its scope of responsibility, or whether a different support area should be notified. As an example, is the incident regarding on-prem client’s services or SaaS? If the incident is not within the SOC scope of responsibility, then the correct support area must be contacted, and the ITSM ticket assigned to the correct ITSM support queue. If the correct support area is not known, then the ITSM ticket should be assigned to the Global Service Desk to be triaged and assigned to correct area.

Incident classification

Once the ‘SOC scope of responsibility’ has been established, the SOC team must perform initial investigation to determine the categorization/classification of the incident.

In the incident classification phase, the SOC team categorizes the incident in terms of:

  • What type of incident is this? Is it a normal incident, security incident or a major incident?

The criteria used to establish this is as follows:

  • The Impact, as in who and what is affected

ITILv3 defines impact as “a measure of the effect of an incident, problem, or change on business processes.”

  • Number of customers/users affected?
  • Amount of lost revenue or incurred costs?
  • Number of IT systems/services/elements involved?
  • Etc…

Based on the answers to the above and similar questions, the SOC team can decide whether the impact of the incident is low (minor), medium (significant) or high (critical).

  • Urgency, or the speed required for resolution

Urgency is not about effect as much as it is about time. A function of time, urgency depends on the speed at which the business or the customer would expect or want something, such as restoring service to normal operation.

The longer that the customer is willing to wait or can afford to be delayed, the lower the urgency. Anything that significantly affects the business from an operational, compliance, or financial perspective is generally more pressing than impacts on other perspectives. For example, an outage to a cloud service covering a whole region would require shorter response and resolution times because it is a more urgent issue. So, understanding the SLA’s we have in place with our customers will allow us to be able to establish whether an incident is a normal or major/critical.

  • Priority, regarding business and customer perspectives

Priority is the intersection of impact and urgency. Considering impact and urgency offers us a clearer understanding of what is more important when it comes to a change: a request or an incident.

Successful incident management relies on having a clear understanding of what the customer agreed to or is willing to tolerate regarding the duration and handling of any particular incident. This is usually defined in service level agreements (SLAs) or contracts, which include timelines for responding and resolving incidents based on some criteria, usually priority, as a function of impact and urgency.

Normal incident

A minor incident with low impact.

For example:

  • A system bug is creating a minor inconvenience to customers, but does not impact overall system function       
  • A minor inconvenience to customers, workaround available
  • Usable performance degradation
  • An incident with the potential to become a major incident if not quickly addressed, for example, a partial loss of functionality for a small sub-set of customers.
  • A minor incident that impact product usability but don’t bring it to a halt, for example, slower-than-average load times.
Major security incident

A security incident is an event that may indicate an attack on an organization’s system or network. It can also signal that security measures in place failed to protect one’s computer from an attack. Most security incidents involve unauthorized system access that may disrupt a target’s normal operations, violate policies, and expose sensitive data.

A major incident disrupts a business. It also requires a response that goes beyond a company’s traditional incident management cycle. Additionally, a major incident is urgent, and it requires an incident management team to act quickly to resolve the issue. Because the longer it takes an incident management team to address a major incident, the more costly the incident likely will become for the business.

For example:

  • Confidentiality or privacy is breached
  • A customer-facing service is unavailable for a subset of customers
  • Core functionality is significantly impacted
Critical security incident

Not all incidents are created equal. Losing data from one database is not the same as losing data from all of your databases. Dealing with an outage that impacts 20% of your users is a whole different ballpark than dealing with an outage that impacts 90 or 100%. Handling a system outage during peak hours is a lot more stressful than handling one when most of your customers are asleep. Even two incidents that look identical on paper are unique under the surface.

This is an incident with significant/critical impact.

For example:

  • A customer-facing service is down for a sub-set of customers, or the customer-facing service is down for all customers.
  • Core functionality is significantly impacted
  • Confidentiality or privacy is breached
  • Customer data loss

Diagnose incident

Once the incident classification has been performed and impact/severity established, the SOC team must perform investigation and diagnosis tasks to determine what has gone wrong, and to determine the fastest way to recover normal service.

All actions performed by the SOC team members must be documented in the ITSM ticket, so that a complete historical record of all activities is always maintained.

Incident Investigation and Diagnosis includes the following actions:

  • Establishing the exact cause of the incident
  • Understanding the chronological order of events
  • Confirming the full impact of the incident, including the number and range of users affected
  • Identifying any events that could have triggered the incident (for example, a recent change or user action)
  • Searching known errors or the knowledgebase for a workaround or resolution
  • Discovering any previous occurrences, including previously logged incident or problems, and known errors, the knowledgebase, and error logs and knowledgebases of associated manufacturers and suppliers
  • Identifying and registering a possible resolution for the incident

Resolution and recovery

At this stage, the SOC team has identified and evaluated possible resolutions before they are applied. Finding a resolution means that a way of resolving the problem has been identified. An action is taken to rectify the root cause of the incident or to implement a workaround. The SOC team must confirm that the affected service has been restored to the required standard.

Documentation and incident closure

Once the incident is resolved, formal incident closure of the record takes place. This process will include a comprehensive documentation of the incident lifecycle in the ITSM ticket and, communicating and confirmation from the customer/business/users that the service experience has returned to normal.

Incident review/post-mortem

Once the incident has been closed, it is vital that the SOC team perform an incident review, which is also known as a post-mortem. The post-mortem should document detailed information regarding every aspect of the incident: from the root cause to the successful resolution, and all the lessons learnt along the way.

The benefits of a post-mortem review to SOC aren’t only about reviewing how the incident resolution was handled, but rather it can potentially indicate unknown system problems and highlight areas in which Operations can improve or automate to reduce risk.

Root cause analysis and Post incident review are two key aspects of the Incident review/post-mortem stage.

Root cause analysis (RCA)

Root cause analysis (RCA) is the process of discovering the root causes of problems to identify appropriate solutions. Root cause analysis assumes that it is much more effective to systematically prevent and solve underlying issues rather than just treating ad-hoc symptoms and putting out fires.

Post incident review (PIR)

Post-incident review/analysis guides you through identifying improvements to your incident response, including time to detection and mitigation. An analysis can also help you understand the root cause of the incidents. This review brings the members of the SOC team together to discuss the details of an incident: why it happened? What impact it had? What actions were taken to resolve it? And how the team can prevent it from happening again?

Examples of benefits of a post-incident review/analysis:

  • Improve incident response
  • Work through what happened
  • Understand the root cause of the problem
  • Address root causes with deliverable action items
  • Discover previously unknown system vulnerabilities
  • Mitigate the possibility of repeat incidents
  • Uncover any potential process improvements that could speed up resolution of the next major incident
  • Analyse the impact of incidents

For more information please follow our #CyberTechTalk WIKI pages.

Be an ethical, save your privacy!

subscribe to newsletter

and receive weekly update from our blog

By submitting your information, you're giving us permission to email you. You may unsubscribe at any time.

Leave a Comment