Creating effective Disaster Recovery planning

Disaster recovery (DR) is an organization’s method of regaining access and functionality to its IT infrastructure after events like a natural disaster, cyber attack, or even business disruptions. A variety of disaster recovery (DR) methods can be part of a disaster recovery planning. And this is one aspect of business continuity.

An effective Disaster Recovery Plan should contains a few essential elements:

  1. Disaster recovery team: This is a group of individuals focused on planning, implementing, maintaining, auditing and testing an organization’s procedures for business continuity (BC) and recovery. This plan should define each team member’s role and responsibilities. In the event of a disaster, the recovery team should know how to communicate with each other, employees, vendors, and customers.
  2. Risk evaluation: Assess potential hazards that put your organization at risk. Depending on the type of event, strategize what measures and resources will be needed to resume business. For example, in the event of a cyber attack, what data protection measures will the recovery team have in place to respond?
  3. Business-critical asset identification: A good disaster recovery plan includes documentation of which systems, applications, data, and other resources are most critical for business continuity, as well as the necessary steps to recover data.
  4. Backups: Determine what needs backup (or to be relocated), who should perform backups, and how backups will be implemented. Include a recovery point objective (RPO) that states the frequency of backups and a recovery time objective (RTO) that defines the maximum amount of downtime allowable after a disaster. These metrics create limits to guide the choice of IT strategy, processes and procedures that make up an organization’s disaster recovery plan. The amount of downtime an organization can handle and how frequently the organization backs up its data will inform the disaster recovery strategy.
  5. Failover and Failback. Failover is the process of transferring mission-critical workloads from the primary production center and recovering the system at an off-site location. The main goal of failover is to mitigate the negative impact of a disaster or service disruption on business services and customers. Once you have recovered your primary site after a disaster and resolved any issues associated, you can transfer business operations back to the source. This is a Failback. Failback helps recover the original workload on the source host (or at a new location of your choice) and return workloads from the replica to the original.
  6. DR test: Disaster recovery testing is the process to ensure that an organization can restore data and applications and continue operations after an interruption of its services, critical IT failure or complete disruption. It is necessary to document this process and review it from time to time with their clients. It will ensure that you know how to save your client in the event of any fail. Keep reading to learn more about disaster recovery testing scenarios and disaster recovery testing best practices.

Below is a simple diagram of formal DR process:

Recovery Plan Phases

The activities necessary to recover from a disruption will be divided into 4 phases.  These phases will follow each other sequentially in time.

  • Disruption Occurrence: This phase begins with the occurrence of the disruption event and continues until a decision is made to activate the recovery plans.  The major activities that take place in this phase include emergency response measures, notification of management, and disruption assessment activities.
  • Plan Activation: In this phase, the Business Continuity Plans are put into effect. This phase continues until critical business functions are re-established and application access has been restored. The major activities in this phase include notification and assembly of the recovery team, implementation of interim procedures, and steps needed to fully recover application functions.
  • Restore and Recovery: apply and follow to organization DR technique. Restoring from backups, cool or hot site failover, using DRaaS solutions.
  • Postmortem/Lessons learned: To provide the team with information that can increase effectiveness and efficiency and to build on the experience that has been earned by the disruption, a Lesson’s Learned will be conducted and all relevant information outlined for future occurrences.

Recovery Tasks and Procedures

First of all, we have to formally initiate a process creating a major incident ticket or follow to already initiated incident management process. Then, it is time to active a DR process:

  • Start DR timer: Disaster recovery timer initiated based on the customer SLA.
  • Personnel Notification: This procedure specifies how the team members are to be notified if the plan is to be put into effect by identifying who is to be notified in the event of a disruption.
  • Recovery Activities and Tasks: After a disruption occurs, quickly assess the situation to determine whether there is an immediate need to initiate recovery steps. Once a significant disruption has been identified, proceed with the steps as outlined in this document.
  • Notification: The first step in the recovery process is to notify the Cloud Operations Centre team and ensure that the team leader has been made aware of the disruption.
NOTE: If duration of outage has been greater than 1 hour and there is no expected recovery time, then Operations management must be informed if they have not been already. If there is a suspected recovery time, then the application support process is to be followed and the situation will not be deemed a disaster recovery.

Starting formal Disaster Recovery means, first of all, assessing a problem:

  • Assessment: The next step in the process is to assess the situation in an effort to figure out the extent of the disruption. A pre-defined procedure is outlined below to determine if a disruption event has occurred and what steps to take if an event is discovered.
  • Choose recovery strategy: Determining what workloads are affected and what recovery strategy best match the situation.
  • Resolution: Check if the issue was resolved and notify an product stakeholders.

Postmortem analysis

  • Avoid Blame and Keep It Constructive: Blameless postmortems can be challenging to write, because the postmortem format clearly identifies the actions that led to the incident. Removing blame from a postmortem gives people the confidence to escalate issues without fear. It is also important not to stigmatize frequent production of postmortems by a person or team. An atmosphere of blame risks creating a culture in which incidents and issues are swept under the rug, leading to greater risk for the organization
  • Share Knowledge
IN PRACTICE: teams share the first postmortem draft internally and solicit a group of senior engineers to assess the draft for completeness. Review criteria might include:
– Was key incident data collected for posterity?
– Are the impact assessments complete?
– Was the root cause sufficiently deep?
– Is the action plan appropriate and are resulting bug fixes at appropriate priority?
– Did we share the outcome with relevant stakeholders?
  • No Postmortem Left Unreviewed: An unreviewed postmortem might as well never have existed. To ensure that each completed draft is reviewed, we encourage regular review sessions for postmortems. In these meetings, it is important to close out any ongoing discussions and comments, to capture ideas, and to finalize the state. Once those involved are satisfied with the document and its action items, the postmortem is added to a team or organization repository of past incidents.81 Transparent sharing makes it easier for others to find and learn from the postmortem.
  • Visibly Reward People for Doing the Right Thing
  • Ask for Feedback on Postmortem Effectiveness: Regularly survey our teams on how the postmortem process is supporting their goals and how the process might be improved.

For more information please follow our #CyberTechTalk WIKI pages.

Be an ethical, save your privacy!

subscribe to newsletter

and receive weekly update from our blog

By submitting your information, you're giving us permission to email you. You may unsubscribe at any time.

Leave a Comment