Recovery Point Objective & Recovery Time Objective Demystified

2014-12-23 - Backup, Disaster Recovery, General, High Availability

Designing a Disaster Recovery Strategy to Match your Business Needs

Disaster recoverability is an important part of any Business Continuity Plan. When designing a Disaster Recovery Strategy, it is imperative to meet the needs of your business. For example, the most sophisticated backup solution is worthless to you, if recovering from a disaster takes longer than your Maximum Tolerable Downtime.

A short downtime is only one part of a good disaster recovery plan (DRP). You also need to consider data loss. DRPs are designed around two objectives: The Recovery Point Objective (RPO) and the Recovery Time Objective (RTO). Simplified, the RPO gives a target for maximum data loss whereas the RTO sets a goal for maximum downtime in case of a disaster.

The following info graphic gives an overview of how the two terms are related:

Recovery Point Objective and Recovery Time Objective

The Recovery Time Objective (RTO)

The recovery time objective or RTO sets a goal for the maximum time a restore operation after a disaster should take. Wikipedia defines the RTO as follows:

The recovery time objective (RTO) is the targeted duration of time [...] within which a business process must be restored after a disaster (or disruption) in order to avoid unacceptable consequences associated with a break in business continuity.

Important in the definition is the word "targeted". Other than the MTD - the maximum tolerable downtime - the RTO does not represent a hard deadline after which the business cannot continue to operate. Instead, the RTO should be chosen so that even under dire circumstances all business processes can continue operations within their respective MTDs.

The Recovery Point Objective (RPO)

The recovery point objective or RPO has to do with the amount of data being lost. In most disaster situations, data loss is not completely avoidable. Data in most business is in a constant state of change. The amount of data loss is therefore measured in a time span rather than in bytes. Wikipedia defines the RPO as follows:

(The) recovery point objective, or “RPO”,[...] is the maximum targeted period in which data might be lost from an IT service due to a major incident.

This definition also uses the word targeted, implying that the RPO is not a hard boundary but rather should be set conservatively, so that the business does not fail if the objective is not met during a disaster.

Most businesses store more than one backup set in case the most recent backup set is not available during a disaster recovery effort. While you might have the "target" to restore to the most recent backup and therefore design your backups so that the most recent backup is always within the RPO, you should make sure that your other backup sets are within the "tolerable" confines of your business.

How to Determine Appropriate Values for RPO and RTO in Your Business

There are different types of disaster. While a dropped network connection does not necessarily involve data loss, a fire in the cabinet holding your SAN array likely will. When selecting your RPO and RTO values, you have to keep all types of disaster in mind.

What is "Downtime"?

Your Business Continuity Plan contains the maximum tolerable downtime for each resource. If the downtime is caused by a defective network switch, the downtime starts when the switch goes down and ends when it was repaired or replaced. Data that was written to the database before this incident is still available. However, depending on the business function and the supporting IT architecture, some data loss might still occur during the downtime.

If on the other hand the outage was caused by a fatal SAN failure (making a database restore required), the situation is not quite as straightforward as the incident now involves data loss and downtime. Two hours of data lost has an impact on the business that, while comparable in nature, is usually more severe than two hours of downtime. Therefore, you should consider the time between the last usable backup and the incident as part of the downtime.

(RTO + RPO) * n < MTD

When selecting your RPO and RTO, both values therefore have to fit within the MTD. However, they do not only have to fit within, they should be well within the MTD.

Consider the case of a backup checksum failure that is found during the restore. A problem like that potentially takes as long to discover as it takes for the restore to run. In this situation, you now have to start over with your restore efforts, using the next recent backup. Remember, the MTD - the maximum tolerable downtime - is defined as the time that would prevent the business from surviving if the downtime extends past it. That means that in the situation described above, the second restore still has to be able to complete within the MTD.

Because of the above considerations, you should set RTO and RPO in a way that their sum is at max half of MTD. I would even recommend keeping their sum under a third of MTD.

RPO < RTO

How to find the proper RTO and RPO values for your business is dependent on many factors. Is data from 15 minutes ago worth more to you than five disgruntled customers running for the competition? You have to ask yourself questions like this, to come up with adequate values for your business.

When working through this problem, keep the following in mind:

  • SQL Server High Availability and Disaster Recovery technologies like AlwaysOn allow us to stay within RPOs and RTOs of seconds. However, that comes at a significant price tag.
  • With a simple offsite backup strategy, it is possible to keep the timeframe between viable backups within a few minutes. The only cost associated with this is the cost of the storage required to keep the backups.

  • Recognizing that there are exceptions, for most businesses a larger RPO increases the damage potential of a disaster more than a larger RTO.

Therefore, in most cases you want to keep the RPO smaller than the RTO.

Additional Considerations

While a best practice for determining RTO and RPO is to not let yourself be influenced by current hardware constraints, you need to keep the price associated with these numbers in mind. It is easy to say that you should always set both values to a few seconds at max. However, implementing that might cost you more than your business can afford. In a case like that, you need to re-evaluate your business requirements (and potentially adjust customer Service Level Agreements) to allow for affordable RTO and RPO values.

You also need consider your risk adversity. No matter how much you plan and invest, there always will be some residual risk. It is possible for a disaster to happen that takes you out of business. If you have a data center in the US and a datacenter in Australia, it is not very likely that both are hit by a natural disaster in short succession, but it is possible. Either you can now add a third datacenter in China, or you can accept the risk. You have to decide how much you want to invest to protect your company from the likely and less-likely disasters. Along the same lines, if you have well-trained staff and a well-tested business recovery plan, you might consider the risk of a disaster being followed by a second disaster that hits during the restore operations small enough to select a RPO and RTO that just gets you recovered a little under the maximum tolerable downtime.

Summary

When designing your Business Continuity Plan (BCP) you have to include a Disaster Recovery Plan (DRP). The DRP aims to fulfill two objectives, the Recovery Point Objective (RPO) and the Recovery Time Objective (RTO). The RPO provides a guideline of how much data can be lost during a disaster. The RTO on the other hand advises on the maximum duration any recovery operation should take. Both objectives should be set as low as possible while staying within the business confines.

Categories: Backup, Disaster Recovery, General, High Availability
Tags: , , , ,