Disaster comes in many shapes and forms and it can affect any of us. The companies that survive a disaster are the ones that are prepared. Are you prepared? Do you have a Business Continuity Plan?
Many think of a disaster as something huge, affecting many. A Hurricane is for example certainly a (natural) disaster. However, smaller events can be just as disastrous to a single company. Disaster that can affect your business continuity include:
There seems to be an odd disparity within this list. However, all have in common that they can immediately lead to significant data loss. In the end, it does not matter, if the database got lost due to a fire or due to an accidentally executed DROP DATABASE statement. In many businesses, the database is the most critical asset. Without it, the business cannot survive. But even if it is not the most critical asset, losing access to the data will have a significant negative impact on a business' viability. Therefore, it is important to have an adequate business continuity plan in place.
A business continuity plan helps your company to survive after a disastrous event. It defines the order in which assets and resources need to be repaired or replaced. It provides step by step instructions on how to revive each part and it documents the timeframes in which each resource has to be back in shape in order to not risk business closure. To create your own business continuity plan, follow these five steps.
The first step in creating a business continuity plan is to identify the important parts of your business. The list should include items like HR, Accounting, Marketing and Sales. Depending on the type of your business, you might have business functions along the lines of Material Flow Management. If you are heading a larger enterprise, you should use a more fine-grained approach to each function. For example, instead of Accounting use separate entries for Accounts Payable, Accounts Receivable and General Ledger.
Once you have the list of business functions you need to define the resources that they rely on. You want to add every type of resource in here, starting with the work force and the workspace down to the IT infrastructure including the database(s). Make sure to include external supplies that you rely on, like rare-earth magnets (if you use them in your products) or a fast connection to the internet.
The result of this effort is a comprehensive list of all functions and resources that are currently involved in running your business. Refrain from make judgment calls in this phase. Include everything that is currently part of your business, even if you do not think that it is important.
During the second step, the different resources have to be ranked by importance to the business. It is important to prioritize the resources, so that in the case of a disaster the most critical resources are attended to first. You would not want the paramedic to stabilize a broken ankle first, if the patient also has an injured neck artery.
To find adequate priorities for your business resources, go through the entire list, and for each item determine the maximum tolerable downtime. That is the maximal timespan that the business can survive without that resource in place. Be realistic here. If you are an IT company and your building is destroyed, your staff probably can work from home for quite a while. On the other hand, if you run an electronic stock exchange and an incident disrupts access to your database, you might lose money at a rate that will force you out of business in a matter of minutes.
Preventative controls are measures that you can put in place to prevent a disaster from happening. Those measures come at a cost, but for the more critical items in your list, this is usually well-invested money. For example, it is today relatively cheap to protect against a failing drive by using a form of RAID storage. So, storing your database on a single drive could be considered negligence. However, there are other possible points of failure and you need to consider all of them when setting up preventive controls. You have to not only use redundant storage, but also redundant network connections and even redundant server hardware.
Measures that prevent a disaster in the IT infrastructure are usually grouped under the term "High Availability" or HA. SQL Server provides a few different HA technologies like AlwaysOn availability groups, AlwaysOn Failover Clustering or database mirroring. Each of these comes at a different price and you need to decide if the resource you are protecting is valuable enough to spend that additional money.
Note, that all these SQL Server technologies, while usually called HA solutions, do not actually prevent a disaster by themselves. If a disk that is part of a RAID set fails, it can be exchanged without any interruption to the system. The same is true for a redundant power supply. On the other hand, if a clustering solution executes a fail over, concurrent connections will be cut. This can lead to unexpected system behavior. To prevent these effects, the application needs to be designed to be resilient to a database fail over. Only then, these techniques can be called "HA".
Independent of your investment into high availability you need to also have a strategy in place if the resource becomes unavailable anyway. For example, AlwaysOn does protect against sudden hardware failure. But it does not protect against a dropped table or dropped database. In addition, the redundant hardware tends to be in the same room, so a flood or a fire can easily affect the entire system.
A recovery strategy is a step-by-step plan that details how to get the business back up and running after a resource or a group of resources become unavailable. In the case of a database, it would include information about where to find the most recent backups and how to restore them. But it also includes a strategy for making sure that those backups actually exists and that they are still available after a disaster (think offsite storage).
For each database you want to have a backup strategy, offsite storage and a restore strategy. When designing these strategies you need to take into account that a disaster will likely cause data loss and will require time to repair. You need to make sure that your backup strategy can support the constraints of the company. For example, if the latest available backup is 24 hours old, the clock counting down the maximum tolerable downtime could be considered also having started 24 hours ago, giving you that much less time for restore operations.
The best plan is worthless if it cannot be executed at the time of need, because important pieces are missing or the current staff cannot understand parts of it. Even worse is, if you find out after the disaster occurred that the backup strategy was implemented incorrectly and you do not have a viable backup at all.
The only way to make sure that the disaster recovery strategy can be executed is to regularly execute training events and test all parts of it. It is not necessarily important that all pieces are tested at once, but you want to make sure that everything is tested on a regular basis.
The entire strategy needs to be very well documented. You do not want to be in a situation where only one person can execute a critical piece of the puzzle, as that person might be unavailable when disaster strikes. Therefore it is important, that everybody, during a disaster training event, follows the documentation closely, so that documentation gaps can be discovered and fixed. Make sure everybody understands the importance of that. Gaps in the documentation of the disaster recovery plan can be just as bad as not having a plan at all.
Disaster can strike any time. It comes in different forms reaching from failing hardware to natural catastrophes. When you are unprepared, even a simple hard drive failure could destroy your business. Do not get caught in that situation. Create your Business Continuity Plan today.
You must be logged in to post a comment.
Pingback: Recovery Point & Recovery Time Objectives Demystified - sqlity.net