Sunday, August 11, 2013

Disaster Recovery : Be The DBA Who Saves The Day!

Those of us who work in the "big data" arena know just how bad things can be when a server turns into a smoking slab, someone mislabeled the tapes, screwing up the rotation and retention or any of the other horrible things that can happen when good data goes bad (or gone).

How do we approach this without being too heavy handed with cron jobs running every 5 minutes, Diffs on the 15's and daily Fulls including system DBs. Plan for failure. Plan for success. Just plan. Before you spend a single penny on new blades, CAL's and a team of high priced "data architects", spend a few minutes to evaluate your current policies and plans for backups and data retention and whether or not you are meeting your SLA's. If you did this and got that sour feeling in your stomach then it is definitely time to revamp your DR plans and begin documenting more.

So without further to-do, here is a great post from Steve over @ RedGate Software.

10 Tips for Efficient Disaster Recovery

by Steve Jones

We experience "disasters" at work all the time, ranging from accidentally deleted data, to hard drives that fail, to misconfigured servers that cause deployments to fail. Most of these disasters are, in fact, relatively minor annoyances that we deal with inside of our department, finding workarounds or accepting delays in completing our work. However, sometimes we have real disasters of the sort that result in management peering impatiently over our shoulder as we strive to find a solution.
When a true disaster occurs, it's often unexpected and forces us to respond in unfamiliar ways. It is a stressful situation, but it's one through which we learn a lot about our server environment and our processes. We learn about the "weaknesses" in our setup, and about what we should have done or practiced in advance to be ready for this particular situation.
All technology workers should adopt the Boy Scout motto, Be Prepared. We can't eradicate disasters entirely; complex systems, especially database systems, can fail in myriad ways. And at times we will have regrets and make mistakes while fixing these systems. But, with better preparation, and by adopting a few good habits and skills, we can respond quickly and decisively to restore normal working order; especially during those times when a manger is staring grimly over our shoulder.
This article presents ten ways to Be Prepared for those database-related bumps in the road, big and small:
  1. Test your restores
  2. Know the location of "the bits"
  3. Have pre-approved expenses
  4. Understand partial restore processes
  5. Have spare hardware
  6. Monitor your systems for changes
  7. Keep test scripts handy
  8. Maintain up-to-date contact information
  9. Use trained staff
  10. Consider the cloud

Test your restores

If you experience "total environment failure", perhaps due to fire, flood, or other natural disaster, your decision to rebuild the system might require a restore at a new location with a backup copy of the data. These days most companies could not survive for any length of time on non-computerized systems.
However, how confident are you that you know the backup process for each of the important databases in the system? Do you know exactly which backup files you need and how to get hold of them? Most importantly, do you know how to recover each of the database backup files?
The mere existence of a backup file in a folder doesn't imply that the file is usable. If you haven't tested fully how to recover each component of your system, including the databases, there will be many things about the disaster recovery process that you don't know. You don't know if your backup process is working, you don't know if you can restore the latest database backup, and you don't know what might go wrong with the restore process.
SQL Server backup and restore operations are quite straightforward, but there are lots of options and potential places where you can make mistakes. Third party tools, like SQL Backup Pro, can make the process of restoring systems simple enough so that anyone can perform a restore.While responding to a disaster, you don't want to discover that certain databases use striped backups, and that you don't know how to perform the restore operation. It certainly would not be very impressive to any directors or vice presidents standing in your office.
However, if you aren't testing the restores, you'll never know if you can complete one.
Test your restores monthly, using a different server and a different restore technique each time, to ensure that you and your team can restore a database or log backup, if needed.

Know the location of "the BITS"

A disaster is usually unexpected. It is very hard to predict what part of your system will fail and the exact manner in which it will fail. Consequently, you will not know exactly which computers you'll need to rebuild, or what you might need to reinstall or restore. All you can do is be prepared.
Of course, a key aspect of your disaster recovery strategy will dictate that you have copies of all database backup files stored off-site, in a location that is unlikely to suffer the same disaster that befell the primary site. However, restoring a working database application requires more than just the database. It will likely require third-party software and relevant license keys, for example. Of course, these days we could simply download another copy of the software from the vendor site, but that isn't a reliable disaster recovery solution. The vendor site may be down, or the version you need may be unavailable. More likely, it will take too long to re-download the files. In addition, there will be patches to apply, data feeds to re-establish, and so on, in order to restore the system fully.
Good supply chain planning involves "staging" supplies so they are ready to go in response to demand. The same is true for disaster recovery planning. An exceptional DBA will ensure that copies of the software, backups, patches and all other required system components, including passwords, encryption keys, and so on are stored in a known location. In addition, they will have documented the location of each and made this available to all IT staff.

Have pre-approved expenses!

I've experienced disasters that lasted minutes and some that have lasted days. In many cases, there's been the need to purchase something in order to facilitate the recovery of our systems. The expense might be very small, such as late night food to sustain the recovery team, or more substantial, such as airfare and hotel for an expert to come in and resolve a problem. On all occasions, the goal of the purchase is to help get the systems back up and running as quickly as possible.
In one case, we lost a password on a system that had taken days to set up. A simple error on the part of a staff member resulted in eight of us staring at a screen, unable to log in. One of my colleagues had used a utility to recover the password at a previous job, but it cost $300. No one was willing to spend the money and take the chance on a reimbursement since the organization had denied a number of expense claims from a previous disaster, including a piece of software. So instead of a $300 charge, and a system back up and running in minutes, three or four people spent a day rebuilding the system from scratch. Our department, including the CIO, was embarrassed. From that point forward, we received a pre-set disaster fund, which the company guaranteed to reimburse.
Most organizations recognize that disaster situations result in unexpected expenses and will often reimburse any reasonable expense after the event. However, it is always best to have the conversation in advance. A simple conversation with management will usually result in a workable disaster recovery budget.

Understand partial restore processes

Many disasters don't result in the total destruction of your hardware. Sometimes a recovery simply entails rerunning a failed process, such as performing a reload of a data warehouse. For others, a problem affects only a small subset of the whole system, such as a particular database, or table within it.
For each of your main systems, especially those complex systems with multiple software processes, or large amounts of data, you ought to have documented methods for restoring a part of the system. In SQL Server, this would be a process for recovering just a filegroup, table, or specific rows. It's worth taking the time to investigate whether your third party backup tool, if used, offers any built-in support for such "partial" restore processes (in the form of object level recovery, or ability to compare live data with backup data, and so on).
If the recovery team is unaware of such methods then they may spend an unacceptable amount of time performing a complete system restore, when they could have resolved the issue by restoring a single object.

Have spare hardware!

Hardware will fail. Some hardware components are more likely to fail than are others. Until you have a replacement component, or you move the system to new hardware, you can't proceed with system recovery. In quite a few organizations I have seen recovery operations halt, while waiting for a replacement part to arrive via FedEx or UPS.
It is a good idea to have spare hardware readily available, especially for those parts most likely to fail, such as hard drives.
In one organization, we had a guarantee with our major vendors, such as Dell, to deliver parts within a given time frame. In such cases, it's important that your team members know who to contact, and have access to the necessary account numbers, associated logins and passwords, and any other information they need to ensure equipment is delivered in a timely manner.

Monitor your systems for changes

Today's complex systems are too large to be able to rely on any DBA's memory for knowledge of each component, and its configuration options and settings. These days, it is increasingly likely that you'll install a new instance of SQL Server with virtual machines and scripted installations, not to mention cloud services. Such systems can change and expand rapidly, as we spin up new VMs, tweak configurations, plug in new services, and so on.

n such cases, it becomes vital that the team is aware of all new systems, and of changes made to existing systems. If not, then any recovery operation is likely to result in an incorrectly configured or incomplete system.

Monitoring your server environment is important for many reasons, but system recovery might be the most important. Your routine method of work should include monitoring any change that might affect the ability to recover the system. Your monitoring process should periodically check for any new systems that appear on the network, of which you were not previously aware. In this way, if a new VM suddenly appears, you can document it, ensure that it's included in the backup and recovery plan.

Keep test scripts handy

When you've recovered a system, how do you know for sure that it's complete and working correctly? How do you know you didn't restore an older version of the software? Are you sure that you recovered all the data that you should have and the application still works?
Customers rely on the IT team to keep their systems running. When there is system downtime, regardless of cause, there is a lot of pressure on the IT team to restore normal working order. In a disaster situation that's already stressful, and one of the most damaging things an IT team can do to its reputation is to recover the system in a state where it doesn't work.
The recovery team needs to run tests to verify the state of a restored system, in much the same way that developers write scripts to test and verify the behavior of their code. In fact, many of the scripts and tools that the development team uses to certify a new deployment are useful for the DBA team too.
Keep copies of the test scripts your development team uses, and run them to verify the state of a restored system. You may even want to get copies of their testing tools and automate these post-restore tests.

Maintain up-to-date contact information

A corollary to Murphy's Law states that disaster is most likely to strike when your senior people are out of the office. Can you contact your staff? Do you know which vendors and consultants you need to contact, and how? If you reach a support person, do you have your company's account information? We often assume, wrongly, that everyone has everyone else's contact information, and that the person that knows about the vendor relationship will handle all communication.
Periodically ask people to update their contact information in a central location that will still be available if your local systems are down. Include vendor contact information and account details as well.

Train your team

Technology changes quickly, and management have unrealistic expectations about the breadth of knowledge of their IT staff. However, if your team's job is to perform recovery tasks, then you need to make sure they know how to use the tools that are necessary to complete the work.
It is not enough that an expert in a particular technology can perform their set of recovery tasks. They need training in how to relay instructions over the phone, to a non-expert. Junior members of staff need training in basic procedures, and need to be comfortable restoring more complex systems, following instructions relayed to them over the phone (or via documented procedures, if they exist and are up-to-date).
Practice makes perfect. Perform some cross training so that all your staff can perform a recovery operation, either solo or under direction.

Consider the cloud

The cloud is an amorphous, poorly defined technology. It means different things to different organizations, and different things to different departments within an organization. It can mean IaaS, PaaS, or SaaS. It might take the form of software on a web site, or a virtual machine provided by a vendor, or a database service, or a simple file storage location.
Ultimately, however, it means a low cost and instant installation of software and services, and as such, it would be remiss not to at least consider the cloud for part of your environment.
The cloud isn't for every application or every situation, but many companies should examine the cloud to determine if it can help lower the cost of some systems, or perhaps just reduce the impact of a disaster.Moving systems to the Cloud doesn't remove the need to manage these systems or perform backups, but it can make disaster recovery a little easier. The day your data center gets flooded from a broken pipe, you might be glad that you backed up certain databases to the cloud, which could be a leased server in a co-location facility, as well as locally. Likewise, use of a remote cloud-based service instead of a one in your data center might help minimize the impact of a disaster.


We cannot anticipate or train for every disaster situation, but preparation and practice can reduce the impact of many disasters. The items presented in this paper can help you better prepare for future incidents without having to learn these lessons the hard way.

No comments:

Post a Comment