The project’s scope was limited but important. We were tasked with selecting a new backup and recovery platform and building a new process that would take advantage of the new system.
The project was initially planned to be a fairly short selection and implementation project, with four months estimated for the selection process and six months for the implementation, testing and rollout.
After three months of review, testing and demos, we selected a solution. It was one of the top-tier backup and recovery solutions at the time. The implementation phase of the project went well with no real hiccups.
As part of the selection and implementation project, a new backup/recovery process was created for the organization. This new process was built around the new platform and was created from scratch to allow the organization to take full advantage of the features and functionality of the new solution.
We built a fairly robust process using on-site backup and off-site storage using tapes.
The solution that we chose would automatically build multiple tape copies of backups to be used for off-site storage. For robustness, we used three off-site backup locations, which required three separate tape copies of the backup volumes. Our off-site storage was geographically distributed to ensure safety, with one location being close enough to the data center to get same-day delivery if needed.
As part of the backup process, before shipping tapes off, we tested each tape to ensure it contained a backup and could be read from. During this testing process, the tapes were inserted into a computer provided as part of the new backup platform. This computer had one task – to check tapes for data and to ensure no corruption.
The new process was built. The system was implemented and rolled out. We were all happy . . . until we weren’t.
Crystal Bedell recently wrote a piece titled Tape vs. Cloud: Reducing the Risk of Data Loss During Backup and Recovery. In that post she wrote:
Despite all your best efforts, there are no guarantees that data backed up to tape will be there when you need it most. It’s not uncommon for backup data to become corrupted due to operational error or mishandling of the tape. It’s also possible to accidentally overwrite critical business data by inserting or partially formatting the wrong tape.
How true. And that’s exactly what happened to my organization about three months after going live.
One day during the integrity checking process of a new batch of tapes, we started seeing quite a few errors on tapes. Not just on one tape, but on multiple tapes. The first time the issue occurred, we thought it was a glitch and asked for support from the vendor. The glitch was highlighted as a known bug and fixed by the vendor.
The next week, we saw issues again. And then again. Our vendor provided updates and fixes for the integrity checking process, but we still saw issues throughout the next few weeks.
This integrity issue drove us crazy. We spent a little over a month working through the issue with little luck. Our backups were still running and our on-site storage was working, but our tape backups weren’t passing the integrity tests so we couldn’t rely on them.
Then one day we received a call from our vendor. They had shipped us a new tape drive to use with the integrity checking process. They wouldn’t really provide us with more information about why we needed to replace the drive other than to say the change was required.
When we received the new drive, we noticed it didn’t look much different from the original. It was the same size as the original and tapes were loaded in the same manner. After we installed the new drive, we started getting tapes to pass the integrity test. In fact, in the new drive, every tape passed the integrity test every time.
We were happy to see the issue resolved but I still wondered why we had seen the issues, so I asked to have the original drive added to another machine for testing. I grabbed a few tapes to test and planned to spend a few hours getting to the bottom of what the problem had been.
As it turns out, it didn’t take me a few hours to find the problem. I inserted the first tape and ran the integrity test and it failed. I then took that tape to the new drive and ran the tests and found that it passed. Then I took the tape back to the original drive and it failed. But now I realized why it failed.
The failure of the integrity checking process wasn’t due to the tape or the backup process. It was due to a simple oversight of the tape drive. The original drive had a small (but very important) defect in it. The defect? The drive didn’t actually force the tape into a “seated” position. In other words, the tape could never be fully inserted into the drive for the data to be read.
After three months of headaches and many hours spent trying to solve the problem, a simple defect was found to be the culprit. A simple plastic mount caused the tapes to not seat fully in the drive, thereby causing the data to fail integrity checks.
Crystal finishes off her post with this:
Backup and recovery is a necessity, but the headaches and risk associated with tape are not.
Talk about headaches. I had a lot of them from that tape drive fiasco. While there will always be headaches in IT, reducing the number of moving parts in the backup and recovery process can help reduce as many headaches as possible.
Moving to the cloud can help reduce those moving parts. The cloud isn’t a perfect solution, but it’s a solution that can help reduce some risks (and headaches) with your backup and recovery solution.
Image Credit: Tape Backups on flickr