You’ve spent a great deal of time and money getting your SaaS application “live.” You went with SaaS as a way to take advantage of the cloud and to offload some of the operational issues around your applications. Your SaaS application is built on a private cloud that your organization manages.
You’ve trained your users. You’ve made certain that the right governance model is in place to ensure everything is covered for the future.
Three weeks after you begin using your SaaS application, you run across a small problem. You start hearing about some data corruption issues from users. The data loss isn’t major and can easily be repaired by having users make a few minor data edits.
After a few more days, you’re still hearing about data corruption issues. You start to worry that you’ve got a major issue with your application—or perhaps a data security issue. The workflow of the application is analyzed to understand if there are any pieces of the process that might introduce corrupt data, but you find nothing out of the ordinary. Your team dives into access and security records to see if you’ve had a security breach of the SaaS system, but nothing points to nefarious activity.
Your team continues to use the system and continues to see small areas of data corruption. Again, it’s nothing major, but the corruption is there. After more digging, it appears that the corruption issues arise only at certain times of the day when the number of users reaches a very high level.
You reach out to your SaaS application vendor and ask them to take a look at the issues. They determine that there are some configuration issues that prevent the application from running correctly when a high user load exists.
Your vendor describes the necessary steps that your team needs to take to reconfigure the system. Over the weekend, your operations teams make the necessary changes to the system to meet the vendor’s requirements after running a full backup of the current data.
During these changes, the application’s database had to be reset to rebuild the tables to meet new requirements. Your team gets to the last few steps of reconfiguration and realizes a major issue.
When the SaaS database was initially configured, it was spread across multiple servers. The backup procedure required each server to have a separate backup process that stored backup data into a single location, which would then be stored off-site in the cloud.
The problem? The backup process didn’t take into account the complex nature of your SaaS database. Each backup agent stored the backup data for each server as a single file within the backup location. When it came time to do the necessary restore, the recovery process failed because the backup system didn’t store the right metadata required to restore the data to the appropriate location.
This wasn’t an issue during the initial test phase. During testing, the single server test system passed the backup/recovery process with flying colors. But the production system wasn’t tested.
All isn’t lost; the data can be restored. The downside to the recovery process is that it takes a great deal of time to recover using a recovery process that must be strictly followed. The downtime for this SaaS application grows from a few hours to a few days of recovery time. You are quite upset. Your team is quite upset. The organization is extremely upset.
The reason for going to the cloud was the robust nature of the systems and increased uptime (along with other benefits). But the lack of a well-planned backup process has immediately discounted the benefits of going with the cloud.
The above issues could have been mitigated by more robust testing, to ensure the backup/recovery process worked before going live. That said, this type of thing happens more often than most folks like to admit.
Have you considered all of the points of failure in your applications and systems? Do you know if your backup/recovery process will actually work the way it’s designed to work when it needs to work?