Our Great Posts

Recently we had a client come to us for help with SharePoint 2013 disaster recovery. The requirement was straightforward. They have two datacenters in two different cities; one is their primary; the other is their secondary. They wanted to have a tested SharePoint disaster recovery plan in place that would guarantee restoration of SharePoint functionality within their SLA (a generous 48 hours) in the event of a catastrophic failure at their primary data center.

We told them we would be happy to assist them with that requirement, but the first point we had to make was that in fact their need for SharePoint disaster recovery was more pressing then they realized, because even though they have reliable and tested SQL backup-restore plans and tools in place, every SharePoint farm is just one bad database away from disaster. We will begin this series on disaster recovery by examining why that is the case:

A SharePoint farm is at risk of disaster in the event of SQL database corruption even if robust SQL backups are in place for two reasons: 1) not all of the necessary configuration information for the Farm is stored in databases, and 2) the Farm Configuration Database in particular is not restorable via SQL backups.

The Farm Config DB exists in a synchronous relationship with the many scripted processes that are constantly running on all of the SP servers in the Farm. These processes are complex, stateful, and cyclical; while they are in progress, their completion is dependent on information that not only exists in the Config DB, but also in memory in the farm servers, and even in the packets passing back and forth between all the servers in the farm. For this reason, the state of the farm at any one moment in time does not simply exist in the Config DB. It exists in the configuration database and in the traffic between the farm servers and farm databases and in the memory on the SharePoint servers as well. This web of related, interdependent data is why simply restoring the configuration database to an earlier point of time doesn’t work; the restored database would be completely out of sync with the state of the processes on the farm servers and even with any packets moving on the network. Things might initially look fine, but the lack of synchronization will cause unpredictable problems; in Microsoft’s words, “users may experience various random errors.”

What all this means is that, if the Config DB were ever to be corrupted or lost to the point where a restore from SQL backup would be desirable, the Farm will have entered a DR situation, and a full DR package would have to be implemented to rebuild or restore the entire farm.

So as far as SharePoint is concerned, disaster recovery is not simply a question of datacenters catching fire or servers blowing up. A SharePoint disaster can be as simple a thing as one corrupted database; and for that reason, every production SharePoint farm needs a tested disaster recovery plan.

Next time we will look at what approaches Microsoft recommends for SharePoint disaster recovery and evaluate the strengths and weaknesses of each approach.


Leave a Reply