I have been trying to figure out the best way to HA and DR protect SQL servers for some time now.
My customer is currently running SQL server 2008 R2 on Veritas cluster with Veritas Volume Manager as block level replication between primary and DR site. The cluster has 4 physical nodes on each site with multiple instances for different system. Two of the hosts on each site has FusionIO flash cards for tempdb storage. All other data is stored in a NetApp SAN connected through dual 8 GB FC HBA’s. Symantec NetBackup 7.6 is used for backups, it’s using the latest snapshot technology in NetApp and Vmware to make backups of databases in just seconds. Restore is also done in just minutes.
Upgrade to SQL 2012 is in the pipe and we are also looking into the possibility of making these servers virtual in Vmware. Load on the physical boxes is not that high and the performance of the Vmware environment would be sufficient. Vmware is running on HP blades, blade chassi is connected to NetApp through dual (or more) 10 GB Ethernet. Datastores in Vmware uses NFS protocol. Veritas is considered expensive and the goal is to get rid of it.
IOmeter och Crystal diskmark gives us almost the same disk performance between physical and virtual servers. We are looking into the possibility of using either InfinIO and/or IOTurbine in the Vmware hosts to get even better performance.
Last 8 months we have been testing AlwaysON to see if that is the solution for a new environment. AlwaysOn is very nice when used with non-shared storage and both synchronous and asynchronous replication between hosts and sites seems to be working very well. We setup a cluster with 3 plus 3 hosts on primary and DR site and created AG’s for both Sharepoint environments and other both big and small databases. We setup AG’s local to only one site and AG’s spanning both sites and we used a file share witness that could also be moved to DR site.
All worked well until we performed a planned DR test where we failed over all AG’s to the DR site and then pulled the plug between the sites. Primary site cluster stayed up but DR site stopped because of lost quorum…. My mistake of course, had to manually change the quorum configuration after failing over. Have fixed that now with a powershell script that fails over all AG’s, resumes replication and changes quorum. So far so good.
But there is still a problem. As I said, one of the AG’s I have created is only local to the servers on the DR site, it’s a database only used by application installed on the DR site so we don’t need it on the primary site. Problem is that the cluster on the DR site goes down when we lose connectivity between the sites.. Microsoft Cluster is designed this way to avoid a split brain situation, but it’s not what I want….
One way to solve this would be two have separate AlwaysOn clusters, but that’s expensive and it would be better to utilize the same servers for more than one thing.
Another way would be two have two FCSI clusters and put AlwaysON on top of that, but that would require shared storage for each FCSI on each site and in Vmware this requires FC or ISCSI. If I understand it correctly it could work in Windows 2012 and Vmware 5.5. But we are not there yet.
On top of this there is the problem with logins and jobs that need s to be synched between the servers. Having six servers and 4 AG’s makes this a bit complicated.
Jonathan Kehayias at SQLskills has written a small application to help out with this, but it’s still not enough. Check it out here http://www.sqlskills.com/blogs/jonathan/synchronize-availability-group-logins-and-jobs/
So, I have started to look into different solutions. I have found the following alternatives that we should consider:
- Vmware native HA with SRM for DR protection
What I’m considering right now is to just put the SQL servers in Vmware as is and use SRM for DR protection, it would give basically the same protection level as the Veritas solution. Block level snapshot replication in the NetApp instead of block level volume replication with Veritas. Failover between local host on primary would be slightly slower because it would require the virtual machine to restarted from scratch on a new hosts, compared to Veritas where only SQL server has to startup. But Veritas has to take care of shared storage and unmounting and mounting diskgroups and that can also take some time. Failing to DR site is manual in Veritas and would be the same in Vmware, data loss would possibly be the same if the line between the sites is not fast enough for high peak load.
Using some other third-party clustering like SIOS or DH2i would maybe be better than Veritas, less complicated, possibly cheaper but something new that needs to be implemented.
With the protection that Vmware, NetApp and Symantec provides is it really neccesary to add AlwaysOn protection as well if we don’t need readable secondarys or backup offloadning ?
What’s your opinion ?