Tag Archives: SQL 2012

SQL Server HA and DR solutions ?

I have been trying to figure out the best way to HA and DR protect SQL servers for some time now.

My customer is currently running SQL server 2008 R2 on Veritas cluster with Veritas Volume Manager as block level replication between primary and DR site.  The cluster has 4 physical nodes on each site with multiple instances for different system. Two of the hosts on each site has FusionIO flash cards for tempdb storage.  All other data is stored in a NetApp SAN connected through dual 8 GB FC HBA’s. Symantec NetBackup 7.6 is used for backups, it’s using the latest snapshot technology in NetApp and Vmware to make backups of databases in just seconds. Restore is also done in just minutes.

Upgrade to SQL 2012 is in the pipe and we are also looking into the possibility of making these servers virtual in Vmware. Load on the physical boxes is not that high and the performance of the Vmware environment would be sufficient. Vmware is running on HP blades, blade chassi is connected to NetApp through dual (or more) 10 GB Ethernet. Datastores in Vmware uses NFS protocol. Veritas is considered expensive and the goal is to get rid of it.

IOmeter och Crystal diskmark gives us almost the same disk performance between physical and virtual servers. We are looking into the possibility of using either InfinIO and/or IOTurbine in the Vmware hosts to get even better performance.

Last 8 months we have been testing AlwaysON to see if that is the solution for a new environment. AlwaysOn is very nice when used with non-shared storage and both synchronous and asynchronous replication between hosts and sites seems to be working very well. We setup a cluster with 3 plus 3 hosts on primary and DR site and created AG’s for both Sharepoint environments and other both big and small databases. We setup AG’s local to only one site and AG’s spanning both sites and we used a file share witness that could also be moved to DR site.

All worked well until we performed  a planned DR test where we failed over all AG’s to the DR site and then pulled the plug between the sites. Primary site cluster stayed up but DR site stopped because of lost quorum…. My mistake of course, had to manually change the quorum configuration after failing over. Have fixed that now with a powershell script that fails over all AG’s, resumes replication and changes quorum. So far so good.

But there is still a problem. As I said, one of the AG’s I have created is only local to the servers on the DR site, it’s a database only used by application installed on the DR site so we don’t need it on the primary site. Problem is that the cluster on the DR site goes down when we lose connectivity between the sites..  Microsoft  Cluster is designed this way to avoid a split brain situation, but it’s not what I want….

One way to solve this would be two have separate AlwaysOn clusters, but that’s expensive and it would be better to utilize the same servers for more than one thing.

Another way would be two have two FCSI clusters and put AlwaysON on top of that, but that would require shared storage for each FCSI on each site and in Vmware this requires FC or ISCSI. If I understand it correctly it could work in Windows 2012 and Vmware 5.5. But we are not there yet.

On top of this there is the problem with logins and jobs that need s to be synched between the servers. Having six servers and 4 AG’s makes this a bit complicated.

Jonathan Kehayias at SQLskills has written a small application to help out with this, but it’s still not enough. Check it out here http://www.sqlskills.com/blogs/jonathan/synchronize-availability-group-logins-and-jobs/

So, I have started to look into different solutions. I have found the following alternatives that we should consider:

  • Vmware native HA with SRM for DR protection
  • SIOS
  • DH2i
  • Veritas

What I’m considering right now is to just put the SQL servers in Vmware as is and use SRM for DR protection, it would give basically the same protection level as the Veritas solution. Block level snapshot replication in the NetApp instead of block level volume replication with Veritas. Failover between local host on primary would be slightly slower because  it would require the virtual machine to restarted from scratch on a new hosts, compared to Veritas where only SQL server has to startup. But Veritas has to take care of shared storage and unmounting and mounting diskgroups and that can also take some time. Failing to DR site is manual in Veritas and would be the same in Vmware, data loss would possibly be the same if the line between the sites is not fast enough for high peak load.

Using some other third-party clustering like SIOS or DH2i would maybe be better than Veritas, less complicated, possibly cheaper but something new that needs to be implemented.

With the protection that Vmware, NetApp and Symantec provides is it really neccesary to add AlwaysOn protection as well if we don’t need readable secondarys or backup offloadning ?

What’s your opinion ?

Advertisements

SQL Pass Rally 2013

sql_rally_nordic_logo_500

Rally is back in Stockholm, apparently this years event was the largest Rally event ever outside US. Even bigger then last time. That is nice and even though I only participated part-time this year I just love it.

This blog has been written during the event but was finished almost a week later. Timeline within the blog is therefore a bit confusing. But I prefer to leave it that way so that you can read it as I experienced it.

This Conference is obviously small compared to SQL Pass Summit in Charlotte some weeks ago, but the speakers are as good, food seems even better and Tuesdays  entertainment sound promising.

Monday

“Setup best practice of the year” – Brent Ozar

Brent Ozar

I attended Brent Ozar’s pre-con  the first day. And it’s amazing that there is always something ew to pickup, Brent has the same kind of background as me, learned SAN and virtualization because admins in these areas didn’t understand SQL related problems. SQL is after all the most demanding servers in most SAN or virtualization environment.

I won’t go into detail about all that was discussed during the day, I’m just going to comment on some of the notes I took during the session.

BrentOzat.com is a great source to get deeper into all of this. There are ton’s of free material to explore.

Try out sp_AskBrent with @expertmode=1 to find what your SQL Server is waiting on.

Try sp_Blitz to get at start on troubles in your SQL server.

Combine with Adam Machanic’s  sp_WhoIsActive and use the @get_plans=1 parameter and you have a wonderful combination of tools to troubleshoot your environment.

Look also for the triage_waitstats script to look deeper into waits.

Best advice to boost performance of your SQL server is to buy memory instead of upgrading SAN. RAM is so cheap these days, getting enough RAM to hold your complete DB in memory is the best way to get rid of storage IO.

To define your hardware for HA and DR the best way you need to define RPO (RecoverPointObject) and RTO (RecoveryTimeObject). In other word, how much data can you lose and how long can your environment be down.
All systems needs to be defined in the following matrix.

HA                   DR
none to 1 sec
1 min
1 hour
1 day

Depending on your choices above you then have several options to achieve this RPO and RTO. AlwaysON, clustering, mirroring, replication in various combinations and with underlying storage replication and such.
AlwaysON is not always the best choice, it is still a product that requires a lot of hands on.And it’s not as easy as running just a wizard. You might end up with a less reliable environment if you don’t do it right.

Run more virtual hosts instead of instances. Combining more instances on one host often more difficult than running multiple VM’s instead. And if you are running Enterprise edition, licensing that on the hypervisor level, means you can run an unlimited number of instances on that host. Might be a solution for your environment.

Always start all VM’s with 2 cpu and increase when you can see wait’s on CPU, giving a VM more CPU can actually make it slower.

I’m a big fan of PCI flashcards and Intel 910 seems to be a bargain, need to check it up. It comes in 600 and 800 Gb size and delivers up to 2 Gb/s.

PCI cards with caching software is maybe a good idea, but Products like OCZ ZD-XL or FusionIO with IOTurbine adds an extra layer of problem with bugs etc. I was interested in these at first but I Think for SQL Server it is better to avoid this.

“Putting SSD in a SAN is like flushing money down the toilette”, according to Brent Ozar. SSD is better local to SQL Server. Fabric will still be the biggest bottleneck.

A normal core these days is capable of receiving about 200 MB/s. To saturate all cores in a 2 socket server with 4 cores each the SAN needs to deliver 1600 MB/s.

Imagine the SAN needed to saturate 2 socket, 10 cores….

Vmware 5.5 with local SSD storage is ideal for SQL server, it can be very good for standard edition with limited memory, cache can appear bigger with this solution. Less SAN IO.

INFINIO also has release a Product last week that uses RAM in vmware host to cache IO. All hosts forms a pool of RAM that is shared to get better cache hit rate. Would like to try that out in the future.

You can also check out brentozar.com/go/filecache for more info on this subject.

To check if there is enough RAM in your server run perfmon. Target memory should be same as max memory otherwise we have memory pressure

Tuesday

“Getting the most from your SAN – file and filegroup layout” – Stephen Archbold

stephen

User databases actually has the same hotspot problem as tempdb, adding extra files to a filegroup gives an immediate performance increase of up to 15%.

Check that read ahead of 512 kb/s test with an aggregate select on fact table and watch avg. bytes /transfer in perfmon .

The following startup parameters for SQL Server can be used to get better IO performance, investigate and make sure you understand how they should be used and for which enviornments. Google for SQL Server Datawarehouse Fast Track  from Microsoft and read it all.

-E
-T1117
-T834

“Use your brain to beat SQL Server” – Adam Machanic and Thomas Kejser

Thomas-Kejser_w163adam

Sherlock Holmes kind of feeling in this session. Thomas and Adam tried to learn us tricks we can use to manipulate how the optimizer creates queryplans. Especially when the queryplans is not looking the way we want and we can’t fix it with normal solutions. I will not even try to recap what was said during the session, I was just amazed. Looking forward to the presentation material when becomes available.

“Index Impact Workload Analysis” – Klaus Aschenbrenner

untitled

Klaus Aschenbrenner had an interesting session about new SQL 2012 feature named distributed replay. This tool can be used to record a trace of all activity in system for a specified time. The trace can then be moved to another system where is can be re run. This gives excellent opportunities to test out indexes or other changes to a system without affecting the production environment. Load testing can also be done by utilizing up to 16 clients to run the trace.
I have only one problem with this and that is that prod and test environment needs to be very similar so that we don’t get different queryplans and result just because of the HW differences.

In the evening on Tuesday there was a dinner with Brynolf and Ljung entertaining, they are two Swedish magicians how gave us an amazing show in magic and illusions.

2013-11-06-07_39_51-Brynolf-Ljung

Wednesday

“Data Warehouse in the Cloud – Marketing or Reality?” – Alexei Khalyako

Alexei

Alexei tried to convince us that we can actually run DW’s in Azure. He has a point and for some customers this probably a viable solution, but for the majority of customers I still Think performance is not enough. Give it a year and will be possible.

“Automating Data Warehouse Patterns Through Metadata” – Davide Mauri

fc8982cb2e7f5b465b92e89b83fde541

Davide is a great speaker and has a way of getting your attention that only Brent Ozar can beat. In this presentation he discussed different DW load patterns and how we can solve them and in the end he gave us insight in how we can automate much of the “monkey work” in a DW development with BIML (Business Intelligence Markup Language).

“HA and DR in the context of the SLA” – Tobiasz Koprowski

Tobiaz

Was a nice recall of what we actually have SLA for. It’s not always as easy as just saying that a DB has a 99,999% uptime, all components a system is relaying must be added to the equation to get the real uptime. This polish guy did a great job describing the consequences.

“Does Optimistic Concurrency Resolve Your Blocking Problems?” – Margarita Naumova

maggie

Maggie Naumova walked us through the problems with locking and how we can solve some of the problems with RowLevelVersioning. I have personally used this on several systems to solve locking problems with good result.
She finished of by explaining how most of the locking problems we have to today will be solved by SQL 2014.

In Summary I had some very nice days at Arlanda Airport where the Conference was held. I meet old friends and also made some new. I learned alot, especially from Brent, how made me think twice about some coming purchases.

I’m sorry I could not attend SQL Rally in Amsterdam also. That would have been nice.

SSIS Unexpexcted termination when running multiple packages in parallel

An escalating problem at a customer site is a problem with SSIS when running multiple SSIS packages in parallel.
We get the following error on intermittent occasions, sometimes after just a couple of minutes, sometimes after 2-3 hours of running.

We have applied the latest SQL 2012 CU3 patch and also a hotfix from Microsoft http://support.microsoft.com/kb/2837964/en-us?sd=rss&spid=1044
But the problem persists. We get the following error in the Windows application log:

Faulting application name: ISServerExec.exe, version: 11.0.3349.0, time stamp: 0x513a9017
Faulting module name: ntdll.dll, version: 6.1.7601.17725, time stamp: 0x4ec4aa8e
Exception code: 0xc0000374
Fault offset: 0x00000000000c40f2
Faulting process id: 0x1fe0
Faulting application start time: 0x01ce4ffb609c5176
Faulting application path: C:\Program Files\Microsoft SQL Server\110\DTS\Binn\ISServerExec.exe
Faulting module path: C:\Windows\SYSTEM32\ntdll.dll
Report Id: 6b101cc8-bc06-11e2-8a88-e41f13b85634

Yesterday we finally found a blog post where someone had gotten rid of this problem by adding the SQL agent account to the local admin group on
the server. They where not able to figure out what permission exactly was missing. In our case and on the first server where I tried this,
the same account was used for both the SQL engine and the agent. It’s not good practice to give the sql account local admin permissions but in
this case it’s the only solution I have found that seems to be working.

And guess what, it seems to have worked !!!

At least last night…. But this error is very intermittent and I will let it run for at least a week
before I consider this a working solution.

We will eagerly be awaiting Cumulative Update 4 which is due to be released this week and that might include a fix for this.

I’ll continue to work on the permission clue and see if something more can be found and the I will update this post

Update 2013-05-24

Giving the SQL services accounts local admin permission didn’t help…
Applied a bunch of Windows .Net patches yesterday. See if that helps.