Tag Archives: Microsoft SQL Server

SQL Server HA and DR solutions ?

I have been trying to figure out the best way to HA and DR protect SQL servers for some time now.

My customer is currently running SQL server 2008 R2 on Veritas cluster with Veritas Volume Manager as block level replication between primary and DR site.  The cluster has 4 physical nodes on each site with multiple instances for different system. Two of the hosts on each site has FusionIO flash cards for tempdb storage.  All other data is stored in a NetApp SAN connected through dual 8 GB FC HBA’s. Symantec NetBackup 7.6 is used for backups, it’s using the latest snapshot technology in NetApp and Vmware to make backups of databases in just seconds. Restore is also done in just minutes.

Upgrade to SQL 2012 is in the pipe and we are also looking into the possibility of making these servers virtual in Vmware. Load on the physical boxes is not that high and the performance of the Vmware environment would be sufficient. Vmware is running on HP blades, blade chassi is connected to NetApp through dual (or more) 10 GB Ethernet. Datastores in Vmware uses NFS protocol. Veritas is considered expensive and the goal is to get rid of it.

IOmeter och Crystal diskmark gives us almost the same disk performance between physical and virtual servers. We are looking into the possibility of using either InfinIO and/or IOTurbine in the Vmware hosts to get even better performance.

Last 8 months we have been testing AlwaysON to see if that is the solution for a new environment. AlwaysOn is very nice when used with non-shared storage and both synchronous and asynchronous replication between hosts and sites seems to be working very well. We setup a cluster with 3 plus 3 hosts on primary and DR site and created AG’s for both Sharepoint environments and other both big and small databases. We setup AG’s local to only one site and AG’s spanning both sites and we used a file share witness that could also be moved to DR site.

All worked well until we performed  a planned DR test where we failed over all AG’s to the DR site and then pulled the plug between the sites. Primary site cluster stayed up but DR site stopped because of lost quorum…. My mistake of course, had to manually change the quorum configuration after failing over. Have fixed that now with a powershell script that fails over all AG’s, resumes replication and changes quorum. So far so good.

But there is still a problem. As I said, one of the AG’s I have created is only local to the servers on the DR site, it’s a database only used by application installed on the DR site so we don’t need it on the primary site. Problem is that the cluster on the DR site goes down when we lose connectivity between the sites..  Microsoft  Cluster is designed this way to avoid a split brain situation, but it’s not what I want….

One way to solve this would be two have separate AlwaysOn clusters, but that’s expensive and it would be better to utilize the same servers for more than one thing.

Another way would be two have two FCSI clusters and put AlwaysON on top of that, but that would require shared storage for each FCSI on each site and in Vmware this requires FC or ISCSI. If I understand it correctly it could work in Windows 2012 and Vmware 5.5. But we are not there yet.

On top of this there is the problem with logins and jobs that need s to be synched between the servers. Having six servers and 4 AG’s makes this a bit complicated.

Jonathan Kehayias at SQLskills has written a small application to help out with this, but it’s still not enough. Check it out here http://www.sqlskills.com/blogs/jonathan/synchronize-availability-group-logins-and-jobs/

So, I have started to look into different solutions. I have found the following alternatives that we should consider:

  • Vmware native HA with SRM for DR protection
  • SIOS
  • DH2i
  • Veritas

What I’m considering right now is to just put the SQL servers in Vmware as is and use SRM for DR protection, it would give basically the same protection level as the Veritas solution. Block level snapshot replication in the NetApp instead of block level volume replication with Veritas. Failover between local host on primary would be slightly slower because  it would require the virtual machine to restarted from scratch on a new hosts, compared to Veritas where only SQL server has to startup. But Veritas has to take care of shared storage and unmounting and mounting diskgroups and that can also take some time. Failing to DR site is manual in Veritas and would be the same in Vmware, data loss would possibly be the same if the line between the sites is not fast enough for high peak load.

Using some other third-party clustering like SIOS or DH2i would maybe be better than Veritas, less complicated, possibly cheaper but something new that needs to be implemented.

With the protection that Vmware, NetApp and Symantec provides is it really neccesary to add AlwaysOn protection as well if we don’t need readable secondarys or backup offloadning ?

What’s your opinion ?


SQL Pass Rally 2013


Rally is back in Stockholm, apparently this years event was the largest Rally event ever outside US. Even bigger then last time. That is nice and even though I only participated part-time this year I just love it.

This blog has been written during the event but was finished almost a week later. Timeline within the blog is therefore a bit confusing. But I prefer to leave it that way so that you can read it as I experienced it.

This Conference is obviously small compared to SQL Pass Summit in Charlotte some weeks ago, but the speakers are as good, food seems even better and Tuesdays  entertainment sound promising.


“Setup best practice of the year” – Brent Ozar

Brent Ozar

I attended Brent Ozar’s pre-con  the first day. And it’s amazing that there is always something ew to pickup, Brent has the same kind of background as me, learned SAN and virtualization because admins in these areas didn’t understand SQL related problems. SQL is after all the most demanding servers in most SAN or virtualization environment.

I won’t go into detail about all that was discussed during the day, I’m just going to comment on some of the notes I took during the session.

BrentOzat.com is a great source to get deeper into all of this. There are ton’s of free material to explore.

Try out sp_AskBrent with @expertmode=1 to find what your SQL Server is waiting on.

Try sp_Blitz to get at start on troubles in your SQL server.

Combine with Adam Machanic’s  sp_WhoIsActive and use the @get_plans=1 parameter and you have a wonderful combination of tools to troubleshoot your environment.

Look also for the triage_waitstats script to look deeper into waits.

Best advice to boost performance of your SQL server is to buy memory instead of upgrading SAN. RAM is so cheap these days, getting enough RAM to hold your complete DB in memory is the best way to get rid of storage IO.

To define your hardware for HA and DR the best way you need to define RPO (RecoverPointObject) and RTO (RecoveryTimeObject). In other word, how much data can you lose and how long can your environment be down.
All systems needs to be defined in the following matrix.

HA                   DR
none to 1 sec
1 min
1 hour
1 day

Depending on your choices above you then have several options to achieve this RPO and RTO. AlwaysON, clustering, mirroring, replication in various combinations and with underlying storage replication and such.
AlwaysON is not always the best choice, it is still a product that requires a lot of hands on.And it’s not as easy as running just a wizard. You might end up with a less reliable environment if you don’t do it right.

Run more virtual hosts instead of instances. Combining more instances on one host often more difficult than running multiple VM’s instead. And if you are running Enterprise edition, licensing that on the hypervisor level, means you can run an unlimited number of instances on that host. Might be a solution for your environment.

Always start all VM’s with 2 cpu and increase when you can see wait’s on CPU, giving a VM more CPU can actually make it slower.

I’m a big fan of PCI flashcards and Intel 910 seems to be a bargain, need to check it up. It comes in 600 and 800 Gb size and delivers up to 2 Gb/s.

PCI cards with caching software is maybe a good idea, but Products like OCZ ZD-XL or FusionIO with IOTurbine adds an extra layer of problem with bugs etc. I was interested in these at first but I Think for SQL Server it is better to avoid this.

“Putting SSD in a SAN is like flushing money down the toilette”, according to Brent Ozar. SSD is better local to SQL Server. Fabric will still be the biggest bottleneck.

A normal core these days is capable of receiving about 200 MB/s. To saturate all cores in a 2 socket server with 4 cores each the SAN needs to deliver 1600 MB/s.

Imagine the SAN needed to saturate 2 socket, 10 cores….

Vmware 5.5 with local SSD storage is ideal for SQL server, it can be very good for standard edition with limited memory, cache can appear bigger with this solution. Less SAN IO.

INFINIO also has release a Product last week that uses RAM in vmware host to cache IO. All hosts forms a pool of RAM that is shared to get better cache hit rate. Would like to try that out in the future.

You can also check out brentozar.com/go/filecache for more info on this subject.

To check if there is enough RAM in your server run perfmon. Target memory should be same as max memory otherwise we have memory pressure


“Getting the most from your SAN – file and filegroup layout” – Stephen Archbold


User databases actually has the same hotspot problem as tempdb, adding extra files to a filegroup gives an immediate performance increase of up to 15%.

Check that read ahead of 512 kb/s test with an aggregate select on fact table and watch avg. bytes /transfer in perfmon .

The following startup parameters for SQL Server can be used to get better IO performance, investigate and make sure you understand how they should be used and for which enviornments. Google for SQL Server Datawarehouse Fast Track  from Microsoft and read it all.


“Use your brain to beat SQL Server” – Adam Machanic and Thomas Kejser


Sherlock Holmes kind of feeling in this session. Thomas and Adam tried to learn us tricks we can use to manipulate how the optimizer creates queryplans. Especially when the queryplans is not looking the way we want and we can’t fix it with normal solutions. I will not even try to recap what was said during the session, I was just amazed. Looking forward to the presentation material when becomes available.

“Index Impact Workload Analysis” – Klaus Aschenbrenner


Klaus Aschenbrenner had an interesting session about new SQL 2012 feature named distributed replay. This tool can be used to record a trace of all activity in system for a specified time. The trace can then be moved to another system where is can be re run. This gives excellent opportunities to test out indexes or other changes to a system without affecting the production environment. Load testing can also be done by utilizing up to 16 clients to run the trace.
I have only one problem with this and that is that prod and test environment needs to be very similar so that we don’t get different queryplans and result just because of the HW differences.

In the evening on Tuesday there was a dinner with Brynolf and Ljung entertaining, they are two Swedish magicians how gave us an amazing show in magic and illusions.



“Data Warehouse in the Cloud – Marketing or Reality?” – Alexei Khalyako


Alexei tried to convince us that we can actually run DW’s in Azure. He has a point and for some customers this probably a viable solution, but for the majority of customers I still Think performance is not enough. Give it a year and will be possible.

“Automating Data Warehouse Patterns Through Metadata” – Davide Mauri


Davide is a great speaker and has a way of getting your attention that only Brent Ozar can beat. In this presentation he discussed different DW load patterns and how we can solve them and in the end he gave us insight in how we can automate much of the “monkey work” in a DW development with BIML (Business Intelligence Markup Language).

“HA and DR in the context of the SLA” – Tobiasz Koprowski


Was a nice recall of what we actually have SLA for. It’s not always as easy as just saying that a DB has a 99,999% uptime, all components a system is relaying must be added to the equation to get the real uptime. This polish guy did a great job describing the consequences.

“Does Optimistic Concurrency Resolve Your Blocking Problems?” – Margarita Naumova


Maggie Naumova walked us through the problems with locking and how we can solve some of the problems with RowLevelVersioning. I have personally used this on several systems to solve locking problems with good result.
She finished of by explaining how most of the locking problems we have to today will be solved by SQL 2014.

In Summary I had some very nice days at Arlanda Airport where the Conference was held. I meet old friends and also made some new. I learned alot, especially from Brent, how made me think twice about some coming purchases.

I’m sorry I could not attend SQL Rally in Amsterdam also. That would have been nice.

Last day at #SQLPass summit 2013

It’s been a very long week but now it comes to and end. I already looking forward to next year in Seattle and I have not even left Charlotte yet.

I’m getting a bit tiered, all this new knowledge everyday and then all the social activities at night is beginning to take its toll.

But this last day at SQLPass was also a great one. I started out with a session with Tim Chapman called “SQL Server Index internals: A deep dive”, that was almost a bit to much at 8.00 in the morning, especially after last nights party at NASCAR Hall of fame. But Tim had  great presentation skills and he kept us all awake with great demos.

Then I went to a session about Central Management Server and Policy based management with Ryan Adams. I have never really gotten the hang on policy based management but Ryan sorted it out just fine and I know think it’s time to start using this feature to get better control that all servers adhere to a given standard.

Then I went to a great session with David Klee, “squeezing Top Performance from Your Virtualized SQL Server”, got out with some really useful tips on how to configure Vmware, Windows and SQL server to get top notch performance. A lot of this can be found on his website.

The for lunch there was a event called “Birds of a feather luncheon”, where all tables has a sign describing the topic to be discussed. I went for a table called AlwaysOn and got to talk to some great guys about difficulties about clustering and AlwaysOn.

The I continues on that track after lunch with a session called “CAT: AlwaysOn Customer panel – Lessons learned & Best practices”. four guys from different companies described their environment and the problems they had encountered while setting up various HA & DR solutions. very interesting.

The last session of this years summit was Paul White describing Parallel query executions in detail. Didn’t know there was something called a Parallel deadlock situation, but Paul managed to great such a scenario and then describe why this happened and how we may avoid it. He also went into detail about a lot of other parallel execution problems one might run into. Paul was also a very skilled presenter an managed to keep us awake even though it was late on friday afternoon.

To summarize this has been a fantastic week and there has been so much interesting sessions and so many magnificent speakers it even hard to imagine where one could get this elsewhere.

I’m so looking forward to next years SQL Pass summit in Seattle, se you all there.



Planning my #SQLPass session schedule….

Started planning my session schedule for SQLPASS while I was waiting inline for immigrations at O’Hare airport. It was a long line so I had time to go through most of the details in the #Guidebook and I picked everything I liked.

No I have a schedule with 3, 4 or even more sessions I want to go to for any given time. How should I choose ?  As a #sqlfirsttimer I’m not sure what the best approach is ? Should I go for the   speaker that I have read and listened to on webinars, youtube, the once that I’m following in Twitter and who seems to really know their stuff ? Or should I go for the once that I never heard off and hope that they have something completely new to teach me ?

There are so many interesting areas I would like to dig inn to, should I choose one area all the way or should I mix them as much as possible to get new input ?

BigData, SANLess Clustering, Extended Events, Power BI and Power Map, database corruption, AlwaysOn, Powershell, PDW, 2014 InMemory OLTP, Clustered updatable columnstore indexes, Indexes, Central Managment Server, SQLCAT session, SQL Clinic, Query optimising, etc, etc…

Not to mention all the other activities, First-Timers orientation, Welcome reception, Exhibitor reception, #SQLRun,  Sponsor breakfasts, Keynote, Community appreciation Party at Nascar hall of Fame, Birds of a Feather Luncheon, SQL Server Clinic an SQLCAT, SQL Kilt day and #SQLKaraoke.

I starting to believe I will not make to all of these events. But I will sure try. I will probably end up bying the DVD…

Monitor failed SQL jobs in Nagios

This script can be used to monitor for failed SQL agent jobs in Nagios.

It requires python-argparse and python-pymssql which can be installed like this in debian Linux:

sudo yum install pymssql

Sudo yum install python-argparse

Script is then run like this:

./check_mssql_job_history.py -H server1 -U username -P password -c 1

Hera are oprtional parameters that can be used.

usage: check_mssql_job_history.py [-h] [-t TIMEOUT]

[–query-timeout QUERY_TIMEOUT] [-w WARNING]


PASSWORD [-j JOB] [-x EXCLUDE] [-l] [-v]

Check a SQL Server for failed jobs – results based on state of last run for

all jobs or for a specific job if specified

optional arguments:

-h, –help show this help message and exit

-t TIMEOUT, –timeout TIMEOUT

Timeout for connection and login in seconds

–query-timeout QUERY_TIMEOUT

Query timeout in seconds

-w WARNING, –warning WARNING

Failed jobs before a warning alert is generated


Failed jobs before a critical alert is generated

-H HOST, –host HOST Host name or IP address to connect to

-p PORT, –port PORT SQL Server port number (only change if you know you

need to)

-U USER, –user USER User name to connect as (does not support Windows

built in or Active Directory accounts)


Password to for user you are authenticating as

-j JOB, –job JOB A comma seperate list of jobs to check instead of all

enabled jobs

-x EXCLUDE, –exclude EXCLUDE

A comma seperated list of jobs not to check

-l, –list This will list all jobs in on your server. This does

not return a nagios check and is used for setup and


-v, –verbose This shows the Transaction SQL code that will be

executed to help debug

#!/usr/bin/env python

# Written by Joseph "jetole" Tole on 10/31/11

# This is a simple script because I saw the idea of monitoring failed jobs at

# <a href="http://goo.gl/qDqGL">http://goo.gl/qDqGL</a> (shortened Nagios Exchange URL) however I really didn't

# want to implement an unnecessarily complex, multi part, compiled java program

# on my nagios hosts to perform this same check. I added a couple features as

# well such as listing the failed jobs and the time they had last run as well

# as allowing you to specify a single job to check which will return

# OK|CRITICAL for that job alone.

import sys

import argparse

import pymssql

# 0 - OK

# 1 - Warning

# 2 - Critical

# 3 - Unknown

def nagios_exit(exit_type, msg):



if exit_code &lt; 0 or exit_code &gt; 2:

exit_code = 3

except ValueError:


if exit_code == 0:

status = "OK"

elif exit_code == 1:

status = "WARNING"

elif exit_code == 2:

status = "CRITICAL"


status = "UNKNOWN"

print "%s - %s" % (status, msg)


def run_datetime(run_date, run_time):

run_date = str(run_date)

run_time = str("%06d" % (int(run_time)))

run_year = run_date[0:4]

run_month = run_date[4:6]

run_day = run_date[6:8]

run_hour = run_time[0:2]

run_minute = run_time[2:4]

run_second = run_time[4:6]

return "%s/%s/%s %s:%s:%s" % (run_month, run_day, run_year, run_hour, run_minute, run_second)

parser = argparse.ArgumentParser(description="Check a SQL Server for failed jobs - results based on state of last run for all jobs or for a specific job if specified")

parser.add_argument("-t", "--timeout", action="store", type=int, help="Timeout for connection and login in seconds", default="60", dest="timeout")

parser.add_argument("--query-timeout", action="store", type=int, help="Query timeout in seconds", default="60", dest="query_timeout")

parser.add_argument("-w", "--warning", action="store", type=int, help="Failed jobs before a warning alert is generated", default="0", dest="warning")

parser.add_argument("-c", "--critical", action="store", type=int, help="Failed jobs before a critical alert is generated", default="1", dest="critical")

parser.add_argument("-H", "--host", action="store", help="Host name or IP address to connect to", required=True, dest="host")

parser.add_argument("-p", "--port", action="store", type=int, help="SQL Server port number (only change if you know you need to)", default=1433, dest="port")

parser.add_argument("-U", "--user", action="store", help="User name to connect as (does not support Windows built in or Active Directory accounts)", required=True, dest="user")

parser.add_argument("-P", "--password", action="store", help="Password to for user you are authenticating as", required=True, dest="password")

parser.add_argument("-j", "--job", action="store", help="A comma seperate list of jobs to check instead of all enabled jobs", dest="job");

parser.add_argument("-x", "--exclude", action="store", help="A comma seperated list of jobs not to check", dest="exclude");

parser.add_argument("-l", "--list", action="store_true", help="This will list all jobs in on your server. This does not return a nagios check and is used for setup and debugging", dest="list_jobs")

parser.add_argument("-v", "--verbose", action="store_true", help="This shows the Transaction SQL code that will be executed to help debug", dest="verbose")

results = parser.parse_args()

connect_host = results.host

if results.port != 1433:

connect_host += ":%s" % (str(results.port))


conn = pymssql.connect(user = results.user, password = results.password, host = connect_host, timeout = results.query_timeout, login_timeout = results.timeout)


nagios_exit(3, "Unable to connect to SQL Server")

cur = conn.cursor()

if results.list_jobs:

tsql_cmd = """ SELECT [name], [enabled]

FROM [msdb]..[sysjobs]"""

if results.verbose:

print "%s\n" % (tsql_cmd)


rows = cur.fetchall()

print "Jobs on %s" % (results.host)

print "\"-\" at the begining means the job is disabled\n"

for row in rows:

if int(row[1]) == 1:

print " %s" % (row[0])


print "- %s" % (row[0])


tsql_cmd = """SELECT [j].[name], [h].[run_date], [h].[run_time]

FROM [msdb]..[sysjobs] [j]

INNER JOIN [msdb]..[sysjobhistory] [h] ON [j].[job_id] = [h].[job_id]


SELECT [job_id], MAX(instance_id) AS max_instance_id

FROM [msdb]..[sysjobhistory]

GROUP BY [job_id]

) [tmp_sjh] ON [h].[job_id] = [tmp_sjh].[job_id] AND [h].[instance_id] = [tmp_sjh].[max_instance_id]

WHERE [j].[enabled] = 1

AND [h].[run_status] = 0"""

if results.job:

tsql_cmd += "\nAND (\n\t[j].[name] = '%s'" % (results.job.split(',')[0].strip())

if len(results.job.split(',')) &gt; 1:

for x in results.job.split(',')[1:]:

tsql_cmd += "\n\tOR [j].[name] = '%s'" % (x.strip())

tsql_cmd += "\n)"

elif results.exclude:

for x in results.exclude.split(','):

tsql_cmd += "\nAND [j].[name] != '%s'" % (x.strip())

if results.verbose:

print "%s\n" % (tsql_cmd)


rows = cur.fetchall()

rowcount = cur.rowcount

if rowcount == 0:

nagios_exit(0, "All jobs completed successfully on their last run")


failed_stats = "Number of failed jobs: %d - Failed Jobs: " % (rowcount)

for row in rows:

failed_stats += "%s last run at %s, " % (row[0], run_datetime(row[1], row[2]))

failed_stats = failed_stats.rstrip(', ')

if rowcount &gt;= results.warning and rowcount &lt; results.critical:

nagios_exit(1, failed_stats)

elif rowcount &lt; results.warning:

nagios_exit(0, "%d failed jobs is below the warning count specified." % (rowcount))

elif rowcount &gt;= results.critical:

nagios_exit(2, failed_stats)


nagios_exit(3, "This should never appear")

DataQualityServices setup with AlwaysON

Setting up Data Quality Service with Always on is well described in this document:


I used that as guide when I installed DQS at a customer site today.

The procedure was very straight forward and I had no problems at all.

1. First I added DQS feature from the SQL 2012 installation media. I did not have to reboot or restart SQL server so that was easy.

2. I applied the SP1 and CU3 patches that where previously applied for the rest of the SQL server.

3. I repeated this on all the nodes where the AvailabilityGroup was setup to run.

4. After that I had to run the DQSInstaller.exe file in the …\MSSQL11.MSSQLSERVER\MSSQL\Binn folder on each node. This requires me to enter a Database Master Key, you must use the same key on all nodes and be careful not to forget it. It is need when a SQL upgrade takes place.

The DQSInstaller then created three databases.

5. Then I added a Windows account on the first node and gave it permissions to the DQS_MAIN database according to the document.

6. Then I deleted the new databases from the two other nodes in the cluster.

7. Now I could add the three databases as availability databases in the AG. And they where copied out to all nodes.

8. I then copied the Windows account to all three nodes. Keep in mind tha AlwaysOn doesn’t replicate any logins. I use a sp called dbs_CopyLogins, it can downloaded here: http://www.sqlsoldier.com/wp/sqlserver/transferring-logins-to-a-database-mirror, is very useful since it copies the password correctly..

9. Then I installed the DQS client from the SQL distribution and patches it. When I started the Client I Could just enter the AG listener name and could successfully login.

Staffan Olofsson
MCSE: Data Platform