Category Archives: Infrastructure

Virtualizing SQL server on Hyper-V

Here is very good withepaper about virtualizing SQL Server on Hyper-V. It’s a Microsoft written paper but I think it gives a very good insight about waht to consider when virtualizing SQL Server.

http://www.google.com.eg/url?sa=t&rct=j&q=&esrc=s&source=web&cd=2&cad=rja&ved=0CDIQFjAB&url=http%3A%2F%2Fdownload.microsoft.com%2Fdownload%2F6%2F1%2FD%2F61DDE9B6-AB46-48CA-8380-D7714C9CB1AB%2FBest_Practices_for_Virtualizing_and_Managing_SQL_Server_2012.pdf&ei=C_e6UsT6L-KQ0AWdl4DQDQ&usg=AFQjCNFDzRHvfSnvAieXDrt8WLZ5lbNH9Q&bvm=bv.58187178,d.d2k

Nutanix, is it a gamechanger ?

Ever heard of Nutanix ?

I think its time now to tell the story about something that might change the data center as we know it.

Traditional infrastructure for virtualization includes servers, network and storage. Normally operated by different departments within the IT organization. This often leads to problems when there is a performance issue, everybody start to blaming each other.

This is also in general a very expensive solution with both servers, switches and SAN equipment. All of this often duplicated to make it high available with redundant extra HW, often doing nothing most of the time.

With VDI becoming more and more common, even in smaller organization, it’s has become a challenge to provide adequate performance for that environment without being affected by other storage activity.

One way could be to separate storage for the VDI environment on it’s on storage, preferably on something with low latency, able to provide high IOPS during bootstroms etc. Here flash-based arrays like Nimbus, Purestorage, Violin or hybrid arrays like Nimble could play a part. Even the big storage providers like NetApp, EMC, HP, IBM now have both hybrid and pure flash-based arrays to compete with the new kids on the block.
But even with these solutions we still have a solution divided into three different areas. Server, network and  storage are just faster, but they will still need to be maintained by different groups of people. And they will still need to communicate over a network.

We are more and more moving towards the software defined data center with public, hybrid and private cloud becoming the standard for the new data center. We need to adopt to this and get rid off all the old legacy systems to be able to smoothly migrate resources between in-house and hosted solutions.

With IT becoming more centralized and outsourcing into cloud solutions becoming some kind of best practice, IT technicians must strive to make the infrastructure as easy to maintain as possible, otherwise it will be outsourced to someone doing it faster and cheaper.

Nutanix, delivering kind of appliance box with CPU and storage locally in the box is definitely on the right track here. By delivering a 2U box with four servers, each with 2 CPU’s, up to 512 GB RAM,SSD and SATA drives with a total of 8 TB useable storage with high IOPS and a distributed filesystem where everything scales linear when adding more 2U boxes it becomes a very easy to manage solution.

Inline dedupe, compression, remote replication to a DR site box, support for running Vmware, Hyper-V or KVM as hypervisor makes it a very attractive solution for running VM’s.

There are a number of different HW specs to choose from, from high density boxes with four nodes to low latency with 2 and added storage capacity or even with Nvidia GPUs to provide VDI solutions capable of delivering 3D and CAD performance.

if you are already running a Vmware shop, just add the box to your Vmware data center or DRS cluster and live migrate machines over to the new hosts.

There is so much new about this that I must urge you all to look deeper into it.  Below are links to some interesting sites and some withepapers.

Update #1:

vSAN is a new feature in vSphere 5.5 that have some of the features that Nutanix offers. This article explains Nutanix more in detail and also compares it to vSAN.

http://up2v.nl/2013/08/26/nutanix-virtual-computing-platform-compared-to-vmware-vsan-1-0/

http://www.nutanix.com

http://stevenpoitras.com/

http://stevenpoitras.com/the-nutanix-bible/

http://stevenpoitras.com/2013/11/advanced-nutanix-sql-server-nutanix-best-practices-released/

http://www.nutanix.com/blog/2013/01/02/2013-predictions-the-end-of-big-iron/

SQL Server HA and DR solutions ?

I have been trying to figure out the best way to HA and DR protect SQL servers for some time now.

My customer is currently running SQL server 2008 R2 on Veritas cluster with Veritas Volume Manager as block level replication between primary and DR site.  The cluster has 4 physical nodes on each site with multiple instances for different system. Two of the hosts on each site has FusionIO flash cards for tempdb storage.  All other data is stored in a NetApp SAN connected through dual 8 GB FC HBA’s. Symantec NetBackup 7.6 is used for backups, it’s using the latest snapshot technology in NetApp and Vmware to make backups of databases in just seconds. Restore is also done in just minutes.

Upgrade to SQL 2012 is in the pipe and we are also looking into the possibility of making these servers virtual in Vmware. Load on the physical boxes is not that high and the performance of the Vmware environment would be sufficient. Vmware is running on HP blades, blade chassi is connected to NetApp through dual (or more) 10 GB Ethernet. Datastores in Vmware uses NFS protocol. Veritas is considered expensive and the goal is to get rid of it.

IOmeter och Crystal diskmark gives us almost the same disk performance between physical and virtual servers. We are looking into the possibility of using either InfinIO and/or IOTurbine in the Vmware hosts to get even better performance.

Last 8 months we have been testing AlwaysON to see if that is the solution for a new environment. AlwaysOn is very nice when used with non-shared storage and both synchronous and asynchronous replication between hosts and sites seems to be working very well. We setup a cluster with 3 plus 3 hosts on primary and DR site and created AG’s for both Sharepoint environments and other both big and small databases. We setup AG’s local to only one site and AG’s spanning both sites and we used a file share witness that could also be moved to DR site.

All worked well until we performed  a planned DR test where we failed over all AG’s to the DR site and then pulled the plug between the sites. Primary site cluster stayed up but DR site stopped because of lost quorum…. My mistake of course, had to manually change the quorum configuration after failing over. Have fixed that now with a powershell script that fails over all AG’s, resumes replication and changes quorum. So far so good.

But there is still a problem. As I said, one of the AG’s I have created is only local to the servers on the DR site, it’s a database only used by application installed on the DR site so we don’t need it on the primary site. Problem is that the cluster on the DR site goes down when we lose connectivity between the sites..  Microsoft  Cluster is designed this way to avoid a split brain situation, but it’s not what I want….

One way to solve this would be two have separate AlwaysOn clusters, but that’s expensive and it would be better to utilize the same servers for more than one thing.

Another way would be two have two FCSI clusters and put AlwaysON on top of that, but that would require shared storage for each FCSI on each site and in Vmware this requires FC or ISCSI. If I understand it correctly it could work in Windows 2012 and Vmware 5.5. But we are not there yet.

On top of this there is the problem with logins and jobs that need s to be synched between the servers. Having six servers and 4 AG’s makes this a bit complicated.

Jonathan Kehayias at SQLskills has written a small application to help out with this, but it’s still not enough. Check it out here http://www.sqlskills.com/blogs/jonathan/synchronize-availability-group-logins-and-jobs/

So, I have started to look into different solutions. I have found the following alternatives that we should consider:

  • Vmware native HA with SRM for DR protection
  • SIOS
  • DH2i
  • Veritas

What I’m considering right now is to just put the SQL servers in Vmware as is and use SRM for DR protection, it would give basically the same protection level as the Veritas solution. Block level snapshot replication in the NetApp instead of block level volume replication with Veritas. Failover between local host on primary would be slightly slower because  it would require the virtual machine to restarted from scratch on a new hosts, compared to Veritas where only SQL server has to startup. But Veritas has to take care of shared storage and unmounting and mounting diskgroups and that can also take some time. Failing to DR site is manual in Veritas and would be the same in Vmware, data loss would possibly be the same if the line between the sites is not fast enough for high peak load.

Using some other third-party clustering like SIOS or DH2i would maybe be better than Veritas, less complicated, possibly cheaper but something new that needs to be implemented.

With the protection that Vmware, NetApp and Symantec provides is it really neccesary to add AlwaysOn protection as well if we don’t need readable secondarys or backup offloadning ?

What’s your opinion ?

Step-by-Step: Build Your Private Cloud with System Center 2012, Windows Server 2012, Hyper-V and Windows Azure

Very nice article describing how to create your own Private Cloud with site failover

http://blogs.technet.com/b/keithmayer/archive/2013/04/01/build-your-private-cloud-in-a-month-new-article-series.aspx

Combine it with Hydration Kit and yoy can build yur own POC environment fast and easy either localy or in Azure

http://www.deploymentresearch.com/Research/tabid/62/EntryId/98/The-Hydration-Kit-for-ConfigMgr-2012-R2-Preview-is-available-for-download.aspx

Windows 2012 R2 CTP and SQL Server 2014 CTP1 bits is now available for MSDN subscribers

Several developers reported availability of the new Windows 2012 R2 Datacenter preview , SQL 2014 CTP1 bits and System Center 2012 R2 last night. More news will be available during this weeks Tech Ed conference in Madrid.

Microsoft is said to focus on storage and networking in the new Windows 2012 R2 version and one of the news is automatic storage tiering, which allows the system to automatically decide which files are most accesses and place them on the fastest avaliable storage (SSD) within a Storage Space. Less frequently access files will be keept on normal hardrives.

Deduplication is also new in windows 2012 R2 and could potentially save lots of space, in particular when it comes to virtual hard disk in virtul environments. It might even increase performance for virtual hard drives when streamed servers are used.

On the networking side migration of virtual machines between hosts have been improved by compressing the virtual hard drive before copying. There is also a update to RDMA (remote direct memory access) which allows a server to be copied directly between each hosts memory without touching the CPU in either server, cutting transmission time in half, according to Microsoft.

There are also some updates to the Azure integration, one being the Hyper-V Recovery Service, which operates from Azure but can handle both on site and off site backup/recovery procedures.

Update 1: Download SQL 20124 CTP1 here : http://technet.microsoft.com/sv-se/evalcenter/dn205290?WT.mc_id=Social_TW_OutgoingEvents_20130625_25292_SQLServer

Monitor failed SQL jobs in Nagios

This script can be used to monitor for failed SQL agent jobs in Nagios.

It requires python-argparse and python-pymssql which can be installed like this in debian Linux:

sudo yum install pymssql

Sudo yum install python-argparse

Script is then run like this:

./check_mssql_job_history.py -H server1 -U username -P password -c 1

Hera are oprtional parameters that can be used.

usage: check_mssql_job_history.py [-h] [-t TIMEOUT]

[–query-timeout QUERY_TIMEOUT] [-w WARNING]

[-c CRITICAL] -H HOST [-p PORT] -U USER -P

PASSWORD [-j JOB] [-x EXCLUDE] [-l] [-v]

Check a SQL Server for failed jobs – results based on state of last run for

all jobs or for a specific job if specified

optional arguments:

-h, –help show this help message and exit

-t TIMEOUT, –timeout TIMEOUT

Timeout for connection and login in seconds

–query-timeout QUERY_TIMEOUT

Query timeout in seconds

-w WARNING, –warning WARNING

Failed jobs before a warning alert is generated

-c CRITICAL, –critical CRITICAL

Failed jobs before a critical alert is generated

-H HOST, –host HOST Host name or IP address to connect to

-p PORT, –port PORT SQL Server port number (only change if you know you

need to)

-U USER, –user USER User name to connect as (does not support Windows

built in or Active Directory accounts)

-P PASSWORD, –password PASSWORD

Password to for user you are authenticating as

-j JOB, –job JOB A comma seperate list of jobs to check instead of all

enabled jobs

-x EXCLUDE, –exclude EXCLUDE

A comma seperated list of jobs not to check

-l, –list This will list all jobs in on your server. This does

not return a nagios check and is used for setup and

debugging

-v, –verbose This shows the Transaction SQL code that will be

executed to help debug

#!/usr/bin/env python

# Written by Joseph "jetole" Tole on 10/31/11

# This is a simple script because I saw the idea of monitoring failed jobs at

# <a href="http://goo.gl/qDqGL">http://goo.gl/qDqGL</a> (shortened Nagios Exchange URL) however I really didn't

# want to implement an unnecessarily complex, multi part, compiled java program

# on my nagios hosts to perform this same check. I added a couple features as

# well such as listing the failed jobs and the time they had last run as well

# as allowing you to specify a single job to check which will return

# OK|CRITICAL for that job alone.

import sys

import argparse

import pymssql

# 0 - OK

# 1 - Warning

# 2 - Critical

# 3 - Unknown

def nagios_exit(exit_type, msg):

try:

exit_code=int(exit_type)

if exit_code &lt; 0 or exit_code &gt; 2:

exit_code = 3

except ValueError:

exit_code=3

if exit_code == 0:

status = "OK"

elif exit_code == 1:

status = "WARNING"

elif exit_code == 2:

status = "CRITICAL"

else:

status = "UNKNOWN"

print "%s - %s" % (status, msg)

sys.exit(exit_code)

def run_datetime(run_date, run_time):

run_date = str(run_date)

run_time = str("%06d" % (int(run_time)))

run_year = run_date[0:4]

run_month = run_date[4:6]

run_day = run_date[6:8]

run_hour = run_time[0:2]

run_minute = run_time[2:4]

run_second = run_time[4:6]

return "%s/%s/%s %s:%s:%s" % (run_month, run_day, run_year, run_hour, run_minute, run_second)

parser = argparse.ArgumentParser(description="Check a SQL Server for failed jobs - results based on state of last run for all jobs or for a specific job if specified")

parser.add_argument("-t", "--timeout", action="store", type=int, help="Timeout for connection and login in seconds", default="60", dest="timeout")

parser.add_argument("--query-timeout", action="store", type=int, help="Query timeout in seconds", default="60", dest="query_timeout")

parser.add_argument("-w", "--warning", action="store", type=int, help="Failed jobs before a warning alert is generated", default="0", dest="warning")

parser.add_argument("-c", "--critical", action="store", type=int, help="Failed jobs before a critical alert is generated", default="1", dest="critical")

parser.add_argument("-H", "--host", action="store", help="Host name or IP address to connect to", required=True, dest="host")

parser.add_argument("-p", "--port", action="store", type=int, help="SQL Server port number (only change if you know you need to)", default=1433, dest="port")

parser.add_argument("-U", "--user", action="store", help="User name to connect as (does not support Windows built in or Active Directory accounts)", required=True, dest="user")

parser.add_argument("-P", "--password", action="store", help="Password to for user you are authenticating as", required=True, dest="password")

parser.add_argument("-j", "--job", action="store", help="A comma seperate list of jobs to check instead of all enabled jobs", dest="job");

parser.add_argument("-x", "--exclude", action="store", help="A comma seperated list of jobs not to check", dest="exclude");

parser.add_argument("-l", "--list", action="store_true", help="This will list all jobs in on your server. This does not return a nagios check and is used for setup and debugging", dest="list_jobs")

parser.add_argument("-v", "--verbose", action="store_true", help="This shows the Transaction SQL code that will be executed to help debug", dest="verbose")

results = parser.parse_args()

connect_host = results.host

if results.port != 1433:

connect_host += ":%s" % (str(results.port))

try:

conn = pymssql.connect(user = results.user, password = results.password, host = connect_host, timeout = results.query_timeout, login_timeout = results.timeout)

except:

nagios_exit(3, "Unable to connect to SQL Server")

cur = conn.cursor()

if results.list_jobs:

tsql_cmd = """ SELECT [name], [enabled]

FROM [msdb]..[sysjobs]"""

if results.verbose:

print "%s\n" % (tsql_cmd)

cur.execute(tsql_cmd)

rows = cur.fetchall()

print "Jobs on %s" % (results.host)

print "\"-\" at the begining means the job is disabled\n"

for row in rows:

if int(row[1]) == 1:

print " %s" % (row[0])

else:

print "- %s" % (row[0])

sys.exit()

tsql_cmd = """SELECT [j].[name], [h].[run_date], [h].[run_time]

FROM [msdb]..[sysjobs] [j]

INNER JOIN [msdb]..[sysjobhistory] [h] ON [j].[job_id] = [h].[job_id]

INNER JOIN (

SELECT [job_id], MAX(instance_id) AS max_instance_id

FROM [msdb]..[sysjobhistory]

GROUP BY [job_id]

) [tmp_sjh] ON [h].[job_id] = [tmp_sjh].[job_id] AND [h].[instance_id] = [tmp_sjh].[max_instance_id]

WHERE [j].[enabled] = 1

AND [h].[run_status] = 0"""

if results.job:

tsql_cmd += "\nAND (\n\t[j].[name] = '%s'" % (results.job.split(',')[0].strip())

if len(results.job.split(',')) &gt; 1:

for x in results.job.split(',')[1:]:

tsql_cmd += "\n\tOR [j].[name] = '%s'" % (x.strip())

tsql_cmd += "\n)"

elif results.exclude:

for x in results.exclude.split(','):

tsql_cmd += "\nAND [j].[name] != '%s'" % (x.strip())

if results.verbose:

print "%s\n" % (tsql_cmd)

cur.execute(tsql_cmd)

rows = cur.fetchall()

rowcount = cur.rowcount

if rowcount == 0:

nagios_exit(0, "All jobs completed successfully on their last run")

else:

failed_stats = "Number of failed jobs: %d - Failed Jobs: " % (rowcount)

for row in rows:

failed_stats += "%s last run at %s, " % (row[0], run_datetime(row[1], row[2]))

failed_stats = failed_stats.rstrip(', ')

if rowcount &gt;= results.warning and rowcount &lt; results.critical:

nagios_exit(1, failed_stats)

elif rowcount &lt; results.warning:

nagios_exit(0, "%d failed jobs is below the warning count specified." % (rowcount))

elif rowcount &gt;= results.critical:

nagios_exit(2, failed_stats)

else:

nagios_exit(3, "This should never appear")

Monitor Sql server blocking sessions in Nagios

Here is a script that uses Freetds to monitor sql server blocking it can be used by Nagios to trigger alarms.

#Script is designed to be used by Nagios. It checks for blocking sessions in MSSQL.

#

# Requirements:

#

# FreeTDS 6.0+ (http://www.freetds.org/)

#

# Original check_mssql.sh written by Tom De Blende (tom.deblende@village.uunet.be) in 2003.

# Changed script to check for blocking sessions in a database by Staffan.olofsson@b3it.se in 2013

#

# Script takes 4 parameters 1. Servername 2.username 3.password 4. 2000 (for SQL 2000 and up)

# Freedtds has to be configured with connection information

#

# Current threshold for blocking is 35000 ms. This can be changed in the SQL code below.

#

# You might want to change these values:

tsqlcmd=’/usr/bin/tsql’

catcmd=`which cat`

grepcmd=`which grep`

rmcmd=`which rm`

mktempcmd=`which mktemp`

wccmd=`which wc`

sedcmd=`which sed`

trcmd=`which tr`

uniqcmd=`which uniq`

###################################################################################################################

hostname=$1

usr=$2

pswd=$3

srv=$4

if [ ! "$#" == "4" ]; then

echo -e "\nYou did not supply enough arguments. \nUsage: $0 <host> <username> <password> <version> \n \n$0 checks Microsoft SQL Server connectivity."

elif [ $tsqlcmd == "" ]; then

echo -e "tsql not found! Please verify you have a working version of tsql (included in the FreeTDS version 6.0+) and enter the full path in the script."

fi

exit="3"

# Creating the command file that contains the sql statement that has to be run on the SQL server.

tmpfile=`$mktempcmd /tmp/$hostname.XXXXXX`

if [ $srv == "7" ]; then

spid=7

elif [ $srv == "2000" ]; then

spid=50

else

echo -e "$srv is not a supported MS SQL Server version!" && exit "3"

fi

echo -e "SELECT host_name,program_name,login_time,login_name,convert(nvarchar(4000), [text]),wait_duration_ms

FROM sys.dm_os_waiting_tasks w

INNER JOIN sys.dm_exec_sessions s

ON w.session_id = s.session_id

INNER JOIN sys.dm_exec_requests r

ON s.session_id = r.session_id

OUTER APPLY sys.dm_exec_sql_text (r.sql_handle) t

WHERE s.is_user_process = 1

and wait_duration_ms >35000\ngo" > $tmpfile

# Running tsql to get the results back.

resultfile=`$mktempcmd /tmp/$hostname.XXXXXX`

errorfile=`$mktempcmd /tmp/$hostname.XXXXXX`

$tsqlcmd -S $hostname -U $usr -P $pswd < $tmpfile 2>$errorfile > $resultfile

$grepcmd -q "Login failed for user" $errorfile

if [ "$?" == "0" ]; then

$rmcmd -f $tmpfile $resultfile $errorfile;

echo CRITICAL – Could not make connection to SQL server. Login failed.;

exit 2;

fi

$grepcmd -q "There was a problem connecting to the server" $errorfile

if [ "$?" == "0" ]; then

$rmcmd -f $tmpfile $resultfile $errorfile;

echo CRITICAL – Could not make connection to SQL server. Incorrect server name or SQL service not running.;

exit 2;

fi

resultfileln=`$catcmd $resultfile | $wccmd -l | $sedcmd ‘s/ //g’`

if [ "$resultfileln" == "2" ]; then

$rmcmd -f $tmpfile $resultfile $errorfile;

echo CRITICAL – Could not make connection to SQL server. No data received from host.;

exit 2;

else

nmbr=`$catcmd $resultfile | $grepcmd -v locale | $grepcmd -v charset| $grepcmd -v 1\> | $sedcmd ‘/^$/d’ | $sedcmd ‘s/ //g’ | $wccmd -l | sed ‘s/ //g’`

hosts=`$catcmd $resultfile | $grepcmd -v locale | $grepcmd -v charset| $grepcmd -v 1\> | $sedcmd ‘/^$/d’ | $sedcmd ‘s/ //g’ | $uniqcmd -c | $trcmd \\\n , | $sedcmd ‘s/,$/./g’ | $sedcmd ‘s/,/, /g’ | $sedcmd ‘s/ //g’ | $trcmd \\\t " " | $sedcmd ‘s/ \./\./g’ | $sedcmd ‘s/ ,/,/g’`

$rmcmd -f $tmpfile $resultfile;

if [ "$nmbr" == "0" ]; then

echo "OK – MS SQL Server $srv has $nmbr blocks: $hosts" | sed ‘s/: $/./g’;

else

echo "$hosts $nmbr"

fi

exit 0;

fi

# Cleaning up.

$catcmd $tmpfile

$catcmd $resultfile

$catcmd $errorfile

$rmcmd -f $tmpfile $resultfile $errorfile

echo $stdio

exit $exit