SQL Server Health Check: How to Identify and Solve SQL Server Performance Issues

SQL Server Health Check: How to Identify and Solve SQL Server Performance Issues

Here at WARDY IT Solutions we live and breathe SQL Server every single day.

With years of experience in the Microsoft Data Platform, we’ve found that performance issues in SQL Server typically manifest as a bottleneck in one of three broad areas:

  • CPU
  • I/O, and\or
  • Memory.

A.k.a. the Performance Triangle.

Here are our top troubleshooting tips to release your bottleneck and give your server the SQL Health Check it needs.

Wait Statistics – the key to unlocking your SQL Server performance issues

SQL Health Check: Look at the Performance Triangle

The first step in resolving any SQL Server performance issue is to pin-point exactly where in the Performance Triangle the problem is coming from. When we’re performing a SQL Server Healthcheck for our customers we always start with Wait Statistics.

What are Wait Statistics?

SQL Server uses a queue mechanism for the different actions a request must take to satisfy the queries submitted to the database engine. For example, there is a queue for fetching data pages from the disk into the buffer pool and another for getting CPU time to process the request.

The time spent waiting in a queue is recorded by SQL Server for every thread of every request. These wait times are called the Wait Statistics. By analysing the Wait Statistics, you can see where your requests are bottlenecking. This helps to narrow down where your performance pains are on the Performance Triangle.

How to check your Wait Statistics

SQL Server expert Paul Randal maintains a very useful script for determining the key Wait Statistics on your server. This script excludes Wait Types that are not performance related. Meaning you can easily identify where the server is struggling for performance without clouding the results with any of the other wait types.

Paul keeps this script up-to-date with new versions and Wait Types, so it’s a good idea to regularly check for updates.

Since SQL Server only maintains Wait Statistics from startup, it’s also a good idea to store this wait data in a database to capture the history over longer time periods. This will help you baseline your server and more quickly identify slow moving performance degradations.

Analysing your Wait Statistics: the most common Wait Types

Once you have your Wait Statistics, you need to analyse them to reveal where the performance impacts lie. To help, we’ve compiled the most common performance impacting Wait Types and the area of the Performance Triangle they relate to. The prevalence of these wait types on your server will help narrow down your search for the root cause of performance problems.

This isn’t an exhaustive list – just a few common examples we often see when we’re performing a SQL Server Healthcheck.

Wait Type Wait Category Notes
CXPACKET CPU This Wait Type indicates parallel threads are waiting on their related threads to complete. This indicates potential CPU related server configuration issues, or query optimisation issues. This wait type is also normal on a system with parallelism enabled, so determining if the number\duration of the waits is a problem requires correlation with other statistics. Look for waits of this type with small, simple queries as an area of optimisation. Optimising simple queries with CXPACKET waits can make them run serially, and this is often quicker for smaller queries.
SOS_SCHEDULER_YIELD CPU/Memory This Wait Type indicates a thread exhausted the 4ms quantum for CPU execution time and yielded voluntarily. This is not necessarily a bad thing, it’s simply indicating high CPU use due to lack of contention for resources (CPU, Memory or IO). Scanning through pages of a large index, for example, can cause this wait on some systems. It can also indicate poor CPU performance or poor query design causing excessive CPU utilisation. If this wait is a high percentage of waits, check your server Power settings (High Performance is recommended) and review your top CPU queries for possible optimisation or index tuning.
WRITELOG IO This Wait Type indicates the IO subsystem cannot keep up with the log flush rate. This can be caused by disk configuration issues, database configuration issues or DML query optimisation issues. Check that your log files are on separate disks and your disks are formatted with 64K block sizes. Also validate that your VLF count is appropriate and your growth settings are not excessively large or small. Lastly, check your INSERT\UPDATE\DELETE queries and look to perform large operations in batches where possible to allow more frequent log flushes of smaller size.
PAGEIOLATCH_SH IO/Memory This Wait Type indicates a thread is waiting for a data page to be fetched from disk. Fetching data pages from disk occurs when a page has aged out of the buffer pool and needs to be retrieved from the disk. This can be normal behaviour or abnormal behaviour and resolving it depends on which it is. For example, highly transactional systems that perform frequent read-back of written data may see high waits of this type, but this doesn’t indicate an issue. A reporting system that relies on cached data to provide rapid report response times would want low numbers for this wait. You need to identify your top IO queries and tune them where possible. You may also need to increase the available memory for SQL Server as this can indicate the buffer cache is simply too small. Optimise first, increase resources later.
PAGELATCH_* IO These Wait Types indicate page contention – most commonly it is tempdb contention due to poor configuration or clustered index contention due to poor query design (loss of concurrent inserts on a clustered index can cause contention at the leaf level pages). If the contention is occurring on tempdb, validate your tempdb configuration meets best practise for size and number of files, separate disks etc. If contention is on a user database, you need to identify if read or write or both are encountering the contention. For write, you may need to reassess the clustering key for your clustered index. For read workload, analyse the queries and indexing strategy, additional NCIs may alleviate this, or you may need to remove excess indexes if they’re unused.
LCK_M_* IO/Memory These Wait Types indicate threads are being blocked by other threads, usually due to lock escalation. This can indicate query design issues, index design issues or poor disk level configurations. You should identify the queries most affected by this wait type and tune them where possible.
RESOURCE_SEMAPHORE Memory This Wait Type simply indicates that a query cannot get the memory needed to execute. This can be because of poor indexing, out of date statistics or poor query design. It can also indicate there is simply not enough memory available. As with PAGEIOLATCH_*, optimise first, then increase resources.

You can find a complete list of Wait Types here. Or chat with us at WARDY IT Solutions. We can help!

SQL Health Check: CPU Issues

Analysed your wait statistics and found most waits are CPU related? The next step is to determine if this is a configuration issue or a query design issue.

1. Common CPU Configuration Issues

  • SQL Server Standard Edition sockets and cores – SQL Server Standard Edition is limited to a maximum number of sockets and cores (4 and 16 respectively in SQL Server 2017). However, quite often virtual machines are built with a one-to-one ratio of cores to sockets. This means the SQL Server is only able to use 4 of 16 cores because it hits the 4-socket limit first. You need to look at reconfiguring the VM to have a higher core to socket ratio to ensure you’re not hitting the socket limit first. You can check for this issue with this script:SELECT scheduler_id, cpu_id, status
    FROM sys.dm_os_schedulers
    WHERE scheduler_id < 256
    AND status = 'VISIBLE OFFLINE'
  • Leaving parallelism settings at their server default – the Max Degree of Parallelism (MAXDOP) and Cost Threshold for Parallelism settings are, by default, 0 and 5. These defaults are legacy defaults based on much older, slower hardware expectations. They’re not appropriate for modern hardware specifications. Typically, if you haven’t profiled your system to determine the best settings, you start simple – ½ the number of cores up to a maximum of 8 for MAXDOP and 35-50 for Cost Threshold. If you have the time, analyse the queries executing on your server and identify the tipping point where queries begin to perform better in parallel over serial. Take the estimated cost of the queries at this tipping point and use the middle ground as your Cost Threshold. This ensures the queries that perform better in parallel will run parallel and those that perform better serially will run serially.There are caveats to this though. For example, SharePoint requires that the server MAXDOP setting is 1, effectively disabling parallelism. This is also a setting you will need to monitor and tune over time. As your databases grow, you may find that your cost threshold needs to increase, or new application developments might mean you need to lower the threshold.
  • Power Plan settings – a default installation of Windows Server uses the Balanced power plan which lowers the CPU performance to save power. Typically, this is overwritten by Group Policy. However, it is a good idea to check your server is using the High Performance power plan, and your physical hosts (for virtual machines) do not have any power saving settings enabled in BIOS. You can find a useful PowerShell script for checking the CPU clock speed to validate this issue here.

2. CPU Query Design

High CPU use can be an indicator of poor query design. If you have validated your configurations are correct, the next step is to analyse your queries for high CPU use and possible fixes. A list of common causes for high CPU in your queries are:

  • Parallelism on low cost queries
  • Cardinality estimation issues, due to poor statistics or implicit conversions
  • Excessive compilation or recompilation
  • UDFs, TVFs and computation intensive queries
  • Row-by-Row operations
  • Calculated fields in SELECT statements or non-persisted calculated columns in tables.
  • High I/O operations

You should analyse your top queries by CPU cost for possible optimisation. Out of date statistics and implicit conversions are some of the most common causes of high CPU utilisation and poor query plan choice, and are also often quite simple to fix.

SQL Health Check: Common Memory Configuration Issues

For performance issues residing in memory, you’ll need to look at whether this is a common configuration or query design issue – just like we did with CPU.

1. Memory: Common Configuration Issues

  • Setting Max Server Memory too low – this is the most obvious memory misconfiguration. Limiting the available space for the buffer pool can cause SQL Server to go to disk for data too much and slow everything down. In addition, setting it too high can starve the OS and other applications of memory. It’s important to balance these competing restrictions. Monitor non-SQL application memory requirements to make sure you’re leaving enough resources and look to increase the available memory if need be.
  • Enabling Lock Pages in Memory – this can help prevent the OS from forcing SQL Server to page out memory. Provided you have catered for your OS and other applications sufficiently, the risks aren’t too substantial for enabling this. If you haven’t analysed your system, then be aware that this setting can lead to an “Out of Memory” condition on your server and result in outages.
  • Turn on the server option ‘Optimise for Adhoc Workloads’ – on most modern systems, this option should be enabled. It enforces more efficient usage of the plan cache and is particularly effective if you have a system with a lot of Adhoc queries that aren’t properly parameterised, particularly those with applications using ORMs (NHibernate, EntityFramework etc).
  • Analyse CHECKPOINT operations – these can cause performance issues related to memory. If you’re seeing high Checkpoints/sec (in PerfMon), check if there is a database with AUTO_CLOSE enabled or if a user is running DBCC CHECK commands during peak periods. There are several operations that trigger a CHECKPOINT including several automatic ones, but typically they won’t cause performance issues. AUTO_CLOSE has been known to cause performance problems and DBCC CHECK commands are very resource intensive.
  • Check with your sysadmins that your hypervisor is reserving resources for your SQL Servers – if your SQL Servers are virtual, check to ensure you’re not getting memory dynamically. The hypervisor may decide to revoke memory from the SQL Server if another machine is requesting memory. This has a drastic impact on SQL Server performance. SQL Server best practise for virtualised environments it to reserve the resources for the SQL Server guest to prevent dynamic reallocation.

2. Memory: Query Design

There are several factors to consider when your queries are slowed by memory bottlenecks.

  • Firstly, check your execution plans for queries with excessive memory requests or excessive memory grants. These queries will often have Sort or Hash Match operations contributing to the excess memory requirements. Try to tune or eliminate these operations as much as possible by rewriting your code or adjusting your indexing.
  • Then, ensure your statistics are up-to-date and eliminate implicit or explicit conversions from your query as cardinality estimation affects memory requests as well. This should help alleviate memory grant issues.
  • Lastly, ensure you’re querying accurately to return only the data you need. If your applications are applying filters to data after SQL Server provides the data set from a query, this might mean you’re reading more data then you need. Move those predicates into your query so it is more selective. With proper indexing this can greatly reduce the memory requirements of queries returning larger data sets.

SQL Health Check: IO Issues

Just like CPU and Memory it’s important to establish whether your IO Wait Issues are configuration or query design issues.

1. Common IO Configuration Issues

  • Check that your data, log, tempdb and backup files are on separate drives – for physical servers, these should be separate physical disks. For virtual machines or SAN-attached storage, separate LUNs/spindles are preferred. However, even without separate LUNs/spindles, you should separate them at the OS level – this is still Microsoft’s recommendation, even in Azure. Check that your data, log, tempdb and backup drives are formatted with 64K block sizes. SQL Server reads and writes in 64K blocks and performs best when the underlying disks align.
  • In an AlwaysOn Availability Group or Mirrored setup, verify the Disk Sector Size matches between all replicas – seeing the error message “There have been NNN misaligned log IOs which required falling back to synchronous IO” in your SQL Server error log? This indicates different disk sector sizes. If you cannot fix the disk sector size in your storage subsystem, apply the hotfix supplied by Microsoft. You can find more information in this Microsoft blog.

2. IO Query Design

  • Only do the minimum amount of work – this is the key to limiting the impact of IO Waits in query. If you’re fetching, inserting or updating data, be selective. Make sure you filter the data as much as possible in the WHERE clause. For INSERT\UPDATE\DELETE operations, performing in batches can help limit the log IO required.
  • Indexing – this is also crucial to limiting the impact of IO. Including the right columns as key columns, or included columns, prevents SQL from having to perform key lookups. Excessive include columns can cause excessive IO, so it is a definite balancing act to get your indexing right. Removing unused indexes is another way to limit the IO workload. Every update to a table must update at least the clustered index (if there is one) and any non-clustered indexes that have that column as a key or include column. Indexes that are not being used by SELECT statements are using IO to be kept up to date while providing no benefit for your read workload – remove them.
  • Leveraging Columnstore indexes and data compression – this can also improve IO performance by limiting the amount of IO required to fetch data. Generally, SQL can decompress using CPU cycles much quicker than storage subsystems can return uncompressed data, and Columnstore indexes, when designed correctly, can achieve substantial data compression.

How healthy is your SQL Server?

If don’t have the time to investigate the underlying issues or if you run into something we haven’t covered above, it may be worth considering specialist help. During our SQL Server Healthcheck service, WARDY IT Solutions’ Data Platform consultants carry out a comprehensive examination of your SQL Server environment. We’ll highlight areas of concern and provide actionable recommendations to improve the performance and availability of your SQL Server infrastructure. And if you prefer, we can also implement the recommendations from the work plan. Download an example of our SQL Server Healthchek report & work plan here.

Or if it’s all too much – talk to us about managing your SQL Server.

Like we said, we live and breathe SQL Server and would love to support your SQL database needs.