Friday, July 24, 2015

Querying IO Statistics, the quick and dirty way

(Apologies for formatting, still trying to figure this thing out.....)

Reducing IO is one of the "easiest" way to speed up SQL queries.  There are off course many other ways, but for many situations, IO is the most bang for the time spent optimizing.

Recently I came across a query that really needed help.  Not going to go into WHY it needed help, as that is a full on article by itself.  One of the tools I use for optimizing is SET STATISTICS IO ON.  This will give a nice line of reads per table used in a query.  (There are plenty of articles online about how to use this information).  This particular query had some, uhm, scary numbers.  And many of the tables where re-used multiple times.  With 2 (yes 2, not 2 thousand, or 2 million) records in the base table that needed processing, I saw values of over 1 million table scans.  At one part in the query plan with 1002 records in the base table, there was a node with 11 billion records output....

I started looking for a way to compile these numbers so I could compare, and get a better overall picture of what I was looking at. Below is a typical line from IO Stats.  This is not the complete line, just the first few pieces of information in it:

Table 'XYZ'. Scan count 5, logical reads 18575......

I decided what I need is a tool that I can relatively easily load these lines into a table, and be able to query and see the differences between different modifications I make to the query.


First we need a table to store and modify the IO lines:

CREATE TABLE [dbo].[StatsImport](
    
[StatsLine] [varchar](MAX) NULL,
    
[StatsXML] [xml] NULL
)  
ON [PRIMARY] TEXTIMAGE_ON [PRIMARY]



To get the stats we need, have these two lines before you run a query:

SET STATISTICS io ON
SET NOCOUNT ON

The nocount will make them all come out as one nice big list if there are multiple queries, this makes the next step easier.

Now take the text from messages and load it into the StatsImport table:


SET QUOTED_IDENTIFIER  OFF
DELETE FROM 
StatsImport
INSERT INTO StatsImport (StatsLine)VALUES("Table 'ABC'. Scan count 5, logical reads 18575, physical reads 68, read-ahead reads 16895, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.",),
(
"Table 'DEF'. Scan count 1, logical reads 2, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.",),
(
"Table 'GHI'. Scan count 5, logical reads 8929, physical reads 63, read-ahead reads 8822, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.")
 

Now the tricky thing is to get this into a format that can easily be put into a table, I decided to go via XML (GASP!!!!!).

Part of the solution, was done with this excellent article by  , mostly the part of how to get the lines set to be valid XML.  But since I wanted to make all the lines query-able, had to make some slight modifications.


SET QUOTED_IDENTIFIER  OFF

UPDATE 
StatsImport
  SET StatsLine REPLACE(StatsLine"'", "'")
UPDATE StatsImport
SET StatsLine REPLACE(StatsLine' reads''_reads')UPDATE StatsImport
SET StatsLine REPLACE(StatsLine'Table ''<Detail Table="')UPDATE StatsImport
SET StatsLine REPLACE(StatsLine'Reads ''reads="')UPDATE StatsImport
SET StatsLine REPLACE(StatsLine'. Scan count','" Scancount="')UPDATE StatsImport
SET StatsLine REPLACE(StatsLine'.''" />')UPDATE StatsImport
SET StatsLine REPLACE(StatsLine',''"')UPDATE StatsImport
SET StatsLine REPLACE(StatsLine,',','"')UPDATE StatsImport
SET StatsLine REPLACE(StatsLine'lob ''lob_')UPDATE StatsImport
SET StatsLine REPLACE(StatsLine,'-''_')
UPDATE StatsImport
SET StatsLine '<Output> ' StatsLine ' </Output>'
UPDATE StatsImport
SET StatsXML StatsLine
  


At this point we have all the data in a valid XML format in the StatsXML column.

Now, if we wanted to just read it and display it, all we would have to do is query that column, I however, wanted to make it so I could do all kinds of crazy things with it.

Table to store the shredded XML:


CREATE TABLE [dbo].[IOStats](
     
[TableName] [varchar](128) NOT NULL,
     
[ScanCount] [int] NOT NULL,
     
[LogicalReads] [int] NOT NULL,
     
[PhysicalReads] [int] NOT NULL,
     
[ReadAheadReads] [int] NOT NULL,
     
[RunIdentifier] [varchar](50) NULL
ON [PRIMARY]

Now to parse it into this table:

SET QUOTED_IDENTIFIER ON
INSERT INTO 
IOStats
    
SELECT
            
c.value('@Table[1]''varchar(128)'AS tableName,
            
c.value('@Scancount[1]''integer'AS ScanCountc,
            
c.value('@logical_reads[1]''integer'AS logical_reads ,
            
c.value('@physical_reads[1]''integer'AS physical_reads  ,
            
c.value('@read_ahead_reads[1]''integer'AS read_ahead_reads
    
--c.value('@lob_logical_reads[1]', 'integer') as lob_logical_reads
            
'Original'
    
FROM StatsImport
            
CROSS APPLY StatsXML.nodes('Output/Detail'AS T(C)


The line that reads 'Original' is just a text identifier for this load of stats.  Now you can see, I did not include all the data from stats as I did not need it for this particular problem, I think for most of you it would be easy to modify this if you need the lob etc. values.  One caveat here is you might need to go to a larger data type than integer if you have particularly crazy tables/queries.

With a little careful naming of your RunIdentifier values, it becomes easy to get output in the order you want.

So how did I use this?  Two ways.  First one was to see changes between different versions of the query.  With a baseline loaded as 'Original' and my first version loaded as 'New':


WITH TableNames AS (SELECT DISTINCT TableName FROM IOStats WHERE TableName NOT LIKE '#%')
IOStats2 AS (
   
SELECT
       
i.RunIdentifier
       
i.TableName
       
SUM(i.ScanCountAS ScanCount
       
SUM(i.LogicalReadsAS LogicalReads
       
SUM(i.PhysicalReadsAS PhysicalReads
       
SUM(i.ReadAheadReadsAS ReadAheadReads
   
FROM IOStats I
   
WHERE TableName NOT LIKE '#%'
   
GROUP BY i.RunIdentifieri.TableName
   
)SELECT TN.TableName,  
   
ISNULL(New.ScanCount,0) - ISNULL(Orig.ScanCount,0AS ScanCountChange,
   
ISNULL(New.LogicalReads,0) - ISNULL(Orig.LogicalReads,0AS LogicalReadsChange,
   
ISNULL(New.PhysicalReads,0) - ISNULL(Orig.PhysicalReads,0AS PhysicalReadsChange,
   
ISNULL(New.ReadAheadReads,0) - ISNULL(Orig.ReadAheadReads,0AS ReadAheadReadsChange--into #TablesFROM TableNames TNLEFT OUTER JOIN IOStats2 Orig
   
ON TN.TableName Orig.TableName
       
AND Orig.RunIdentifier 'Original'LEFT OUTER JOIN IOStats2 New
   
ON TN.TableName New.TableName
       
AND New.RunIdentifier 'new'ORDER BY TN.TableName ASC

The output looks like below, although table names here has been changed to protect the guilty.  Basically this shows the reduction in reads on the different tables.



Another way I used it was for a more overall view of changes:

SELECT RunIdentifier,SUM(scanCountSumScanCountSUM(LogicalReadssumLogicalReadsFROM IOStatsGROUP BY RunIdentifier


The 1k, 6k and 20k, are how many values I started with that would eventually need an update.  Although the particular query I was optimizing just inserts it into a temporary table that then will have further manipulations done to it.  I didn't even attempt 20k records with the original query, since I know when that happens (which is what caused me to start this project) we had to kill the job after 3 hours.  There are some abnormalities in this result that I need to go back and re-test, but at least I can easily see that the new query scales much much better as the number of records goes up.

So what is left to do?  I think writing a PowerShell script to be able to just save the IO Statistics to a file and load to the table and run the parsing.  Make the file name be the name of the parse.  That way re-loading into the tables would be much more straightforward, as well as make it possible to have my developers send me the stats and load them should be straight forward.