Anders On SQL: SentryOne cleanup, PerformanceAnalysisDataTableAndIndexCounter table

SentryOne is a great tool for monitoring servers, but with a lot of "chatty" servers, the amount of data collected can become unwieldy. With about 65 servers monitored, the delete functionality in SentryOne could not keep up with the deletes in the PerformanceAnalysisDataTableAndIndexCounter (hereafter just refered to as the table....) table. At the point I got brought in to clean it up there was close to a billion records in this table, and we were about to add another 110+ servers to be monitored. The table itself was about 320 GB, the cleanup function was not working, and the manual method provided by SentryOne support could not handle the deletes without causing a lot of blocking and time out issues.

The delete function provided had multiple manual steps, and the where clause was not well aligned with the indexes on the table. To be able to handle cleanup of this size, and not cause undue blocking and timeouts in the collection of data, I decided to find a better way.

Looking at the code provided by SentryOne, and digging a bit in the database itself, found how I could get the deletes into smaller, adjustable batches.

The first problem was to find the boundaries for the delete, and this step is the single most time consuming step in the process:

SELECT @EndID = MAX(id) FROM table WITH (nolock) WHERE TIMESTAMP <= @EndTimeStamp -- this step can take time
Since data in this table is just written to and never updated, using nolock here is ok.

After this it becomes a simple loop until you meet a goal in batches. The delete is done with a simple top statement in a CTE.

After I was done catching up to the 15 days retention rate we wanted, which is hardcoded into the procedure to avoid accidentally deleting more than we wanted and can be changed if your retention rate is different, the procedure was set up to simply be called once a day. We decided to set the retention rate in SentryOne to be 1 day longer than the procedure, so if anything was ever changed SentryOne delete routine would kick in if this one fails.

Hopefully this is useful for you if you run into this problem with SentryOne. The same setup can be used for other tables as well, but PLEASE check with their support before you start to delete data that it is ok since some tables are detail "collector" tables, some are summaries that you need for your time lines. The same format was used to delete in a couple of other tables, including the one that collects deadlock information.......

The code in its entirety is on Github, or below. I would highly recommend getting it from Github if you are going to use it instead of here.

The parameters to use needs to be adjusted for your use and server performance, the defaults are just a safe small run of deletes.

@BatchSize increased as I got less data in the table, and how busy the system was. The trick is to get each batch to be big enough you will get done in a reasonable time with your cleanup, but not so large that they cause time outs. Only one way to figure out, run the procedure and keep a close eye on your system while it is running.

@MaxIterations is simply how many times to loop through running deletes.

USE SQLSentry
-- use SentryOne -- depends if new install, or upgradeCREATE PROCEDURE dbo.BatchDeletePerformanceAnalysisDataTableAndIndexCounterData
   @EndDateTime DATETIME,
   @MaxIterations INT = 1,
   @BatchSize INT = 1000
AS/******************
Anders Pedersen
2017-01-18

Made to batch delete data out of the PerformanceAnalysisDataTableAndIndexCounter table.
Delete statement provided by SentryOne could not handle a lot of data without
a lot of manual intervention (manual batches).

This uses the column with the primary key on it for deletes instead of the timestamp column,
this substantially speeds up deletes.

*/BEGIN
   SET NOCOUNT ON
   --set DEADLOCK_PRIORITY low -- if you see a lot of other processes being victim, set this
   --(unless it is more important to get it cleaned up than loosing a few records)
-- get first time stamp
   DECLARE @BeginTimeStamp INT, @EndTimeStamp INT
   DECLARE @BeginID bigint, @EndID bigint
-- sanity check, do not let it delete to a date less than 15 days back.  Making it 16 to cover the entire day.  Close enough
   IF DATEDIFF(DAY, @EndDateTime, GETDATE()) <= 16 OR @EndDateTime IS NULL
       SET @EndDateTime = DATEADD(DAY, -16, GETDATE())
   SELECT @EndTimeStamp = [dbo].[fnConvertDateTimeToTimestamp] (@EndDateTime)

   -- get ID for that newest one to delete
   SELECT @EndID = MAX(id) FROM PerformanceAnalysisDataTableAndIndexCounter WITH (nolock) WHERE TIMESTAMP <= @EndTimeStamp -- this step can take time
   SELECT @BeginID = MIN(id) FROM PerformanceAnalysisDataTableAndIndexCounter WITH (nolock) WHERE ID <= @EndID

   DECLARE @cnt  bigint = 0
   DECLARE @X INT = 1
   WHILE @x <= @MaxIterations AND @BeginID < @EndID
   BEGIN
       IF @x <> 1 -- skip wait on first iteration. Save time if you only running small batches
           WAITFOR DELAY '00:00:03'; -- 3 second delay to let transactions through.  This might need adjusted
       WITH CTE AS
       (
           SELECT TOP (@BatchSize) * FROM PerformanceAnalysisDataTableAndIndexCounter WHERE ID BETWEEN @BeginID AND @EndID ORDER BY ID ASC
       )
       DELETE FROM CTE
       SELECT @cnt += @@ROWCOUNT
       SET @x +=1
 
 
   END
   SELECT @cnt AS RecordsDeleted

END

3 comments:

UnknownMarch 5, 2017 at 10:27 AM
Great post! Thorough and thoughtful. =^) Personally, I'm a little surprised our team didn't think of that in the first place. I know that both Aaron and I have been using batch size adjustments for process huge numbers of rows since the 1990's. I think it's just one of those outlyer issues where not many customers have a large enough estate to warrant this approach. Again, good post!

-Kev
GregGonzalezMay 9, 2017 at 9:03 AM
Anders,
This is a nice script... but this growth should never have happened as that table should never get large at all. After some quick analysis on your (former) environment, the root of the issue was a bug in our index analysis process -- if your S1 monitoring service and repository database were in a different time zone than the monitored target, the analysis could get stuck in a loop and run continually for hours, logging a bunch of duplicate data and leading to the issue you saw. This was immediately apparent by opening up the Defrag Schedule calendar, where the duplicate executions were plainly visible. The quick fix was to simply disable the index analysis schedule for the server in question. The excess data has stopped coming in, and we were able to disable the delete script. We have already coded a fix and it should be in the next SentryOne release.

Thanks,
Greg Gonzalez
CEO, SentryOne

Thursday, March 2, 2017

SentryOne cleanup, PerformanceAnalysisDataTableAndIndexCounter table

3 comments: