Sunday, July 2, 2017

Archiving Data

Many times as a DBA I have taken over a system that has had data gathering for years.  The worst I took over had about 400 million records in the main table, with an average of 24 records in the secondary related table per primary record, data going back 12+ years.  In 2017, this might not be that bad, but back in 2002....

Seeing as the data had to be retrievable for any date, I could not simply delete the very old data.  These tables also had constant inserts and updates into them, so making sure the tables remained available became important, i.e. needed to have acceptable time that the table was being locked, with time for waiting transactions to finish.

The solution I came up with does this with variable size batches.  Now, with modern versions of SQL, there are other ways to do this, but the good thing about this method it works regardless of version of SQL, as well as edition.  Azure SQL DB would need some modification to make it work to archive to a separate database.

The following sample illustrates arching data from two tables, in a typical Sales header and detail tables, data consistency must be maintained.


CREATE OR ALTER PROCEDURE prc_ArchiveSalesData (@StartDate date@EndDate date@BatchSize INT)AS
BEGIN
   SET NOCOUNT ON
   SET XACT_ABORT ON
  
   DECLARE 
@rowcount INT-- set end date to be one after the date sent in to account for time data. 
-- Better performance than converting data in where clause
   
SET @EndDate DATEADD(DAY,1,@EndDate)
-- check for existing temp tables , drop and create
   
IF OBJECT_ID('tempdb..#AllItems'IS NOT NULL
       
DROP TABLE #AllItems
       
CREATE TABLE #AllItems (
           
SalesID INT
           
)

   
CREATE INDEX #idxAllItems ON #AllItems(SalesID--lots of joins to happen on this

   
IF OBJECT_ID('tempdb..#CurrentBatch'IS NOT NULL
       
DROP TABLE #CurrentBatch

   
CREATE TABLE #CurrentBatch (
       
SalesID INT
       
)

   
CREATE INDEX #idx_Current ON #CurrentBatch(SalesID)
-- gather all items for the given date range
   
INSERT INTO #AllItems
   
SELECT  S.SalesID FROM Sale S WHERE S.SaleDate >= @StartDate AND S.SaleDate @EndDate

   
-- loop outside transaction, that way if a failure or need to cancel, will commit the succesful ones
   
INSERT INTO #CurrentBatch
   
SELECT TOP (@BatchSizeSalesID FROM #AllItems ORDER BY SalesID ASC -- technically no need to do this order by, but makes debugging easier
   
SELECT @rowcount @@ROWCOUNT
   
WHILE @rowcount 0
   BEGIN
       BEGIN TRANSACTION
-- copy to archive tables
           
INSERT INTO SalesArchive.dbo.Sale (SalesIDCustomerIDSalesPersonIDTotalSalesSaleDate)
           
SELECT S.*  -- this would make archive fail if we change the table, which we would want
           
FROM Sale S
           INNER JOIN #CurrentBatch CB
               ON S.SalesID CB.SalesID
      
  
           INSERT INTO SalesArchive.dbo.SalesDetail SalesDetailIDSalesIDItemIDQuantity)
           
SELECT  SD.*
           
FROM SalesDetail SD
           INNER JOIN #CurrentBatch CB
               ON SD.SalesID CB.SalesID

-- delete from production tables

           
DELETE FROM SalesDetail
           FROM SalesDetail SD
           INNER JOIN #CurrentBatch CB
               ON SD.SalesID CB.SalesID

           DELETE FROM Sale
           FROM Sale S
           INNER JOIN #CurrentBatch CB
               ON S.SalesID CB.SalesID
-- delete the current batch from all items
           
DELETE FROM #AllItems
           
FROM #AllItems AI
           INNER JOIN #CurrentBatch CB
               ON AI.SalesID CB.SalesID
-- delete from current batch completed
           
DELETE FROM #CurrentBatch
       
COMMIT TRANSACTION       -- committing transaction here makes it possible to stop the archiving and not loose everything

       
WAITFOR delay '00:00:10' -- 10 second pause before continuing, adjust as needed


       
INSERT INTO #CurrentBatch
       
SELECT TOP (@BatchSizeSalesID FROM #AllItems ORDER BY SalesID ASC -- technically no need to do this order by, but makes debugging easier
       
SELECT @rowcount @@ROWCOUNT

   
END -- rowcount loop
END -- proc
  

Hope you find this useful if you ever find a need for archiving large amounts of data.


1 comment:

  1. When you are going for working on a real time project than you should have to re assist the big data f your company form making a new project strategy for the future work, so this is very big problem to read and understand the big data of your company, there is only solution for solving this problem is to hire the data scientist for this and you can get them form this https://activewizards.com/.

    ReplyDelete