Developing Automated Data Purge Solution

DaveAuld

4.79/5 (6 votes)

30 Nov 2013CPOL7 min read

25.1K

Too much data is breaking the system, we need a fix!

Introduction

Health Warning - This code features the infamous GOTO in the SQL statements

We had a problem. A big problem. Our centralised alarm event database system was growing to extremes. Bigger than the box could handle and throwing money at it wasn't the answer, it would only delay the problem to another day.

This particular system was taking the data from two offshore platforms and storing in a centralised alarm/event historian in an onshore 'bunker'.

The two platforms were normally pushing about 25,000 records per day each, however if a significant platform event occurred or a fault existed on a field device, then this might typically be around 100,000 events or in a fault scenario push towards 1,000,000.

Every so often the server would fill up, run out of disk space and prevent the offshore servers from flushing their local buffers to the historian.

The manufacturer provided a manual tool to purge history. This would take a long time, would regularly fail for reasons like running out of disk space needed to execute the queries. This meant that it could take days of picking away at the data until it was possible to slowly run larger and larger time frames until sufficient space was free again to put the server back 'online'. During this time, offshore data buffers were overflowing and data was being lost. Not critical, but annoying. If a platform tripped for any reasons during the time the purge was ongoing, chances are we would not be able to analyse the event in any great detail.

The basic schematic of the (relevant parts of the) system is shown below:

A Quick Look At The Data

It was easy enough to open up SQL Management tools and have a look at the data/tables, this would maybe help understand some of the issues more.

Looking at the various tables for things that relate to Events and Alarms the following schema elements were identified.

Now, we are starting to get an understanding of what is going on.

Step 1: Understanding the problem

The main problem was easy, not enough server resources, i.e. too small disks. As stated above, this could easily be fixed temporarily by throwing money at the problem, but this wouldn't solve the issue long term.

The secondary problem was probably the most significant. The purge tool would not run with clients still feeding data. The tool was proprietary 3rd Party and no one knew what it was doing internally. There was no documentation. Not forgetting this was a manual tool. Someone had to remember to check SQL storage space and run the tool as necessary to free up space.

The first utility tool that I decided to use to help understand what was going on was the SQL Query Profiler. Starting this up and watching what was going on, allowed me to see what processes were connecting etc, enabling filters to be set to simplify the process on peeking under the hood.

Using the profiler, the purge tool was initiated and watching for the queries coming through it was easy to pick up what SQL statements were being fired off against the box and in what order, how far they are getting and why things were failing.

Recovered Queries

There were effectively 8 steps recovered from the profiler, these were;

SQL

Select * From Comment Where CommentID In
  (Select CommentID From AlarmDetail Where AlarmID In
    (Select AlarmID From AlarmMaster Where Origination Time < 'yyyy/mm/dd'))

Delete From AlarmDetail Where AlarmID in
  (Select AlarmID From AlarmMaster Where Origination Time < 'yyyy/mm/dd')

Delete From Comment Where CommentID In (xxxx,yyyy,zzzz,....)

IF @@TRANCOUNT > 0 Rollback Tran

Delete From AlarmMaster Where OriginationTime < 'yyyy/mm/dd'

Delete From Events Where EventTime < 'yyyy/mm/dd'

Delete From TagStatus When ChangeTime < 'yyyy/mm/dd'

IF @@TRANCOUNT > 0 Commit Tran

The 'yyyy/mm/dd' were the same values across each step and was the date submitted via the purge tool GUI.

The ID used in (xxxx,yyyy,zzzz,....) where the record IDs recovered from Step 1 of the process.

You can also see the nested Selects, these are probably not helping with the size of the query and intermediate processing required.

@xyz are local variable used in any SQL query/procedure and @@xyz are server level variables

We are definitely getting somewhere now. What I did also observe was when one of the offshore clients pushed data to the server, that is when the purge tool failed and that is due to the Rollback statement which is conditional on the @@TRANCOUNT, i.e. if there are other transactions pending, don't go any further.

Step 2 - Working A Solution

Looking at what we recovered it was easy to see where there are no dependencies. I also proved that by trying to delete data in the wrong order resulted in Foreign Key Constraint exceptions.

I also want to do this in small batches to prevent huge transaction log requirements etc.

The process must be fully automated. To do this I set up a maintenance job that ran every night during a typically quiet time, e.g. 3am, This would be when it was very unlikely that others would be polling the database to investigate plant issues that have previously occurred etc. Note: the data is classified as 'non-critical' and after about four months it is pretty much useless. Any events that have occurred after this time have already been fully investigated. For this reason, we are happy to delete the data rather than export it to longer term storage etc.

In the maintenance job I created four SQL queries that would run in sequence. These four queries would;

Delete the Event records
Delete the AlarmDetail records
Delete the AlarmMaster records
Delete the Alarm Comment records

Stage 1 - Remove Event Records

At the start of the query you will see we tell the server which database to USE WWALMD in this case. I can only assume that this name is short for WonderWare Alarm Database, as it was an WonderWare InTouch HMI (human machine interface). For more about this product, visit: www.wonderware.com

Next up there are few declarations for the variables that will be used in the query. We set these up as appropriate, using a DATEADD function to calculate the earliest date to retain the records.

Next up we establish the record count to delete and simply loop through these records deleting and committing the transactions in the batch size defined.

We use the @@ROWCOUNT to get the number of records deleted by the previously executed statement, this allows us to count down the number of records as we loop through each batch.

The other stages are done in the same way, but just changing the relevant table and field names where required.

SQL

/* 
    This will purge a batch of Event Records older than @Days days
    Use a negative number as we are going backwards
    e.g. -360 = Retain the last 360 Days
    
    Uses a loop to perform transaction batches of @BatchSize
    This prevents transaction log from potentially using all disk space,
    and also will at least keep committing smaller batches until a failure.
    
    Author (c) : Dave Auld
    Version: 1.1
    Date: 30th October 2010
*/
USE WWALMDB
Declare @Days as BigInt             /* Number of Days to Retain */
Declare @StartDate as DateTime      /* Start Date */
Declare @Count as bigint            /* Records Found To Delete */
Declare @RowCountTotal as bigint    /* Total Records Deleted */
Declare @BatchSize as bigint        /* Batch Size For Transaction */
Declare @MSG as nvarchar(256)       /* Local Messages */
/* Set the number of Days to Retain */
Set @Days = -120
/* Set the Earliest Date for the Purge */
Set @StartDate = DateAdd(dd,@Days,GetDate());
Set @MSG = 'Start Date For Purge: ' + cast(@StartDate as varchar );
RAISERROR (@MSG,0,1) With NOWAIT
/* Set the @RowCountTotal to 0 */
Set @RowCountTotal = 0
/* Set the Batch Size */
Set @BatchSize = 10000

/* Get the Number of Records Needing Deleted */
Select @Count = (Select Count(EventID) From Events Where EventTime < @StartDate);
Set @MSG = 'Records To Delete = ' + cast( @Count as varchar)
RAISERROR (@MSG,0,1) With NOWAIT
LoopStart:
IF (@Count - @RowCountTotal) > @BatchSize 
 Begin
  Begin Transaction
   Delete Top (@BatchSize) From Events Where EventTime < @StartDate
   Set @RowCountTotal = @RowCountTotal + @@RowCount
  Commit Transaction
  Goto LoopStart
    End
Else
 If (@Count - @RowCountTotal) > 0
  Begin    
   Begin Transaction
    Delete Top (@Count - @RowCountTotal) From Events Where EventTime < @StartDate
    Set @RowCountTotal = @RowCountTotal + @@RowCount   
   Commit Transaction
  End
  
Set @MSG = 'Completed Purge Operation of Events.';
RAISERROR (@MSG,0,1) With NOWAIT
Set @MSG = 'Total Records Purged: ' + cast(@RowCountTotal as varchar); 
RAISERROR (@MSG,0,1) With NOWAIT

Stage 2 - Remove Alarm Detail Records

SQL

/* 
    This will purge a batch of Alarm Detail Records older than @Days days
    Use a negative number as we are going backwards
    e.g. -360 = Retain the last 360 Days
    
    Uses a loop to perform transaction batches of @BatchSize
    This prevents transaction log from potentially using all disk space,
    and also will at least keep committing smaller batches until a failure.
    
    Author (c): Dave Auld
    Version: 1.0
    Date: 30th October 2010
*/
USE WWALMDB
Declare @Days as BigInt             /* Number of Days to Retain */
Declare @StartDate as DateTime      /* Start Date */
Declare @Count as bigint            /* Records Found To Delete */
Declare @RowCountTotal as bigint    /* Total Records Deleted */
Declare @BatchSize as bigint        /* Batch Size For Transaction */
Declare @MSG as nvarchar(256)       /* Local Messages */
/* Set the number of Days to Retain */
Set @Days = -120
/* Set the Earliest Date for the Purge */
Set @StartDate = DateAdd(dd,@Days,GetDate());
Set @MSG = 'Start Date For Purge: ' + cast(@StartDate as varchar );
RAISERROR (@MSG,0,1) With NOWAIT
/* Set the @RowCountTotal to 0 */
Set @RowCountTotal = 0
/* Set the Batch Size */
Set @BatchSize = 10000
/* Get the Number of Records Needing Deleted */
Select @Count = (Select Count(AlarmDetailID) From AlarmDetail Where TransitionTime < @StartDate);
Set @MSG = 'Records To Delete = ' + cast( @Count as varchar)
RAISERROR (@MSG,0,1) With NOWAIT
LoopStart:
IF (@Count - @RowCountTotal) > @BatchSize 
 Begin
  Begin Transaction
   Delete Top (@BatchSize) From AlarmDetail Where TransitionTime < @StartDate
   Set @RowCountTotal = @RowCountTotal + @@RowCount
  Commit Transaction
  Goto LoopStart
    End
Else
 If (@Count - @RowCountTotal) > 0
  Begin    
   Begin Transaction
    Delete Top (@Count - @RowCountTotal) From AlarmDetail Where TransitionTime < @StartDate
    Set @RowCountTotal = @RowCountTotal + @@RowCount   
   Commit Transaction
  End
  
Set @MSG = 'Completed Purge Operation of AlarmDetail.';
RAISERROR (@MSG,0,1) With NOWAIT
Set @MSG = 'Total Records Purged: ' + cast(@RowCountTotal as varchar); 
RAISERROR (@MSG,0,1) With NOWAIT

Stage 3 - Remove Alarm Master Records

SQL

/* 
    This will purge a batch of Alarm Master Records older than @Days days
    Use a negative number as we are going backwards
    e.g. -360 = Retain the last 360 Days
    
    Note: The script will leave 1 extra day data to prevent risk of foreign key issues
    i.e. trying to delete AlarmMaster still referenced to a AlarmDetail record
    
    Uses a loop to perform transaction batches of @BatchSize
    This prevents transaction log from potentially using all disk space,
    and also will at least keep committing smaller batches until a failure.
    
    Author (c): Dave Auld
    Version: 1.0
    Date: 30th October 2010
*/
USE WWALMDB
Declare @Days as BigInt             /* Number of Days to Retain */
Declare @StartDate as DateTime      /* Start Date */
Declare @Count as bigint            /* Records Found To Delete */
Declare @RowCountTotal as bigint    /* Total Records Deleted */
Declare @BatchSize as bigint        /* Batch Size For Transaction */
Declare @MSG as nvarchar(256)       /* Local Messages */
/* Set the number of Days to Retain */
Set @Days = -120
/* Set the Earliest Date for the Purge */
Set @StartDate = DateAdd(dd,@Days-1,GetDate());
Set @MSG = 'Start Date For Purge: ' + cast(@StartDate as varchar );
RAISERROR (@MSG,0,1) With NOWAIT
/* Set the @RowCountTotal to 0 */
Set @RowCountTotal = 0
/* Set the Batch Size */
Set @BatchSize = 10000
/* Get the Number of Records Needing Deleted */
Select @Count = (Select Count(OriginationTime) From AlarmMaster Where OriginationTime < @StartDate);
Set @MSG = 'Records To Delete = ' + cast( @Count as varchar)
RAISERROR (@MSG,0,1) With NOWAIT
LoopStart:
IF (@Count - @RowCountTotal) > @BatchSize 
 Begin
  Begin Transaction
   Delete Top (@BatchSize) From AlarmMaster Where OriginationTime < @StartDate
   Set @RowCountTotal = @RowCountTotal + @@RowCount
  Commit Transaction
  Goto LoopStart
    End
Else
 If (@Count - @RowCountTotal) > 0
  Begin    
   Begin Transaction
    Delete Top (@Count - @RowCountTotal) From AlarmMaster Where OriginationTime < @StartDate
    Set @RowCountTotal = @RowCountTotal + @@RowCount   
   Commit Transaction
  End
  
Set @MSG = 'Completed Purge Operation of Alarm Master.';
RAISERROR (@MSG,0,1) With NOWAIT
Set @MSG = 'Total Records Purged: ' + cast(@RowCountTotal as varchar); 
RAISERROR (@MSG,0,1) With NOWAIT

Stage 4 - Remove Alarm Comment Records

SQL

/* 
    This will purge a batch of Alarm Comment Records older than @Days days
    Use a negative number as we are going backwards
    e.g. -360 = Retain the last 360 Days
    
    Note: The script will leave 1 extra day data to prevent risk of foreign key issues
    i.e. trying to delete comments still referenced to a AlarmDetail record
    
    Uses a loop to perform transaction batches of @BatchSize
    This prevents transaction log from potentially using all disk space,
    and also will at least keep committing smaller batches until a failure.
    
    Author (c): Dave Auld
    Version: 1.0
    Date: 30th October 2010
*/
USE WWALMDB
Declare @Days as BigInt             /* Number of Days to Retain */
Declare @StartDate as DateTime      /* Start Date */
Declare @Count as bigint            /* Records Found To Delete */
Declare @RowCountTotal as bigint    /* Total Records Deleted */
Declare @BatchSize as bigint        /* Batch Size For Transaction */
Declare @MSG as nvarchar(256)       /* Local Messages */
/* Set the number of Days to Retain */
Set @Days = -120
/* Set the Earliest Date for the Purge */
Set @StartDate = DateAdd(dd,@Days -1,GetDate());
Set @MSG = 'Start Date For Purge: ' + cast(@StartDate as varchar );
RAISERROR (@MSG,0,1) With NOWAIT
/* Set the @RowCountTotal to 0 */
Set @RowCountTotal = 0
/* Set the Batch Size */
Set @BatchSize = 10000
/* Get the Number of Records Needing Deleted */
Select @Count = (Select Count(CommentTime) From Comment Where CommentTime < @StartDate);
Set @MSG = 'Records To Delete = ' + cast( @Count as varchar)
RAISERROR (@MSG,0,1) With NOWAIT
LoopStart:
IF (@Count - @RowCountTotal) > @BatchSize 
 Begin
  Begin Transaction
   Delete Top (@BatchSize) From Comment Where CommentTime < @StartDate
   Set @RowCountTotal = @RowCountTotal + @@RowCount
  Commit Transaction
  Goto LoopStart
    End
Else
 If (@Count - @RowCountTotal) > 0
  Begin    
   Begin Transaction
    Delete Top (@Count - @RowCountTotal) From Comment Where CommentTime < @StartDate
    Set @RowCountTotal = @RowCountTotal + @@RowCount   
   Commit Transaction
  End
  
Set @MSG = 'Completed Purge Operation of Comment.';
RAISERROR (@MSG,0,1) With NOWAIT
Set @MSG = 'Total Records Purged: ' + cast(@RowCountTotal as varchar); 
RAISERROR (@MSG,0,1) With NOWAIT

Points of Interest

The Alarm Master data would nearly always fail towards the end of the stage as there would be some foreign key constraints still present. This wasn't a problem though as these would be swept up in the next days purge cycle.

I also noted that the TagStatus query wasn't required, as there was no data ever present in this table.

I am no SQL guru and this certainly was an interesting challenge. Pouring over the SQL Books Online for hour on end, lots of trial and error until Bingo, it works!

Because it also works in smaller transactional batches, it means the Control System can continue to push new records to the database. The transaction log also was kept smaller.

Also, and a big also, we did not need to go and spend more money on hardware which was going to be made redundant.

This went on to serve us well for a long time (5+ years) and is reaching its retirement shortly as we have nearly finished the migration to a new control system with much improved alarm historian features.

Elegant it may not be, but it works. And more importantly, it works reliably. No more spending hours recovering the system picking off data to get us through another couple of weeks to then repeat.

Sometimes, a hacky hack is just what is needed. Even if there are a few GOTOs in the mix :)

Back-up Lesson

You will notice that the dates in the code are 2010, but the narrative says been in use for 5+ years, well we had a server meltdown and nobody had taken backups of the queries. As a result of this I had to rewrite them again. Probably not written the same the second time round, although from memory I think they are better than the first attempt, but in principle they do the same thing.

This time, I kept copies of the scripts......

SQL References

@@TRANCOUNT - http://technet.microsoft.com/en-us/library/ms187967(v=sql.90).aspx
@@ROWCOUNT - http://technet.microsoft.com/en-us/library/ms187316(v=sql.90).aspx
Begin...End - http://technet.microsoft.com/en-us/library/ms190487(v=sql.90).aspx
Begin Transaction - http://technet.microsoft.com/en-us/library/ms188929(v=sql.90).aspx
Commit Transaction - http://technet.microsoft.com/en-us/library/ms190295(v=sql.90).aspx
GetDate - http://technet.microsoft.com/en-us/library/ms188383(v=sql.90).aspx
DateAdd - http://technet.microsoft.com/en-us/library/ms186819(v=sql.90).aspx
Declare - http://technet.microsoft.com/en-US/library/ms188927(v=sql.90).aspx
If...Else - http://technet.microsoft.com/en-US/library/ms182717(v=sql.90).aspx
RaiseError - http://technet.microsoft.com/en-us/library/ms178592(v=sql.100).aspx
Set - http://technet.microsoft.com/en-US/library/ms189484(v=sql.90).aspx

History

30th November 2013 - Narrative updates
29th November 2013 - First article version

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)