SQL server: Purging data using switch out

On very large database tables, most of my clients have decided to archive the data by extracting them first and then purging old data based on a Date column using something like :

DELETE FROM <SchemaName>.<TableName> WHERE <DateField> < GETDATE() - X days.

Not only it usually takes time but it is dangerous.

If using the DELETE way, you should index the column on where you select the Purged data. This way, you can do an index seek and avoid a more massive lock on the table.

If using the DELETE way, you should do it by small packets, in order to decrease the size of the SQL transactions and reduce the lock duration as well.

If using the DELETE way, you should consider log backups or change the recovery to simple if the DELETE is massive (depending obviously on your level of SLA in regards of data loss).

But I would go another route:

Starting 2016, entreprise features are available such as Data partitioning and even if we do put all the partitions under the same filegroup, it is still valuable in order to purge data fast and painlessly.

If the large table where you purge already exists (which is usually the case), it is worth a maintenance so that you can , afterwards, easily handle a much nicer way to purge.

Assuming you need to do a monthly purge keeping only one month worth of data, you can create a rolling partition by implementing a computed and persisted column that relates to the month number such as:

ALTER TABLE <SchemaName>.<TableName> ADD MonthPartition AS ISNULL(CONVERT(TINYINT,MONTH(CreatedOn)),0) PERSISTED

Remember 2 things:

the persisted computed column cannot be null (so the isnull function is important in order to indicate sql server engine that the computed column is not null).
convert your result to the specific datatype in order to minimize the impact of the persisted column on storage and overall performance. In this case there can only be 12 months and therefore a tinyint is plenty sufficient.

Once the maintenance is done, you can drop and recreate the clustered index (usually the primary key) including the partition column. Prior to this, create the partition scheme and function (example of a clustered PK here on a ID column):

CREATE PARTITION FUNCTION FnMonthPartition (tinyint) AS RANGE LEFT FOR VALUES (1,2,3,4,5,6,7,8,9,10,11,12);

CREATE PARTITION SCHEME PsMonthPartition AS PARTITION FnMonthPartition ALL TO ([PRIMARY]);

ALTER TABLE <SchemaName>.<TableName> ADD CONSTRAINT [PK_SchemaNameTableName] PRIMARY KEY CLUSTERED ([ID] ASC, MonthPartition) ON PsMonthPartition(MonthPartition)

Once this is done, align the NC indexes on the same partitions like:

CREATE NONCLUSTERED INDEX [NCI_SchemaNameTableNameExample] ON <SchemaName>.<TableName> ([Field] ASC) WITH (DROP_EXISTING = ON) ON PsMonthPartition(MonthPartition)

create a Stage table with the exact same structure of the table to purge, including the indexes and all aligned on the same partitions. Just right click on the table on SSMS and generate the script for the table. Do the same for the indexes. Set up the table with a convenient name. I , myself, use the schema Stage and set the same table name.

Now you have all structure to perform a scheduled purge monthly that would take seconds to be performed.

As I need to extract , then transform the monthly partition each month, I need to set up a scheduler with specific transition phases and checks in order to : Switch out, extract, transform and purge the data. If people are interested, I can share my methodology to perform this.

When you are ready to purge and keep only the current month and the last month of data and get the rest purge you can perform such purge using Switch out methodology then truncation. In order to be able to do so in seconds, the foreign key constraints if any MUST be disabled during the duration of the operation.

I set up specific tool to automate such switch out methodology.

Here are the objects I created:

a view to denormalised the various information about the partitions:

CREATE VIEW Stage.[ObjectDetail]

SELECT SCHEMA_NAME(so.schema_id) AS schema_name ,

OBJECT_NAME(p.object_id) AS object_name ,

p.partition_number ,

p.data_compression_desc ,

dbps.row_count ,

dbps.reserved_page_count * 8 / 1024. AS reserved_mb ,

si.index_id ,

CASE WHEN si.index_id = 0 THEN '(heap!)'

ELSE si.name

END AS index_name ,

si.is_unique ,

si.data_space_id ,

mappedto.name AS mapped_to_name ,

mappedto.type_desc AS mapped_to_type_desc ,

partitionds.name AS partition_filegroup ,

pf.name AS pf_name ,

pf.type_desc AS pf_type_desc ,

pf.fanout AS pf_fanout ,

pf.boundary_value_on_right ,

ps.name AS partition_scheme_name ,

rv.value AS range_value

FROM sys.partitions p

JOIN sys.objects so

ON p.object_id = so.object_id

AND so.is_ms_shipped = 0

LEFT JOIN sys.dm_db_partition_stats AS dbps

ON p.object_id = dbps.object_id

AND p.partition_id = dbps.partition_id

JOIN sys.indexes si

ON p.object_id = si.object_id

AND p.index_id = si.index_id

LEFT JOIN sys.data_spaces mappedto

ON si.data_space_id = mappedto.data_space_id

LEFT JOIN sys.destination_data_spaces dds

ON si.data_space_id = dds.partition_scheme_id

AND p.partition_number = dds.destination_id

LEFT JOIN sys.data_spaces partitionds

ON dds.data_space_id = partitionds.data_space_id

LEFT JOIN sys.partition_schemes AS ps

ON dds.partition_scheme_id = ps.data_space_id

LEFT JOIN sys.partition_functions AS pf

ON ps.function_id = pf.function_id

LEFT JOIN sys.partition_range_values AS rv

ON pf.function_id = rv.function_id

AND dds.destination_id = CASE pf.boundary_value_on_right

WHEN 0 THEN rv.boundary_id

ELSE rv.boundary_id + 1

END

2. A stored procedure that will help doing the switch out in a dynamic way (if you have multiple tables to switch out and like to deploy the same methodology). :

Sample:

EXEC Stage.SwitchData @TableName = 'TableName', @Debug = 1

EXEC Stage.SwitchData @TableName = 'TableName', @Debug = 1, @SchemaName = 'Stage', @StageSchemaName = 'dbo'

EXEC Stage.SwitchData @TableName = 'TableName', @Debug = 0, @SchemaName = 'dbo', @StageSchemaName = 'Stage'

EXEC Stage.SwitchData @TableName = 'TableName', @Debug = 0, @SchemaName = 'Stage', @StageSchemaName = 'dbo'

ALTER PROCEDURE Stage.SwitchData

@SchemaName VARCHAR(128) = 'dbo',

@StageSchemaName VARCHAR(128) = 'Stage',

@TableName VARCHAR(128),

@StageTableName VARCHAR(128)=NULL,

@PartitionToSwitch int = NULL,

@Debug BIT = 0

BEGIN TRY

DECLARE @ErrMsg VARCHAR(256) = ''

IF @StageTableName IS NULL

SET @StageTableName = @TableName

IF @PartitionToSwitch IS NULL

SET @PartitionToSwitch = DATEPART(MONTH,DATEADD(MM,-2,GETDATE()))

DECLARE @Rows bigint = 0

SELECT @PartitionToSwitch = Partition_Number, @Rows = row_count FROM Stage.[ObjectDetail] WHERE Object_Name = @TableName AND schema_name = @SchemaName AND index_Name LIKE 'PK_%' AND Range_Value = @PartitionToSwitch

IF @Rows = 0 AND @Debug = 0

BEGIN

RAISERROR('No rows within partition',16,1)

END

IF @PartitionToSwitch IS NULL AND @Debug = 0

BEGIN

RAISERROR('Partition unknown',16,1)

END

DECLARE @FromTable VARCHAR(256) = '['+@SchemaName+'].['+@TableName+']'

DECLARE @ToTable VARCHAR(256) = '['+@StageSchemaName+'].['+@TableName+']'

DECLARE @Stmt NVARCHAR(512) = '', @ConstraintDisable NVARCHAR(512)='', @ConstraintEnable NVARCHAR(512)=''

SET @Stmt = 'ALTER TABLE @@FromTable@@ SWITCH PARTITION @@PartitionNumber@@ TO @@ToTable@@ PARTITION @@PartitionNumber@@'

SET @ConstraintEnable = 'ALTER TABLE @@ToTable@@ CHECK CONSTRAINT ALL'

SET @ConstraintDisable = 'ALTER TABLE @@ToTable@@ NOCHECK CONSTRAINT ALL'

SET @Stmt = REPLACE(@Stmt,'@@FromTable@@',@FromTable)

SET @Stmt = REPLACE(@Stmt,'@@ToTable@@',@ToTable)

SET @Stmt = REPLACE(@Stmt,'@@PartitionNumber@@',CONVERT(VARCHAR(16),@PartitionToSwitch))

SET @ConstraintEnable = REPLACE(@ConstraintEnable,'@@ToTable@@',@ToTable)

SET @ConstraintDisable = REPLACE(@ConstraintDisable,'@@ToTable@@',@ToTable)

IF @Debug = 1

BEGIN

PRINT @ConstraintDisable

PRINT @Stmt

PRINT @ConstraintEnable

END

ELSE

BEGIN

EXEC sp_ExecuteSql @ConstraintDisable

EXEC sp_ExecuteSql @Stmt

EXEC sp_ExecuteSql @ConstraintEnable

END

SELECT @Rows RowsSwitched, @ErrMsg ErrorTxt

END TRY

BEGIN CATCH

SET @ErrMsg = ERROR_MESSAGE()

SELECT @Rows RowsSwitched, @ErrMsg ErrorTxt

END CATCH

3. A stored procedure that will help doing the purge in a dynamic way (if you have multiple tables to switch out and like to deploy the same methodology). :

-- EXEC Stage.PurgeData @TableName = 'SchemaName.TableName',@LastRowsExtracted='0'

ALTER PROCEDURE Stage.PurgeData

@TableName varchar(128),@LastRowsExtracted int

DECLARE @Stmt NVARCHAR(MAX) = '', @ErrMsg varchar(512)=''

BEGIN TRY

SET @Stmt = 'SELECT COUNT(*) FROM ' + @TableName

CREATE TABLE #Rows (RowsAffected int)

INSERT INTO #Rows (RowsAffected)

EXEC sp_ExecuteSql @Stmt

DECLARE @RowsToPurge int

SELECT @RowsToPurge = RowsAffected FROM #Rows

IF @RowsToPurge <> @LastRowsExtracted

BEGIN

SET @ErrMsg = 'Table '+@TableName+' to purge has '+ CONVERT(VARCHAR(16),@RowsToPurge)+' row(s) and it was expected to be ' + CONVERT(VARCHAR(16),@LastRowsExtracted)+ ' row(s)'

RAISERROR(@ErrMsg,16,1)

END

IF @RowsToPurge = @LastRowsExtracted

BEGIN

SET @Stmt = 'TRUNCATE TABLE ' + @TableName

EXEC sp_ExecuteSql @Stmt

END

SELECT @RowsToPurge RowsPurged, @ErrMsg ErrorMsg

END TRY

BEGIN CATCH

SET @ErrMsg = ERROR_MESSAGE()

SELECT 0 RowsPurged, @ErrMsg ErrorMsg

END CATCH

Then you can easily swith out and purge:

EXEC Stage.SwitchData @TableName = 'TableName'

EXEC Stage.PurgeData @TableName = 'SchemaName.TableName',@LastRowsExtracted='NumberOfRowsToPurge'

Note that the @LastROwsExtracted parameter is a safeguard parameter to make sure that we know how many rows we will purge as the truncation is performed instantaneously. On my ETL package, I usually set up the parameter with the number of rows that I extracted first. This way, I make sure that I purge a matching number. However, if you just purge your data without extracting, you can just count (ideally by using sp_spaceused to do it faster) what is on the staging table and then purge the data using the sproc. But if you are confident, you do not need to use the dynamic stored procedure and just truncate the table directly (if FKs are disabled if any exists).

The main advantages of this technique:

speed: this is instantaneous
less lock: the lock is very short and painless
Rollback: you can rollback the switch out very easily and fast, as long as you created all the indexes and the exact same structure on the staging table.
No maintenance on the partition scheme/function: there are only 12 months in a year and this will stay like this forever. Try to keep the rolling partition scheme like this: dayofmonth, quarterofyear, month, week, etc... All those numbers have very limited amount of partitions and you will not need any splitting of partitions nor creating what we called sandwich partitions.
Add performance in the mix: you can add performance by setting the different partitions to different filegroup and therefore different disk arrays, improving selects on mulitple partitions (if you were to keep multiple months in our case intead of one).

The main disadvantages of this technique:

Complicated: it is a bit more complicated than the DELETE technique.
Maintenance: if you change the structure of the primary table, as adding a field or changing the datatype or adding an index, you need to do the same on the staging table.
Need at least the standard edition: data partition was an entreprise edition feature and then became a standard edition after 2016. If you use a different edition than Standard and Enterprise, you might not be able to use the feature.

Blog Post

SQL server: Purging data using switch out

Leave a Reply Cancel reply

Categories

Blog Post

SQL server: Purging data using switch out

Related Posts

Setting database mirroring and minimizing downtime

SQL server tuning: DML for multiple rows in one shot

Data vs Database management

Leave a Reply Cancel reply

Categories