Clement Huge

•

04.11.2018

SQL Server: Purging Data Using Switch Out

On very large database tables, most of my clients have decided to archive the data by extracting them first and then purging old data based on a Date column using something like :

DELETE FROM <SchemaName>.<TableName>
WHERE <DateField> < GETDATE() - X days;

‍

Not only it usually takes time but it is dangerous.

‍

If using the DELETE way, you should index the column on where you select the Purged data. This way, you can do an index seek and avoid a more massive lock on the table.

‍

If using the DELETE way, you should do it by small packets, in order to decrease the size of the SQL transactions and reduce the lock duration as well.

‍

If using the DELETE way, you should consider log backups or change the recovery to simple

‍

if the DELETE is massive (depending obviously on your level of SLA in regards of data loss).

‍

But I would go another route:

Starting 2016, entreprise features are available such as Data partitioning and even if we do put all the partitions under the same filegroup, it is still valuable in order to purge data fast and painlessly.

‍

If the large table where you purge already exists (which is usually the case), it is worth a maintenance so that you can , afterwards, easily handle a much nicer way to purge.

‍

Assuming you need to do a monthly purge keeping only one month worth of data, you can create a rolling partition by implementing a computed and persisted column that relates to the month number such as:

‍

ALTER TABLE <SchemaName>.<TableName>
ADD MonthPartition AS 
    ISNULL(CONVERT(TINYINT, MONTH(CreatedOn)), 0) 
    PERSISTED;

‍

Remember 2 things:

the persisted computed column cannot be null (so the isnull function is important in order to indicate sql server engine that the computed column is not null).
convert your result to the specific datatype in order to minimize the impact of the persisted column on storage and overall performance. In this case there can only be 12 months and therefore a tinyint is plenty sufficient.

‍

Once the maintenance is done, you can drop and recreate the clustered index (usually the primary key) including the partition column. Prior to this, create the partition scheme and function (example of a clustered PK here on a ID column):

‍

CREATE PARTITION FUNCTION FnMonthPartition (TINYINT)  
AS RANGE LEFT FOR VALUES (1,2,3,4,5,6,7,8,9,10,11,12);  
GO
CREATE PARTITION SCHEME PsMonthPartition  
AS PARTITION FnMonthPartition  
ALL TO ([PRIMARY]);
GO
ALTER TABLE <SchemaName>.<TableName> 
ADD CONSTRAINT [PK_SchemaNameTableName] 
PRIMARY KEY CLUSTERED 
(
    [ID] ASC, 
    MonthPartition
) 
ON PsMonthPartition(MonthPartition);
GO

‍

Once this is done, align the NC indexes on the same partitions like:

‍

CREATE NONCLUSTERED INDEX [NCI_SchemaNameTableNameExample] 
ON <SchemaName>.<TableName> ([Field] ASC) 
WITH (DROP_EXISTING = ON) 
ON PsMonthPartition(MonthPartition);
GO

‍

create a Stage table with the exact same structure of the table to purge, including the indexes and all aligned on the same partitions. Just right click on the table on SSMS and generate the script for the table. Do the same for the indexes. Set up the table with a convenient name. I , myself, use the schema Stage and set the same table name.

Now you have all structure to perform a scheduled purge monthly that would take seconds to be performed.

‍

As I need to extract , then transform the monthly partition each month, I need to set up a scheduler with specific transition phases and checks in order to : Switch out, extract, transform and purge the data. If people are interested, I can share my methodology to perform this.

‍

When you are ready to purge and keep only the current month and the last month of data and get the rest purge you can perform such purge using Switch out methodology then truncation. In order to be able to do so in seconds, the foreign key constraints if any MUST be disabled during the duration of the operation.

‍

I set up specific tool to automate such switch out methodology.

‍

Here are the objects I created:

a view to denormalised the various information about the partitions:

‍

CREATE VIEW Stage.[ObjectDetail]
AS
SELECT 
    SCHEMA_NAME(so.schema_id) AS schema_name,
    OBJECT_NAME(p.object_id) AS object_name,
    p.partition_number,
    p.data_compression_desc,
    dbps.row_count,
    dbps.reserved_page_count * 8 / 1024. AS reserved_mb,
    si.index_id,
    CASE 
        WHEN si.index_id = 0 THEN '(heap!)'
        ELSE si.name
    END AS index_name,
    si.is_unique,
    si.data_space_id,
    mappedto.name AS mapped_to_name,
    mappedto.type_desc AS mapped_to_type_desc,
    partitionds.name AS partition_filegroup,
    pf.name AS pf_name,
    pf.type_desc AS pf_type_desc,
    pf.fanout AS pf_fanout,
    pf.boundary_value_on_right,
    ps.name AS partition_scheme_name,
    rv.value AS range_value
FROM sys.partitions p
JOIN sys.objects so
    ON p.object_id = so.object_id
   AND so.is_ms_shipped = 0
LEFT JOIN sys.dm_db_partition_stats AS dbps
    ON p.object_id = dbps.object_id
   AND p.partition_id = dbps.partition_id
JOIN sys.indexes si
    ON p.object_id = si.object_id
   AND p.index_id = si.index_id
LEFT JOIN sys.data_spaces mappedto
    ON si.data_space_id = mappedto.data_space_id
LEFT JOIN sys.destination_data_spaces dds
    ON si.data_space_id = dds.partition_scheme_id
   AND p.partition_number = dds.destination_id
LEFT JOIN sys.data_spaces partitionds
    ON dds.data_space_id = partitionds.data_space_id
LEFT JOIN sys.partition_schemes AS ps
    ON dds.partition_scheme_id = ps.data_space_id
LEFT JOIN sys.partition_functions AS pf
    ON ps.function_id = pf.function_id
LEFT JOIN sys.partition_range_values AS rv
    ON pf.function_id = rv.function_id
   AND dds.destination_id = CASE pf.boundary_value_on_right
                                WHEN 0 THEN rv.boundary_id
                                ELSE rv.boundary_id + 1
                            END;
GO

‍

A stored procedure that will help doing the switch out in a dynamic way (if you have multiple tables to switch out and like to deploy the same methodology).

‍

/* Sample:
EXEC Stage.SwitchData @TableName = 'TableName', @Debug = 1 
GO

EXEC Stage.SwitchData @TableName = 'TableName', @Debug = 1, @SchemaName = 'Stage', @StageSchemaName = 'dbo'
GO

EXEC Stage.SwitchData @TableName = 'TableName', @Debug = 0, @SchemaName = 'dbo', @StageSchemaName = 'Stage'  
GO

EXEC Stage.SwitchData @TableName = 'TableName', @Debug = 0, @SchemaName = 'Stage', @StageSchemaName = 'dbo' 
*/

ALTER PROCEDURE Stage.SwitchData 
    @SchemaName       VARCHAR(128) = 'dbo',
    @StageSchemaName  VARCHAR(128) = 'Stage',
    @TableName        VARCHAR(128),
    @StageTableName   VARCHAR(128) = NULL,
    @PartitionToSwitch INT = NULL,
    @Debug            BIT = 0
AS
BEGIN TRY
    DECLARE @ErrMsg VARCHAR(256) = ''

    IF @StageTableName IS NULL
        SET @StageTableName = @TableName

    IF @PartitionToSwitch IS NULL
        SET @PartitionToSwitch = DATEPART(MONTH, DATEADD(MM, -2, GETDATE()))

    DECLARE @Rows BIGINT = 0

    SELECT 
        @PartitionToSwitch = Partition_Number, 
        @Rows = row_count 
    FROM Stage.[ObjectDetail] 
    WHERE Object_Name = @TableName 
      AND schema_name = @SchemaName 
      AND index_Name LIKE 'PK_%' 
      AND Range_Value = @PartitionToSwitch

    IF @Rows = 0 AND @Debug = 0
    BEGIN
        RAISERROR('No rows within partition',16,1)
    END

    IF @PartitionToSwitch IS NULL AND @Debug = 0
    BEGIN
        RAISERROR('Partition unknown',16,1)
    END

    DECLARE @FromTable VARCHAR(256) = '[' + @SchemaName + '].[' + @TableName + ']'
    DECLARE @ToTable   VARCHAR(256) = '[' + @StageSchemaName + '].[' + @TableName + ']'

    DECLARE @Stmt NVARCHAR(512) = ''
    DECLARE @ConstraintDisable NVARCHAR(512) = ''
    DECLARE @ConstraintEnable  NVARCHAR(512) = ''

    SET @Stmt = 'ALTER TABLE @@FromTable@@ SWITCH PARTITION @@PartitionNumber@@ TO @@ToTable@@ PARTITION @@PartitionNumber@@'
    SET @ConstraintEnable  = 'ALTER TABLE @@ToTable@@ CHECK CONSTRAINT ALL'
    SET @ConstraintDisable = 'ALTER TABLE @@ToTable@@ NOCHECK CONSTRAINT ALL'

    SET @Stmt = REPLACE(@Stmt, '@@FromTable@@', @FromTable)
    SET @Stmt = REPLACE(@Stmt, '@@ToTable@@', @ToTable)
    SET @Stmt = REPLACE(@Stmt, '@@PartitionNumber@@', CONVERT(VARCHAR(16), @PartitionToSwitch))
    SET @ConstraintEnable  = REPLACE(@ConstraintEnable,  '@@ToTable@@', @ToTable)
    SET @ConstraintDisable = REPLACE(@ConstraintDisable, '@@ToTable@@', @ToTable)

    IF @Debug = 1
    BEGIN
        PRINT @ConstraintDisable
        PRINT @Stmt
        PRINT @ConstraintEnable
    END
    ELSE 
    BEGIN
        EXEC sp_ExecuteSql @ConstraintDisable
        EXEC sp_ExecuteSql @Stmt
        EXEC sp_ExecuteSql @ConstraintEnable
    END

    SELECT @Rows AS RowsSwitched, @ErrMsg AS ErrorTxt
END TRY
BEGIN CATCH
    SET @ErrMsg = ERROR_MESSAGE()
    SELECT @Rows AS RowsSwitched, @ErrMsg AS ErrorTxt
END CATCH

‍

A stored procedure that will help doing the purge in a dynamic way (if you have multiple tables to switch out and like to deploy the same methodology).

‍

GO
-- EXEC Stage.PurgeData @TableName = 'SchemaName.TableName', @LastRowsExtracted = '0'

ALTER PROCEDURE Stage.PurgeData 
    @TableName varchar(128),
    @LastRowsExtracted int
AS
DECLARE @Stmt NVARCHAR(MAX) = '', 
        @ErrMsg varchar(512) = ''
BEGIN TRY
    SET @Stmt = 'SELECT COUNT(*) FROM ' + @TableName

    CREATE TABLE #Rows (RowsAffected int)

    INSERT INTO #Rows (RowsAffected)
    EXEC sp_ExecuteSql @Stmt

    DECLARE @RowsToPurge int
    SELECT @RowsToPurge = RowsAffected FROM #Rows

    IF @RowsToPurge <> @LastRowsExtracted
    BEGIN
        SET @ErrMsg = 'Table ' + @TableName + ' to purge has ' 
                      + CONVERT(VAR

‍

Then you can easily swith out and purge:

‍

EXEC Stage.SwitchData 
    @TableName = 'TableName';

EXEC Stage.PurgeData 
    @TableName = 'SchemaName.TableName',
    @LastRowsExtracted = 'NumberOfRowsToPurge';

‍

Note that the @LastROwsExtracted parameter is a safeguard parameter to make sure that we know how many rows we will purge as the truncation is performed instantaneously. On my ETL package, I usually set up the parameter with the number of rows that I extracted first. This way, I make sure that I purge a matching number. However, if you just purge your data without extracting, you can just count (ideally by using sp_spaceused to do it faster) what is on the staging table and then purge the data using the sproc. But if you are confident, you do not need to use the dynamic stored procedure and just truncate the table directly (if FKs are disabled if any exists).

‍

The main advantages of this technique:

speed: this is instantaneous
less lock: the lock is very short and painless
Rollback: you can rollback the switch out very easily and fast, as long as you created all the indexes and the exact same structure on the staging table.
No maintenance on the partition scheme/function: there are only 12 months in a year and this will stay like this forever. Try to keep the rolling partition scheme like this: dayofmonth, quarterofyear, month, week, etc... All those numbers have very limited amount of partitions and you will not need any splitting of partitions nor creating what we called sandwich partitions.
Add performance in the mix: you can add performance by setting the different partitions to different filegroup and therefore different disk arrays, improving selects on mulitple partitions (if you were to keep multiple months in our case intead of one).

‍

The main disadvantages of this technique:

Complicated: it is a bit more complicated than the DELETE technique.
Maintenance: if you change the structure of the primary table, as adding a field or changing the datatype or adding an index, you need to do the same on the staging table.
Need at least the standard edition: data partition was an entreprise edition feature and then became a standard edition after 2016. If you use a different edition than Standard and Enterprise, you might not be able to use the feature.