SQL server: Purging data using switch out
On very large database tables, most of my clients have decided to archive the data by extracting them first and then purging old data based on a Date column using something like :
DELETE FROM <SchemaName>.<TableName> WHERE <DateField> < GETDATE() - X days.
Not only it usually takes time but it is dangerous.
If using the DELETE way, you should index the column on where you select the Purged data. This way, you can do an index seek and avoid a more massive lock on the table.
If using the DELETE way, you should do it by small packets, in order to decrease the size of the SQL transactions and reduce the lock duration as well.
If using the DELETE way, you should consider log backups or change the recovery to simple if the DELETE is massive (depending obviously on your level of SLA in regards of data loss).
But I would go another route:
Starting 2016, entreprise features are available such as Data partitioning and even if we do put all the partitions under the same filegroup, it is still valuable in order to purge data fast and painlessly.
If the large table where you purge already exists (which is usually the case), it is worth a maintenance so that you can , afterwards, easily handle a much nicer way to purge.
Assuming you need to do a monthly purge keeping only one month worth of data, you can create a rolling partition by implementing a computed and persisted column that relates to the month number such as:
ALTER TABLE <SchemaName>.<TableName> ADD MonthPartition AS ISNULL(CONVERT(TINYINT,MONTH(CreatedOn)),0) PERSISTED
Remember 2 things:
- the persisted computed column cannot be null (so the isnull function is important in order to indicate sql server engine that the computed column is not null).
- convert your result to the specific datatype in order to minimize the impact of the persisted column on storage and overall performance. In this case there can only be 12 months and therefore a tinyint is plenty sufficient.
Once the maintenance is done, you can drop and recreate the clustered index (usually the primary key) including the partition column. Prior to this, create the partition scheme and function (example of a clustered PK here on a ID column):
CREATE PARTITION FUNCTION FnMonthPartition (tinyint) AS RANGE LEFT FOR VALUES (1,2,3,4,5,6,7,8,9,10,11,12);
GO
CREATE PARTITION SCHEME PsMonthPartition AS PARTITION FnMonthPartition ALL TO ([PRIMARY]);
GO
ALTER TABLE <SchemaName>.<TableName> ADD CONSTRAINT [PK_SchemaNameTableName] PRIMARY KEY CLUSTERED ([ID] ASC, MonthPartition) ON PsMonthPartition(MonthPartition)
GO
Once this is done, align the NC indexes on the same partitions like:
CREATE NONCLUSTERED INDEX [NCI_SchemaNameTableNameExample] ON <SchemaName>.<TableName> ([Field] ASC) WITH (DROP_EXISTING = ON) ON PsMonthPartition(MonthPartition)
GO
create a Stage table with the exact same structure of the table to purge, including the indexes and all aligned on the same partitions. Just right click on the table on SSMS and generate the script for the table. Do the same for the indexes. Set up the table with a convenient name. I , myself, use the schema Stage and set the same table name.
Now you have all structure to perform a scheduled purge monthly that would take seconds to be performed.
As I need to extract , then transform the monthly partition each month, I need to set up a scheduler with specific transition phases and checks in order to : Switch out, extract, transform and purge the data. If people are interested, I can share my methodology to perform this.
When you are ready to purge and keep only the current month and the last month of data and get the rest purge you can perform such purge using Switch out methodology then truncation. In order to be able to do so in seconds, the foreign key constraints if any MUST be disabled during the duration of the operation.
I set up specific tool to automate such switch out methodology.
Here are the objects I created:
- a view to denormalised the various information about the partitions:
CREATE VIEW Stage.[ObjectDetail]
AS
SELECT SCHEMA_NAME(so.schema_id) AS schema_name ,
OBJECT_NAME(p.object_id) AS object_name ,
p.partition_number ,
p.data_compression_desc ,
dbps.row_count ,
dbps.reserved_page_count * 8 / 1024. AS reserved_mb ,
si.index_id ,
CASE WHEN si.index_id = 0 THEN '(heap!)'
ELSE si.name
END AS index_name ,
si.is_unique ,
si.data_space_id ,
mappedto.name AS mapped_to_name ,
mappedto.type_desc AS mapped_to_type_desc ,
partitionds.name AS partition_filegroup ,
pf.name AS pf_name ,
pf.type_desc AS pf_type_desc ,
pf.fanout AS pf_fanout ,
pf.boundary_value_on_right ,
ps.name AS partition_scheme_name ,
rv.value AS range_value
FROM sys.partitions p
JOIN sys.objects so
ON p.object_id = so.object_id
AND so.is_ms_shipped = 0
LEFT JOIN sys.dm_db_partition_stats AS dbps
ON p.object_id = dbps.object_id
AND p.partition_id = dbps.partition_id
JOIN sys.indexes si
ON p.object_id = si.object_id
AND p.index_id = si.index_id
LEFT JOIN sys.data_spaces mappedto
ON si.data_space_id = mappedto.data_space_id
LEFT JOIN sys.destination_data_spaces dds
ON si.data_space_id = dds.partition_scheme_id
AND p.partition_number = dds.destination_id
LEFT JOIN sys.data_spaces partitionds
ON dds.data_space_id = partitionds.data_space_id
LEFT JOIN sys.partition_schemes AS ps
ON dds.partition_scheme_id = ps.data_space_id
LEFT JOIN sys.partition_functions AS pf
ON ps.function_id = pf.function_id
LEFT JOIN sys.partition_range_values AS rv
ON pf.function_id = rv.function_id
AND dds.destination_id = CASE pf.boundary_value_on_right
WHEN 0 THEN rv.boundary_id
ELSE rv.boundary_id + 1
END
GO
2. A stored procedure that will help doing the switch out in a dynamic way (if you have multiple tables to switch out and like to deploy the same methodology). :
/*
Sample:
EXEC Stage.SwitchData @TableName = 'TableName', @Debug = 1
GO
EXEC Stage.SwitchData @TableName = 'TableName', @Debug = 1, @SchemaName = 'Stage', @StageSchemaName = 'dbo'
GO
EXEC Stage.SwitchData @TableName = 'TableName', @Debug = 0, @SchemaName = 'dbo', @StageSchemaName = 'Stage'
GO
EXEC Stage.SwitchData @TableName = 'TableName', @Debug = 0, @SchemaName = 'Stage', @StageSchemaName = 'dbo'
*/
ALTER PROCEDURE Stage.SwitchData
@SchemaName VARCHAR(128) = 'dbo',
@StageSchemaName VARCHAR(128) = 'Stage',
@TableName VARCHAR(128),
@StageTableName VARCHAR(128)=NULL,
@PartitionToSwitch int = NULL,
@Debug BIT = 0
AS
BEGIN TRY
DECLARE @ErrMsg VARCHAR(256) = ''
IF @StageTableName IS NULL
SET @StageTableName = @TableName
IF @PartitionToSwitch IS NULL
SET @PartitionToSwitch = DATEPART(MONTH,DATEADD(MM,-2,GETDATE()))
DECLARE @Rows bigint = 0
SELECT @PartitionToSwitch = Partition_Number, @Rows = row_count FROM Stage.[ObjectDetail] WHERE Object_Name = @TableName AND schema_name = @SchemaName AND index_Name LIKE 'PK_%' AND Range_Value = @PartitionToSwitch
IF @Rows = 0 AND @Debug = 0
BEGIN
RAISERROR('No rows within partition',16,1)
END
IF @PartitionToSwitch IS NULL AND @Debug = 0
BEGIN
RAISERROR('Partition unknown',16,1)
END
DECLARE @FromTable VARCHAR(256) = '['+@SchemaName+'].['+@TableName+']'
DECLARE @ToTable VARCHAR(256) = '['+@StageSchemaName+'].['+@TableName+']'
DECLARE @Stmt NVARCHAR(512) = '', @ConstraintDisable NVARCHAR(512)='', @ConstraintEnable NVARCHAR(512)=''
SET @Stmt = 'ALTER TABLE @@FromTable@@ SWITCH PARTITION @@PartitionNumber@@ TO @@ToTable@@ PARTITION @@PartitionNumber@@'
SET @ConstraintEnable = 'ALTER TABLE @@ToTable@@ CHECK CONSTRAINT ALL'
SET @ConstraintDisable = 'ALTER TABLE @@ToTable@@ NOCHECK CONSTRAINT ALL'
SET @Stmt = REPLACE(@Stmt,'@@FromTable@@',@FromTable)
SET @Stmt = REPLACE(@Stmt,'@@ToTable@@',@ToTable)
SET @Stmt = REPLACE(@Stmt,'@@PartitionNumber@@',CONVERT(VARCHAR(16),@PartitionToSwitch))
SET @ConstraintEnable = REPLACE(@ConstraintEnable,'@@ToTable@@',@ToTable)
SET @ConstraintDisable = REPLACE(@ConstraintDisable,'@@ToTable@@',@ToTable)
IF @Debug = 1
BEGIN
PRINT @ConstraintDisable
PRINT @Stmt
PRINT @ConstraintEnable
END
ELSE
BEGIN
EXEC sp_ExecuteSql @ConstraintDisable
EXEC sp_ExecuteSql @Stmt
EXEC sp_ExecuteSql @ConstraintEnable
END
SELECT @Rows RowsSwitched, @ErrMsg ErrorTxt
END TRY
BEGIN CATCH
SET @ErrMsg = ERROR_MESSAGE()
SELECT @Rows RowsSwitched, @ErrMsg ErrorTxt
END CATCH
3. A stored procedure that will help doing the purge in a dynamic way (if you have multiple tables to switch out and like to deploy the same methodology). :
GO
-- EXEC Stage.PurgeData @TableName = 'SchemaName.TableName',@LastRowsExtracted='0'
ALTER PROCEDURE Stage.PurgeData
@TableName varchar(128),@LastRowsExtracted int
AS
DECLARE @Stmt NVARCHAR(MAX) = '', @ErrMsg varchar(512)=''
BEGIN TRY
SET @Stmt = 'SELECT COUNT(*) FROM ' + @TableName
CREATE TABLE #Rows (RowsAffected int)
INSERT INTO #Rows (RowsAffected)
EXEC sp_ExecuteSql @Stmt
DECLARE @RowsToPurge int
SELECT @RowsToPurge = RowsAffected FROM #Rows
IF @RowsToPurge <> @LastRowsExtracted
BEGIN
SET @ErrMsg = 'Table '+@TableName+' to purge has '+ CONVERT(VARCHAR(16),@RowsToPurge)+' row(s) and it was expected to be ' + CONVERT(VARCHAR(16),@LastRowsExtracted)+ ' row(s)'
RAISERROR(@ErrMsg,16,1)
END
IF @RowsToPurge = @LastRowsExtracted
BEGIN
SET @Stmt = 'TRUNCATE TABLE ' + @TableName
EXEC sp_ExecuteSql @Stmt
END
SELECT @RowsToPurge RowsPurged, @ErrMsg ErrorMsg
END TRY
BEGIN CATCH
SET @ErrMsg = ERROR_MESSAGE()
SELECT 0 RowsPurged, @ErrMsg ErrorMsg
END CATCH
GO
Then you can easily swith out and purge:
EXEC Stage.SwitchData @TableName = 'TableName'
EXEC Stage.PurgeData @TableName = 'SchemaName.TableName',@LastRowsExtracted='NumberOfRowsToPurge'
Note that the @LastROwsExtracted parameter is a safeguard parameter to make sure that we know how many rows we will purge as the truncation is performed instantaneously. On my ETL package, I usually set up the parameter with the number of rows that I extracted first. This way, I make sure that I purge a matching number. However, if you just purge your data without extracting, you can just count (ideally by using sp_spaceused to do it faster) what is on the staging table and then purge the data using the sproc. But if you are confident, you do not need to use the dynamic stored procedure and just truncate the table directly (if FKs are disabled if any exists).
The main advantages of this technique:
- speed: this is instantaneous
- less lock: the lock is very short and painless
- Rollback: you can rollback the switch out very easily and fast, as long as you created all the indexes and the exact same structure on the staging table.
- No maintenance on the partition scheme/function: there are only 12 months in a year and this will stay like this forever. Try to keep the rolling partition scheme like this: dayofmonth, quarterofyear, month, week, etc... All those numbers have very limited amount of partitions and you will not need any splitting of partitions nor creating what we called sandwich partitions.
- Add performance in the mix: you can add performance by setting the different partitions to different filegroup and therefore different disk arrays, improving selects on mulitple partitions (if you were to keep multiple months in our case intead of one).
The main disadvantages of this technique:
- Complicated: it is a bit more complicated than the DELETE technique.
- Maintenance: if you change the structure of the primary table, as adding a field or changing the datatype or adding an index, you need to do the same on the staging table.
- Need at least the standard edition: data partition was an entreprise edition feature and then became a standard edition after 2016. If you use a different edition than Standard and Enterprise, you might not be able to use the feature.
Related Posts
Leave a Reply Cancel reply
Categories
- Inspiration (1)
- Security (2)
- SQL Architecture (21)
- SQL Monitoring (1)
- SQL Performance (16)
- Tips & tricks (1)
- Uncategorized (5)
- Utilities (1)