As I was saying in an earlier article, Always Encrypted is a key element of the GDPR compliance implementation: how to encrypt and be able to search the personal data. It is rather better not to be able to search as you can increase encryption by a randomization algorythm. But in some case, you still want to look for the data. For example, a customer service representative might want to look for a last name in an application that can decrypt the data.

Assuming you need on specific fields to search, there is still the issue of looking for a portion of the string. For example, we might want to look for ph12@adven in the EmailAddress field of a Person table. Even with deterministic encryption, the search pattern cannot be found as deterministic encryption only allows for exact match like Person.EmailAddress = @EncryptedPattern.

My first phase of research is then to build a simple solution for such pattern searches. I came up with a dictionary solution. To keep it less complex for now, I discarded the encryption aspect as deterministic encryption would work similarly. On another article, I will look into implementing the encryption and look how to disguise all patterns, including within the dictionary.

  1. Setting up test data:

Note: the scripts below are just for test purposes and filegroups, data types etc... could be even better optimized. There is no need to comment those things. Also for encryption purposes the fields and the tables would not have clear meanings and stored procedures code would be encrypted as well.

I first created the table named Unencrypted.Person:

CREATE TABLE [Unencrypted].[Person](

[Id] [int] NOT NULL,

[EmailAddress] [nvarchar](50) NULL,

CONSTRAINT [PK_Person] PRIMARY KEY CLUSTERED

(

[Id] ASC

)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]

) ON [PRIMARY]

I then filled up test data (you can use adventureworks database for example.

Then I created the dictionary of emails:

CREATE TABLE [Unencrypted].[PersonEmailAddressSearchText](

[Id] [int] NOT NULL,

[SearchPattern] [nchar](1) NOT NULL,

[Rank] [smallint] NOT NULL

) ON [PRIMARY]

with the index:

CREATE NONCLUSTERED INDEX [IX_UnencryptedPersonEmailAddressSearchText_SearchPattern] ON [Unencrypted].[PersonEmailAddressSearchText]([SearchPattern] ASC)

To fill up the dictionary, I created a table below filled up up to 20000 numbers:

CREATE TABLE [Performance].[Number]([Num] [smallint] NOT NULL,PRIMARY KEY CLUSTERED ([Num] ASC)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY])

INSERT INTO Performance.Number (Num)

select TOP 20000 row_number() over (order by a.object_id) Rnum FROM master.sys.tables A cross join master.sys.tables B cross join master.sys.tables c cross join master.sys.tables d

And I created a function to store the letters and their positions:

CREATE FUNCTION [String].[Fn_GetLetterPosition]

(@String NVARCHAR (4000))

RETURNS TABLE

AS

RETURN

SELECT SUBSTRING(@String, N.Num, 1) AS SearchPattern,

RANK() OVER (ORDER BY N.Num) AS [Rank]

FROM  [Performance].[Number] AS N

WHERE (N.Num >= 0

AND N.Num <= LEN(@String))

Having this function, I was able to fill up the dictinary:

TRUNCATE TABLE Unencrypted.PersonEmailAddressSearchText

INSERT INTO Unencrypted.PersonEmailAddressSearchText

SELECT Id,SearchPattern,Rank from Unencrypted.Person

CROSS APPLY [String].[Fn_GetLetterPosition](EmailAddress)

As you noticed, if we had a new Person or an updated person, it would be quite easy to perform the inserts or the updates in the dictionary.

2. Create the search procedure

I tested many solution, with a gram, an accuracy option, a wild card, etc... After many tests, I came up with this stored procedure which I find elegant enough and performing well (using index seeks).

Here is the code:

/*

Prerequisites for performance:

CREATE INDEX IX_UnencryptedPersonEmailAddressSearchText_SearchPattern ON Unencrypted.PersonEmailAddressSearchText(SearchPattern)

*/

ALTER PROCEDURE [Unencrypted].[GetPersonFromEmailAddress]

@SearchPattern nvarchar(128) = 'ph__@adven',

@Accuracy smallint = 0, --> if 0 this means no difference, 1 means 1 letter can be off, etc...

@IsRegExpr bit = 1 --> this means that underscore can be used (more possible)

AS

BEGIN TRY

-- 1. get the lenth

DECLARE @Len smallint = LEN(@SearchPattern)

-- 2. get the gram (not actually used, except to forbid searches of less than the gram)

DECLARE @Gram smallint = 1

DECLARE @ErrMsg varchar(256) = 'Search pattern must be above '+CONVERT(VARCHAR(16),@Gram)+' characters'

IF @Len < @Gram

RAISERROR(@ErrMsg,16,1)

--3. Create the search values

DECLARE @tSearchPattern TABLE (SearchPattern NVARCHAR(128),[Rank] smallint)

DECLARE @TId table (Id Int, Accuracy smallint)

INSERT INTO @tSearchPattern (SearchPattern, [Rank])

SELECT SearchPattern,[Rank] FROM [String].[Fn_GetLetterPosition](@SearchPattern)

IF @IsRegExpr = 1 AND CHARINDEX('_',@SearchPattern)>0

BEGIN

WITH CTE AS (

SELECT A.Id, A.[Rank]

FROM [Unencrypted].[PersonEmailAddressSearchText] A

JOIN @tSearchPattern B ON A.SearchPattern = B.SearchPattern

WHERE B.[Rank] = 1

)

INSERT INTO @TId

SELECT C.Id, COUNT(*) Accuracy

FROM CTE C

JOIN Unencrypted.PersonEmailAddressSearchText D ON C.Id = D.Id

JOIN @tSearchPattern E ON

CASE WHEN E.SearchPattern = '_' THEN D.SearchPattern ELSE E.SearchPattern END = D.SearchPattern

AND E.[Rank] = D.[Rank] + 1 - C.[Rank]

GROUP BY C.Id, C.[Rank]

HAVING COUNT(*) >= @Len-@Accuracy

END

ELSE

BEGIN

WITH CTE AS (

SELECT A.Id, A.[Rank]

FROM [Unencrypted].[PersonEmailAddressSearchText] A

JOIN @tSearchPattern B ON A.SearchPattern = B.SearchPattern

WHERE B.[Rank] = 1

)

INSERT INTO @TId

SELECT C.Id, COUNT(*) Accuracy

FROM CTE C

JOIN Unencrypted.PersonEmailAddressSearchText D ON C.Id = D.Id

JOIN @tSearchPattern E ON

D.SearchPattern = E.SearchPattern

AND E.[Rank] = D.[Rank] + 1 - C.[Rank]

GROUP BY C.Id

HAVING COUNT(*) >= @Len-@Accuracy

END

SELECT B.Id, @Len - Accuracy LetterOff, A.EmailAddress FROM [Unencrypted].[Person] A

JOIN @TId B ON A.Id = B.Id

ORDER BY LetterOff

END TRY

BEGIN CATCH

SET @ErrMsg = ERROR_MESSAGE()

RAISERROR(@ErrMsg,16,1)

END CATCH

GO

I used the following parameters:

@SearchPattern: this is the actual search pattern we like to run. We can use as many letters as we want and we can include an underscore as a placeholder. As it is a search pattern, it is like having the wild card '%' at each extremity of the search.

@Accuracy: not necessary but useful. This option enables the user to search the string with absolute accuracy (@Accuracy = 0) or with some letters off: for exemple the email address clement@huge.com would be found with the search pattern clemont@huge.com with one letter off (@Accuracy = 1). This does not take any real hint of performance.

@IsRegExpr: not necessary either but useful. This option enables the user to use placeholders in the search. For example john@doe.co.uk would be found for the search pattern _o_n@do_.co.uk, with an exact accuracy. The @IsRegExpr option is useful because for some search the wild card character might be actually to be searched!

3. Examples of search

I executed on my test data hundreds of different runs and here are some results:

The first example is a simple execution of a pattern with exact accuracy and no regular expression. (By the way, in my code, I also check if there is an underscore used in the search pattern and switch the search on a non-wildcard search for performance purposes).

The second example is a similar search except we allow one letter to be a wild card (letter is highlighted in yellow in the capture screen):

The third example us similar with multiple wild card letters:

The fourth example shows accurate and inaccurate results (with a fault tolerance of 1 letter only). Note that I put @IsRegExpr = 1 even though I do not use a wild card and this does not affect performance.

4. Performance

Using statistics (set statistics time on; set statistics io on) on such storage (that I can optimize a slightly bit more):

I have good performance on my searches, using index seeks on the dictionary:

For the run: EXEC [Unencrypted].[GetPersonFromEmailAddress] @SearchPattern = 'ph12@adven', @Accuracy = 0, @IsRegExpr = 0

The statistics are:

SQL Server Execution Times:  CPU time = 0 ms, elapsed time = 0 ms.

Table '#BEC219B2'. Scan count 0, logical reads 4, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Table 'Workfile'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Table 'PersonEmailAddressSearchText'. Scan count 3022, logical reads 9909, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Table '#BDCDF579'. Scan count 2, logical reads 2, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times: CPU time = 63 ms, elapsed time = 66 ms.

with a excution plan of this sort:

And for this run : EXEC [Unencrypted].[GetPersonFromEmailAddress] @SearchPattern = 'ph1_@a__en', @Accuracy = 0, @IsRegExpr = 1

I have a performance hint on the dictionary table as more rows are correct.

SQL Server Execution Times: CPU time = 0 ms, elapsed time = 0 ms.

Table '#BEC219B2'. Scan count 0, logical reads 53, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 76, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Table 'PersonEmailAddressSearchText'. Scan count 30220, logical reads 93180, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Table 'Workfile'. Scan count 3, logical reads 120, physical reads 9, read-ahead reads 111, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Table '#BDCDF579'. Scan count 2, logical reads 11, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times: CPU time = 281 ms, elapsed time = 320 ms.

Here is the execution plan for it:

Those numbers are not surprising and are acceptable.

I then conclude that this search capabilities is promising.

5. So, is this good? and what is next?

Pros: easy to implement and work relatively efficiently

Cons: large storage for the dictionary (but can be reduced by changing the SearchPattern on the dictionary to Nchar(1) and even char(1) if we know there is no unicode data. For 100 000 persons with emails, the dictionary is about 2.7 millions (letters). Some additional DML operation hints as any inserts/updates of a person's email address should add inserts/updates on the dictionary as well.

Challenges: now that we have the ability, we need to see how to convert this model to encrypted data and enforce the best encryption. With the dictionary we can probably use random encryption on the Person.EmailAdress but we need to enforce deterministric Encryption on the dictionary. Additionally, the position of the letter (the dictionary Rank field might help decypher the deterministic aspect of the letters more easily). I will look into this on another paper.

Please add comment if you like it or if you have any ideas of additional improvement of the search.