Using Data Classification in your obfuscation routines

In my previous post I discussed the various reasons why you should be classifying your data. These include regulatory and compliance reasons, to help you to bring a breach under control, and to prevent accidental sharing of sensitive data. I also walked through how to set this up in Azure SQL Database.

Great, but what else can it do?

One of the most common questions I get when I talk to customers about data classification is, well that’s great, but can I use that to automatically obfuscate data in my development environment. The answer is no, the data is classified, but there is nothing out of the box that will magic away that sensitive data in a non-production environment. However, there’s nothing to stop you using it to write your own obfuscation script. Here is my version of such a script which you can use to get you started.

Building the script

Get the data from sys.sensitivity_classifications

The first thing we need to do is find the objects we need to obfuscate. These are available in sys.sensitivity_classifications view. I’m joining with sys.objects and sys.columns so I can get the names of the objects.

  SELECT 
    schema_name(o.schema_id) AS schema_name,
    o.[name] AS table_name,
    c.[name] AS column_name,
    CAST([information_type] AS NVARCHAR(128)) AS info_type
FROM sys.sensitivity_classifications AS sc
    JOIN sys.objects AS o ON sc.major_id = o.object_id
    JOIN sys.columns AS c ON sc.major_id = c.object_id  AND sc.minor_id = c.column_id

I am using AdventureWorks2017, and my data is classified as follows:

sys.sensitvity_classifications for AdventureWorks2017

Loop through each item

Next I’m going to loop each item in the list and create a statement to update my development copy of my database based on the type of data the column holds. In order to get the data type information, we’ll pass the SchemaName, TableName, and ColumnName to the cursor and lookup the relevant information from INFORMATION_SCHEMA.COLUMNS

SELECT  
        @DataType = DATA_TYPE,
        @CharacterMaxLength = CHARACTER_MAXIMUM_LENGTH,
        @NumericPrecision = NUMERIC_PRECISION,
        @NumericScale = NUMERIC_SCALE
FROM    INFORMATION_SCHEMA.COLUMNS
WHERE   TABLE_SCHEMA = @SchemaName
AND     TABLE_NAME = @TableName
AND     COLUMN_NAME = @ColumnName;

Set some basic masking

In a lot of cases you just want to hide the value from the user without any fancy code. Ideally in a normalised database all your joins are on your CustomerId column, not your customer name or email address, so you don’t need to worry if all the names are “xxxxx”.

Start by deciding the format to apply to each type of data. Here is what I’m using:

-- Default Mask Values --
--Strings
DECLARE @StringMask CHAR(30) = REPLICATE('x', 30)
DECLARE @EmailMask CHAR(100) = '@xxxxxx.com'

-- Dates
DECLARE @DateMask CHAR(10) = '1900-01-01' 
DECLARE @YearMask CHAR(4) = '1900' 

-- Numbers
DECLARE @NumberMask CHAR(20) = REPLICATE('1', 20)

Because I have the data type and it’s max length or precision, I can mask my data without worrying that I will run into a truncation error on any of my columns. In the case of strings I can use CHARACTER_MAXIMUM_LENGTH from INFORMATIONSCHEMA.COLUMNS to ensure my string is not longer than the column permits. In the case of integers and decimals I can use the NUMERIC_PRECISION, and I can also use the NUMERIC_SCALE for decimal types.

SET @Mask = (
  SELECT 
    CASE 
      WHEN @DataType IN ('char', 'nchar', 'varchar', 'nvarchar')
	THEN '''' + LEFT(@StringMask, @CharacterMaxLength) + ''''
      WHEN @DataType IN ('date', 'datetime', 'datetime2', 'smalldatetime', 
                         'datetimeoffset')
	THEN '''' + @DateMask + ''''
      WHEN @DataType IN ('tinyint', 'smallint', 'int', 'bigint')
	THEN LEFT(@NumberMask, @NumericPrecision)
      WHEN @DataType IN ('decimal', 'numeric', 'money', 'smallmoney')
	THEN CONCAT(LEFT(@NumberMask, @NumericPrecision-@NumericScale),'.', 
                    LEFT(@NumberMask, @NumericScale))
      ELSE 
'NULL'
  END AS Mask)
	
SET @sql = 'UPDATE ' + @SchemaName + '.' + @TableName + 
	   ' SET ' + @ColumnName + ' = ' + RTRIM(@Mask) + ';';

And that will work well in a lot of cases. However, you may have application logic that means you need to be a bit more sophisticated in some areas. Here are a couple of examples:

A column holding a value for a year is likely defined as a numeric type, in this case a smallint. However, using a default value mask for a smallint would give a value of 11111. If the application requires a valid year, it will probably error on that value. So I can override all year fields to use a standard valid value.

-- Year fields must be valid year
BEGIN
  SET @sql = 'UPDATE ' + @SchemaName + '.' + @TableName + 
             ' SET ' + @ColumnName + ' = ' + @YearMask + ';';

For email addresses, the application may have some logic that checks the format is correct, for example looking for the @symbol, and at least one stop. In this case I have defined a my email mask as ‘@xxxxxx.com’, which is the last part of a valid email address. I can then prefix this with my default string mask and use the RIGHT function to get the right number of characters for my field, while retaining the important formatting.

-- Email must be in a valid format
SET @sql = 
  'UPDATE ' + @SchemaName + '.' + @TableName + 
    ' SET ' + @ColumnName + ' = ''' + 
    RIGHT(CONCAT(RTRIM(@StringMask), RTRIM(@EmailMask)),@CharacterMaxLength)       
    + ''';';

And here is one final example to demonstrate handling more complicated rules. In the AdventrureWorks2017 database the Sales.CreditCard table has a Unique index on the CardNumber field. This prevents us using a standard xxxx-xxxx-xxxx-xxxx format. I don’t need to worry about the numbers being valid, they just need to be unique. I’m going to use some CTEs and some dynamic SQL.

First of all, I need to know how many rows I have in my table, and generate a Numbers table so I can generate the appropriate number or rows. Then I generate the card numbers as 16 characters combining x with the number from the numbers table to give me a format xxxxxxxxxxxxnnnn. Finally I format the card number to xxxx-xxxx-xxxx-nnnn and update the table.

You can use this same method for any other columns in your database that have this type of requirement.

DECLARE @sqlcommand NVARCHAR(500) = N'SELECT @NumberOfRows = COUNT(*) FROM ' + 
                                    @SchemaName + '.' + @TableName + ';' 
DECLARE @MaxRows BIGINT 
EXECUTE sp_executesql @sqlcommand, N'@NumberOfRows INT OUTPUT', 
                                     @NumberOfRows = @MaxRows OUTPUT;
		
SET @sql = 
  'DECLARE @MaxRows BIGINT = 
    (SELECT COUNT(*) FROM ' + @SchemaName + '.' + @TableName + '); 
  
WITH Numbers
 AS
(
  SELECT 1 AS Number   
  UNION ALL
  SELECT Number + 1 
  FROM numbers 
  WHERE number < @MaxRows 
),
MaskedCardNumbers AS
(
  SELECT
    Number,
    RIGHT(REPLICATE(''x'',16) + CAST(Number AS VARCHAR(16)), 16) 
      AS MaskedCardNumber
  FROM Numbers AS n
),
ExistingCards
 AS 
(
  SELECT ROW_NUMBER() OVER(ORDER BY CreditCardID) AS RowNo, CreditCardID
  FROM ' + @SchemaName + '.' + @TableName + '
)
UPDATE ' + + @SchemaName + '.' + @TableName + '
  SET ' + @ColumnName + ' = CONCAT(
    SUBSTRING(MaskedCardNumber, 1, 4), ''-'',
	
    SUBSTRING(MaskedCardNumber, 5, 4), ''-'',
											 
    SUBSTRING(MaskedCardNumber, 9, 4), ''-'',
											 
    SUBSTRING(MaskedCardNumber, 13, 4))
FROM MaskedCardNumbers AS MaskedCards
JOIN ExistingCards
 AS ExistingCards
  ON ExistingCards.RowNo = MaskedCards.Number
OPTION (maxrecursion ' + CAST(@MaxRows AS VARCHAR(20)) + ');'

You can find the full script in GitHub at https://bit.ly/2BNYMEG. All of the update statements are created in the @sql variable and you can choose to print or execute them. In this version I am just writing them to a table.

How is this better from my existing routines?

If you use the sys.sensitivity_classifications view, you will automatically be able to obfuscate any new columns that are added without having to revisit your script.

The script is auto portable across multiple databases because nothing is hard coded, meaning you can write one script and have it in use across all your databases in a short period of time.

Of course this does assume you are classifying your data in the first place, and also that you continue to apply classification when new objects are created.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.