In my previous post I discussed the various reasons why you should be classifying your data. These include regulatory and compliance reasons, to help you to bring a breach under control, and to prevent accidental sharing of sensitive data. I also walked through how to set this up in Azure SQL Database.
Great, but what else can it do?
One of the most common questions I get when I talk to customers about data classification is, well that’s great, but can I use that to automatically obfuscate data in my development environment. The answer is no, the data is classified, but there is nothing out of the box that will magic away that sensitive data in a non-production environment. However, there’s nothing to stop you using it to write your own obfuscation script. Here is my version of such a script which you can use to get you started.
Building the script
Get the data from sys.sensitivity_classifications
The first thing we need to do is find the objects we need to obfuscate. These are available in sys.sensitivity_classifications view. I’m joining with sys.objects and sys.columns so I can get the names of the objects.
SELECT schema_name(o.schema_id) AS schema_name, o.[name] AS table_name, c.[name] AS column_name, CAST([information_type] AS NVARCHAR(128)) AS info_type FROM sys.sensitivity_classifications AS sc JOIN sys.objects AS o ON sc.major_id = o.object_id JOIN sys.columns AS c ON sc.major_id = c.object_id AND sc.minor_id = c.column_id
I am using AdventureWorks2017, and my data is classified as follows:
Loop through each item
Next I’m going to loop each item in the list and create a statement to update my development copy of my database based on the type of data the column holds. In order to get the data type information, we’ll pass the SchemaName, TableName, and ColumnName to the cursor and lookup the relevant information from INFORMATION_SCHEMA.COLUMNS
SELECT @DataType = DATA_TYPE, @CharacterMaxLength = CHARACTER_MAXIMUM_LENGTH, @NumericPrecision = NUMERIC_PRECISION, @NumericScale = NUMERIC_SCALE FROM INFORMATION_SCHEMA.COLUMNS WHERE TABLE_SCHEMA = @SchemaName AND TABLE_NAME = @TableName AND COLUMN_NAME = @ColumnName;
Set some basic masking
In a lot of cases you just want to hide the value from the user without any fancy code. Ideally in a normalised database all your joins are on your CustomerId column, not your customer name or email address, so you don’t need to worry if all the names are “xxxxx”.
Start by deciding the format to apply to each type of data. Here is what I’m using:
-- Default Mask Values -- --Strings DECLARE @StringMask CHAR(30) = REPLICATE('x', 30) DECLARE @EmailMask CHAR(100) = '@xxxxxx.com' -- Dates DECLARE @DateMask CHAR(10) = '1900-01-01' DECLARE @YearMask CHAR(4) = '1900' -- Numbers DECLARE @NumberMask CHAR(20) = REPLICATE('1', 20)
Because I have the data type and it’s max length or precision, I can mask my data without worrying that I will run into a truncation error on any of my columns. In the case of strings I can use CHARACTER_MAXIMUM_LENGTH from INFORMATIONSCHEMA.COLUMNS to ensure my string is not longer than the column permits. In the case of integers and decimals I can use the NUMERIC_PRECISION, and I can also use the NUMERIC_SCALE for decimal types.
SET @Mask = ( SELECT CASE WHEN @DataType IN ('char', 'nchar', 'varchar', 'nvarchar') THEN '''' + LEFT(@StringMask, @CharacterMaxLength) + '''' WHEN @DataType IN ('date', 'datetime', 'datetime2', 'smalldatetime', 'datetimeoffset') THEN '''' + @DateMask + '''' WHEN @DataType IN ('tinyint', 'smallint', 'int', 'bigint') THEN LEFT(@NumberMask, @NumericPrecision) WHEN @DataType IN ('decimal', 'numeric', 'money', 'smallmoney') THEN CONCAT(LEFT(@NumberMask, @NumericPrecision-@NumericScale),'.', LEFT(@NumberMask, @NumericScale)) ELSE 'NULL' END AS Mask) SET @sql = 'UPDATE ' + @SchemaName + '.' + @TableName + ' SET ' + @ColumnName + ' = ' + RTRIM(@Mask) + ';';
And that will work well in a lot of cases. However, you may have application logic that means you need to be a bit more sophisticated in some areas. Here are a couple of examples:
A column holding a value for a year is likely defined as a numeric type, in this case a smallint. However, using a default value mask for a smallint would give a value of 11111. If the application requires a valid year, it will probably error on that value. So I can override all year fields to use a standard valid value.
-- Year fields must be valid year BEGIN SET @sql = 'UPDATE ' + @SchemaName + '.' + @TableName + ' SET ' + @ColumnName + ' = ' + @YearMask + ';';
For email addresses, the application may have some logic that checks the format is correct, for example looking for the @symbol, and at least one stop. In this case I have defined a my email mask as ‘@xxxxxx.com’, which is the last part of a valid email address. I can then prefix this with my default string mask and use the RIGHT function to get the right number of characters for my field, while retaining the important formatting.
-- Email must be in a valid format SET @sql = 'UPDATE ' + @SchemaName + '.' + @TableName + ' SET ' + @ColumnName + ' = ''' + RIGHT(CONCAT(RTRIM(@StringMask), RTRIM(@EmailMask)),@CharacterMaxLength) + ''';';
And here is one final example to demonstrate handling more complicated rules. In the AdventrureWorks2017 database the Sales.CreditCard table has a Unique index on the CardNumber field. This prevents us using a standard xxxx-xxxx-xxxx-xxxx format. I don’t need to worry about the numbers being valid, they just need to be unique. I’m going to use some CTEs and some dynamic SQL.
First of all, I need to know how many rows I have in my table, and generate a Numbers table so I can generate the appropriate number or rows. Then I generate the card numbers as 16 characters combining x with the number from the numbers table to give me a format xxxxxxxxxxxxnnnn. Finally I format the card number to xxxx-xxxx-xxxx-nnnn and update the table.
You can use this same method for any other columns in your database that have this type of requirement.
DECLARE @sqlcommand NVARCHAR(500) = N'SELECT @NumberOfRows = COUNT(*) FROM ' + @SchemaName + '.' + @TableName + ';' DECLARE @MaxRows BIGINT EXECUTE sp_executesql @sqlcommand, N'@NumberOfRows INT OUTPUT', @NumberOfRows = @MaxRows OUTPUT; SET @sql = 'DECLARE @MaxRows BIGINT = (SELECT COUNT(*) FROM ' + @SchemaName + '.' + @TableName + '); WITH Numbers AS ( SELECT 1 AS Number UNION ALL SELECT Number + 1 FROM numbers WHERE number < @MaxRows ), MaskedCardNumbers AS ( SELECT Number, RIGHT(REPLICATE(''x'',16) + CAST(Number AS VARCHAR(16)), 16) AS MaskedCardNumber FROM Numbers AS n ), ExistingCards AS ( SELECT ROW_NUMBER() OVER(ORDER BY CreditCardID) AS RowNo, CreditCardID FROM ' + @SchemaName + '.' + @TableName + ' ) UPDATE ' + + @SchemaName + '.' + @TableName + ' SET ' + @ColumnName + ' = CONCAT( SUBSTRING(MaskedCardNumber, 1, 4), ''-'', SUBSTRING(MaskedCardNumber, 5, 4), ''-'', SUBSTRING(MaskedCardNumber, 9, 4), ''-'', SUBSTRING(MaskedCardNumber, 13, 4)) FROM MaskedCardNumbers AS MaskedCards JOIN ExistingCards AS ExistingCards ON ExistingCards.RowNo = MaskedCards.Number OPTION (maxrecursion ' + CAST(@MaxRows AS VARCHAR(20)) + ');'
You can find the full script in GitHub at https://bit.ly/2BNYMEG. All of the update statements are created in the @sql variable and you can choose to print or execute them. In this version I am just writing them to a table.
How is this better from my existing routines?
If you use the sys.sensitivity_classifications view, you will automatically be able to obfuscate any new columns that are added without having to revisit your script.
The script is auto portable across multiple databases because nothing is hard coded, meaning you can write one script and have it in use across all your databases in a short period of time.
Of course this does assume you are classifying your data in the first place, and also that you continue to apply classification when new objects are created.