There are many facets to security when we’re talking about data. Hopefully our DBAs are doing the fundamental things like ensuring firewalls (you can check out the start of John’s seies of posts on protecting SQL Server with Windows Firewall here.) are in place on our servers making sure that database user access rights are appropriate, but what about the data itself? What are we doing on that level to ensure we’re not exposing data where we shouldn’t?
SQL Server offers us a few useful options in this space, such as dynamic data masking and row level security. Less well used in my experience is data classification, and that’s what I want to talk about today. So let’s start at the beginning and explain what data classification is and why you should use it.
Data classification is a process that allows us to identify and report on different types of data within our databases. In the SQL world there is a Data Discovery and Classification tool that makes it really simple for us. This is available for both SQL Server and Azure PaaS database services. For SQL Server you can use SSMS (version 18 and above) to classify your data; you will find the Data Discovery and Classification option by right clicking on the database and looking in the Tasks menu. For Azure databases you’ll find it under Advanced data security in the portal. You can of course also use the Rest API or PowerShell.
Why Classify Your Data?
As to why you should classify your data, there are couple of reasons:
Firstly, and most importantly, you have a duty to ensure that personal data is protected. Regulatory requirements may vary by region and industry but fundamentally the GDPR (General Data Protection Regulation) requires that companies can identify the data they hold on EU citizens, and they can delete it on request. Classifying your data will make this process faster and easier, and will also allow you to more easily demonstrate your compliance to regulators when requested.
Second, data breaches will happen. And they will happen to you. Being able to quickly identify what data may have been stolen will enable you to bring the situation under control much faster.
Thirdly, you can prevent inappropriate sharing of data. If a field is classified as Confidential, then chances are you shouldn’t be sharing that data, especially externally. Most people would probably consider that sending survey results to a third party for analysis which contains the respondent’s email address would be inappropriate. If the email address is marked as confidential, then it makes it much easier for data processors to understand what they should and should not do.
There is also an added bonus to classifying your data that helps with the age old question of how you provide production like data to developers without risking accidentally including sensitive data. We will come to that in a later post.
What labels you use, and what process you attach to them may be matter for your compliance or security departments but it is the responsibility of every developer and DBA to ensure that the data itself is labelled appropriately.
How to Classify Your Data
OK, so now we understand the importance of classifying our data, let’s look at how we can start the process in Azure SQL Database. I’m going to use the trusty AdventureWorks database here because it has a wide variety of data.
In the Azure Portal locate Advanced data security in the Security menu and open Data Discovery & Classification.
You will quickly be able to see the total number of tables and columns detected, and some recommendations for data classification based on the object names. It is important to note that this is looking at your database schema, not your data, so it is key that you have a good naming convention that allows this tool to identify likely classification targets. If you have gone down the route of using obscure names to make it harder for hackers to guess where the data is (and yes, I have seen that done), then you probably will not get a lot of recommendations and you will have to do a lot of the work manually.
For AdventureWorks we can see that there are 33 columns with classification recommendations.
Click on the link to see the recommendations. The first thing you should notice is that wide variety of data has been identified, and suggestions for classification and labelling have been made. There is an important distinction here. The information type is what you use to identify the type of data you are storing. The sensitivity label is what you use to determine how to handle that data. If you need to change the sensitivity labels or information types, you can so this by clicking Configure from the Overview screen. Note, you need to have permission to modify the Information Protection policy on the tenant root group to make changes. Ideally these should be agreed and defined once, and only changes if absolutely necessary.
The list of objects recommended for classification includes personal information such as name and date of birth as well as financial data such as credit cards. However, it is important to remember that this list is a best effort based on the object names. As we shall see, some items identified may not require classification and there may be others that are missed. And this just emphasises the importance of a solid naming convention.
Let’s start by accepting some of the recommendations. Firstly, Person.PersonPhone.PhoneNumber. That has been classified as Contact Info and labelled Confidential. That looks correct so we can tick the box and click Accept selected recommendations. We can now see that we have one column classified, and at the bottom of the screen we can expand the window to see the remaining 32. Let’s go back and take another look. There are some items here that do not look quite right.
PhoneNumberTypeID looks like a business key rather than personal data, and a quick check in the data confirms that it has a value of 1, 2, or 3. That does not look like data that needs to be classified so we can select those two items and click Dismiss selected recommendations. That will stop us from seeing these recommendations again, allowing us to concentrate on items we haven’t reviewed yet.
Once we have worked through the list, we need to start thinking about what other data we have that might have been missed. AdventureWorks has a table called HumanResources.JobCandidate, and if we look in there, we can see that the resume is being stored in an xml column. Candidate resumes should be correctly classified; they often contain PII (Personally Identifiable Information) data such as the candidate’s name as well as other confidential information such as contact numbers or email address. Let’s add a classification for that now. Click Add classification and in the Add classification blade locate the Resume column from HumanResources.JobCandidate column. Since the resume may contain multiple types of information, I have chosen to set the information type to Other. As it may contain PII data, we should set the sensitivity label to Confidential – GDPR.
Click Add classification and remember to save your changes. Back on the Overview tab we can start to see our classification is taking shape.
You can also get information about your data’s classification and labelling from the sys.sensitivity_classifications view for Azure SQL Database and SQL Server 2019. For earlier versions of SQL Server the data is stored in Extended Properties as sys_information_type_id, sys_information_type_name, sys_sensitivity_label_id, and sys_sensitivity_label_name.
SELECT schema_name(o.schema_id) AS schema_name, o.[name] AS table_name, c.[name] AS column_name, [information_type], [label] FROM sys.sensitivity_classifications AS sc JOIN sys.objects AS o ON sc.major_id = o.object_id JOIN sys.columns AS c ON sc.major_id = c.object_id AND sc.minor_id = c.column_id
Once you have worked through the list of recommendations and located and classified data that was missed, you are in good shape.
However, this is not the time to get complacent. Schemas change and new columns and tables will be added. You need to ensure that any new columns or tables that are added are appropriately classified.
In addition, while the data may be classified correctly, that is nothing without solid processes in place that dictate what can and cannot be done with that data. Each organisation must document what the sensitivity labels and information types mean, and ensure that staff are trained correctly to help prevent against incorrect usage of data.
In the next post I will take a look at how you can also use this classification to speed up the processes of providing copies of production databases to development teams without exposing your sensitive data.