How to work and store privacy sensitive information and why Data Vault is helping you without losing information
Of course deleting information from a data warehouse is in essence a no-go. Bill Inmon stated long time ago that one of the 4 characteristics of a data warehouse is that it is non-volatile. This means we do not delete data. The reason for this rule is that the data warehouse should be the one place where we keep track of the complete history of the organization. As we are removing data from this storage we essentially removing part of the history.
And there comes GDPR (or AVG in Dutch) and other privacy regulations. In essence this requires that “all information which can identify to an individual person” needs to be protected for misuse and eventually deleted whenever requested by that individual.
Let’s take some time and see what this last part means in terms of a data warehouse. What will be the consequence in my data warehouse when I do need to delete data in my data warehouse? Let’s take some time and see what this last part means in terms of a data warehouse. What will be the consequences if I need to delete data from my historized data warehouse?
If I need to delete, for instance, a customer from my customer table in 3NF or a dimensional schema then I am not only deleting the record itself but, because I need to cascade delete – I also delete all history of that person plus all relations attached to that individual. This will ultimately mean that I can end up with sales without an identified customer – the sale itself would still be in place, it has only not a related – because deleted – customer any more.
Because of the separation of business key, relations and context in Data Vault modeling (which is typical for all Ensemble modeling patterns) there is some hope.
In Data Vault the business key itself is stored in the Hub with a one to one relation to the primary key. As long as you are able to choose a business key without any to the individual related attributes (in case of a concatenation or just because it is THE business key in itself) there is no need to delete this information from the Hub. When you do need to create a unique business key when it is not available and you are not able to avoid individual related attributes in creating it (e.g. a concatenation which includes something like first name, last name, date of birth, email address, etc.) you need to take extra care. See later in this post. This is because a Hub is a construct that hold the business key and no context whatsoever. This will also mean that all relations will stay in place when all to the individual person identifiable data needs to be deleted.
The context or individual attributes which can be made up to identify an individual will be stored in their own satellite where we design according to more or less privacy sensitive areas. See below an example how this will work out.
Looking into more detail, how will this work?
Each Hub representing an individual person need to have a non-identifiable business key. This means that the business key might have to be scrambled / pseudonimize before it is stored into the Hub whenever there are to the individual related attributes as part of the business key. Every new entry in the Hub will automatically create a corresponding surrogate ID – just a stupid number – as its primary key (see further down why this is a good idea). This surrogate ID is of course the primary key of the Hub where all the Satellites and Links will refer to. To be extra secure, this business key attribute – the column – can be masked for all users, except of course for the data warehouse user for the loading of the Hub and lookups when loading Links and Satellites. The masking of this column can be seen as an extra layer of security and thus follows the intention of the GDPR and other privacy regulations.
Another option is to opt for a “Focal” style for each hub representing individual persons – store the business key separate from the Hub itself. On the downside meaning that we break the pure Data Vault pattern and make automation / generation more complex. On the plus we do not need to mask or hide the business key within the Hub.
Of course the design of the Satellites must be as such that privacy sensitive information will be stored separately from non privacy sensitive information. And also here you will have an extra choice
- scramble or pseudonimize that information before storing => meaning the data might be less valuable but will most likely not needed to be deleted.
- delete the specific information, including history, when requested => meaning that there will be keys in the Hub without corresponding context in that Satellite.
An extra option is to create a s_DEA (Satellite with Derived, Enhanced & Augmented context) for generic use where all privacy sensitive information is pseudonimized or grouped making it more general but still useful even after the details are deleted or non accessible. In this case you can also use different personas to replace the original information to provide some context in case the original data is no longer available.
Why a surrogate ID?
Last thing helping with privacy regulations is the use of stupid numbers or surrogate ID as the primary key of the Hub instead of hash keys or even natural business keys. Why am I in favor of using stupid numbers instead of hashing my business key? There is no way to know from this surrogate ID what the original business key ever was. Even when I do just a “try & error” check – hashing what might be the original key and see if the hash value is still available – will not work when using a surrogate ID. Let alone when you are using natural keys as your primary key, there is no other way then to delete that key completely if it contains to the individual related attributes. Replacing a natural key with a more generic – persona – key will also mean that you will need to replace that key in all Satellites and Links to maintain the referential integrity. Plus it will no longer be recognized as an unique person.
In the case the business key itself is non-accessible and might be scrambled/pseudonimized as well I still suggest to delete the business key (or replace with a standard business key referring to a deleted individual) whenever the request to remove an individual. But keep the unique surrogate ID to know it is recognized as an unique person.
Another important reason for doing this that in any case an individual might become a customer (or employee or …) again and allowing us to store information again this will lead to be just a completely new individual without any reference to its future or past self. So this is to prevent that there is a connection between the information we now have on the customer to be connected with data we are not allowed to connect to this individual from its past.
To come to a conclusion
The separation of business key, relations and the context, especially if the context itself can be separated between to the individual related context and generic context is a big help with privacy regulations like GDPR. In case of an opt out there is no need anymore to cascade delete major parts of your data. The use of a surrogate ID will make sure that all related information – like transactions, generic context, etc. – will still find an unique landing place at the Hub still providing valuable information. You are not loosing transactions, sales, events, etc. from your data warehouse and still comply with privacy regulations like GDPR.