Is Self-Service truly Self-Service?

The last couple of years the word is on “Self-Service” BI. Almost all BI-vendors are feeling the need to tell the world that their tool is a perfect match for “Self-Service”. Finally the business users don’t need IT anymore to analyze data and create reports, etc. Even nowadays with the possibility to explore all data available in data lakes (to use another buzzword). In this post I will explain why I do not agree with the promise of “Self-Service” vendors and SI’s try to sell as a concept to their customers.

When I started in BI in 1999 tool vendors and SI’s, including myself, where fighting against users with MS Excel and MS Access creating their own reports and analyze their data. For this they asked for direct ODBC access or rather “a connection” to the different source systems and work their way thru the data and try to integrate within different spreadsheets with lookups and such. Apart from being time consuming and labor intensive it was also error prone. But it was also very agile and direct for the users themselves. Of course they still needed help them combining the data and also understanding the data. And sometimes they indeed ask for that support.

The major reason for the fight against this type of “Self-Service BI” was that this was considered dangerous. Users could change or manipulate the data or create calculations without any governance, on purpose or just out of ignorance. The answer from vendors and IT-department was: use proper BI tools like Cognos, BO, Microstrategy, etc. Tool vendors, SI’s, myself included, embraced the possibility to deliver reports in PDF to restrict the ‘misuse’ of data.

We tried to create the proper reports and dashboards for all users and include as much filters as requested to make the users happy. In the end users still asked for the “export to XLS” button plus asked for filtered reports with lots of data in it at a lowest possible level of grain and a maximum of X number of rows. Of course this was their replacement of the direct ODBC access to the source systems which they no longer had.

For me this already changed my opinion in the restriction of users in accessing the (their own) data. I’d rather gave them the freedom on the governed data out of data marts instead of let them access the source systems again. I like to think this was also in the users interest, the datamarts made more sense then the source systems.

The need of users to control their data and have unlimited access was, and always will be, big. This is also heard by new and upcoming tool vendors like Qlik and later on Tableau and such. These claimed to be easy to use tools which provide the promise of full control by the user (according to Qlik in the early days you don’t need a data warehouse anymore!). For me it led to a discussion on an event in Amsterdam opposed to some tool vendors where I claimed that I don’t mind what tool a user is using to access their data. From my side this could even be Excel! Of course all the other people in the panel were not amused and even offended. How could I suggest to let users use the darkest of all tools! No control on presented data is the worst thing. The only thing is that they did not listen to what I suggested: Have a governed and trusted set of data available which users can freely access and explore without any restriction on tooling. Of course, but that is governance as well, when there is a meeting or presentation the data is presented in a formal way with a standard tool the whole organization is using . This is to prevent endless discussions about calculations and definitions. This is the architecture I want to create for my users .

So for me there is no such thing as true and complete Self-Service BI for the entire organization. There are a limited number of users capable of doing it all on their own and still got the right data / information on the right time for the actions to be made. These are however a minority if they ever exist in an organization.

In my opinion there will always be a governed set of data – let’s call these the data marts, virtual or physical, in any or non modeling pattern – which users can freely access to explore and analyze. Next to a fixed set of reports, dashboards, etc. which have a formal status in the organization with agreed definitions and use (which can be based upon the same datasets). This is what I call Managed-Service, the data is already combined/integrated/curated and ready for the users to explore. Which tool they are using for it and if they want to combine it with other data it is to them to decide. However in my experience the only true “Self-Service” tool is Excel, a tool which users feel free to use and explore data the way they want it. Tableau. Qlik, PowerBI and all others have a too big technology component which scares users. Except of course the power users, or those people on the department which are feeling more happy to work with more advanced tools and in fact are the BI people within each department. These are also the people you want in your “Center of Excellence” or at least help your team to create reports and analysis to be used in the entire organization.

I like to thank Ronald Damhof who brought up his 4 Quadrant model (see: Keynote  22 mei 2014 – dwh automation – 4 Quadrant) . For me that is a guidance when I tell people my thoughts on “Self-Service”. People are free to use and explore data as long as they are aware that this is a non (less) governed area or Quadrant IV even if the data itself comes directly out of the data warehouse or Quadrant I. Also thanks to Martijn Imrich in his post on Data Lab ( where he is suggesting to make the data full available for exploration where he is focusing on Quadrant I, III & IV of the Quadrant model and thus extending governed data with non-governed data.

In my Managed-Service thoughts the Data Lab is consisting of a governed set of data which the data scientist can and probably will extend with extra data (external or other data marts or other sources or all) to explore and discover useful possibilities for their organizations. A part of this data is still managed and the group of users are highly skilled and trained to work with tools and data.


Why I love using Post-Its for data modeling

 In my early days working as an environmental engineer, almost 20 yrs ago, I tried to be organized and use the famous ToDo lists we all had on our desks. The purpose was to write down all your tasks of that day / week preferably with some kind of ordering or prioritizing. Looking back this was actually a really Agile way of working before Agile was commonly used. Tasks could shift during the day or in the week depending on customer demands or priority. My only issue was and still is that I am not that organized. Writing everything down on the ToDo list on the right day or rewrite a task to another moment in time if that was more appropriate, this was not my style of working. My solution was writing it all down on Post-Its. Soon, these could be found on my desktop computer, screen, phone (not a mobile one but a proper desk phone), drawer, desk itself, etc. in no particular order or systematic way. In other words it help me write down tasks and organize it in my own special way of which I hoped I could found the right task at the right moment. Of course it should have been done in a better way to put the Post-Its in the ToDo lists and reorder them in there to make sure I could not forget a task but unfortunately I am not that organized kind of person.

In my first period as a BI consultant I stopped using Post-Its except for the usual notes about who to call back or other small notes which I found on various spots trying to read back what I wrote down as being important. All the work I had to do was written down in functional and technical requirements and came from Modeling tools like Erwin, Powerdesigner, ets.. In that period of time my work consisted of creating cubes and reports on top of Kimball style data warehouses or dimensional datamarts. I loved the Starschemas and BI-Matrices when I had a meeting with a user, These provide structure and overview in the complexity of the data and made it easily understood by users.

After a while I started as an ETL developer and needed to build the appropriate dimensions and fact tables. I still received the big functional and technical requirements, design documents and data models and try to maintain the BI matrices to ensure conformity around the dimensions.

The BI Matrix was giving me, my customers and colleagues, the structure needed to ensure we did it all right the first time. We already knew that changing a dimension or a fact table is hard work and needs thorough re-engineering. That is why we tried to communicate with the star schemas plus the BI-Matrices to avoid being forced to create a new dimension next to the existing one as a simple and secure solution. And thus started to loose conformity instead which was a quicker and more solid solution than doing the hard work of re-engineering and keeping conformity.

After building the 4th, active, Customer dimension for the same organization I felt there must be another / better way. My initial thought was to analyze more upfront to make sure we covered all before building. Of course that was not the real answer. It simply takes too much time to analyze all and information delivery will always be too late for the business. That is why Kimball style data warehouse became more popular in the first place over Inmonn’s Corporate Information Factory, at least in the Netherlands.

Around 2004 / 2005 I first heard about Data Vault. My initial thoughts on Data Vault were: ” A way too complex and complicated model, far to technical”. Being trained in Data Vault modeling in 2005/2006 improved a lot for me but I still faced some issues in the communication of the model with my Customers. Of course all Data Models were still written down in Erwin or Powerdesigner or any other modeling tool using ERD’s. It did not helped then back in the days we were still facing some modeling issues in Data Vault making it more complex which we now solved in the current state of Data Vault modeling (no more Link on Link constructs, no transactional links, reference tables, etc). But still it was hard to show the model to a user and discuss the model with them. Undoubtedly Linstedts Data Vault changed a lot for me but it needed a bit more time to see the real beauty of the promise of Data Vault.

It became more clear to me working with Hans Hultgren on some projects and learning about Ensemble Modeling (the family name for modeling patterns like Data Vault, Focal Point, Anchor, DV2.0, and many more). By designing Ensembles first instead of full blown Data Vault models, the model suddenly became clear again for the users which directly lead to valuable input by the users and creating a stronger model. This also led to a more interactive way of modeling using whiteboards to draw the ensembles together with the business users staying away from the technical stuff. By wiping out some ensembles and redrawing them on another, more appropriate, part of the whiteboard during the modelstorming process we got to the best possible model to be translated into a physical model and feed the data warehouse automation tool.

In our training, Genesee Academy’s Certified Data Vault Data Modeler, we started to use Post-Its again for flexibility and speed in the workshops. Also in the training rooms we do not always have enough whiteboards or flip overs to work on. And there is always room to stick Post-Its. The simple way of changing and reordering a model by just replacing the Post-Its is very helpful in the learning process.

Although this way of working is very Agile and communicative I had the feeling it could be improved further. This became clear for me after talking to Larence Corr a couple of years ago and reading his book “Agile Data Warehouse Design”. To me it felt that the BEAM methodology could be the part I was missing. However, the book and the methodology are completely focused on dimensional modeling as an end-result . For me this is not my desired model for a Data Warehouse., I really love the speed, flexibility and adaptability of Data Vault and other Ensemble Model Patterns and still wants to implement them for my customers.

Recently I followed Lawrence Corr’s training and learned his way of modelstorming with the 7W’s and the BEAM Canvas which really enriched my toolbox for Modelling again. I learned to write all down on Post-Its and place them in a certain order on the 7W framework and next transfer them to the BEAM Canvas (easy to not have to rewrite just re-use). This use of Post-Its provides me the structure I first missed when moving towards Data Vault modeling and leaving the BI Matrix behind.

Combining his techniques and thoughts with the premise of Ensemble Modeling from Hans Hultgren I am now able to completely build an entire datamodel staying away from the technical ERD’s or modelling tools except the most important tools: Post-It plus Whiteboard. And the best part of it is that is totally interactive with the users which makes the result, Data Warehouse plus (virtual) datamarts, understandable for my user. And to top that, staying away from ERD’s also means that implementation in JSON or any other Big Data (non) structure is possible from the same basis.

The only thing missing in my toolbox now is the Post-It app of 3M which gives the opportunity to rearrange the Post-Its in Powerpoint for documentation. The app is only available in for IoS :-(.

Hope this will help others to start using the most powerful modeling tools and get the user involvements. In the end, we are building data warehouses, data marts, reports, etc. for them, not just for ourselves

The Shuttle, connecting data

A new Data Vault construct to connect to the Big Data Environment
In my former post I wrote that if you really want to create and maintain a connection from a Data Vault model (or almost any kind of Ensemble) to the big data environment you need to search for an elegant way to solve this. In this article I will describe my search and my claim to have found this elegant solution. I will describe this construct from the perspective of Data Vault but I am convinced that other types of Ensemble Data Warehouses can use the same construct.

I wrote before that if one wants to create and maintain the connection to unstructured data on a hashkey, which is a hash representive of the Business key (the driver for the hub), this will not work the majority of the time. In fact only in a small number of cases the business key (or the Hash representation of that business key) is also meaningful or available in the unstructured world. For instance: where sku is the business key for products in retail organizations, this will rarely be used in social media, streams or calls etc. This can be of course resolved with a taxonomy or ontology which matches “white sneakers” to a range of sku’s (for each different size, etc). In this case it does not matter if you are matching Business keys, hashkeys or surrogate keys (if you are looking for the matching key or keys it doesn’t matter if it is a surrogate key, a hashkey or even the business key itself).
And if we take it one step further you might even want to analyze “running shoe” from your Big Data environment and run that against your structured data, this will lead to another lookup and matching effort. — Remember it is Big Data so we do not know upfront where to start. —
So it is not only a hashkey representation that won’t work but also the lookup seems not to be the solution we are looking for.

In Data Vault modeling links are representing business relations so we might want to consider using a link to connect to the Big Data Environment. A link construct consists of its own key and all the keys of its connecting hubs. In the case connecting to a Big Data Environment we will probably be able to calculate the necessary hashes to “white sneaker” and/or  “running shoe” upfront and use these in the link with all appropriate sku’s or use the search strings themselves as a key in the link. This however brings in another problem: a link construct will only store existing relations, not possible relations. Since we are connecting to the Big Data Environment we do not know upfront if there is something to connect to or it just represents a possible relation or a relation that just might be there in the not foreseen future.  — Remember it is Big Data so we do not know upfront where to start. —
So the use of a link is also not the solution.

Another path to investigate is to store the information in a satellite, just as we would do with reference information, have the search string with its hashkey  for all possible foreseen search strings in the appropriate satellite(s) and refer to the Big Data Environment. This make sense because this connection could be seen as some kind of context information. This is coming close to the solution I am looking for.
Thinking thoroughly on this one I noticed that I need to look further. The reason to look further is there might be a problem if the components of the search string are stored in different satellites due to different rate of change, source system, etc. (this will also prevent virtualization of the hashkey). This will mean that I might need to redesign my satellites to store complete search strings. Also the adding of a new search string will be a destructive change to a satellite which we want to prevent as much as possible. Last thing which is making me to look further is that I do not want to store the appropriate search string (in any format) in each satellite which might be of interest to combine with the unstructured data.   — Remember it is Big Data so we do not know upfront where to start. —

Comes in the Shuttle. This is a structure which looks like a satellite but its sole purpose is to store the search string and its hashkey representation. This might be based on the business key, a part of the business key or a just a description or a grouping of context attributes from the EDW perspective. The search string and its hash key representation will be created upon information already available in the Ensemble regardless if there will be a match in the Big Data Environment and regardless where it’s parts are stored (Hub for the business key or to be found in the different Satellites).
The creation of the search string will be no more than just formatting the search string to match the required pattern coming from the Big Data environment. Due to this formatting and the calculations of the hashkey this will also imply that a Shuttle is a BDV construct and will be calculated after loading the Ensemble. Adding a new search string is possible by adding another shuttle to the hub (or link if applicable).  Same for removing a search string which is no longer applicable is just a matter of storing the Shuttle offline.

IMO the Shuttle is the most elegant and non-destructive solution to connect the structured and unstructured world together.

Data Vault2.0 vs Data Vault, a battle?

Main differences between Data Vault and DataVault2.0

I get more and more questions about the difference between Data Vault (DV) and Data Vault2.0 (DV2.0). In this article I try to explain the differences between the two based upon my knowledge and experience.
To get started: Data Vault is a modeling technique which can be used in data warehouse projects almost regardless of the project methodology used where DV2.0 claims to be a data warehousing methodology by itself. Besides this different approach DV2.0 has of course a modeling component where the focus of this article is upon. A DV2.0 model is different from a standard DV model however they are closely related to each other. I will pick the 3 main differences concerning the underlying modeling approach between these two and will describe my observations.

Use of Hashkeys
In DV2.0 the use of hashkeys as the surrogate key is a must. This gives the flexibility to load all data in parallel since there is no dependency between Hubs, its Satellites or the Links. This also seems to pave a way to use unstructured data or use DV2.0 in a NoSQL environment.
In standard DV there is an option to use a hashkey instead of a surrogate key and the decision to use one or the other requires some thoughtful consideration. One of the reasons for this is that there is a benefit to maintaining the keyed dependency of Links and Sats with the Hubs. This benefit is that it maintains the integrity of the Ensemble.

Also the connection to unstructured data on a hashkey, which is a hash representive of the business key (the driver for the Hub) will not work the majority of the time. In fact only in a small number of cases is the business key (or the Hash representation of that business key) also meaningful in the unstructured world. For instance: where sku is the business key for products in retail organizations, this will rarely be used in social media, streams or calls etc. This can be of course resolved with a taxonomy or ontology which matches “white sneakers” to a range of sku’s (for each different size, etc) but this can also be done to match with a surrogate key (if you are looking for the key it doesn’t matter if it is a surrogate key, a hashkey or even the business key itself). If you really want to connect both worlds you need to search for a more elegant way to solve this. This elegant way is underway in a special construct which handles the connection from a DV model (or even DV2.0 model or almost any kind of Ensemble) to the big data environment.

The use and purpose of Links evolved during the last decade. In the past Links were used to represent a relation, event or transaction. This led to Link on Link constructs (or even Link to Link to Link or more Links constructs) to represent grain shifts (order – orderline) and extended Sats on these Links. Where in DV2.0 this still is the case in the current best practice for the DV modeling pattern the choice is made to model only business relations as a Link. Transactions and events have business meaning as their own core business concept (person, place, thing, event) and so become Ensembles complete with a Hub, Sats and Link(s).
This is an evolution based upon modeling experience around the world. Links are solely there to represent the business relations. If we have something with business meaning or we want to add lots of context to it -> this requires a Hub. The business key for this Hub might be a concatenation of the combined business keys of the attached Hubs. For example, a sale modeled as a Link between a customer, employee, store and product has several issues. First there is no way to track an additional identical sale arriving in the same load. Also the grain of the sale is not represented if the sale is for more than one product or there is more than one employee related to the sale. And payments, deliveries, returns, or other subject areas can only connect to this sale via a Link on Link construct. The Data Vault community now recognizes that an event or transaction modeled as a Link is not a part of the core pattern and in fact any transactional Link is in reality not a Link but rather a form of 3NF entity. The sale, as with all events and transactions, is its own business concept thus an Ensemble requiring its own Hub.

Reference tables
The last main model difference between DV and DV2.0 is the modeling of reference tables. Common practice in DV modeling is to go for Hubs and Sats. In fact it is strongly recommended and taught to stick with DV all throughout the model (even when there is no need for history or there is a fixed referenced list). This simplifies a lot. There is no need for discussion which pattern is used in your data warehouse. Sticking with the pattern also means automation throughout the entire model. In DV2.0 there are several ways to go including use of 3NF tables. All depending on need for history, fixed datasets, etc. This means that in DV2.0 the modeler needs to think which kind of reference is being asked for (now and in the future) and model the correct kind of reference table. In my opinion this is a complication which is not needed and also breaks the automation pattern or at the least makes it more complex for automation.

DV2.0 claims to be the new and improved version for Data Vault implementations however IMO I see some points remaining in DV2.0 which were in DV modeling many years ago where based upon best practice and experience of modelers all over the world Data Vault has evolved to what it is now. In fact, I see more new development and improvement in Data Vault modeling then in DV2.0.

The point of this article is to encourage you to be aware of the differences between the current Data Vault pattern and the DV2.0 pattern and pick your own optimal approach. Not based upon a claim that one is new, better and improved (if it could make this claim) but based upon your own analysis and your own needs and requirements.

Note to the reader:
For this article I used my own modeling experience and knowledge on Data Vault and DV2.0. Also I read thoroughly the books “Building a Scalable Data Warehouse with Data Vault 2.0” by Daniel Linstedt & Michael Olschimke and “Modeling the Agile Data Warehouse with Data Vault” by Hans Hultgren. I do have my own preference between the two but I tried to be as objective and precise as possible so the reader can be better informed. The start to be better informed is to read the books, follow training & certification and discuss with experienced modelers.

A Business Key Satelite

Today a short post on a best practice from my own experience in Data Vault modeling.

For those who know Data Vault modeling one of the important things (the first thing to look for) is to find the Unique Enterprise Wide Business Key . If you find the business key (BK) but it is not enterprise wide unique, or not unique at all, there are a couple of options. The preferred one is to create a concatenation to make the BK  unique. The problem occurs when you only need a part, or several parts of the concatenated bk in reports you need to split the bk. This is always a pain because each report builder needs to know how your concatenation is build and which part is the part they need.

It is of course also a possibility to store the parts in the regular satellites on the hub but there are reasons not to do this. When there are multiple satellites, especially grouped on purpose, you need to store the individual parts in each of the satellites. When there is a need to change the concatenation (extending or decreasing) all satellites needs to be changed.

To avoid all that it is wise to create a separate satellite where all the individual parts are stored. This satellite is 1 on 1 with the hub (each record in the hub is one record in the satellite).  Because the bk satellite is a 1 on 1 satellite there is always one and only one valid record where each part can be found – no issue with different time slices in the different satellites.

Business Key too specific

One of the first things learned at any Data Vault training is the importance of getting the right business key. Preferably an enterprise wide unique business key -that business key uniquely identified and used throughout the entire enterprise. If that is not available due to reuse of the business key or receiving the same BK from different source-systems meaning something different, we learn to make it unique. By concatenating fields (like concatenating BK with source-system) or use of a multivalued key (the first is the preferred one).
But what happens if we got a business key from a source system that is too specific???? e.g a policy number with the date attached or a insurance number with a sequence? Especially when we might receive multiple records of the same policy or insurance distinguished by the attached date or sequence. In these cases policy number and insurance number are the enterprise wide business key we need for our hub. This might happen when there are intraday deliveries and this is the only way a source can deliver all the changes within that day.
There are a couple of solutions. One is to just store the business key as it comes. With a SAL (Same As Link) we can then group all business key to the first one arrived at the Data Vault or even attach it to the BK without the date or sequence (if available). This is easy on the way in, just load as if it is a new business key but makes extracting more complex. Also the Hub is not a representing just the business key. It will now store context as well (being the date a policy has changed). Which will lead to solution number 2, being my preference: Detach the date or sequence or… from the actual business key.
This means that in the loading process, most likely to be pre-staging, the one field will be split into the actual business key and the date/sequence/…. This will then be handled as context to be stored in the appropriate satellite. Of course, to be absolute transparent and auditable, the original business key will be stored as context as well.

Of course, there are cases where the date or sequence part really belongs to the BK, This is like Order with the different Orderlines. This will lead to 2 Hubs of course which are linked together. e.g. a policy number with date attached which are really several policies all belonging to one main policy which actually are the different extensions of the one policy.

It is always wise to look at your candidate BK and make sure that it is representing the real BK, not to specific due to delivery from source, nor not unique enough!