by Dave Keeshin
March 13, 2020
A while ago I came across the “data vault” concept. I felt I could “relate” to it since it formalizes many of the design techniques and strategies I have been using for years. It’s a scalable, hybrid approach combining the best of legacy relational database and star schema designs.
The approach has been around for 20 years. It’s a concept attributed to database author and consultant Dan Linstadt. According to Linstadt:
“The data vault is a detail oriented, historical tracking and uniquely linked set of normalized tables that support one or more functional areas of business. It is a hybrid approach encompassing the best of breed between 3rd normal form (3nf) and star schema. The design is flexible, scalable, consistent and adaptable to the needs of the enterprise. It is a data model that is architected specifically to meet the needs of today’s enterprise data warehouses.”
History Repeats Itself
Architecturally, a data vault is similar to a typical “data warehouse”.
Figure 1.- Data Warehouse Design circa 2005
The process has evolved a bit from the previous diagram. Today's data vault architecture might look something like this:
Figure 2.- Today
Today, source data comes from a variety of structured, semi-structured, unstructured or binary sources. In high volume situations it's often routed through a message queue or event hub on its way to staging. This step helps efficiently manage the flow and volume of data making the entire process much more scalable.
Staging is where all “as is” source data initially lands or passes. Modeled after the source data. The idea is to load data as is, warts and all. You could say that this is an extract-load-transform (ELT) process rather than the more traditional extract-transform-load (ETL) process.
As noted in one of my previous blogs, it’s always better to have data “pushed” from the original source. That is, the original source should own the process.
The staging database generally maintains no history and applies only “hard” rules such as data types and additional meta data used for tracking and auditing data. Complex business rules and transformations are located downstream.
This is the point of departure from the conventional data warehouse star or snowflake design.
You can think of it as an interim approach that helps improve the efficiency and ease of managing data, especially data changes. The vault contains original source data and subsequent changes ADDED to the vault. No UPDATES (Delete-Insert) or DELETES allowed.
There are three main table constructs, called “Hub”, “Satellite” and “Link”. I use my own naming conventions CORE, ITEM AND LINK
Conceptually, a CORE (or HUB) table is a one column table for maintaining a unique set of business keys. In addition to the business key, they include meta data for maintaining source, effective dates, surrogate keys and hash keys.
LINK tables maintain foreign keys that define CORE table and ITEM table relations.
ITEM (or Satellite) tables maintain attributes that define the business keys stored in CORE and LINK tables. They often include non-identifying objects like cost, color, height, weight, description detail, etc. Items can also be split or partitioned by frequency.
Here’s a section of a sample design for a service billing data vault design.
Figure 3. Service Billing Sample
Here there are two CORE tables, one is called Core.InvoiceDate and the other, Core.Invoice. Core.Invoice holds the source system invoice number, while Core.InvoiceDate, the invoice date. I store the WorkDate,WorkHours and WorkDescription in the Item.InvoiceDetail table. The Link.InvoiceDetailToProjectConsultantRate holds the foreign keys that relates the Core.InvoiceDate and Core.Invoice and Item.InvoiceDetail together as well as other Core tables in the design. Here’s the entire model.
After data lands in the vault, you apply rules and transformations before sending it to its destination.
Ideally, the destination is a platform for providing self-serve business intelligence information for end users. Using the data vault as the source for custom reporting or OLAP star schema data marts or master data destinations. And optionally, building out operational marts to maintain error, performance and meta data.
There are genuine advantages using a data vault approach:
- Low data redundancy
- Flexibility to create and drop relations “on the fly"
- Emphasis on historical change tracking
- Easier data management by isolating business keys from descriptive attributes
- Easier data management by using meta-data like data source and effective dates
- Easier extensibility making an agile project approach practical
- High scalability
- Load processes can easily run in parallel
- Data is easy to audit
A data vault methodology is an evolutionary approach worth looking into. Anything to make what can easily become an overly complex, inefficient process is worth the time. Especially if you can learn something, improve it and add it to your set of tools.
Here are some useful categories to links that will help decipher some of the mysteries of SQL Server and other data technologies