The brains behind the data vault technique, Dan Linstedt, developed the idea in 1990 and released it in 2000 via a series of five articles in The Data Administration Newsletter. The Data Vault concept is a technique for producing data models that are optimised to handle long-term historical data storage. These models are designed to handle auditing, data loading speed and are resilient to change. In essence it is a hybrid approach that blends together design concepts from both 3rd normal form (3NF) and dimensional modelling.
The technique is based upon a number of key concepts:
- It understands where all the data comes from. Every row is be accompanied by record source and a load date. This is great from an audit point of view as we are provided with complete traceability and auditability for all data.
- It stores data irrespective of whether it is good or bad. In a traditional data warehouse bad data (data that doesn’t conform to business rules and quality checks) is typically removed (or cleansed). The philosophy here is that ALL data is useful even if it does conform to existing business rules
- It separates structural attributes from descriptive attributes. This provides the design with resilient to change.
- A Data Vault positively enables parallel loading of data so that the system can scale out without continuous remediation.
So how does the Data Vault work?
The Data Vault methodology solve the problem of change in the real world by separating the business keys from the more descriptive attributes. The assumption is that business keys uniquely identify a business entity and rarely change. In fact the assumption is that business keys only change when the business changes and are therefore they are the most stable of elements and therefore can be used as a foundation on which to build the database.
The data model is based around the concept of Hubs. Each Hub contains a business key, meta data (such as the source of the business key, update and extraction dates/times), and a surrogate key (used to connect to other structures within the database.
Hubs are linked to satellite tables that contain all the descriptive data. Clearly choosing the correct keys for the Hubs is of prime importance for the stability of your model.
Relationships, associations, transactions and hierarchies are managed via Link tables. These are basically many-to-many joins with some metadata.
The Hubs and Links form the structure of the model, but have no temporal attributes and hold no descriptive attributes. These are stored in separate tables called Satellites. These consist of metadata linking them to their parent Hub or Link, metadata describing the origin of the association and attributes, as well as a timeline with start and end dates for the attribute. Where the Hubs and Links provide the structure of the model, the Satellites provide the “meat” of the model, the context for the business processes that are captured in Hubs and Links. These attributes are stored both with regards to the details of the matter as well as the timeline and can range from quite complex (all of the fields describing a clients complete profile) to quite simple (a satellite on a Link with only a Valid-indicator and a timeline).