insights by infactory in-factory GmbH Data Warehouse Architectures

Data Architecture for Business Intelligence Systems

Based on the definition of the data warehouse according to William H. Inmon[1], the hub-and-spoke approach is first presented as the basic architecture in the data warehouse environment. Based on this, the various interpretations of the basic architecture and the authoritative concepts of the “Enterprise Data Warehouse”¹, the “Dimensional Data Warehouse”[2] and the “Data Vault Approach”[3] are shown. Finally, the Data Vault is explained in depth.

Core Data Warehouse with Hub-and-Spoke Architecture

Probably the best known definition of a data warehouse was made by William H. Inmon in 1992 in “Building The Data Warehouse”. Inmon defines the data warehouse as follows:

A data warehouse is a subject-oriented, integrated, time-variant, nonvolatile collection of data in support of management’s decision-making process.” ¹

In order to realize a data warehouse database that meets the above requirements, various architectural and modeling concepts evolved. The basic commonality of the concepts considered in this paper is the hub-and-spoke architecture, which is shown in the following figure:

Hub And Spoke architecture

Hub-and-Spoke architecture

The core data warehouse serves as the “hub” and fulfills the task of integration, quality assurance and data distribution to the data marts. The data marts are the “spokes” and have an application orientation as well as predefined business enrichment and aggregation.

Architecture Overview

The main architectures in the data warehouse environment are based on the hub-and-spoke approach explained above:

Data Warehouse architecture

Essential architectures in the data warehouse environment

The core data warehouse integrates the data and populates the data marts, which act as the basis for data analysis and reporting.

  • Enterprise Data Warehouse: The goal of Inmon’s architecture proposal is to create an enterprise-wide data model for the core data warehouse. The data is organized in third normal form (3NF) and exists atomically. Data marts can be easily built from this data pool. Key figure definitions apply across the entire model. The result is a complex model with a large number of tables and relationships. Further development and maintenance of the system are therefore costly.
  • Dimensional Data Warehouse: Kimball provides for a dimensional data model in star schema modeling for the core data warehouse, i.e., the data is stored in dimension and fact tables. “Confirmed dimensions” can be used across use cases with different fact tables, i.e. in different business contexts. Changes to the sources usually lead to time-consuming development activities. The dimensional modeling aspect is also frequently used in other architectural approaches at the data mart level.
  • Data Vault: Linstedt developed the basic architecture further and not only addresses data modeling aspects in the Data Vault approach, but also proposes suitable process models. The use of only a few modeling patterns results in a high degree of flexibility with respect to changes. The Data Vault method therefore enables the agility often demanded by business departments and simplifies the adaptation of business intelligence solutions due to changing requirements.

Principles of the architecture of the Data Vault

The basic architecture shown below with Raw Vault and Business Vault follows the hub-and-spoke approach.

Basic architecture with Raw Vault and Business Vault

Basic architecture with Raw Vault and Business Vault

The Raw Vault records the data of the Stage Area. Relationships between objects are identified and the descriptive attributes are stored historically. This creates the “single version of facts.” After applying business logic (e.g. aggregation and transformation), the Business Vault represents the “single point of truth”. The BI layer is accessed via an access layer that provides the data from the Business Vault. Special data requirements can also be served by the access layer through the Raw Vault.

The different layers of the architecture can be further explained as follows:

Staging Area

The main task of the staging area is to make the delivered data available for the loading process in an unchanged but optimized form. This optimization can manifest itself in a relational structure on the target system in order to avoid or minimize media breaks and network traffic. However, a landing zone at file system level is also possible (criterion: accessibility by ETL server).

The data structures of a relational staging area map the sources. Here, the use of measures to ensure integrity and content transformations is deliberately avoided to a large extent. Only the speed and functional reliability of the loading process are important. This means that new sources can be connected quickly in line with the agile approach. Existing sources can be managed in parallel and the reorganization of data can be implemented flexibly.

Raw Vault

The content of the data remains unchanged in the Raw Vault. The task of the Raw Vault is rather the integration and historization of the information.

The data is thus assigned to business objects, which in turn are defined as hubs and satellites or links and satellites. Hubs contain keys for the unique identification of an entity, associated satellites contain the descriptive information and perform the necessary historization. Links connect hubs and have their own satellites to describe the relation. Within these structures, information is mapped to target data types. References into relational staging can provide temporary lineage functionality.

All tables in the Raw Vault are managed using uniform “insert only” processes (different hash value comparisons depending on the target type) and are structured uniformly. The managed processes are therefore particularly suitable for automation.

Business Vault

The Business Vault is managed after the Raw Vault and is located in the same schema. Its task is to map the business rules, which map the business requirements and change the content of the data. KPIs are derived from supplied metrics in the Raw Vault. Aggregation and consolidation of data from different systems are performed.

Zugriffsschicht

The access layer, which accesses data from the Business Vault and possibly the Raw Vault, may also be virtually structured. It follows the requirements of the BI infrastructure used, regularly exhibits dimensional features and is organized according to use cases.

Individual use cases (from experience, this concerns the rather atomic reporting, legal reporting or exports) can also directly access the Business Vault or the Raw Vault.

The blog entry „Data Vault Basics“ goes into more detail about the architecture of a DV and illustrates it with a use case.


[1] Building the Data Warehouse (1992)
[2] The Data Warehouse Toolkit 3rd Edition (2013)
[3] Data Vault Series 1 – Data Vault Overview (2002)

Author: Wilfried Decker

ANY QUESTIONS? WE HAVE THE ANSWERS!

Write us. We are looking forward to your message!

MAIL TO