Data warehouse design is the process of structuring a centralized database that consolidates data from multiple sources, such as CRM systems, ERP platforms, and marketing tools, into a single location optimized for analysis and reporting. Unlike transactional databases built for speed and data entry, a data warehouse is built for querying large volumes of historical data quickly.

The two dominant design philosophies are Inmon’s top-down approach (build a normalized enterprise warehouse first, then create data marts) and Kimball’s bottom-up approach (build subject-oriented data marts first, each using dimensional modeling). Kimball’s approach is more common in practice today because it delivers value faster.

Core Components of a Data Warehouse

Layer Name Purpose Data State
1 Source Systems Operational databases, APIs, flat files Raw, transactional
2 Staging Area Temporary landing zone for ingested data Raw, unvalidated
3 ODS (Operational Data Store) Near-real-time cleansed operational data Cleansed, current
4 Core DW / Integration Layer Integrated, historical data store Transformed, historical
5 Data Mart Subject-specific subset for a team or function Aggregated, analysis-ready
6 Presentation Layer BI tools, dashboards, reports Queried by end users

Inmon vs Kimball: The Two Schools of Thought

Dimension Inmon (Top-Down) Kimball (Bottom-Up)
Starting Point Enterprise-wide normalized warehouse Individual data marts
Schema Style 3NF (Third Normal Form) Dimensional (Star/Snowflake)
Time to Value Slower (months) Faster (weeks)
Consistency High – single source of truth Can have inconsistencies across marts
Best For Large enterprises, regulated industries Agile teams, faster analytics delivery
Complexity High upfront design cost Lower upfront, higher integration cost later

Schema Design: Star vs Snowflake vs Data Vault

Star Schema

The star schema places a central fact table surrounded by dimension tables. The fact table stores measurable events (sales amounts, pageviews, transactions) while dimension tables hold descriptive context (customer name, product category, date).

It is fast for querying, simple for analysts to understand, and the most widely used schema in business intelligence. The trade-off is some data redundancy in the dimension tables.

Snowflake Schema

The snowflake schema normalizes dimension tables into sub-dimensions, reducing redundancy. A product dimension might link to a category dimension, which links to a department dimension. Storage is more efficient, but queries require more joins and are harder for non-technical analysts to write.

Data Vault

Data Vault splits data into three object types: Hubs (business keys), Links (relationships between hubs), and Satellites (descriptive attributes with full history). It is highly auditable and handles source system changes gracefully – but it is complex to implement and query.

Schema Query Speed Storage Efficiency Change Flexibility Analyst-Friendly
Star Schema Fast Lower (some redundancy) Moderate High
Snowflake Schema Moderate Higher Moderate Lower
Data Vault Slower (more joins) Highest Very High Low (needs semantic layer)

Step-by-Step Data Warehouse Design Process

  1. Define business requirements – what questions must the warehouse answer? Involve stakeholders early.
  2. Identify data sources – catalog every system that produces relevant data.
  3. Design the staging layer – a raw landing zone that mirrors source data.
  4. Define the dimensional model – choose fact and dimension tables based on business processes.
  5. Design the ETL/ELT pipelines – how data moves from source to warehouse.
  6. Implement slowly changing dimensions (SCDs) – decide how to handle changes to dimension data over time.
  7. Build the presentation layer – data marts or semantic models for BI tools.
  8. Test data quality and performance – validate accuracy and optimize query speed.

Slowly Changing Dimensions (SCDs)

One of the trickiest parts of warehouse design is handling attributes that change over time. For example, a customer moves to a new city. Do you overwrite the old city? Keep both? Track the change with dates? There are six SCD types:

  • Type 1: Overwrite – no history kept. Simple but you lose the past.
  • Type 2: Add a new row – full history preserved with effective dates. Most common.
  • Type 3: Add a column – keeps current and one previous value. Limited history.
  • Type 6: Combination of Types 1, 2, and 3. Flexible but complex.

Common Design Mistakes

  • Over-engineering on day one. A star schema serving real users beats a perfect architecture serving no one.
  • Not tracking data lineage – users need to trust where numbers come from.
  • Ignoring data quality at the source. Garbage in, garbage out applies more to warehouses than anywhere else.
  • Treating the warehouse as a backup system instead of an analytical asset.
  • Skipping documentation. Six months later, no one remembers what that column means.

Modern Tools for Data Warehouse Design

Category Tools Notes
Cloud Warehouses Snowflake, BigQuery, Redshift, Synapse Fully managed, scalable on demand
Transformation (ELT) dbt (data build tool) SQL-based, version-controlled transformations
Orchestration Apache Airflow, Prefect, Dagster Schedules and monitors pipelines
Data Modeling Erwin, LucidChart, dbdiagram.io Visual schema design
BI / Presentation Tableau, Looker, Power BI, Metabase End-user reporting layer

A well-designed data warehouse does not just store data – it makes data trustworthy. When every analyst in your organization is working from the same definitions, the same history, and the same source of truth, decisions get better. That is the actual goal of the whole exercise.

Author

Write A Comment