Data warehouse design is the process of structuring a centralized database that consolidates data from multiple sources, such as CRM systems, ERP platforms, and marketing tools, into a single location optimized for analysis and reporting. Unlike transactional databases built for speed and data entry, a data warehouse is built for querying large volumes of historical data quickly.
The two dominant design philosophies are Inmon’s top-down approach (build a normalized enterprise warehouse first, then create data marts) and Kimball’s bottom-up approach (build subject-oriented data marts first, each using dimensional modeling). Kimball’s approach is more common in practice today because it delivers value faster.
Core Components of a Data Warehouse
| Layer | Name | Purpose | Data State |
|---|---|---|---|
| 1 | Source Systems | Operational databases, APIs, flat files | Raw, transactional |
| 2 | Staging Area | Temporary landing zone for ingested data | Raw, unvalidated |
| 3 | ODS (Operational Data Store) | Near-real-time cleansed operational data | Cleansed, current |
| 4 | Core DW / Integration Layer | Integrated, historical data store | Transformed, historical |
| 5 | Data Mart | Subject-specific subset for a team or function | Aggregated, analysis-ready |
| 6 | Presentation Layer | BI tools, dashboards, reports | Queried by end users |
Inmon vs Kimball: The Two Schools of Thought
| Dimension | Inmon (Top-Down) | Kimball (Bottom-Up) |
|---|---|---|
| Starting Point | Enterprise-wide normalized warehouse | Individual data marts |
| Schema Style | 3NF (Third Normal Form) | Dimensional (Star/Snowflake) |
| Time to Value | Slower (months) | Faster (weeks) |
| Consistency | High – single source of truth | Can have inconsistencies across marts |
| Best For | Large enterprises, regulated industries | Agile teams, faster analytics delivery |
| Complexity | High upfront design cost | Lower upfront, higher integration cost later |
Schema Design: Star vs Snowflake vs Data Vault
Star Schema
The star schema places a central fact table surrounded by dimension tables. The fact table stores measurable events (sales amounts, pageviews, transactions) while dimension tables hold descriptive context (customer name, product category, date).
It is fast for querying, simple for analysts to understand, and the most widely used schema in business intelligence. The trade-off is some data redundancy in the dimension tables.
Snowflake Schema
The snowflake schema normalizes dimension tables into sub-dimensions, reducing redundancy. A product dimension might link to a category dimension, which links to a department dimension. Storage is more efficient, but queries require more joins and are harder for non-technical analysts to write.
Data Vault
Data Vault splits data into three object types: Hubs (business keys), Links (relationships between hubs), and Satellites (descriptive attributes with full history). It is highly auditable and handles source system changes gracefully – but it is complex to implement and query.
| Schema | Query Speed | Storage Efficiency | Change Flexibility | Analyst-Friendly |
|---|---|---|---|---|
| Star Schema | Fast | Lower (some redundancy) | Moderate | High |
| Snowflake Schema | Moderate | Higher | Moderate | Lower |
| Data Vault | Slower (more joins) | Highest | Very High | Low (needs semantic layer) |
Step-by-Step Data Warehouse Design Process
- Define business requirements – what questions must the warehouse answer? Involve stakeholders early.
- Identify data sources – catalog every system that produces relevant data.
- Design the staging layer – a raw landing zone that mirrors source data.
- Define the dimensional model – choose fact and dimension tables based on business processes.
- Design the ETL/ELT pipelines – how data moves from source to warehouse.
- Implement slowly changing dimensions (SCDs) – decide how to handle changes to dimension data over time.
- Build the presentation layer – data marts or semantic models for BI tools.
- Test data quality and performance – validate accuracy and optimize query speed.
Slowly Changing Dimensions (SCDs)
One of the trickiest parts of warehouse design is handling attributes that change over time. For example, a customer moves to a new city. Do you overwrite the old city? Keep both? Track the change with dates? There are six SCD types:
- Type 1: Overwrite – no history kept. Simple but you lose the past.
- Type 2: Add a new row – full history preserved with effective dates. Most common.
- Type 3: Add a column – keeps current and one previous value. Limited history.
- Type 6: Combination of Types 1, 2, and 3. Flexible but complex.
Common Design Mistakes
- Over-engineering on day one. A star schema serving real users beats a perfect architecture serving no one.
- Not tracking data lineage – users need to trust where numbers come from.
- Ignoring data quality at the source. Garbage in, garbage out applies more to warehouses than anywhere else.
- Treating the warehouse as a backup system instead of an analytical asset.
- Skipping documentation. Six months later, no one remembers what that column means.
Modern Tools for Data Warehouse Design
| Category | Tools | Notes |
|---|---|---|
| Cloud Warehouses | Snowflake, BigQuery, Redshift, Synapse | Fully managed, scalable on demand |
| Transformation (ELT) | dbt (data build tool) | SQL-based, version-controlled transformations |
| Orchestration | Apache Airflow, Prefect, Dagster | Schedules and monitors pipelines |
| Data Modeling | Erwin, LucidChart, dbdiagram.io | Visual schema design |
| BI / Presentation | Tableau, Looker, Power BI, Metabase | End-user reporting layer |
A well-designed data warehouse does not just store data – it makes data trustworthy. When every analyst in your organization is working from the same definitions, the same history, and the same source of truth, decisions get better. That is the actual goal of the whole exercise.
