Skip to content

Data Modeling Standards

Status: 🟢 Active  |  Owner: Data Engineering

How data is modelled is one of the most consequential and least-reversible decisions an engineering team makes. A poorly designed schema generates years of workarounds, performance problems, and migration risk. These standards apply to all persistent data stores owned by the organisation's services.


Operational vs. Analytical Data Models

The single most important architectural distinction in data modeling is the difference between operational (OLTP) data and analytical (OLAP) data. These serve fundamentally different purposes and must be modelled, stored, and governed separately.

Operational Data (OLTP)

Operational data supports the real-time transactions of the business: placing orders, authenticating users, processing payments, managing inventory. It is characterised by:

  • High write frequency with short-lived transactions.
  • Strong consistency requirements — reads must reflect the most recent writes.
  • Row-level access — queries fetch or mutate a small number of specific records.
  • Normalised schemas — data is structured to eliminate redundancy and enforce integrity.
  • Low query complexity — simple lookups, joins across a small number of tables.

Operational data lives in transactional databases (PostgreSQL, MySQL). It must never be queried directly for analytics at any scale — even a well-written analytical query run against the production OLTP database can hold locks, degrade performance, and impact paying customers.

Analytical Data (OLAP)

Analytical data supports reporting, business intelligence, and data science. It is characterised by:

  • High read frequency with complex, long-running queries.
  • Eventual consistency is acceptable — analytics can tolerate minutes or hours of lag.
  • Column-level access — queries scan many rows across a small number of columns.
  • Denormalised schemas — data is structured to reduce join complexity at query time (star schema, wide tables).
  • Aggregation-heavy — GROUP BY, window functions, time-series aggregates are the norm.

Analytical data lives in analytical stores (Snowflake, BigQuery, Redshift, ClickHouse). It is populated from operational stores via an Extract-Transform-Load (ETL) or Extract-Load-Transform (ELT) pipeline, with a defined replication lag.

The Hard Separation Rule

┌─────────────────────────────┐     CDC / ETL / ELT     ┌──────────────────────────────┐
│   OPERATIONAL PLANE          │ ─────────────────────►  │   ANALYTICAL PLANE            │
│                              │                          │                               │
│  PostgreSQL (per service)    │  One-way replication     │  Snowflake / BigQuery         │
│  Owned by: Squad             │  Lag: minutes-hours      │  Owned by: Data Engineering   │
│  Access: Service accounts    │                          │  Access: BI / Data Science    │
│  Purpose: Transactions       │                          │  Purpose: Reporting / ML      │
└─────────────────────────────┘                          └──────────────────────────────┘
         ▲                                                          ▲
         │                                                          │
    Application code                                         BI tools / notebooks
    (no analytics queries)                                   (no direct OLTP access)

No analytical queries may run against operational databases. This includes: - Business intelligence tools connected to production PostgreSQL. - Data science notebooks querying service databases directly. - Reporting endpoints in service APIs that aggregate large volumes of data. - Ad-hoc queries by engineers running in production using database credentials.

Violations degrade production services for paying customers. When analytics access to operational data is needed, the correct path is to request that the data be made available in the analytical plane via the Data Engineering team.


Relational Modeling (PostgreSQL)

Naming Conventions

Object Convention Example
Table snake_case, plural nouns orders, order_items, customer_profiles
Column snake_case customer_id, created_at, is_active
Primary key id (UUID preferred) id UUID DEFAULT gen_random_uuid()
Foreign key <referenced_table_singular>_id customer_id, order_id
Index idx_<table>_<columns> idx_orders_customer_status
Unique constraint uq_<table>_<columns> uq_users_email
Check constraint chk_<table>_<description> chk_orders_positive_total

Primary Keys

Use UUIDs (v4) as primary keys for all new tables. Auto-incrementing integers are not approved for new tables because they:

  • Expose record counts to external callers (a security concern).
  • Cause hot-spot contention in distributed inserts.
  • Make cross-environment data migration and merging error-prone.
CREATE TABLE orders (
    id          UUID        DEFAULT gen_random_uuid() PRIMARY KEY,
    customer_id UUID        NOT NULL REFERENCES customers(id),
    status      TEXT        NOT NULL,
    total       NUMERIC(12,2) NOT NULL CHECK (total >= 0),
    created_at  TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    updated_at  TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

Timestamps

Every table must have created_at and updated_at columns of type TIMESTAMPTZ (timezone-aware). Store all times in UTC. Never store local time in the database.

Use a trigger or application-layer default for updated_at — do not rely on callers to set it:

CREATE OR REPLACE FUNCTION set_updated_at()
RETURNS TRIGGER AS $$
BEGIN
    NEW.updated_at = NOW();
    RETURN NEW;
END;
$$ LANGUAGE plpgsql;

CREATE TRIGGER trg_orders_updated_at
    BEFORE UPDATE ON orders
    FOR EACH ROW EXECUTE FUNCTION set_updated_at();

Normalisation

Follow Third Normal Form (3NF) for operational schemas. Denormalisation is permitted for specific, justified performance requirements — but it must be documented and the justification reviewed by Data Engineering.

Common violations to avoid:

-- ❌ Storing comma-separated values in a column
CREATE TABLE orders (
    id UUID PRIMARY KEY,
    tag_ids TEXT  -- "tag-1,tag-2,tag-3" — unmaintainable
);

-- ✅ Junction table
CREATE TABLE order_tags (
    order_id UUID REFERENCES orders(id),
    tag_id   UUID REFERENCES tags(id),
    PRIMARY KEY (order_id, tag_id)
);

-- ❌ Repeated group — columns that vary by iteration
CREATE TABLE products (
    id          UUID PRIMARY KEY,
    attribute_1 TEXT,
    attribute_2 TEXT,
    attribute_3 TEXT
);

-- ✅ EAV or JSONB for variable attributes
CREATE TABLE product_attributes (
    product_id UUID REFERENCES products(id),
    key        TEXT NOT NULL,
    value      TEXT NOT NULL,
    PRIMARY KEY (product_id, key)
);

Soft Delete Policy

Soft deletes (an is_deleted or deleted_at column) are discouraged. They complicate every query with an extra filter and make enforcing foreign key integrity impossible. Prefer:

  1. Hard delete — if the data genuinely has no retention requirement.
  2. Status column — for entities with a lifecycle: status = 'CANCELLED', status = 'ARCHIVED'.
  3. Audit table — if the deleted data must be preserved for regulatory or audit purposes.

If soft delete is unavoidable, use a deleted_at TIMESTAMPTZ nullable column, create a partial index excluding deleted rows for all performance-sensitive queries, and use Row-Level Security to exclude soft-deleted rows by default.

JSONB Usage Policy

JSONB is approved for:

  • Truly variable-structure metadata where the schema cannot be known at design time.
  • Event payloads stored as-is for auditability.
  • Configuration objects with deeply nested, rarely queried structure.

JSONB is not approved for:

  • Fields that are queried or filtered on — define proper columns with indexes instead.
  • Relationships that should be normalised into foreign keys.
  • Storing lists that will later need to be queried individually.

Analytical Modeling (Snowflake / BigQuery)

Analytical schemas use a star schema design. Unlike operational schemas, denormalisation is intentional and expected.

Star Schema

                        ┌──────────────────┐
                        │   fact_orders     │
                        │                  │
                        │  order_id (SK)   │
                        │  customer_key ──────► dim_customer
                        │  product_key  ──────► dim_product
                        │  date_key     ──────► dim_date
                        │  channel_key  ──────► dim_channel
                        │                  │
                        │  revenue         │  (measures)
                        │  quantity        │
                        │  discount_amount │
                        └──────────────────┘
  • Fact tables contain measurable events (order placed, page viewed, payment processed). They are narrow on non-key columns and very wide on rows.
  • Dimension tables contain descriptive attributes about the entities in the fact table. They are wide on columns (denormalised) and narrow on rows.
  • Surrogate keys in the analytical layer are integers (not UUIDs) for join performance.
  • The original operational UUID is preserved as a source_id column in every dimension for lineage traceability.

Naming in the Analytical Layer

Object Convention Example
Fact table fact_<event> fact_orders, fact_page_views
Dimension table dim_<entity> dim_customer, dim_product
Staging table stg_<source>_<entity> stg_postgres_orders
Mart / aggregate mart_<domain>_<description> mart_revenue_by_month

PII and Sensitive Data in Models

Data Classification at the Column Level

Every table containing personal or sensitive data must declare the classification of each column in its schema documentation (Confluence / data catalogue).

Column Type Classification Required Controls
Email address Sensitive Encrypted at rest, excluded from logs
Full name Sensitive Encrypted at rest
Payment card data Restricted Must not be stored — use payment processor tokenisation
IP address Sensitive (PII under GDPR) Anonymised before ingestion into analytical plane
User ID (non-personal) Internal No additional controls beyond standard access control
Hashed password Restricted bcrypt/Argon2 only; never reversible
Free-text fields Sensitive (may contain PII) Reviewed for accidental PII; not sent to third parties

Anonymisation in the Analytical Pipeline

PII that flows from operational databases into the analytical plane must be anonymised or pseudonymised at the transformation layer. Raw PII must never land in the analytical store in a form that could be queried directly by BI users.

Approved anonymisation techniques:

  • Pseudonymisation: Replace the real identifier with a consistent, reversible token (keyed HMAC). Allows linking across datasets without exposing the raw value. Key is held in the secrets manager.
  • Generalisation: Replace precise values with ranges (exact age → age bracket, precise location → region/country).
  • Suppression: Remove the field entirely where it is not needed for the analytical purpose.
  • Noise addition (for aggregates): Add statistical noise to small-count aggregates to prevent re-identification.

References


Last reviewed: 2025-Q4  |  Owner: Data Engineering