Data Modeling Standards¶

Status: 🟢 Active | Owner: Data Engineering

How data is modelled is one of the most consequential and least-reversible decisions an engineering team makes. A poorly designed schema generates years of workarounds, performance problems, and migration risk. These standards apply to all persistent data stores owned by the organisation's services.

Operational vs. Analytical Data Models¶

The single most important architectural distinction in data modeling is the difference between operational (OLTP) data and analytical (OLAP) data. These serve fundamentally different purposes and must be modelled, stored, and governed separately.

Operational Data (OLTP)¶

Operational data supports the real-time transactions of the business: placing orders, authenticating users, processing payments, managing inventory. It is characterised by:

High write frequency with short-lived transactions.
Strong consistency requirements — reads must reflect the most recent writes.
Row-level access — queries fetch or mutate a small number of specific records.
Normalised schemas — data is structured to eliminate redundancy and enforce integrity.
Low query complexity — simple lookups, joins across a small number of tables.

Operational data lives in transactional databases (PostgreSQL, MySQL). It must never be queried directly for analytics at any scale — even a well-written analytical query run against the production OLTP database can hold locks, degrade performance, and impact paying customers.

Analytical Data (OLAP)¶

Analytical data supports reporting, business intelligence, and data science. It is characterised by:

High read frequency with complex, long-running queries.
Eventual consistency is acceptable — analytics can tolerate minutes or hours of lag.
Column-level access — queries scan many rows across a small number of columns.
Denormalised schemas — data is structured to reduce join complexity at query time (star schema, wide tables).
Aggregation-heavy — GROUP BY, window functions, time-series aggregates are the norm.

Analytical data lives in analytical stores (Snowflake, BigQuery, Redshift, ClickHouse). It is populated from operational stores via an Extract-Transform-Load (ETL) or Extract-Load-Transform (ELT) pipeline, with a defined replication lag.

The Hard Separation Rule¶

┌─────────────────────────────┐     CDC / ETL / ELT     ┌──────────────────────────────┐
│   OPERATIONAL PLANE          │ ─────────────────────►  │   ANALYTICAL PLANE            │
│                              │                          │                               │
│  PostgreSQL (per service)    │  One-way replication     │  Snowflake / BigQuery         │
│  Owned by: Squad             │  Lag: minutes-hours      │  Owned by: Data Engineering   │
│  Access: Service accounts    │                          │  Access: BI / Data Science    │
│  Purpose: Transactions       │                          │  Purpose: Reporting / ML      │
└─────────────────────────────┘                          └──────────────────────────────┘
         ▲                                                          ▲
         │                                                          │
    Application code                                         BI tools / notebooks
    (no analytics queries)                                   (no direct OLTP access)

No analytical queries may run against operational databases. This includes: - Business intelligence tools connected to production PostgreSQL. - Data science notebooks querying service databases directly. - Reporting endpoints in service APIs that aggregate large volumes of data. - Ad-hoc queries by engineers running in production using database credentials.

Violations degrade production services for paying customers. When analytics access to operational data is needed, the correct path is to request that the data be made available in the analytical plane via the Data Engineering team.

Relational Modeling (PostgreSQL)¶

Naming Conventions¶

Object	Convention	Example
Table	`snake_case`, plural nouns	`orders`, `order_items`, `customer_profiles`
Column	`snake_case`	`customer_id`, `created_at`, `is_active`
Primary key	`id` (UUID preferred)	`id UUID DEFAULT gen_random_uuid()`
Foreign key	`<referenced_table_singular>_id`	`customer_id`, `order_id`
Index	`idx_<table>_<columns>`	`idx_orders_customer_status`
Unique constraint	`uq_<table>_<columns>`	`uq_users_email`
Check constraint	`chk_<table>_<description>`	`chk_orders_positive_total`

Primary Keys¶

Use UUIDs (v4) as primary keys for all new tables. Auto-incrementing integers are not approved for new tables because they:

Expose record counts to external callers (a security concern).
Cause hot-spot contention in distributed inserts.
Make cross-environment data migration and merging error-prone.

CREATE TABLE orders (
    id          UUID        DEFAULT gen_random_uuid() PRIMARY KEY,
    customer_id UUID        NOT NULL REFERENCES customers(id),
    status      TEXT        NOT NULL,
    total       NUMERIC(12,2) NOT NULL CHECK (total >= 0),
    created_at  TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    updated_at  TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

Timestamps¶

Every table must have created_at and updated_at columns of type TIMESTAMPTZ (timezone-aware). Store all times in UTC. Never store local time in the database.

Use a trigger or application-layer default for updated_at — do not rely on callers to set it:

CREATE OR REPLACE FUNCTION set_updated_at()
RETURNS TRIGGER AS $$
BEGIN
    NEW.updated_at = NOW();
    RETURN NEW;
END;
$$ LANGUAGE plpgsql;

CREATE TRIGGER trg_orders_updated_at
    BEFORE UPDATE ON orders
    FOR EACH ROW EXECUTE FUNCTION set_updated_at();

Normalisation¶

Follow Third Normal Form (3NF) for operational schemas. Denormalisation is permitted for specific, justified performance requirements — but it must be documented and the justification reviewed by Data Engineering.

Common violations to avoid:

-- ❌ Storing comma-separated values in a column
CREATE TABLE orders (
    id UUID PRIMARY KEY,
    tag_ids TEXT  -- "tag-1,tag-2,tag-3" — unmaintainable
);

-- ✅ Junction table
CREATE TABLE order_tags (
    order_id UUID REFERENCES orders(id),
    tag_id   UUID REFERENCES tags(id),
    PRIMARY KEY (order_id, tag_id)
);

-- ❌ Repeated group — columns that vary by iteration
CREATE TABLE products (
    id          UUID PRIMARY KEY,
    attribute_1 TEXT,
    attribute_2 TEXT,
    attribute_3 TEXT
);

-- ✅ EAV or JSONB for variable attributes
CREATE TABLE product_attributes (
    product_id UUID REFERENCES products(id),
    key        TEXT NOT NULL,
    value      TEXT NOT NULL,
    PRIMARY KEY (product_id, key)
);

Soft Delete Policy¶

Soft deletes (an is_deleted or deleted_at column) are discouraged. They complicate every query with an extra filter and make enforcing foreign key integrity impossible. Prefer:

Hard delete — if the data genuinely has no retention requirement.
Status column — for entities with a lifecycle: status = 'CANCELLED', status = 'ARCHIVED'.
Audit table — if the deleted data must be preserved for regulatory or audit purposes.

If soft delete is unavoidable, use a deleted_at TIMESTAMPTZ nullable column, create a partial index excluding deleted rows for all performance-sensitive queries, and use Row-Level Security to exclude soft-deleted rows by default.

JSONB Usage Policy¶

JSONB is approved for:

Truly variable-structure metadata where the schema cannot be known at design time.
Event payloads stored as-is for auditability.
Configuration objects with deeply nested, rarely queried structure.

JSONB is not approved for:

Fields that are queried or filtered on — define proper columns with indexes instead.
Relationships that should be normalised into foreign keys.
Storing lists that will later need to be queried individually.

Analytical Modeling (Snowflake / BigQuery)¶

Analytical schemas use a star schema design. Unlike operational schemas, denormalisation is intentional and expected.

Star Schema¶

                        ┌──────────────────┐
                        │   fact_orders     │
                        │                  │
                        │  order_id (SK)   │
                        │  customer_key ──────► dim_customer
                        │  product_key  ──────► dim_product
                        │  date_key     ──────► dim_date
                        │  channel_key  ──────► dim_channel
                        │                  │
                        │  revenue         │  (measures)
                        │  quantity        │
                        │  discount_amount │
                        └──────────────────┘

Fact tables contain measurable events (order placed, page viewed, payment processed). They are narrow on non-key columns and very wide on rows.
Dimension tables contain descriptive attributes about the entities in the fact table. They are wide on columns (denormalised) and narrow on rows.
Surrogate keys in the analytical layer are integers (not UUIDs) for join performance.
The original operational UUID is preserved as a source_id column in every dimension for lineage traceability.

Naming in the Analytical Layer¶

Object	Convention	Example
Fact table	`fact_<event>`	`fact_orders`, `fact_page_views`
Dimension table	`dim_<entity>`	`dim_customer`, `dim_product`
Staging table	`stg_<source>_<entity>`	`stg_postgres_orders`
Mart / aggregate	`mart_<domain>_<description>`	`mart_revenue_by_month`

PII and Sensitive Data in Models¶

Data Classification at the Column Level¶

Every table containing personal or sensitive data must declare the classification of each column in its schema documentation (Confluence / data catalogue).

Column Type	Classification	Required Controls
Email address	Sensitive	Encrypted at rest, excluded from logs
Full name	Sensitive	Encrypted at rest
Payment card data	Restricted	Must not be stored — use payment processor tokenisation
IP address	Sensitive (PII under GDPR)	Anonymised before ingestion into analytical plane
User ID (non-personal)	Internal	No additional controls beyond standard access control
Hashed password	Restricted	bcrypt/Argon2 only; never reversible
Free-text fields	Sensitive (may contain PII)	Reviewed for accidental PII; not sent to third parties

Anonymisation in the Analytical Pipeline¶

PII that flows from operational databases into the analytical plane must be anonymised or pseudonymised at the transformation layer. Raw PII must never land in the analytical store in a form that could be queried directly by BI users.

Approved anonymisation techniques:

Pseudonymisation: Replace the real identifier with a consistent, reversible token (keyed HMAC). Allows linking across datasets without exposing the raw value. Key is held in the secrets manager.
Generalisation: Replace precise values with ranges (exact age → age bracket, precise location → region/country).
Suppression: Remove the field entirely where it is not needed for the analytical purpose.
Noise addition (for aggregates): Add statistical noise to small-count aggregates to prevent re-identification.

References¶

Last reviewed: 2025-Q4 | Owner: Data Engineering