Skip to content

Data Engineering

Status: 🟢 Active  |  Owner: Data Engineering  |  Last Reviewed: 2025-Q4


Introduction

Data is among the most valuable and most regulated assets the organisation manages. How data is modelled, stored, accessed, migrated, and governed has profound implications for system performance, operational reliability, regulatory compliance, and the ability of the organisation to make good decisions based on accurate, trustworthy information.

The most important structural principle governing all data engineering decisions at this organisation is the hard separation between operational and analytical data. Operational data powers live transactions; analytical data powers decisions. These are different workloads, different consistency requirements, different access patterns, and different governance regimes — and they must be managed on separate infrastructure with separate access controls.


Intent

The Operational / Analytical Boundary

One of the most damaging antipatterns in data engineering is running analytical queries against operational databases. A business intelligence tool connected to a production PostgreSQL instance, a data scientist querying the orders table directly, or a reporting API running a large aggregation — all of these put customer-facing transactions at risk. Production incidents caused by analytical queries blocking operational writes are common and entirely preventable.

The standards in this section enforce a hard separation: operational data lives in squad-owned PostgreSQL instances, accessible only by service application code. Analytical data is replicated, via Change Data Capture pipelines, into Snowflake — where BI tools, data scientists, and analysts can query it freely without any risk of impacting live customers.

In a SaaS platform, data does not belong to the platform — it belongs to the users and organisations who created it. This principle has engineering consequences. Data must be attributed to its creator at the schema level. Users and organisations must be able to export their data. Consent for data processing must be explicitly captured, recorded in an audit-grade ledger, and checked at the point of processing. When consent is withdrawn, downstream systems must stop processing and clean up within a defined window.

These are not legal formalities — they are engineering requirements with concrete implementation standards.

Schema Evolution as an Engineering Discipline

Production databases evolve. The migration standards documented here address how to evolve schemas safely in systems that cannot tolerate downtime, that have multiple service versions running simultaneously during deployments, and that contain data that cannot be regenerated if a migration goes wrong. The expand-contract pattern, batch backfill strategies, and concurrent index creation are the tools for doing this safely.

Data Governance as Infrastructure

Data governance is the set of engineering practices that make data trustworthy, findable, and handled in accordance with the rights of its subjects. The governance standards here document the practical implementation: the consent ledger schema, the data catalogue requirements, the lineage tooling, the retention automation, and the data quality tests that make governance enforceable rather than aspirational.


What You Will Find Here

Page Intent
Data Modeling Standards Operational vs. analytical modeling, PostgreSQL conventions, PII classification at the column level
Database Selection Framework Structured decision guide by workload type; approved and prohibited technologies
Data Access Patterns & ORM Repository pattern, ORM configuration, connection pooling, access security
Data Migration & Schema Evolution Flyway/Alembic standards, backward-compatible migration patterns, large table strategies
Data Governance, Lineage & Consent Data ownership, classification, consent management, data attribution, lineage, retention

Key Principles at a Glance

Principle Standard
Operational vs. analytical separation Operational data in PostgreSQL (squad-owned); analytical in Snowflake (Data Engineering-owned). No cross-plane queries.
No analytical queries on production databases BI tools, data science, and reporting connect to Snowflake only
Consent is auditable Every consent grant and withdrawal is recorded in the immutable Consent Ledger
Fail closed on consent If consent status is unknown, processing does not proceed
Data is attributed to its creator Every user-generated record carries created_by and organisation_id
Migrations are backward compatible Old and new code must work against the new schema simultaneously
Schema changes are code Migrations are version-controlled, reviewed, and tested in CI
Retention is automated Data is deleted or anonymised by automated jobs — not manual processes

Last reviewed: 2025-Q4  |  Owner: Data Engineering