Data Engineering¶
Status: 🟢 Active | Owner: Data Engineering | Last Reviewed: 2025-Q4
Introduction¶
Data is among the most valuable and most regulated assets the organisation manages. How data is modelled, stored, accessed, migrated, and governed has profound implications for system performance, operational reliability, regulatory compliance, and the ability of the organisation to make good decisions based on accurate, trustworthy information.
The most important structural principle governing all data engineering decisions at this organisation is the hard separation between operational and analytical data. Operational data powers live transactions; analytical data powers decisions. These are different workloads, different consistency requirements, different access patterns, and different governance regimes — and they must be managed on separate infrastructure with separate access controls.
Intent¶
The Operational / Analytical Boundary¶
One of the most damaging antipatterns in data engineering is running analytical queries against operational databases. A business intelligence tool connected to a production PostgreSQL instance, a data scientist querying the orders table directly, or a reporting API running a large aggregation — all of these put customer-facing transactions at risk. Production incidents caused by analytical queries blocking operational writes are common and entirely preventable.
The standards in this section enforce a hard separation: operational data lives in squad-owned PostgreSQL instances, accessible only by service application code. Analytical data is replicated, via Change Data Capture pipelines, into Snowflake — where BI tools, data scientists, and analysts can query it freely without any risk of impacting live customers.
Consent and Data Attribution¶
In a SaaS platform, data does not belong to the platform — it belongs to the users and organisations who created it. This principle has engineering consequences. Data must be attributed to its creator at the schema level. Users and organisations must be able to export their data. Consent for data processing must be explicitly captured, recorded in an audit-grade ledger, and checked at the point of processing. When consent is withdrawn, downstream systems must stop processing and clean up within a defined window.
These are not legal formalities — they are engineering requirements with concrete implementation standards.
Schema Evolution as an Engineering Discipline¶
Production databases evolve. The migration standards documented here address how to evolve schemas safely in systems that cannot tolerate downtime, that have multiple service versions running simultaneously during deployments, and that contain data that cannot be regenerated if a migration goes wrong. The expand-contract pattern, batch backfill strategies, and concurrent index creation are the tools for doing this safely.
Data Governance as Infrastructure¶
Data governance is the set of engineering practices that make data trustworthy, findable, and handled in accordance with the rights of its subjects. The governance standards here document the practical implementation: the consent ledger schema, the data catalogue requirements, the lineage tooling, the retention automation, and the data quality tests that make governance enforceable rather than aspirational.
What You Will Find Here¶
| Page | Intent |
|---|---|
| Data Modeling Standards | Operational vs. analytical modeling, PostgreSQL conventions, PII classification at the column level |
| Database Selection Framework | Structured decision guide by workload type; approved and prohibited technologies |
| Data Access Patterns & ORM | Repository pattern, ORM configuration, connection pooling, access security |
| Data Migration & Schema Evolution | Flyway/Alembic standards, backward-compatible migration patterns, large table strategies |
| Data Governance, Lineage & Consent | Data ownership, classification, consent management, data attribution, lineage, retention |
Key Principles at a Glance¶
| Principle | Standard |
|---|---|
| Operational vs. analytical separation | Operational data in PostgreSQL (squad-owned); analytical in Snowflake (Data Engineering-owned). No cross-plane queries. |
| No analytical queries on production databases | BI tools, data science, and reporting connect to Snowflake only |
| Consent is auditable | Every consent grant and withdrawal is recorded in the immutable Consent Ledger |
| Fail closed on consent | If consent status is unknown, processing does not proceed |
| Data is attributed to its creator | Every user-generated record carries created_by and organisation_id |
| Migrations are backward compatible | Old and new code must work against the new schema simultaneously |
| Schema changes are code | Migrations are version-controlled, reviewed, and tested in CI |
| Retention is automated | Data is deleted or anonymised by automated jobs — not manual processes |
Last reviewed: 2025-Q4 | Owner: Data Engineering