Cell-Based Architecture (Part 5): Multi-Platform Cells — Resilience Across Clouds and On-Prem

“The most dangerous phrase in resilience engineering is ‘our cloud provider would never go fully down.’”

In Part 4 we laid six isolation boundaries on a table — Namespace, Node, Availability Zone, Region, Multi-Platform, and Multi-Account — and asked which one your business actually needs.

Five of those six share a quiet assumption: the cloud provider stays up. Multi-AZ survives a data-center fire. Multi-Region survives a regional outage. But all of them are still one IAM root, one billing account, one provider’s control plane away from a bad day you can’t route around.

The sixth boundary is the only one that breaks that assumption. Multi-platform cells place complete, self-contained cells across different platforms — AWS, your own data center, maybe a second cloud — so that losing an entire provider takes out a slice, not the system.

It is the strongest isolation you can buy. It is also the most expensive way to sleep at night. Let’s earn it.

Where we are in the journey

graph LR
    P1[Part 1<br/>Why cells]:::done --> P2[Part 2<br/>Routing & sharding]:::done
    P2 --> P3[Part 3<br/>Implementation]:::done
    P3 --> P4[Part 4<br/>Selection Matrix]:::done
    P4 --> P5[Part 5<br/>Multi-Platform<br/>◀ you are here]:::now
    P5 --> P6[Part 6<br/>Multi-Account]:::next

    classDef done fill:#dcfce7,stroke:#22c55e,color:#166534
    classDef now  fill:#dbeafe,stroke:#3b82f6,color:#1e3a8a,stroke-width:3px
    classDef next fill:#f1f5f9,stroke:#94a3b8,color:#475569,stroke-dasharray:4 3

Part 4 ended with a ranking of isolation strength and cost. Multi-platform sat at the far end of both axes: strongest isolation, highest cost ($$$$$), highest complexity. This post zooms all the way in on that boundary.

What a multi-platform cell actually is

A reminder of the core idea from Part 1: a cell is a complete, isolated replica of your stack — compute, data, networking — serving a slice of users. One cell failing affects only its slice.

A multi-platform cell adds one more property: cells don’t all live on the same provider.

Single-platform (Multi-Region):   Multi-platform (Hybrid):
   Cell A → AWS us-east-1            Cell A → AWS (EKS + Aurora)
   Cell B → AWS eu-west-1            Cell B → AWS (EKS + Aurora)
   Cell C → AWS ap-south-1          Cell C → On-prem (K8s + PostgreSQL)
   (one provider, three regions)     Cell D → Azure (AKS + PostgreSQL)
                                     (independent platforms)

The blast radius you’re now containing isn’t a server, a zone, or a region. It’s an entire provider — its control plane, its global IAM, its account-wide quotas, its bad deploy. No single-cloud boundary can isolate that. Multi-platform can.

[!NOTE] Multi-platform is not “lift-and-shift to two clouds.” It’s running production cells on each platform, each able to serve real traffic, with users routed across them.

Why you would ever take this on

Be honest: most teams should not reach for this. The legitimate drivers are specific.

Regulation & data residency. A regulator requires certain data to stay on-premises or inside national borders a given cloud can’t satisfy. The cell boundary becomes a compliance boundary.
Provider independence / concentration risk. A board or risk function decides the business cannot be 100% dependent on a single vendor’s availability or commercial terms.
Sovereignty. Government or critical-infrastructure workloads that mandate operator-controlled infrastructure.
Gradual migration. A large estate moving to cloud over years runs hybrid cells as the steady state during the transition, not just a cutover.

If your reason isn’t on a list like this, multi-region or multi-account cells will give you most of the resilience for a fraction of the pain. (More on that in Decision, below.)

The architecture

graph TB
    U([Users]):::user --> GLB[Global Load Balancer<br/>geo / weighted routing + health checks]:::route
    subgraph AWS["AWS"]
        C1[Cell A<br/>EKS &middot; Aurora &middot; MSK]:::aws
        C2[Cell B<br/>EKS &middot; Aurora &middot; MSK]:::aws
    end
    subgraph DC["On-Premises DC"]
        C3[Cell C<br/>K8s &middot; PostgreSQL &middot; Kafka]:::onprem
    end
    GLB --> C1
    GLB --> C2
    GLB --> C3
    C1 <-. "CDC / log replication<br/>over Direct Connect" .-> C3
    C2 <-. "Kafka MirrorMaker 2" .-> C3

    classDef user   fill:#ede9fe,stroke:#8b5cf6,color:#5b21b6
    classDef route  fill:#fae8ff,stroke:#d946ef,color:#86198f
    classDef aws    fill:#dbeafe,stroke:#3b82f6,color:#1e3a8a
    classDef onprem fill:#fef3c7,stroke:#f59e0b,color:#92400e
    style AWS fill:#eff6ff,stroke:#3b82f6,stroke-dasharray:5 4
    style DC  fill:#fffbeb,stroke:#f59e0b,stroke-dasharray:5 4

Three pieces do the heavy lifting:

Global routing. A provider-neutral entry point (DNS-based GSLB, Global Accelerator + an on-prem load balancer, or an anycast edge) sends each user to a healthy cell with sticky routing, and drains a cell that goes unhealthy — regardless of which platform it’s on.
Homogeneous-enough cells. Each cell is a full stack. The trap is letting them diverge (see Guardrails). On AWS that’s EKS + Aurora; on-prem that’s Kubernetes + PostgreSQL — chosen deliberately to look as alike as possible.
Inter-platform connectivity. Production hybrid runs on AWS Direct Connect (10–50 ms, predictable) — not VPN over the public internet (50–200 ms, jittery). This link is the spine your cross-platform replication rides on.

The hard part: data across platforms

Inside one cloud, you lean on managed magic — Aurora Multi-AZ gives you synchronous replication and RPO 0. Across platforms, that magic doesn’t exist. There is no native cross-platform replication between Aurora and a PostgreSQL box in your data center.

So you build it at the application/data layer, and you accept that it’s asynchronous:

Data store	Cross-platform mechanism	Typical RPO
Relational	Aurora → PostgreSQL logical replication / Debezium CDC	5–15 s
Event/stream	MSK ↔ on-prem Kafka via MirrorMaker 2	1–5 s
Object	S3 ↔ on-prem (MinIO) sync	seconds–minutes

graph LR
    A[(Aurora<br/>AWS)]:::aws -- "logical replication / CDC<br/>RPO 5-15s" --> B[(PostgreSQL<br/>On-prem)]:::onprem
    K1[MSK &middot; AWS]:::aws -- "MirrorMaker 2<br/>RPO 1-5s" --> K2[Kafka &middot; On-prem]:::onprem

    classDef aws    fill:#dbeafe,stroke:#3b82f6,color:#1e3a8a
    classDef onprem fill:#fef3c7,stroke:#f59e0b,color:#92400e

[!WARNING] Because replication is async, a cell can fail with a few seconds of writes not yet mirrored. Your design must tolerate eventual consistency across cells — idempotent writes, conflict resolution, and a clear “source of truth per slice.” If your domain truly cannot lose 5 seconds of data across a provider failure, multi-platform is the wrong tool; keep that data on a single-platform synchronous boundary.

RTO/RPO reality

This is the uncomfortable trade. Multi-platform gives you the best isolation but not the best recovery numbers:

Boundary	Isolation from…	RTO	RPO
Multi-AZ	Data-center failure	< 5 min	0 (sync)
Multi-Region	Regional failure	< 1 min (active-active)	~0–1 s
Multi-Platform	Whole-provider failure	5–30 min	1–15 s

You are trading tighter RPO for a wider failure domain coverage. That’s the deal: nothing else isolates a full provider, and the price of that isolation is async data and a slower, more manual failover.

Real-world: a regulated bank

A bank under a data-residency mandate runs:

5 cells on AWS (EKS + Aurora + MSK) for scale and elasticity.
3 cells on-premises (Kubernetes + PostgreSQL + Kafka) to satisfy the regulator that core records live on operator-controlled infrastructure.
Direct Connect (10 Gbps) between them; PostgreSQL logical replication and MirrorMaker 2 carrying state across the boundary.
Result: ~99.95% availability, full regulatory sign-off, and survivability of a complete AWS-side outage — at roughly $400K/month.

That last number is the whole point of the next section.

Cost & complexity: the $$$$$ pattern

Multi-platform is the most expensive boundary in the matrix, and the cost isn’t mostly infrastructure — it’s operations:

Two (or three) different operational playbooks, on-call rotations, and upgrade cadences.
Direct Connect circuits, cross-platform egress, and duplicated tooling.
A standing hybrid failover test program — untested failover is theater.
New failure modes the single-cloud teams never see: replication lag spikes, clock skew, certificate trust across domains, split-brain risk.

Rule of thumb: budget for ~2× the operational effort of a single-platform multi-cell deployment, not 2× the servers.

Guardrails

✅ Do

Standardize the stack to minimize divergence — Kubernetes everywhere, PostgreSQL-compatible everywhere. Every difference (EKS vs vanilla K8s, Aurora vs PostgreSQL) is operational tax you pay forever.
Use Direct Connect (or equivalent dedicated link) for production, never VPN-over-internet.
Do replication at the application/data layer (CDC, logical replication, MirrorMaker) and make writes idempotent.
Test hybrid failover on a schedule — drain an AWS-side cell to on-prem and back.

❌ Don’t

Mix three database engines because each platform’s “native” option was easiest. Stack sprawl kills you.
Assume cloud and on-prem have the same performance envelope or the same elasticity.
Reach for multi-platform to fix application reliability. Cells isolate infrastructure; circuit breakers, timeouts, and bulkheads (Part 3) handle application failures. Don’t confuse the two.

Decision: is multi-platform right for you?

graph TD
    Q1{Regulatory / sovereignty<br/>mandate for on-prem or<br/>provider independence?}:::q -->|No| ALT[Use Multi-Region or<br/>Multi-Account cells<br/>— cheaper, simpler]:::alt
    Q1 -->|Yes| Q2{Can the domain tolerate<br/>1–15s RPO across the<br/>platform boundary?}:::q
    Q2 -->|No| SPLIT[Keep that data on a<br/>single-platform sync boundary;<br/>hybrid only the rest]:::alt
    Q2 -->|Yes| Q3{Funded for ~2x<br/>operational effort +<br/>Direct Connect?}:::q
    Q3 -->|No| WAIT[Not yet —<br/>close the gap first]:::alt
    Q3 -->|Yes| GO[✅ Multi-platform cells]:::go

    classDef q   fill:#fef3c7,stroke:#f59e0b,color:#92400e
    classDef alt fill:#f1f5f9,stroke:#94a3b8,color:#475569
    classDef go  fill:#dcfce7,stroke:#22c55e,color:#166534,stroke-width:2px

If you answered “no” to the first question, you almost certainly want multi-region (for geographic resilience) or multi-account (for compliance isolation within one cloud) — both deliver strong resilience without crossing the platform chasm.

Key takeaways

Multi-platform is the only cell boundary that isolates a whole provider — its control plane, IAM, and account-wide failures included.
You pay for it in RPO and operations, not just dollars: async cross-platform replication (1–15 s) and ~2× the operational effort.
There is no native cross-platform replication — build it with CDC, logical replication, and MirrorMaker 2, and design for eventual consistency.
Standardize the stack (Kubernetes + PostgreSQL everywhere) to keep divergence — your biggest long-term cost — under control.
It’s a business/regulatory decision, never a default. No mandate? Use multi-region or multi-account.

What’s next

In Part 6 we stay inside a single cloud but turn isolation into a compliance tool: Multi-Account cells — one AWS account per cell for PCI-DSS-grade blast-radius and billing isolation, without the hybrid tax.

Running hybrid cells, or weighing it for a regulated workload? I’d love to hear how you’re handling cross-platform replication — reply or connect on LinkedIn.

References

AWS Builders’ Library — Workload isolation using shuffle-sharding and resilience patterns
AWS Direct Connect — resiliency recommendations (service docs)
Amazon Aurora PostgreSQL — logical replication; Debezium — change data capture
Apache Kafka — MirrorMaker 2 (geo-replication) for cross-cluster replication
Prior in this series — Part 1 · Part 4 (full series linked above)