“The most dangerous phrase in resilience engineering is ‘our cloud provider would never go fully down.’”
In Part 4 we laid six isolation boundaries on a table β Namespace, Node, Availability Zone, Region, Multi-Platform, and Multi-Account β and asked which one your business actually needs.
Five of those six share a quiet assumption: the cloud provider stays up. Multi-AZ survives a data-center fire. Multi-Region survives a regional outage. But all of them are still one IAM root, one billing account, one provider’s control plane away from a bad day you can’t route around.
The sixth boundary is the only one that breaks that assumption. Multi-platform cells place complete, self-contained cells across different platforms β AWS, your own data center, maybe a second cloud β so that losing an entire provider takes out a slice, not the system.
It is the strongest isolation you can buy. It is also the most expensive way to sleep at night. Let’s earn it.
Where we are in the journey
graph LR
P1[Part 1<br/>Why cells]:::done --> P2[Part 2<br/>Routing & sharding]:::done
P2 --> P3[Part 3<br/>Implementation]:::done
P3 --> P4[Part 4<br/>Selection Matrix]:::done
P4 --> P5[Part 5<br/>Multi-Platform<br/>β you are here]:::now
P5 --> P6[Part 6<br/>Multi-Account]:::next
classDef done fill:#dcfce7,stroke:#22c55e,color:#166534
classDef now fill:#dbeafe,stroke:#3b82f6,color:#1e3a8a,stroke-width:3px
classDef next fill:#f1f5f9,stroke:#94a3b8,color:#475569,stroke-dasharray:4 3
Part 4 ended with a ranking of isolation strength and cost. Multi-platform sat at the far end of both axes: strongest isolation, highest cost ($$$$$), highest complexity. This post zooms all the way in on that boundary.
What a multi-platform cell actually is
A reminder of the core idea from Part 1: a cell is a complete, isolated replica of your stack β compute, data, networking β serving a slice of users. One cell failing affects only its slice.
A multi-platform cell adds one more property: cells don’t all live on the same provider.
Single-platform (Multi-Region): Multi-platform (Hybrid):
Cell A β AWS us-east-1 Cell A β AWS (EKS + Aurora)
Cell B β AWS eu-west-1 Cell B β AWS (EKS + Aurora)
Cell C β AWS ap-south-1 Cell C β On-prem (K8s + PostgreSQL)
(one provider, three regions) Cell D β Azure (AKS + PostgreSQL)
(independent platforms)
The blast radius you’re now containing isn’t a server, a zone, or a region. It’s an entire provider β its control plane, its global IAM, its account-wide quotas, its bad deploy. No single-cloud boundary can isolate that. Multi-platform can.
[!NOTE] Multi-platform is not “lift-and-shift to two clouds.” It’s running production cells on each platform, each able to serve real traffic, with users routed across them.
Why you would ever take this on
Be honest: most teams should not reach for this. The legitimate drivers are specific.
- Regulation & data residency. A regulator requires certain data to stay on-premises or inside national borders a given cloud can’t satisfy. The cell boundary becomes a compliance boundary.
- Provider independence / concentration risk. A board or risk function decides the business cannot be 100% dependent on a single vendor’s availability or commercial terms.
- Sovereignty. Government or critical-infrastructure workloads that mandate operator-controlled infrastructure.
- Gradual migration. A large estate moving to cloud over years runs hybrid cells as the steady state during the transition, not just a cutover.
If your reason isn’t on a list like this, multi-region or multi-account cells will give you most of the resilience for a fraction of the pain. (More on that in Decision, below.)
The architecture
graph TB
U([Users]):::user --> GLB[Global Load Balancer<br/>geo / weighted routing + health checks]:::route
subgraph AWS["AWS"]
C1[Cell A<br/>EKS · Aurora · MSK]:::aws
C2[Cell B<br/>EKS · Aurora · MSK]:::aws
end
subgraph DC["On-Premises DC"]
C3[Cell C<br/>K8s · PostgreSQL · Kafka]:::onprem
end
GLB --> C1
GLB --> C2
GLB --> C3
C1 <-. "CDC / log replication<br/>over Direct Connect" .-> C3
C2 <-. "Kafka MirrorMaker 2" .-> C3
classDef user fill:#ede9fe,stroke:#8b5cf6,color:#5b21b6
classDef route fill:#fae8ff,stroke:#d946ef,color:#86198f
classDef aws fill:#dbeafe,stroke:#3b82f6,color:#1e3a8a
classDef onprem fill:#fef3c7,stroke:#f59e0b,color:#92400e
style AWS fill:#eff6ff,stroke:#3b82f6,stroke-dasharray:5 4
style DC fill:#fffbeb,stroke:#f59e0b,stroke-dasharray:5 4
Three pieces do the heavy lifting:
- Global routing. A provider-neutral entry point (DNS-based GSLB, Global Accelerator + an on-prem load balancer, or an anycast edge) sends each user to a healthy cell with sticky routing, and drains a cell that goes unhealthy β regardless of which platform it’s on.
- Homogeneous-enough cells. Each cell is a full stack. The trap is letting them diverge (see Guardrails). On AWS that’s EKS + Aurora; on-prem that’s Kubernetes + PostgreSQL β chosen deliberately to look as alike as possible.
- Inter-platform connectivity. Production hybrid runs on AWS Direct Connect (10β50 ms, predictable) β not VPN over the public internet (50β200 ms, jittery). This link is the spine your cross-platform replication rides on.
The hard part: data across platforms
Inside one cloud, you lean on managed magic β Aurora Multi-AZ gives you synchronous replication and RPO 0. Across platforms, that magic doesn’t exist. There is no native cross-platform replication between Aurora and a PostgreSQL box in your data center.
So you build it at the application/data layer, and you accept that it’s asynchronous:
| Data store | Cross-platform mechanism | Typical RPO |
|---|---|---|
| Relational | Aurora β PostgreSQL logical replication / Debezium CDC | 5β15 s |
| Event/stream | MSK β on-prem Kafka via MirrorMaker 2 | 1β5 s |
| Object | S3 β on-prem (MinIO) sync | secondsβminutes |
graph LR
A[(Aurora<br/>AWS)]:::aws -- "logical replication / CDC<br/>RPO 5-15s" --> B[(PostgreSQL<br/>On-prem)]:::onprem
K1[MSK · AWS]:::aws -- "MirrorMaker 2<br/>RPO 1-5s" --> K2[Kafka · On-prem]:::onprem
classDef aws fill:#dbeafe,stroke:#3b82f6,color:#1e3a8a
classDef onprem fill:#fef3c7,stroke:#f59e0b,color:#92400e
[!WARNING] Because replication is async, a cell can fail with a few seconds of writes not yet mirrored. Your design must tolerate eventual consistency across cells β idempotent writes, conflict resolution, and a clear “source of truth per slice.” If your domain truly cannot lose 5 seconds of data across a provider failure, multi-platform is the wrong tool; keep that data on a single-platform synchronous boundary.
RTO/RPO reality
This is the uncomfortable trade. Multi-platform gives you the best isolation but not the best recovery numbers:
| Boundary | Isolation from⦠| RTO | RPO |
|---|---|---|---|
| Multi-AZ | Data-center failure | < 5 min | 0 (sync) |
| Multi-Region | Regional failure | < 1 min (active-active) | ~0β1 s |
| Multi-Platform | Whole-provider failure | 5β30 min | 1β15 s |
You are trading tighter RPO for a wider failure domain coverage. That’s the deal: nothing else isolates a full provider, and the price of that isolation is async data and a slower, more manual failover.
Real-world: a regulated bank
A bank under a data-residency mandate runs:
- 5 cells on AWS (EKS + Aurora + MSK) for scale and elasticity.
- 3 cells on-premises (Kubernetes + PostgreSQL + Kafka) to satisfy the regulator that core records live on operator-controlled infrastructure.
- Direct Connect (10 Gbps) between them; PostgreSQL logical replication and MirrorMaker 2 carrying state across the boundary.
- Result: ~99.95% availability, full regulatory sign-off, and survivability of a complete AWS-side outage β at roughly $400K/month.
That last number is the whole point of the next section.
Cost & complexity: the $$$$$ pattern
Multi-platform is the most expensive boundary in the matrix, and the cost isn’t mostly infrastructure β it’s operations:
- Two (or three) different operational playbooks, on-call rotations, and upgrade cadences.
- Direct Connect circuits, cross-platform egress, and duplicated tooling.
- A standing hybrid failover test program β untested failover is theater.
- New failure modes the single-cloud teams never see: replication lag spikes, clock skew, certificate trust across domains, split-brain risk.
Rule of thumb: budget for ~2Γ the operational effort of a single-platform multi-cell deployment, not 2Γ the servers.
Guardrails
β Do
- Standardize the stack to minimize divergence β Kubernetes everywhere, PostgreSQL-compatible everywhere. Every difference (EKS vs vanilla K8s, Aurora vs PostgreSQL) is operational tax you pay forever.
- Use Direct Connect (or equivalent dedicated link) for production, never VPN-over-internet.
- Do replication at the application/data layer (CDC, logical replication, MirrorMaker) and make writes idempotent.
- Test hybrid failover on a schedule β drain an AWS-side cell to on-prem and back.
β Don’t
- Mix three database engines because each platform’s “native” option was easiest. Stack sprawl kills you.
- Assume cloud and on-prem have the same performance envelope or the same elasticity.
- Reach for multi-platform to fix application reliability. Cells isolate infrastructure; circuit breakers, timeouts, and bulkheads (Part 3) handle application failures. Don’t confuse the two.
Decision: is multi-platform right for you?
graph TD
Q1{Regulatory / sovereignty<br/>mandate for on-prem or<br/>provider independence?}:::q -->|No| ALT[Use Multi-Region or<br/>Multi-Account cells<br/>β cheaper, simpler]:::alt
Q1 -->|Yes| Q2{Can the domain tolerate<br/>1β15s RPO across the<br/>platform boundary?}:::q
Q2 -->|No| SPLIT[Keep that data on a<br/>single-platform sync boundary;<br/>hybrid only the rest]:::alt
Q2 -->|Yes| Q3{Funded for ~2x<br/>operational effort +<br/>Direct Connect?}:::q
Q3 -->|No| WAIT[Not yet β<br/>close the gap first]:::alt
Q3 -->|Yes| GO[β
Multi-platform cells]:::go
classDef q fill:#fef3c7,stroke:#f59e0b,color:#92400e
classDef alt fill:#f1f5f9,stroke:#94a3b8,color:#475569
classDef go fill:#dcfce7,stroke:#22c55e,color:#166534,stroke-width:2px
If you answered “no” to the first question, you almost certainly want multi-region (for geographic resilience) or multi-account (for compliance isolation within one cloud) β both deliver strong resilience without crossing the platform chasm.
Key takeaways
- Multi-platform is the only cell boundary that isolates a whole provider β its control plane, IAM, and account-wide failures included.
- You pay for it in RPO and operations, not just dollars: async cross-platform replication (1β15 s) and ~2Γ the operational effort.
- There is no native cross-platform replication β build it with CDC, logical replication, and MirrorMaker 2, and design for eventual consistency.
- Standardize the stack (Kubernetes + PostgreSQL everywhere) to keep divergence β your biggest long-term cost β under control.
- It’s a business/regulatory decision, never a default. No mandate? Use multi-region or multi-account.
What’s next
In Part 6 we stay inside a single cloud but turn isolation into a compliance tool: Multi-Account cells β one AWS account per cell for PCI-DSS-grade blast-radius and billing isolation, without the hybrid tax.
Running hybrid cells, or weighing it for a regulated workload? I’d love to hear how you’re handling cross-platform replication β reply or connect on LinkedIn.
References
- AWS Builders’ Library β Workload isolation using shuffle-sharding and resilience patterns
- AWS Direct Connect β resiliency recommendations (service docs)
- Amazon Aurora PostgreSQL β logical replication; Debezium β change data capture
- Apache Kafka β MirrorMaker 2 (geo-replication) for cross-cluster replication
- Prior in this series β Part 1 Β· Part 4 (full series linked above)
π¬ Comments