Migrating a Live Payment System With Minimal Downtime

Recently I worked on a migration of a live payment system from a legacy production path to a new AWS stack. The constraints were exactly the kind that make migrations interesting: the system handled real money and payments, the old environment had to stay available, being able to rollback if things went wrong.

The problem

The original setup had two weaknesses:

The runtime and database path were tightly coupled to a legacy production environment in GCP
We wanted a cleaner runtime model on AWS without accepting a long maintenance window
We wanted the new stack to be more scalable and maintainable

The migration goal was simple to state and hard to execute: move the application runtime and database path with minimal downtime, while preserving a fast rollback option.

The migration strategy

We ended up choosing a five-part strategy:

Keep the old production system live.
Build the new runtime in parallel on AWS.
Use full load plus change data capture to continuously replicate the database into the new environment.
Validate the new stack behind a separate hostname before public cutover.
Handle reverse replication prerequisites ahead of cutover so rollback is operationally possible.

That last point matters more than it sounds. A lot of rollback plans quietly assume you can just switch traffic back. In reality, once the new system starts receiving writes, rollback without reverse replication becomes messy very quickly.

In our case, testing exposed one extra prerequisite: reverse CDC depended on database binlogging being enabled on the new cluster, so I had to turn that on during a maintenance window before treating rollback as truly ready.

Target architecture

At a high level, the final design looked like this:

flowchart LR
  subgraph Legacy Production
    OLDAPP[Legacy runtime]
    OLDDB[Legacy MySQL database]
    OLDAPP --> OLDDB
  end

  subgraph New AWS Production
    DMSF[Forward replication\nfull load + CDC]
    DMSR[Reverse replication\nrequires binlog-enabled source]
    ALB[Load balancer]
    ECS[ECS Fargate service]
    AURORA[Aurora database]
    CACHE[Valkey / Redis]

    ALB --> ECS
    ECS --> AURORA
    ECS --> CACHE
  end

  OLDDB --> DMSF --> AURORA
  AURORA -. rollback safety .-> DMSR -.-> OLDDB

The important design choice here is that the database migration is decoupled from the application cutover.

The new database is brought close to real time first. Only after the new runtime is healthy and validated do you move public traffic.

What pre-production taught us

The most valuable part of the project happened before production migration even started. We have an environment called pre-production (UAT) which mirrors the same setup we have for production.

In pre-production while testing this we found a chain of issues that would have made the real cutover much riskier:

a queue/cache configuration issue that only surfaced under clustered Redis behavior
an internal authentication mismatch between the payment system and an upstream booking service
a missing database table that had never been created in that environment
a few runtime assumptions that were true in one environment and false in another

While none of these setbacks were significant in isolation, their cumulative effect created a ‘death by a thousand cuts’ that turned an orderly migration into a fire drill.

Building the new stack

Most of our new platform runs on AWS and I wanted familiar, scalable infrastructure.

The new stack included:

ECS on Fargate for the application runtime
Aurora for the relational database target
Valkey for cache and queue workloads
an application load balancer with HTTPS enforced
image build and deployment automation
managed replication tasks for forward and reverse database sync

The stack itself was not the hard part. The hard part was making sure that all the operational assumptions matched reality once the service was live.

The timeline

This is the shape of the migration rather than the exact schedule:

flowchart TD
  A[Pre-prod fixes] --> B[Build AWS stack]
  B --> C[Deploy runtime]
  C --> D[Run app migrations]
  D --> E[Start forward replication]
  E --> F[Validate on shadow hostname]
  F --> G[Enable binlog during maintenance window]
  G --> H[Verify reverse CDC readiness]
  H --> I[Switch public traffic]
  I --> J[Start reverse replication]
  J --> K[Observe for 24-48h]

That sequencing gave us three useful checkpoints:

runtime health independent of public traffic
database replication health independent of application cutover
rollback safety immediately after cutover rather than hours later

What went wrong during the build

As usual, the interesting work lived in the edge cases.

1. Cache behavior changed under the new runtime

One issue traced back to clustered cache behavior. A prefixing strategy that looked harmless in one environment triggered cross-slot errors in another when queue-related operations hit the cache.

The fix was not dramatic, but it was subtle: align key prefixing with cluster-safe patterns and remove overrides that diverged from the already-working pre-production setup.

2. An upstream dependency was reachable on one hostname but not another

The payment system depended on an upstream booking API. From the new runtime, one environment-specific hostname timed out while another production hostname worked correctly.

This was a useful reminder that application configuration is often a hidden network migration. Changing infrastructure without re-validating upstream paths is how you end up with a healthy service that cannot actually do useful work.

3. Explicit HTTPS enforcement was missing

To harden the public access path, we configured the load balancer to force HTTP-to-HTTPS redirects. This step was crucial to make the validation environment behave exactly like the live public setup.

4. Replication hit duplicate-key issues on busy tables

The forward replication task completed its full load, moved into CDC mode, and then reported duplicate-key issues on two high-write tables.

This turned out not to require a redesign. Targeted reloads of the affected tables cleared the issue and the replication task returned to a healthy state.

This is one of the reasons I like staged migrations: transient data movement problems are much easier to fix before public traffic is involved.

5. The rollback path had a hidden database prerequisite

One of the most useful findings from cutover testing was that reverse CDC was not just a replication-task problem. It depended on the new Aurora cluster emitting binary logs in the right format.

Forward replication into the new database worked, but testing the reverse path exposed the gap. The fix was not complicated, but it was operationally sensitive: binlogging had to be enabled on the cluster during a maintenance window, then verified before relying on reverse replication for rollback safety.

What mattered most was when we found it. Because testing caught the issue early, I could handle the database change in a controlled window instead of discovering it in the middle of the real cutover.

Why the shadow hostname mattered

Before cutting over the public hostname, we exposed the new stack under a separate validation hostname.

That gave us room to verify:

the service was healthy over public HTTPS
real pages rendered correctly
upstream booking lookups succeeded from inside the new runtime
application URLs, redirects, and environment values were coherent

It’s tempting to skip this stage when a migration is running behind schedule, but doing so is a critical mistake. Validating under a shadow hostname fundamentally shifts the goal of the launch: you stop asking, “Will the new stack boot?” and start proving, “The new system behaves exactly like production when accessed like production.”

Cutover and rollback design

The cutover plan itself was intentionally small because the riskiest database prerequisite had already been handled earlier in a maintenance window.

Confirm go or no-go ownership.
Verify the binlog change from the maintenance window is active.
Switch the public hostname to the new load balancer.
Update public-facing application URLs if needed.
Start reverse replication immediately.
Monitor closely for the next 24 to 48 hours.

The rollback model looked like this:

flowchart LR
  CUTOVER[Traffic on new stack] --> CHECK{Issue detected?}
  CHECK -- No --> STABLE[Continue stabilization]
  CHECK -- Yes --> DNS[Repoint traffic to legacy stack]
  DNS --> CDC[Let reverse replication catch up]
  CDC --> RESUME[Resume writes on legacy side]

The key idea is simple: rollback is not just a DNS action. It is a data consistency action.

If the new system has already accepted writes, you need a plan for those writes before old production can safely become authoritative again.

One subtle lesson about validation traffic

With the validation hostname live, we could finally run functional smoke tests against the new stack. While highly useful, this introduced an immediate side effect: parity checks against the original database began showing expected drift on tables touched by those tests.

It sounds obvious in hindsight, but it’s a trap worth highlighting. Teams frequently see any mismatch during a soak phase and panic, assuming replication is broken.

The reality? Often, the replication is perfectly fine. it’s your comparison model that got skewed. If you require strict parity evidence late in the migration, you must keep validation writes tightly scoped and deliberate.

The migration in one picture

gantt
  title Payment system migration shape
  dateFormat  YYYY-MM-DD HH:mm
  axisFormat  %m/%d %H:%M

  section Preparation
  Pre-prod fixes                :done, a1, 2026-03-01 09:00, 8h
  AWS stack build               :done, a2, 2026-03-03 09:00, 24h

  section Migration
  Runtime deployment            :done, b1, 2026-03-05 10:00, 4h
  Forward load + CDC            :done, b2, 2026-03-06 09:00, 8h
  Shadow hostname validation    :active, b3, 2026-03-07 09:00, 24h

  section Cutover
  Public switch                 :crit, c1, 2026-03-08 10:00, 1h
  Reverse replication           :crit, c2, after c1, 24h
  Stabilization window          :c3, after c1, 48h

The dates here are illustrative, but the shape is accurate: a long period of preparation, a short cutover, and a deliberate stabilization window.

What I would keep from this approach

If I had to compress the whole experience into a few practical rules, they would be these:

use pre-production to discover failure modes, not just to prove the happy path
separate database replication from application cutover whenever possible
validate through a real public path before switching the main hostname
design rollback as a write-consistency problem, not just a traffic-routing problem
test rollback prerequisites early, especially database binlog and CDC requirements
expect environment-specific issues in caches, secrets, and upstream connectivity

Sensitive system migrations are won by sequencing, observability, and restraint more than by cleverness.

If I revisit this in a future post, I will probably go deeper on the CDC tradeoffs and the application-level smoke testing strategy. Those two pieces did most of the heavy lifting.