Migration patterns

May 1, 2023

So you want to migrate legacy systems that no one knows in detail and no one wants to touch it. That system might be very brittle, cause a lot of trouble, can't scale to handle the business growth, and your team doesn't want to deal with that anymore.

One day, the team was in the room to discuss fancy ideas of how we can change the system, and how we can utilize amazing modern technologies. Someone shared their own experiences about the best way to achieve the migration without causing too much pain, burden, and issues to the team.

The system was just 3-tiers monolith architecture — Mobile app, API server, and DB. The team decided to move forward with microservices with a fancy horizontal scalable database, a cache server, and a centralized message queue for event-driven applications.

Woah, that's too many changes! Should you do that? Is that possible? Or how could we possibly make that? Let's find out.

More contexts about the migration

The API for the mobile app was developed a long time ago. it was evolved and did not fit well with the company's needs.
The new set of services will be developed with completely new technology and new API schemas. You can't reuse old codebases.
Some features were deprecated or removed from the user perspective but the code still exists. No one actually knew what was used or not. Digging through the code to gain understanding would take a lot of effort.

Things don't work well with the migration

Before we do any migrations, please make sure you migrated away from the following:

DB-controlled ID (Auto incremental ID, UUID generated by DB, etc)
Distributed DB transaction (Transaction that spans multiple DB instances)
Database Foreign key

Migration goals

The migration might take months to years, depending on how complex it is. It's not an easy boring task. It's full of traps, sweat, and tears. Unfortunately, the system has to be migrated otherwise it will drag you down until the whole company dies. I will write some other approaches (both good and bad) at the end of this blog post, so you can learn and don't have to repeat the same mistakes.

We want the migration process to be:

Low risk – least issues, downtime
Repeatable – easy to follow, able to get the same result
Effective – not much waste, minimum effort, get value fast

We should treat the migration process to be much like the refactoring process. To achieve the 3 goals above, we must put some constraints at first, and after we completed the migration, all constrains can be removed and it should unlock the full potential of making any changes to the new system architecture.

There are 2 key constraints to consider here: the feature set and the API specs. Basically, the set feature and API specs of the new system must not exceed the existing system. Otherwise, this would be considered as feature development, not system migration. Let's take a look for examples.

[NOT OK] Your order processing system has 5 states and the new system has 8 states.
[OK] Your order processing system has 5 states and the new system has 3 states.
[NOT OK] The old user API schema doesn't have “Age” field but the new schema has it
[OK] The old user API schema has “full name” field and the new schema has “first name” and “last name” fields that can be extracted from existing “full name” data.
[OK] The old product API has an image URL field as a string but the new schema has an image URL as an array of strings.

You see it. Keep the feature set and API specs not exceeding the scope of the existing system. And once the system is migrated away, hopefully, the new codebase and architecture would enable your team to make changes easily.

Migration steps

migration-stages

There are 6 stages.

Initial stage
Translation stage
Dual-write stage
Read-compare stage
Cut-off stage
Clean-up stage

For the Initial stage, there won't be much detail, so we'll go into detail about the rest of the stages.

Translation stage

The goal of this is to test new the API schema with the client. You create a new service with the technology you want. You implement a new API spec and translate it into the existing API. Then, migrate the mobile app to use the new API, test it, and release it.

In this stage, no database or persistent was developed. You just translate ugly-existing API to the new and clean API you want. Please keep in mind that this could be possible because you don't add any features.

The value you get from this stage is that

You test your new API design to see if that fits well with the client
You may adjust some UI in mobile to match the new experiences you want. You can test the new UI design with your customer and get feedback. If anything goes wrong, you can easily change it without a lot of wasted effort.

Dual-write stage

After everything is going well, then you can start implementing more on the writing side (insert, update, delete). Start by setup a database you want, then write some code to insert, update, and delete the data on the database when the API is invoked.

You shouldn't implement every table and deploy all of it at once. You should split the work into multiple chunks, each chunk should focus on some set of tables. Keep deploying it into production continuously. This allows you to test whether the DB is well-suited to your need. Also, use feature flags for enabling and disabling the write into the new DB at the runtime (see the code sample in the section below).

Inserting records should be fine in all cases. Some old records that hadn't been inserted by the new system might be missing during the update or deletion. You can ignore those errors for a while and focus on the correctness of the newly created data.

Do not fetch the data from the new DB for returning to the client yet because it might not be complete or correct. Let it runs on production for a couple of days, monitor for any issues, fix it, and then you're good for the next step.

If anything goes wrong with your design or your data in the new system, then you can simply reset the DB and redesign it again. It's low-risk because the data hasn't been used by the client.

A common mistake is to start working on the reading side first instead of the writing side. This will hold you back because have to share the DB, keep it to be compatible with the existing system, do massive DB migration which uses a lot of time, and find out that it's very hard to shut the DB down during the clean-up stage.

Another common mistake is coupling the new system with the existing one. For example, after you insert a record into the new DB, then you use the result to update the existing system via its API. This is not the result you want because you need to remove the existing system without changing the implementation of the new one. Make sure those 2 systems don't know each other.

Read-compare stage

The prerequisite in this stage is you need to back-fill the data on the new system. From the previous stage, not all the data are available on the new system. Now, you must find somehow to migrate it. Examples of methods could be by writing DB export & import or writing a script to invoke an endpoint on the new system to set up those data.

Then, you should compare the result of all responses back to the client. Log all the records that don't match. Fixes the mismatched data and fixes the bug that causes it. Wait until the system is stable and almost has no issues.

All of the comparisons must be controlled under the same feature flag.

You could write a feature flag and flow controls like pseudo-code below:

func someApiHandler(request, response):
    stage = featureFlag.get("migration.someApi", default="translation")
    if stage in ["cut-off"]:
        result = callNewApi(request)
    if stage in ["translation", "dual-write", "read-compare"]:
        result = callOldApi(request)

    if stage in ["dual-write", "read-compare"]:
        newResult = callNewApi(request)
    if stage in ["read-compare"]:
       compareResponseAndLogIfNotMatched(result, newResult)

    return result

Cut-off stage

At this point, all bugs should be fixed. All response data should be matched. Performance, availability, and reliability should meet your expectation. You can simply switch the feature flag to stage “cut-off”. Be careful because, at this point, you can't turn it back. If anything is broken, you have to do a hotfix.

You should take a look at the existing system to see if there's any traffic going there. Does it have any load? Where is the source of the traffic?

Clean-up stage

You're free to destroy the existing system and remove the migration code on the new system. Now, you are in a better world with a shining bright future. You can add many features you want and hopefully, the new code and the new architecture could enable you to achieve and unlock your business goals faster.

As you may see, the migration code is quite generic which you can develop as a generic migration proxy that can route an API call to different versions of backends. This also helps you eliminate code clean-up effort in the last stage. I don't know if there is any open source but hope someone will develop it.

Other approaches that may work or may not

I've seen system migration attempts using these patterns. Some of it work in some context. I sorted it by the most possible approach to the lease possible approach. I still recommended the pattern above and please don't use the below if you don't know what you are doing.

1. DB synchronization with some down-time during the cut-off

You can use some tools like AWS DMS to synchronize DB changes. When the date has come, you simply shut down all the services and up with the new DB.

This could work for simple migration such as splitting the database. This pattern also only has a couple of steps to do which require a small amount of effort.

2. Modularize the codebase by refactoring, then splitting the deployment

If you can't make the code to be very modular with very low coupling, it might be impossible to split it into services. The idea is simply to only allow those modules to communicate with each other using only basic data types such as DTO and primitives. Do not allow them to share the same ORM model or use shared memory.

After the code has been refactored, you can make a copy of its deployment and split the traffic. You may need an API gateway for routing and aggregation responses from different services.

The downside of this approach is that it requires you to refactor all the code (both used and unused) which takes a huge amount of effort to understand all corner cases of the existing system.

3. DB synchronization with reading endpoints first

It's similar to #1 except the new API service will be implemented with reading endpoints first. This may sound OK to you but it turns out to cause problems later.

Tools like DMS will make a conflict write when you want to start implementing writing endpoints
DMS might cause issues from time to time such as synchronization lag, lost write, etc.
Dependency between systems is cyclic which makes it very hard to fully migrate the system.

4. Rewrite from the ground up and migrate later

This approach starts by developing the new system with your desired architecture, data modeling, APIs, etc. Then, try to find a way to migrate both DB data and API calls later.

It creates an illusion of making progress because you can say that the new system is developing and features are implementing. We're approaching the target. But, actually, you don't deliver any value to users at all.

It is also very hard to create a plan to migrate the data from a messy database to the new one while the new database also requires more newly data input from users.