Data migration on a live system!
Every system starts small on some random day. You design the system with few assumptions only to learn later that they don't make sense as per the future requirements. There is no issue with this thought process. You can never design a system which will serve well for the eternity. On the flip side, the best design is something which can be altered later without collapsing the entire system. We need to embrace the change and build a system which will adapt the never ending changes.
Let us apply this philosophy to software systems also. Any software system has two main parts to it. One, the software or the code and the other one is the data that the system generates to maintain the state of the application. Generally, the software is comparatively is easy to get adapted with the changes. If you think the software need to be built in different way, just go and change the code. It should be as simple as that. On the other hand, the data that the software stores in the databases becomes increasingly hard for changes. These changes can be application level requirements, for example "configuration" of the application no more belongs to user but the project. So, you need to delink it from user and link it to "project". This can be adding a column, or updating the data in existing columns etc. At times the reason for the data changes could be low level. For example, mongodb has a cap on the document size ie. 5mb. So, if you have designed the data in a way which will potentially hit this limit, you might have to completely restructure the data! Unfortunately, this is one of the main reasons behind the data migration that we had gone through recently.
A little back story; we are building a next generation loan management system at work. We had to store a stream of events for each loan. Every event carries few common details like the timestamp of occurrence and a bunch of other details corresponding to the type of the event. In the favour of the tech stack we follow, we built it on mongodb. While designing the system in early days, we were in fact confused about the structure of the data being stored in the database. Were were not sure if to store each event as an individual record in the events collection, or to group all the events belong to one loan as nested list under one record. After some brainstorming we went ahead with the later approach.
It was absolutely fine, we did not have any major issue with this approach until it did. Because of the change in the business, we could see the events per loan increasing rapidly, from the orders of hundreds to thousands. This is a significant change. Separately, we have different data warehouses for analytics purposes and data pipelines to mirror the data. We could see these pipelines started timing out, few of our systems also started taking more time to churn the data for the outputs we wanted. We quickly understood the situation and went back to the whiteboard to redesign the events collection. We also learnt about the per record memory cap on mongodb in the process. After weeks of discussions we decided to move the data to the first approach that I mentioned above, each event as a record in the collection. We were just worried about the mongo read performance as the number of records in the collection will shoot up to multiple hundreds of folds. We finally gave more weight to mongodb's capabilities than our doubts. At the end of all these discussions, we concluded that
If number of items (in a list) can be determined and fixed it can be a nested list in a mongo document. If the number of items cannot be determined, it should be a record itself in the collection
In our case, we cannot control the number of events per loan ourself. It can grow out of our control. On the other hand, let's say _address_es per user can be determined upfront (lets say you will never allow more than 10 addresses per user). So, the later one makes sense to be a nested list under user document.
Deciding to migrate the data is the simplest thing. The hardest part is actually moving the data itself. Main problem is it is not simple data migration from one collection to another collection. We need to unwrap the nested list into individual records. It would have been a click of button on UI if it is just data migration from one collection to another. On top of this, this is one of our heaviest collections. It stores tonnes of data. Remember that this data is realtime and data is being updated every second! We needed to make sure that gets migrated consistently and completely and we had to prove it that it actually did!
As mentioned earlier, this is one of our critical __data__s we store. Messing it up means losing the sleep for months. To keep the system healthy and sane from ourself, we had to come up with few rules. No matter what
- Never stop writing the existing collection until (3)
- Don't change anything about existing data
- We need to prove with numbers that
- Data is moved completely
- Data is moved consistently ie X event exists in new collection and it is exactly X not more, not less
They sound simple and straight forward. Isn't it? It all depends on the type of the data. As mentioned earlier, this is very critical data. We had to do it this way. No way we just blindly run few scripts and click few buttons and consider it is done. Never!
As mentioned above, we did not want to do any, I really mean any changes both in terms of read and write to the old collection until we prove that the data is properly copied and being mirrored on the new collection. Following is the quickest strategy came to our mind but we ruled it out quickly.
0. Move at one shot - Ruled out
The first thing that came to our mind is to copy the data one night and move the rights to the new collection. There are so many issues with this approach.
- The data is always floating. There is no one such moment where there are no writes happening. You really want to deal with the data getting written while you run the migration scripts. There will always be edge cases even if you use timestamps to figure out the events that getting added in delta time and add them later.
- You cannot simply shift the writes. You need to prove that it is complete! So, that means you will take some more time to smartly show that there is no difference. You will have to fill the delta data again! It never ends. But this says that there are two process, backfill, and separately mirror realtime writes.
Even though we understood that this approach does not work for us, we understood a very important point as mentioned above. There are two steps to it
- Mirror realtime writes to new collection
As learnt above, we had to mirror the new writes and updates on the new collection apart from backfill (covered next). We implemented it as a side effect because we did not want to break the main flow if mirroring on the new collection fails because of any reason. We deal with the failed ones in the backfill step safely. This is the first step before backfilling because we don't want to miss out any data because of the time gap between these two steps.
This is the primary step of the entire process. As already mentioned, it is not just a data migration from one collection to another. There is some logic to unwrap the list and make an individual mongo record out of it. There is nothing interesting in this step, it is the most simplest one, yet, we went wrong here as well, more on this later. It is a simple python script which takes a record from old collection, reads the events list, for each nested event, transforms and constructs a dict which gets inserted as individual mongo record.
This step is idempotent, which means, if some event already exists in the new collection, it does not insert it again. This is achieved by indexes. This makes this step to run multiple times without corrupting the data on the new collection.
3. Writes vs Reads
This is not an action per say but something we understood in the process that we need to deal with writes and reads to the new collections separately. We had to focus on reads in the first phase completely. We had to build good tools (more later) to measure the health of above two steps and make sure the data is being mirrored properly. After getting enough confidence, if at all everything else is in place, we could think about moving the reads to new collection. This thought process gave us more freedom and confidence. This made us to just scrap the entire new collection if something goes wrong and start from beginning which we actually did!
At some point we had some bug/issue in the transformation part of the back fill. We started mirroring also. Only after a week, in the process of checking the new data, we identified that the data is not in correct form. We never came to know about it because no one is reading from new collection! But because we did not change anything in the old collection, we just deleted the new collection and started from step 0.
To the point "prove that data is complete and consistent" we had to write few python scripts that prove it. They are like litmus tests. One main script was to go through the old collection and check every event presents in the new one and the content also matches. There are nuances for this step also. One such thing was that, from the time it reads from old collection to the time it reads the new collection, there would have been changes in the data and it would say data is not 100% match. When we go and check the particular event that the script says not matching, they would match! But most of the times it used to tell 100% of the data is mirrored.
5. Primary and side effects
Once all of the above steps are done properly and the health scripts are green for a week, we started building confidence in the new collection. But we are not done yet! We just wrote the data to new collection. We still need to start reading from new collection. This is the second phase. Reading from new collection does not mean that we stop writing to old collection. We decided to still write to old collection but as a side effect now. That means, the new collection becomes the primary one. This is the first place where we are altering the write/read behavior on old collection. This was the scariest part I would say. Any mismatch in the data results in so many other things. The data corruption spreads to other data stores.
We did not want to take the risk at that level. We wanted to change the reads in a full controlled manner. We wanted to roll it out incrementally. Kind of A/B. We had put custom logic to roll it out to selected loans so that we can see if there are any issues. We started with one loan, 10%, 50% and later 100%. We kept checking the health scripts still, through out this process to make sure that everything is still in place.
If I am not wrong this step itself took a good 10 days to couple of weeks. We had to take time to make sure that nothing is going wrong.
6. Stop writing to old collection
This is the third phase where we completely stop writing to old collection. I think its been around a month since we finished the last step and as I write this article, we have not stopped writing to old collection yet! We still run the health scripts every now and then and they all seem fine. We are pretty confident of stopping writes to old collection very soon now though. We can still take some time on it as both the reads and writes have already moved to the new collection.
Initially we felt that we are actually doing over engineering with all side effects, health scripts, controlled roll outs, etc. but they are absolutely required when you are dealing with critical data. Without them, you don't know when you screw up the data and it will be nightmare to recover it back. I think in total it took us around 2 months to fully complete this process. Pretty much worth it. This gave us room for risk also!
I need to give credits to Mansha who took over this entire process from scratch across development and monitoring. Hope you enjoyed reading this. Cheers!