Migrate on-prem infrastructure to the cloud

All roads lead to AWS 🙂

OK before I tell you this story, why would you wanna do this in the first place?

Well, there are several reasons why you might want to migrate to the cloud, however the truth is that you need to carefully evaluate your needs and make sure you’re not migrating to the cloud just for the sake of doing it i.e. “Well, everybody else is doing it so it has do be something we need to do as well, right?” but that you’ll actually gain something that you can’t achieve with the on-prem infrastructure.

In our case we were paying for 11 pretty expensive dedicated servers and they were really underutilized all the time so it was really bad bang for the buck. While basically overpaying for servers we couldn’t scale up when needed because the infrastructure was the good old Gentoo Linux with LXC containers so even though you can spin multiple containers on a single machine you still have finite hardware at hand. Our business started growing, new things like SOC2 compliance started being relevant, we needed to make sure we have as best uptime as possible, and not rely on a single switch in the data center etc. so we decided to bite the bullet and go to the cloud, and in hindsight I think it was a good decision, considering all pros and cons of both approaches.

Plan carefully

The first step was, of course, planning all steps that we need to take in order for the migration to happen. We needed to do several things:

  1. Make sure all apps can run in containers (make any necessary changes to java apps so they can run properly without any external dependencies)
  2. Create docker containers for all services we need (make sure to have a standardized build process that takes care of DB, Java apps, any external dependencies etc.)
  3. Migrate files
  4. Migrate data (DB)
  5. Make sure everything works as expected before pulling the plug on the on-prem infrastructure

This is obviously somewhat simplified list of steps, but it encapsulates the most important ones. Important thing to note here is that in ideal world we would probably take 6-9 months to migrate everything properly since we were learning new things, we needed to make sure nothing is left behind etc., however we had to set a bit tighter deadline of just 4 months (and we actually made it all happen in those 4 months!).

Challenges

When you look at the list of things we needed to do it’s relatively easy to spot the most challenging parts of the migrations, those are for sure files and data migration (yes… believe it or not this was the most challenging to do, even though AWS tells you they will provide you with awesome tools to migrate everything).

Files migration

Because of what we do (custom apparel website where you can create your own design using our design tool), our platform generates a lot of files every day, and at the moment of migration we had to migrate around 21TB of files and we needed to make sure nothing was missing! Another problematic thing was that instead of us being able to utilize S3 we had to go for EFS because our app wouldn’t know what to do with the object storage or how to find files, so we had to (at least initially) simulate the same file structure we had on on-prem infrastructure. With that we’ve lost the ability to use the AWS Snowball as they support import only to S3 (their words, we did want to use this service but their support told us it’s not possible if we’re importing files to EFS) so we actually needed to use AWS DataSync.

We started looking into DataSync, and we’ve installed the agent on our infrastructure so it can start the transfer, however it was extremely slow and unreliable as agent would simply die for no obvious reason so we constantly needed to monitor it, and start it when it went down… After almost a month of going back and forth with their support, we decided to do things our way – again the old fashioned Linux way 🙂

What we ended up doing was, making an SSH tunnel to the EC2 instance on AWS which had EFS endpoint locally mounted, and by doing so we were able to initiate the rsync from our side and move files to EFS… The problem with this approach was that those 21TB of files actually hold many many millions of small files and data transfer was extremely slow, so what we again ended up doing was squashing our unionfs FS locally and we ended up migrating 7 huge files to EFS where we were able to unpack them using another EFS instance having the same Gentoo version we used on on-prem and all that was done in just 4 days (well the migration, unpacking and restoring files took another 5-6 days). Once all this was done then we were able to do the rsync so we can pick up the delta of files that were created during those 10-ish days, and before you knew it all files were transferred to AWS, without using any AWS proprietary tools (so again, in hindsight we would save probably 20 days if we took this approach initially vs trying to chase our own tail with DataSync and AWS’ support).

Data migration

Our data migration was another interesting story. We are using PostgreSQL 12 and the migration was a no-brainer, we knew we needed to use AWS DMS (Data Migration Service) as that was exactly what it was built for – migrating your database to RDS.

We had around 2TB of data across several DBs and we initiated the AWS DMS agents/jobs a month ahead of time so they have enough time to migrate everything by the time we need to do the switch, and everything seemed to be working nicely. Then around 5th day in, we saw that one DB has an err in migration and we had to resume it and everything seemed fine, AWS DMS didn’t report any issues after the restart it continued working properly. Then again, and again and again, one by one our DBs were having some random error while migrating data and we found out that errors were due to some triggers we had running on local DBs so that interfered with the transfer. We added those triggers and some materialized views to the ignore list and everything started working – or so we thought.

The day has come when we decided to do the switch of EFS mounts and RDS endpoints to our application so it starts using the AWS infrastructure rather than our local on-prem infrastructure (we were aware that we would have more lag until we switch services to AWS as well but that was fine), and then the starry thing happened – we realized that we’re missing data… a lot of it actually and by analyzing our data we concluded that whenever we would have an err in migration reported by the DMS, and when we restarted the transfer, all data “in between” so since the err happened and we restarted the process was lost. We tried to rebuild data manually as we had the “original” db on our on-prem DB server but that ended up being hard to impossible task with many people reporting issues, and everything on fire so we had to make a decision to go back to our local DB server, which meant that we’ve lost around 4h worth of data, the data that was written to RDS after we did the switch, and while we could have somehow export that data and import it to our on-prem DB that would be super hard because of many dependencies (FKs, new campaigns that were created on RDS and missing on on-prem DB server, etc.).

After all this we had to take a step back, and we did some more analysis finding many holes 2-3 years back in our data so we decided to drop DBs on RDS and take a different approach.

We have planned for 3h downtime of our main service over night when we would export data using pg_dump/psql, and we would import that dump to RDS and start the service… We got ready, stopped the service, started generating the DB dump, imported it to RDS and started the service again. And the best thing was – we managed to do all this in just 2h so 1h faster than we initially planned. We had DB with all indexes, materialized views, and most importantly all data!

We did all this from our on-prem DB server over ssh connecting to EC2 instance that had connection to RDS up and running.

We exported data like:

dump pre-data only: pg_dump -U USERNAME -d DB_NAME --section=pre-data > db-name-pre-data.sql

dump data only: pg_dump -U USERNAME -d DB_NAME --verbose --data-only > db-name-data.sql

dump post-data only: pg_dump -U USERNAME -d DB_NAME --section=post-data > db-name-post-data.sql

We could have exported everything at once, but since the main DB was pretty big we decided to do things in steps and verify every step along the way.

After each file was created we would transfer it to the EC2 instance using scp and then we would issue:

psql -h RDS_ENDPOINT_URL -U USERNAME -d DB_NAME -f FILENAME.sql

Once again, pretty similar to the files migration story, we concluded that with relatively big DB it’s most relible to use old, and battle proofed tools vs AWS’ proprietary tools.

Files and data are on AWS, now what?

Well those were the biggest 2 obstacles we faced during the migration, everything else was pretty straight forward – we needed to make sure that our apps can run properly as containers and once that was the case we ended up creating a jenkins pipeline that fetches the code that needs to be deployed, builds everything and creates docker image, sends it to AWS ECR and from there we’re able to run each image from ECR automatically using K8S (AWS EKS).

In our case K8S was probably (well is, not probably) an overkill but we ended up having some nice bonus things like the ability to easily scale up/down services, automatically recover if there’s an issue with some service etc. so overall it wasn’t a huge overhead, but if I needed to do everything again I’m not sure I would opt in for K8S.

The conclusion

The thing with things like this is that they usually happen “once in a company’s lifetime” so I’m not sure if I’ll be able to ever use this experience on a similar project, but at least it’s documented here so someone might find it useful when they start thinking about the migration. One thing I would say to my “past self” is: Don’t rely on proprietary software so much!

As we could see, we actually made the whole migration in a relatively short time frame, but it would be even shorter with less stress (how stressful is lost data!?) if we started doing things using good old tools that everyone is familiar with.

So if you’re thinking about the migration, my 2c would be -> rsync over DataSync (in case of EFS, not sure about S3 as we never needed it) and pg_dump/psql over DMS!

Leave a Reply

Your email address will not be published.