We value your privacy. We use cookies to enhance your browsing experience, serve personalized ads or content, and analyze our traffic. By clicking "Accept All", you consent to our use of cookies. Read our Privacy Policy for more information.

Engineering

Implementing a major refactoring of a data-intensive IoT product’s cloud architecture

Wednesday, October 8, 2025

Johan Haest

Engineering manager

As technology evolves, so do our products. However, improving existing architectures often means working within the constraints of previous choices and technologies. In this article, we’ll share our approach to tackling key architectural changes in a trajectory we completed for one of our clients, detailing our journey toward a more efficient system and how we successfully achieved a zero-downtime migration. The case at hand is a combination of hardware and software, where the hardware captures real-life data in a complex setting, which is uploaded and processed through the cloud.

Identifying weaknesses in the current architecture

When an application has been operational for some time, developers often have a good sense of its weaknesses. However, it's important to occasionally step back and reassess the architecture from the ground up.

In this particular application, we faced highly dynamic workloads, with enormous spikes on times that were impossible to know upfront. These surges put significant strain on various microservices and databases, resulting in occasional hiccups in the system.

As a result, the primary goals of this migration were to:

Increase processing speed of the data ingestion of the application
Batch processing of our measurements
Reduce database stress at peak moments
Increase reliability of the system, decreasing downtimes

The journey

After exploring RabbitMQ as a message broker for our measurements coming from all the IoT devices, we discovered that it has a big disadvantage which is that your consumer(s) will process messages one at a time.

With millions of measurements needing to be stored in a database every day, processing each individually places a heavy load not only on the database but also on the microservice responsible for managing this storage. Additionally, RabbitMQ has a memory limit on the number of messages it can store. If messages aren't processed quickly enough, it risks running into memory issues, further complicating the system’s performance.

Amazon SQS

Due to the observed patterns in the sensor message arrival times, we required a messaging system that could:

Process messages in batch
Be serverless, meaning it would be managed by the cloud provider

Since our application was already hosted on Amazon AWS, we aimed to stay within that ecosystem, which led us to Amazon Simple Queue Service (SQS). SQS allows for sending, storing, and receiving messages at any scale, without risking message loss or requiring other services to be available. Being serverless, AWS fully manages the infrastructure, offering virtually limitless scalability except for pricing constraints.

Migrating from RabbitMQ to SQS

Our IoT devices are tightly integrated with RabbitMQ, making it difficult to eliminate. However, instead of processing messages in RabbitMQ, we now simply forward them directly to SQS without any intermediate handling. Because these RabbitMQ consumers are lightweight, we can run them in parallel on spot pricing instances, enabling fast and cost-efficient transfers to SQS.

Amazon Lambda

Now that the messages are in SQS, we have much more flexibility in processing them. This is where AWS Lambda comes in a serverless way to run functions on AWS. An important feature of Lambda is that it can be triggered by SQS, with support for batch processing.

A Lambda function is essentially a lightweight, Docker-like container that can be instantiated almost instantly. It has a “cold start” phase where it connects to the database and other services, but once started, it remains active until five minutes of inactivity.

To avoid overloading the database by inserting records one at a time, we configured Lambda to trigger only when 1,000 measurements are ready or when 30 seconds have passed. Additionally, we run up to six Lambda functions in parallel, significantly improving processing efficiency.

The pricing model is another benefit - you only pay for the actual usage. If a Lambda function isn’t triggered, there’s no cost, making it an efficient and cost-effective solution. As a result, the new architecture can deal with extreme peak loads while minimizing the strain on the databases by inserting measurements in bulk.

Levering Amazon Aurora

To enhance the reliability and robustness of our database infrastructure, we transitioned to Amazon Aurora. Aurora is a fully managed, cloud-based relational database engine optimized for high performance and availability. One of Aurora’s key advantages is its approach to replication at the disk level, allowing us to write data through a dedicated writer instance, while all read operations are handled by separate reader (replica) instances.

In traditional failover systems, the secondary database often remains idle until needed in an emergency. With Aurora, we can actively distribute workloads by separating write and read operations across different instances, improving overall performance and efficiency.

The validation

After completing a significant change to our application's processing engine into optimized microservices, the next critical step was validation. This was no minor task as we had overhauled the core of the system. All data that flowed through the application would now be handled by this new architecture, so ensuring everything worked as expected was critical.

Setting up a dual environment for testing

To ensure a thorough validation, we created an entirely new environment for testing that was connected to the same live data as the old system. With this approach, all incoming measurements from the devices were mirrored to and processed simultaneously by both the old and the new system. This one-to-one comparison gave us visibility into whether the new system was functioning correctly.

On top of that, we gained valuable insights into how the new environment impacted performance and cost at the infrastructure level. Essentially, we were able to test under real-world conditions without risking our live service, making it an ideal setup for validation.

Database migration with Amazon DMS

An essential part of this migration was moving both our measurement and structural databases to new instances running on different architectures. For this, we leveraged Amazon Database Migration Service (DMS), a powerful tool that made it easy to migrate databases while keeping everything in sync using the operational logs. This helped streamline the migration process, minimizing downtime and reducing risk, as we could revert back if needed.

Stress testing: preparing for the worst

One of the key questions we needed to answer was how resilient the new system would be under extreme conditions. What if the system went down for a few days, causing messages and data to pile up? Could it handle the backlog without crashing under pressure?

To simulate this, we deliberately stopped processing in the new environment for three days. When we resumed, the initial result was concerning – performance bottlenecks became immediately apparent. However, this visibility into the problem allowed us to pinpoint the issue and apply the necessary fixes.

Our second stress test was of great success. The system processed three days’ worth of data in just a few hours, a result that would have been catastrophic under the old architecture. The bottleneck was eliminated, and the system demonstrated its ability to recover from potential outages swiftly.

The big day: go live

By continuously validating the new system throughout development, we were well-prepared for the go-live moment. We had a detailed plan of action that anticipated potential issues, leaving very little to chance. On the day of the switchover, the most significant task was switching the DNS over to the new platform, a process that went smoothly thanks to our rigorous preparation.

Built-in disaster recovery

One of the most comforting aspects of this migration was that we always had a fallback plan. The old system continued to process data in parallel, meaning if anything went wrong with the new platform, we could instantly switch back. While we were confident that a rollback wouldn’t be necessary, having that safety net in place was invaluable.

Conclusion: a successful, scalable platform

Our approach to this migration and validation project paid off. Running the new and old systems side by side gave us confidence at every stage of the process, making the go-live event almost seamless. The only final step was a DNS flip, which was reversible at any point in time.

Additionally, by integrating serverless components wherever possible, we built a platform designed to scale effortlessly. The result? A highly efficient, usage-based architecture that’s ready for global expansion.

‍