Cloud-Native Message Routing

As we exited the summer in 2019, we had some clouds on our horizon at ShippingEasy. We had been acquired by Stamps.com, our customer base and volume had grown, and some systems we had that had served us well were starting to reach their limits of scale. Holiday season was approaching and with it a new high of usage for our product allowing E-tailers to serve their customers.

One of these systems was an integration with Sendgrid that we used for email marketing on behalf of our customers to their customers. Campaigns and the actual sending of email was delegated to SendGrid, and they in turn sent us a fairly large volume of messages with statistics about delivery, bounce rates, etc… via webhook. It was this webhooks return path into our system that was at risk.

Customer Sharding

To scale our monolith to our growing customer base, we had divided up our customers amongst multiple instances of the application that we called shards. All of the customers OLTP data was stored in the database associated with their shard, and we had tooling made to migrate customers between shards as operational needs dictated. Thus today, customer A might be on shard 1, customer B on shard 2 and customer C on shard 3. Tomorrow, customer A might be moved to shard 2.

Original Webhook Architecture & Infrastructure

The original email infrastructure had an easiest path design and implementation that worked well for quite some time. We had a Serverless application that would receive the webhooks call from SendGrid with a message containing statistics about email delivery relevant to one of our distinct customers. The lambda receiving that event would publish it to an SNS topic. Subscriptions were set up between that singular SNS topic directly to our three SQS queues serving as inputs to each application shard. Messages were multicast to every shard, and if the customer related to the message didn’t currently reside in a receiving shard, we simply threw the message away.

As our scale grew, however, we ran into issues with this multicast design. Our queues started to get saturated with messages for customers that didn’t even reside in the shard the queue fed. We started to see a growing delay in a shard processing its input queue. With the upcoming holiday shopping season, we knew we were going to hit a wall with processing delays on the order of days. We had the limits of our current design, and something new would be needed to handle our next stage of growth.

Enter the Shard Router

We ultimately decided to introduce a piece of infrastructure that we named “Shard Router”. Its purpose was to intelligently route messages intended for a specific customer to the input queue for their shard. This would eliminate the load of uninteresting messages from each shard, and allow them to only have to deal with the messaging load for their specific subsets of customers.

The design was fairly simple. Shards of our monolith would emit “CustomerSharding” events whenever the following happened:

A customer was explicitly assigned a shard (as with onboarding)
A customer started migrating between shards
A customer successfully finished migrating between shards
A customer failed to migrate between shards

For each of these events, we tracked three key pieces of information, the customer identifier, their current shard, and whether they were actively migrating to a new shard. These would be stored in a DynamoDB table and be used in a loosely event-sourcing way -- the most recent event would contain all the information as to what shard a message for the customer needed to be delivered and whether or not they were currently migrating so that the message could be held and delivered later.

A shard routing lambda function would be tied to the email webhook output SNS topic. Whenever it was invoked, it would get the customer identifier as metadata from the message, look up that customer’s most recent CustomerSharding event and then make a decision. If the customer was not actively migrating between shards, it would enqueue the message into the SQS input queue for the shard that the customer resided in. If the customer was actively migrating between shards, the message would be placed in a delay queue where it would retrigger the routing lambda 5 minutes later, whereupon the decision making would be repeated. If the most recent event was a successful or failed migration event, the message would be delivered appropriately. Otherwise, if it was still the migration starting event, the message would be re-enqueued into the delay queue for subsequent delivery attempts.

This design worked extremely well. The delays we were experiencing disappeared. We were able to hold up to our new holiday email load without having to stand up new shards and further subdivide customers between them. Additionally, we had designed shard router in such a way that multiple routing functions could be added for use in different cases. Eventually, all of our webhook receiving pieces leveraged it to direct messages to only the shards where the customer intended to receive the message resided.

Wrapping Up

One of the key takeaways for myself in this project was how nimble and agile you can be with a cloud-native architecture and infrastructure. As long as you take care to make your seams loosely coupled (the webhook recipient just published messages to an SNS topic and had no notion who received them, the shards just read from an SQS queue with no notion of what filled it), you can modify exactly how messages get from point A to point B with some ease and impunity. In our case, that allowed us to design and scale our infrastructure iteratively and agilely as we hit different levels of scale.