Getting a 1000% Cost Improvement in DynamoDB

A couple of years ago, my team was given a target to cut back our cloud costs. When the director I work under told me about the goal, I gave him the immediate feedback that we did not have everything we needed to reach that goal. We were being asked to engineer for cost, but not provided with the cost data to see whether, or not we were actually improving that metric.

He appreciated that feedback and immediately set to work getting the senior engineers on his staff access to the Cost Management tooling in the appropriate production AWS account. Once we had that, I was able to find some troublesome resources that were likely costing us too much.

One of these resources was a DynamoDB table for handling the mapping targeted user data to an appropriate ad creative. The original team that built this part of the system had a mandate to make sure that they had an appropriate dashboard to track the performance of the ad targeting in near real time. They opted to use a single DynamoDB table to house the data required for the core ad targeting business logic, impression data, and aggregated data from the impression data.

The key problem with this approach is that they’re doing analytics on top of Dynamo and feeding it back into the same table. DynamoDB is an Online Transaction Processing database (OLTP). It shines for workloads that need to read and write in a transactional fashion. For example, it would be great to house your shopping carts for your e-commerce site (which if I’m not mistaken was the original use case for Dynamo within Amazon). It is not an Online Analytics Processing database (OLAP). Those databases like Big Query, Red Shift etc are used to stream large quantities of data to feed analytics crunching so you can know things like how many of a certain product land in a cart and are subsequently removed.

Suffice it to say that looking at the costs, the read/write capacity provisioned, and how the code was using the table, this single DynamoDB table was over provisioned. I then double checked the traffic for the application that was serving the dashboard populated from the aggregate data in Dynamo DB by looking at how many log lines were produced for the root route in Kibana. There was no activity other than me poking around to investigate the application.

This was a great result. We could plan on getting rid of the dashboard that no-one is using, and tear down the process that’s aggregating data within DynamoDB. I then turned my attention to all of the impression data. I thought it was strange that we were storing it in DynamoDB, since I had built a means of sending this impression data to BigQuery for the Business Intelligence team and to power some dashboards for the marketing team. I thought that we might have a process in place to send the data from DynamoDB to my API. That was not the case.

After examining more of the code, I found out that the impressions written to the DynamoDB table were only to serve the aggregation for the dashboard. The flow to BigQuery came from the original logging of the events to S3. Part of the process driving the aggregation orchestrated the transfer of the impression data from the log in S3 to DynamoDB, and then aggregated the data to write it back into DynamoDB for the defunct dashboard.

That meant that I could get rid of all of the impression data that we were writing, all of the aggregated data, almost all of the orchestration logic (some of it was still needed to transfer to BigQuery), and then re-provision the table with lower read/write capacities. Unfortunately, the state of the Terraform code proved that to be incredibly naive. The last developer was working on an effort to make the table auto scale as an attempt to fix some of the over-provisioning problem. He was working with a DevOps engineer to put in an auto-scaling policy that was not supported with the version of the AWS Terraform provider that they had at the time. Since this feature was not supported, they opted to invoke the AWS CLI tool via PowerShell during the deployment of the resource. That unfortunately means that we couldn’t properly change the provisioning of the table since the auto-scaling policy wasn’t present in the Terraform state.

My team and I discussed some of our options for how to solve this problem. We opted to refactor using some infrastructure that we built for other projects (including an intern project that I think I’ll have to turn into another article). We refactored the code to use a pair of tables; one would allow us to look up the targeting, and the other would map the targeting data to the appropriate creative. The change to the code was relatively simple and since we were levering other existing infrastructure, we made quick work of the change.

However, there were concerns about performance. We had this massively over-provisioned table that we were trying to replace, and we didn’t quite know how much of the read capacity we needed to serve the requests for this flow on the new infrastructure. That’s when we chose to set up a load test in K6. I had already load tested some portion of this infrastructure for a previous project, so I helped another engineer work out an estimate for how much traffic he needed to send to our test API to simulate a production load. With that in hand, we were able to demonstrate that the current read capacity on the newer tables was sufficient to handle the production load.

We shipped the new code leveraging our two table approach, and ended up cutting the per month cost of a flow in DynamoDB by over 1000% by getting rid of workloads that were redundant and unused; creating more dedicated streamlined tables; and leveraging flexible infrastructure choices to have these tables support more than one workflow.