In a recent migration from a standard RDS DB instance to Aurora RDS PostgreSQL Serverless V2, we encountered an unexpected issue. A significant and unexplained increase in database connections. This anomaly led to spikes in resource utilization, specifically memory, and caused a noticeable rise in operational costs. After a thorough investigation, we identified RDS Proxy as the root cause.
This blog details our troubleshooting process, the steps we took to investigate the issue, and how we ultimately resolved it.
Shortly after migrating to Aurora RDS PostgreSQL Serverless V2, we observed memory utilization spikes that were not aligned with our application’s transaction load, query execution, or routine maintenance operations like autovacuum. Despite database optimizations, the issue persisted, and our RDS costs continued to escalate.
Our first step was to optimize the database to rule out common causes of increased memory usage:
ANALYZE
VACUUM
pg_repack
a postgresql extension.pg_repack -h rds-forward.proxy-xxxxxxxxx.ap-southeast-1.rds.amazonaws.com -U db_user -d db_name -t schema.table_name -k
Despite these efforts, the memory spike persisted, and our RDS costs were soaring like a rocket.
To pinpoint the issue, we analyzed RDS metrics over a period of several weeks. We noticed an unusually high number of database connections during periods when the application load was low. This observation led us to focus on the following areas:
Our database thought it was at a never-ending party, with connections multiplying like confetti. Time to turn down the music and clean up this digital disco.
The investigation revealed that RDS Proxy was at the heart of the issue. We found that RDS Proxy was maintaining a large number of idle connections, which were being reused inefficiently. These connections were only timing out after the default 24-hour period, which led to excessive resource consumption.
We identified that the issue stemmed from the MaxIdleConnectionsPercent
parameter in AWS RDS Proxy. This parameter wasn’t readily accessible through the console, so we used the AWS CLI to make the necessary adjustments.
Here’s how we tweaked the parameters according to our workloads
First, we listed out the rds proxy configuration using the below aws cli command:
aws rds describe-db-proxy-target-groups --db-proxy-name dsr-forward --region ap-southeast-1
output:
{
"TargetGroups": [
{
"DBProxyName": "dsr-forward",
"TargetGroupName": "default",
"TargetGroupArn": "arn:aws:rds:ap-southeast-1:863547483732:target-group:prx-tg-0a926g7b3ddyg8j0f",
"IsDefault": true,
"Status": "available",
"ConnectionPoolConfig": {
"MaxConnectionsPercent": 100,
"MaxIdleConnectionsPercent": 50,
"ConnectionBorrowTimeout": 120,
"SessionPinningFilters": []
},
"CreatedDate": "2021-10-25T12:07:51.836000+00:00",
"UpdatedDate": "2023-06-24T09:02:25.222000+00:00"
}
]
}
AWS CLI Reference – Here
Here’s what each parameter generally means:
max_connections
. Default value is 100
for postgresql engine.max_connections
. With a high value, the proxy leaves a high percentage of idle database connections open. A low value causes the proxy to close more idle connections and return them to the database.If you specify this parameter, then you must also include a value for MaxConnectionsPercent
. The default value is half of the value of MaxConnectionsPercent
.You have a database that can handle up to 500 connections at a time. To find out how many of those connections can be idle (not in use) at any given time, you use a percentage.
Here’s how you calculate it:
Formula:
MaxIdleConnections = (MaxIdleConnectionsPercent / 100) × MaxConnections
Input:
MaxIdleConnectionsPercent: 50%
So, if your database allows 500 connections in total, and you set the MaxIdleConnectionsPercent
to 50%, then the number of idle connections allowed would be:
Max Idle Connections = (50 / 100) × 500 = 250
This means out of 500 connections, 250 can be idle.
You have a database that allows 2000 connections at a time and MaxIdleConnectionsPercent
is 10%.
Now, using the formula:
Max Idle Connections = (10 / 100) × 2000 = 200
This means out of 2000 connections, 200 can be idle.
We fine-tuned the MaxConnectionsPercent
and MaxIdleConnectionsPercent
parameters in AWS RDS Proxy to better suit our needs.
Here’s how we updated them:
Input:
aws rds modify-db-proxy-target-group \
--target-group-name default \
--db-proxy-name dsr-forward \
--region ap-southeast-1 \
--connection-pool-config MaxConnectionsPercent=50,MaxIdleConnectionsPercent=10
output:
{
"DBProxyTargetGroup": {
"DBProxyName": "dsr-forward",
"TargetGroupName": "default",
"TargetGroupArn": "arn:aws:rds:ap-southeast-1:863547483732:target-group:prx-tg-0a926g7b3ddyg8j0f",
"IsDefault": true,
"Status": "available",
"ConnectionPoolConfig": {
"MaxConnectionsPercent": 50,
"MaxIdleConnectionsPercent": 10,
"ConnectionBorrowTimeout": 120,
"SessionPinningFilters": []
},
"CreatedDate": "2021-10-25T12:07:51.836000+00:00",
"UpdatedDate": "2023-09-14T10:37:09.421000+00:00"
}
}
By setting this parameter to a minimal value, we restored balance to our database environment, and everything began running smoothly once again. Additionally, we saw RDS Serverless costs decrease by 50% due to significantly lower ACU consumption, as there were fewer active connections.
And so, our database party came to an end, leaving us with a valuable lesson: keep an eye on those resources, or you might find yourself footing the bill for a tech-tacular blowout.
Our journey dealing with high connections in Aurora Serverless V2 was both challenging and rewarding. We learned the crucial importance of meticulous monitoring and analysis, and the value of keeping up with the latest features and parameters. By sharing our experience, we aim to help other engineers avoid similar pitfalls and optimize their Aurora Serverless deployments.