How do we troubleshoot the RDS Aurora problem as SRE
I’m Nhan, SRE at Classi corp — an education service for teachers and students in Japan. Lately, we had an incident with Aurora MySQL. We have taken a lot of time to troubleshoot this problem, so in this article, I will show you how we react — troubleshoot this incident.
Firstly, you need to know about our services. We are using micro-services architecture, and each service runs on an ECS cluster. We also have a proxy cluster, all requests coming here and based on host, they will be forwarded to other app clusters.
As you know (or not), the Japanese Yen is weakening, and my company uses AWS, every billing is in USD, so our infrastructure cost increased very much. The mission of SRE team in this period is cost-cutting without effect to the activities of the system.
That is a nice Wednesday, my boss wants to size down the main DB’s reader instance. As the estimation, we can down the DB instance 2 times (from 8xlarge → 4xlarge), and this action will not trigger any downtime because we create a new reader instance and change the reader endpoint to it.
Everything is quite ok, after the size down instance run very well and the DB’s CPU is doubled but in the safe line. But after 3 hours, we received tons of alerts from the proxy cluster. Request from users cannot connect to the server and raise 500 errors. The requests are from many services, so we guess the failure point is in proxy or database (All services use proxy and DB, so it may be the problem).
After checking the monitoring of the DB instance, we realize that the instance’s freeable memory is zero (and we don’t have any warning/alert set up on this metric). As an immediate reaction, we scale up the instance to 8xlarge (revert to the state before the incident), but nothing changed. The memory still goes down every hour and runs out of memory after a dozen hours. When changing the DB instance size, we don’t change any RDS setting or MySQL setting, no one knows what happens.
For services running without downtime, we have only 1 way: increase the size of the instance to 12xlarge, and add 2 reader instances. The memory still runs out, but the speed is slower and we have time to troubleshoot the root cause. SRE team has to reboot the DB instance every day to free the memory. We realize when we start the new instance, the new patch of security update on this instance also be updated automatically. Because the other instances without the update patch are running well, the incident root cause may be in this patch. We connect to AWS support for this case, but after 1 day, we only receive a guide about how to check the memory allocation(to be honest it does not help any, and AWS supporter also cannot troubleshoot the problem)
The highest priority is troubleshooting and solving this incident (because 12xlarge is very expensive). While waiting for response from AWS support, we connect to the MySQL server and check all setting variables (especially variables relate to cache, because running out of memory), and we find something “fishy”: the ‘table_open_cache’ variable is set to 524288. This is the max value in the range allowed by AWS (from 0 ~ 524288).
As MySQL document,
table_open_cache is related to
max_connections. For example, for 200 concurrent running connections, specify a table cache size of at least
200 * N, where
N is the maximum number of tables per join in any of the queries which you execute. You must also reserve some extra file descriptors for temporary tables and files.”
so the value is set too high may be the root cause of this incident.
The interesting are all instances of the DB use the same cluster parameter group and instance parameter group, and the ‘table_open_cache’ variable is set to 524288, but in running well instance the value is only 58525. Something missing here (maybe an AWS bug because the ‘table_open_cache’ is the dynamic type and can be applied without reboot), but after reducing the value, the DB instance works stability.
After this incident, we have a postmortem to review all flow of the incident and come up with a way to avoid it. Here are a few key ideas:
- The immediate reaction is important to reduce damage and get time to troubleshoot
- The run-out of memory error usually relates to cache
- Don’t expect too much on AWS Support
- Don’t trust 100% vendors (Maybe they have bugs somewhere)
- Be careful with your setting
- Be surely the four golden signals is set on the critical component
Thank you for reading this far. I’m a newbie in SRE but I want to share my experience, so if you have any questions, please comment below and I will reply as soon as possible.