Eric Workman

Recovering from "no active leader"

Published on

I needed to resize a PostgreSQL instance on Fly.io in the last few days, but I hit a problem. This post is a recounting of what happened and how I fixed it. I've changed the names of apps and machines and IP addresses so that its easier to follow. I also only included the output of commands if it was relevant.

Context

For my project, I have an Elixir application called myapp. It uses a multi-node setup, but the details don't matter for this problem. All nodes were experiencing identical symptoms.

This app uses a PostgreSQL database called db. This database was a clustered, high availability (HA) setup using three machines with 2 cpu, 4096 MB of memory, and 40 GB of attached storage in volumes. The HA setup used one primary and two replicas following the primary.

The postgres instances were oversized for the app's intended use and current scale, so I needed to right-size them to save some bucks. The plan was to reduce the memory to 1024 MB.

Triggering the problem

Resizing postgres instances for memory is a two-step process. The first step is to set shared_buffers to something less than the total system memory. A good starting point is 25% of memory. The current setting was 131072 or about 1GB (131072 * 8kb per block), which was fine for 4096 MB but way too much for 1024 MB. The second step would be to resize the machines to 1024 MB of memory.

So starting with shared_buffers, I took a backup of the database then ran the following command to set shared_buffers to 1/4 of 1 GB (256 MB)

fly postgres config update --shared-buffers 32768 --app db

This configuration change took, then Fly started the automatic rolling restart of the database machines. This was expected. However, the machines never came back healthy.

Only 2 out of the 3 health checks passed, and the checks for the cluster were spitting out

Health check for your postgres role has failed. Your cluster's membership is inconsistent.
Health check for your postgres vm is warning. Your instance might be hitting resource limits.
Health check for your postgres database is warning. Your database might be malfunctioning.
Health check for your postgres database has failed. Your database is malfunctioning.
[] cluster-locks: `zombie.lock` detected (7.9µs)

The db app was generating logs at 50x the normal rate, and messages similar to these were repeated ad nauseam

replica1: FATAL: database "repmgr" does not exist
replica2: FATAL: database "repmgr" does not exist
primary: FATAL: database "repmgr" does not exist

There were even more log messages, but these are the meaningful lines.

Diving in

As a general rule, I dislike leaving unexplored problems. Especially for stateful services like databases. Especially when they could pop up in pretty routine situations.

I had a fresh backup of the database, so I could restore and attach myapp to a fresh database easily. I'd likely be stuck in the exact same situation when resizing or maybe a worse spot if I presized the new database. Because this wasn't a critical service, I chose to dive in. I started by gathering as much information about the state of the database as I could.

Client behavior

myapp was returning 500 errors on any page that called out to the database, all centered around the connection to db closing early or unexpectedly. fly -a db postgres connect and all other postgres cli actions returned nothing but Error: no active leader found. Proxying with fly -a db postgres proxy 15432:5432 and connecting with psql locally would only give

psql: error: connection to server at "localhost" (::1), port 15432 failed: Connection refused
	Is the server running on that host and accepting TCP/IP connections?
connection to server at "localhost" (127.0.0.1), port 15432 failed: server closed the connection unexpectedly
	This probably means the server terminated abnormally
	before or while processing the request.

If I couldn't connect to the database remotely, maybe I could from the host. Using fly -a db ssh console would only return instanctly, as if the connection was closing as soon as it was established. This was a dead end.

Leaderless

The cluster was stuck without an agreed leader. This was a tough spot to be in. Through some more sleuthing in the logs, I realized that the db nodes were in restart crash loops. Being unable to connect to the database to further diagnose left me with few options. First, I tried to restart the primary node. A quick fly -a db machine restart primarymachineid later, and I saw no changes to the logs coming out of the primary or replicas.

Next, I spun down the replicas with fly -a db machine stop replicamachineid, checked the logs, and restarted the primary machine. This set of actions had no effect.

Finally, updating the app image was the last Hail Mary before a rebuild. This potentially could resolve underlaying bugs in the image or maybe trigger some change in the infrastructure, but its really hard to know with confidence what changed. Issuing fly -a db image update --skip-health-checks so that I didn't wait for health checks I expected to keep failing, and a change!

primary: WARNING:  database "repmgr" has a collation version mismatch
primary: DETAIL:  The database was created using collation version 2.31, but the operating system provides version 2.36.
primary: HINT:  Rebuild all objects in this database that use the default collation and run ALTER DATABASE repmgr REFRESH COLLATION VERSION, or build PostgreSQL with the right library version.

A new error! The database appeared to be getting further along, but it was restarting constantly on the primary. I didn't know what changed, but that's a different rabbit hole I can explore if this happens again.

Collation

Since there were logs containing database "repmgr" has a collation version mismatch, I needed to figure out what this meant. In postgres, collations are schema objects that control how strings (aka "text") fields are sorted. The postgres docs on Managing Collations gives a big hint, bold added:

A collation is an SQL schema object that maps an SQL name to locales provided by libraries installed in the operating system.

The docs go on to say that libc is one of the standard providers. This is often packaged as glibc, and this library can change if the image changes. Since this error popped up after updating the image, which did change several versions, we should be able to figure out how to recover.

What is repmgr anyway?

repmgr is a command line tool for managing replication and failover for PostgreSQL clusters. Fly chose this tool over Stolon earlier this year, and it is what powers the cluster automation and cli. Since postgres on the primary was getting further along, it was worth trying to ssh again.

fly --a db ssh console worked. With a session, this is what I ran

root@primarymachineid:/# su postgres
postgres@primarymachineid:/$ repmgr -f data/repmgr.conf cluster show
WARNING:  database "repmgr" has a collation version mismatch
DETAIL:  The database was created using collation version 2.31, but the operating system provides version 2.36.
HINT:  Rebuild all objects in this database that use the default collation and run ALTER DATABASE repmgr REFRESH COLLATION VERSION, or build PostgreSQL with the right library version.
WARNING:  database "repmgr" has a collation version mismatch
DETAIL:  The database was created using collation version 2.31, but the operating system provides version 2.36.
HINT:  Rebuild all objects in this database that use the default collation and run ALTER DATABASE repmgr REFRESH COLLATION VERSION, or build PostgreSQL with the right library version.
 ID         | Name                             | Role    | Status        | Upstream                          | Location | Priority | Timeline | Connection string
------------+----------------------------------+---------+---------------+-----------------------------------+----------+----------+----------+---------------------------------------------------------------------------------------------
 replica1id | fdaa:1:30c1:a7b:xxxx:yyyy:zzzz:a | standby | ? unreachable | ? fdaa:1:30c1:a7b:bbb:cccc:dddd:e | iad      | 100      |          | host=fdaa:1:30c1:a7b:xxxx:yyyy:zzzz:a port=5433 user=repmgr dbname=repmgr connect_timeout=5
 replica2id | fdaa:1:30c1:a7b:bbb:cccc:dddd:e  | standby | ? unreachable | ? fdaa:1:30c1:a7b:bbb:cccc:dddd:e | iad      | 100      |          | host=fdaa:1:30c1:a7b:bbb:cccc:dddd:e port=5433 user=repmgr dbname=repmgr connect_timeout=5
 primaryid | fdaa:1:30c1:a7b:ffff:gggg:hhhh:i  | primary | * running     |                                   | iad      | 100      | 1        | host=fdaa:1:30c1:a7b:ffff:gggg:hhhh:i port=5433 user=repmgr dbname=repmgr connect_timeout=5

Since the replicas were stopped, it made sense that their status was unreachable.

I unregistered the replicas with repmgr -f data/repmgr.conf standby unregister --node-id=replica1id, thinking that maybe postgres would start and accept connections. In retrospect, this was probably wrong and unnecessary. The repmgr command was outputting the same collation error as seen in the logs. This command was clearly connecting to postgres enough to pull data from the repmgr schema and tables.

Recovery

Knowing that the database was queriable, I used myapp's postgres connection string and psql to connect to the database

postgres@primarymachineid:/$ psql postgres://postgres:...
WARNING:  database "myappdb" has a collation version mismatch
DETAIL:  The database was created using collation version 2.31, but the operating system provides version 2.36.
HINT:  Rebuild all objects in this database that use the default collation and run ALTER DATABASE tradeshowprizeco_prod REFRESH COLLATION VERSION, or build PostgreSQL with the right library version.
psql (15.3 (Debian 15.3-1.pgdg120+1))
Type "help" for help.

myappdb=# |

And there it is, we can repair the databases! It's a reasonable assumption that all of the databases should be repaired.

\l lists all the databases, including postgres, repmgr, and myappdb. Iterating through the databases, I ran

myappdb=# \c thedatabase
thedatabase=# REINDEX DATABASE thedatabase; ALTER DATABASE thedatabase REFRESH COLLATION VERSION;

And with that, or maybe earlier, myapp started to work again. The postgres logs revealed no issues with the primary. I could connect again with the fly postgres commands. Success!

Cleanup

With a valid primary, functioning repmgr, and a better understanding, I completed the resize, removed the broken replicas, cloned the primary to make two new replicas, and cleaned up the replica volumes

fly -a db machine update primarymachineid --memory 1024
fly -a db machines remove replica1machineid --force
fly -a db machines remove replica2machineid --force
fly -a db machines clone primarymachineid
fly -a db machines clone primarymachineid

fly --app db volumes list
# ID                STATE          	NAME   	SIZE	REGION	ZONE	ENCRYPTED	ATTACHED VM   	    CREATED AT
# vol_primary	    created        	pg_data	40GB	iad   	e172	true     	primarymachineid	1 month ago
# vol_newreplica1	created        	pg_data	40GB	iad   	7adf	true     	replica1machineid	10 minutes ago
# vol_newreplica2	created        	pg_data	40GB	iad   	7e7d	true     	replica2machineid	10 minutes ago
# vol_oldreplica1	created        	pg_data	40GB	iad   	7adf	true     	                	1 month ago
# vol_oldreplica2	created        	pg_data	40GB	iad   	7e7d	true     	                	1 month ago

fly volumes destroy vol_oldreplica1
fly volumes destroy vol_oldreplica2

Conclusion

I couldn't find much on Fly's community forums that directly helped. This thread about no active leader didn't resolve my problem. This thread about collation mismatches was helpful though.

I still don't know exactly what went wrong during the first restart when changing shared_buffers. Next time I will skip the restart and manually restart one of the replicas. Perhaps that would reveal more information.

Lastly, I've been very happy with Fly's services. I had a tough time figuring out this problem, but I think this was ultimately my fault. I should have controlled the cluster restart. I understand why they go out of their way to clarify that this is not managed postgres. I think their cli is top-notch. The only stumbling block I've experienced was specifying resources sizes and clearly understanding their costs. They have been improving their billing interfaces, and their docs are thorough. I think the user experience is in a sweet spot of being clearer than Heroku, being more customizable than Digital Ocean's App Platform, and being reasonable in cost and less complex than the big 3 cloud providers.