Post-Mortem: The massive lemmy.world -> lemmy.dbzer0.com federation delays.

A couple days ago, someone posted on /0 (the meta community for the Divisions by zero) that the incoming federation from lemmy.world (the largest lemmy instance by an order of magnitude) is malfunctioning. Alarmed, I started digging in, since a federation problem with lemmy.world will massively affect the content my community can see.

As always my first stop was the Lemmy General Chat on Matrix where I asked the lemmy.world admins if this appears to be something on their end. To their credit both their lead infra admin and the owner himself jumped in to assist me, changing their sync settings, adding custom DNS entries and so on. Nothing seemed to help.

But the problem is must still be somewhere in lemmy.world I thought. It’s the only instance where this is happening and they upgraded to 0.19.3 recently, so something must have broken. But wait, this didn’t start immediately after the upgrade. Someone pointed out this very useful federation status page, which kinda point that the problem is only on lemmy.world.

Not quite, other big instances like lemmy.ml and lemm.ee were not having any issues with federation with lemmy.world (even though 2 dozen others like lemmy.pt were), and they are as big if not bigger than lemmy.dbzer0.com. A problem originating from lemmy.world cannot be possibly affecting only some specific instances. To make matters worse, both me and lemmy.ml are using the same host (OVH), so I couldn’t even blame my hosting provider somehow.

So obviously the main culprit it somewhere in my backend, right? Well, maybe. Problem is, none of the components of my infrastructure were overloaded, everything sitting between 5-15% utilization. Nothing to even worry about.

OK, so first I need to make sure it’s not a network issue somehow specifically between me and lemmy.world specifically. I know OVH gave me a bum floating IP in the past and were completely useless at even understanding that their floating IP was faulty, so I had to stop using it. Maybe there’s some problem with my loadbalancers.

Still, I’m using haproxy, which is nothing if not fast and rock solid. So I didn’t really suspect the software. Rather, maybe it’s a network issue with the LB itself. So first thing I did is double the amount of Loadbalancers in play, by setting my DNS record to point to my secondary LB at the same time. This should lessen the amount of traffic hitting my LB and even take them at a completely different VM, and thus point if the problem is on the haproxy side. Sadly, this didn’t improve things at all.

OK so next step, I checked how long a request takes to return from the backend after haproxy sends it over. The results were not good.

I don’t blame you if you cannot read this, but what this basically says is that a request hitting a POST on my /inbox, took between 0.8 and 1.2 seconds. This is bad! This is supposed to be a tiny payload to tell you an event happened on another instance, it should be practically instant.

Even more weird, this is affecting all instances, not just lemmy.world. So this is clearly a problem on my end, but it also confused me. Why am I not having troubles with other instances? The answer came when I was informed that 0.19.3 added a brand new, special new federation queue.

You see, the old versions of lemmy used to send all federation actions over as soon as they received them. Fire and forget style. This naturally lead to federation events being dropped due to a myriad of issues, like network, downtimes, gremlins etc. So you would lose posts, comments and votes, and you would (probably) never realize.

The new queue added order to this madness, by making each instance send its requests serially. A request would be sent again and again until it succeeded. And the next one would only be sent if the previous one was done. This is great for instances not experiencing issues like mine. You see, at this point, I was processing 1 incoming federation request per second approximately, while lemmy.world was sending around 3. Even worse, I would occasionally timeout as well by exceeding 10 seconds to process, causing 2 more seconds or wait time.

Unlike lemmy.world, other federating instances to mine didn’t have nearly as much activity, so 1 per second was enough to keep up to sync with them. This explained why I seemingly was only affected by lemmy.world and nobody else. I was somewhat slow, but only slow enough to notice if the source had too much traffic.

OK, we know the “what”, now we needed to know the “why”.

At this point I’m starting to suspect something is going on my Database. So I have to start digging into stuff I’m really not that familiar. This is where the story gets quite frustrating, because there’s just not a lot of admins in the chat who know much about the DB stuff of lemmy internals. So I would ask a question, or provide logs, and then had to wait sometimes hours for a reply. Fortunately both sunaurus from lemm.ee and phiresky were around, who could review some of my queries.

Still, I had to know enough sql to craft and finetune those queries myself and how to enable things like pg_stat_activity etc.

Through trial and error we did discover that some insert/update queries were taking a bit too much time to do their thing, which could mean that we were I/O bound. Easy fix, disable synchronous_commit, sacrificing some safety for speed. Those slow queries went away, but the problem remained the same. WTF?!

There was nothing else clearly slow in the DB, so there was nothing more we could do there. So my next thought was, maybe it’s a networking issue between my loadbalancers and my backend. OK so I needed to remove that from the equation. I set up a haproxy directly on top of my backend which would allow me to go through the loopback interface and have 0 latency. For this I had to ask the lemmy.world admins to kindly add lemmy.dbzer0.com directly to their /etc/hosts file so they alone would hit my local haproxy.

No change whatsoever!

At this point I’m starting to lose my mind. It’s not networking between my LB and my backend, and it’s not the DB. It has to be the backend. But it’s not under any load and there’s no errors. Well, not quite. There’s some “INFO” logs which refer to lost connections, or unexpected errors, but nobody in the chat seems to worry about them.

Right, that must mean the problem is networking between my backend and my database, right? Unlike most lemmy instances, I keep my lemmy DB and my backend separated. Also, the DB has a limited amount of connections and lemmy backend itself limits itself to a small pool of connections. Maybe I run out of connections because of slow queries?

OK let’s increase that to a couple thousands and see what happens.

Nothing happens, that’s what happens. Same 1 per second requests.

As I’m spiraling more and more towards madness, and the chat is running out of suggestions, sunaurus suggests that he adds some extra debugging to lemmy and I will run that to try and figure out which DB query is losing time. Great idea. Problem is, I have to compile lemmy from scratch to do that. I’ve never done that before. Not only that, I barely know how to use docker in the first place!

Alright, nothing else I can do, got to bite that bullet. So I clone the lemmy backend and while waiting for sunaurus to come online, I start hacking at it to figure out how to make it compile a docker lemmy backend from scratch. I run into immediate crashes and despair. Fortunately nutomic (one of the core devs) walked by and told me the git commands to run to fix it, so I could proceed in cooking my very first lemmy container. Then nutomic helped me realize I don’t need to set up a whole online repo to transfer my docker container. The more you know…

Alright, so I cooked a container and plugged it onto a whole separate docker infra, which is only connected to the lemmy.world loadbalancer, so I can remove all other logs from anything but federation requests. So far so good.

Well, not quite, unfortunately I forgot that the “main” branch of lemmy is actually the development branch and has untested code in there. So when I was testing my custom docker deployment, I migrated my DB to whatever the experimental schema is on main. Whoops!

OK, nothing seemingly broke. Problem for a different day? No, just foreshadowing.

Finally sunaurus comes back online and gives me a debug fork. I eagerly compile and deploy it on prod and then send some logs to sunaurus. We were expecting we’d see 1 or 2 queries that were struggling, so maybe a bad lock situation somewhere. We did not expect we’d see ALL queries, including the most simple query such as lookup a language, take 100ms or more! That can’t be good!

Sunaurus connects the dots and asks the pertinent question: “Is your DB close to the Backend, geographically?”

Well, “Yes”, I reply, “I got them in the same datacenter”. “Can you ping?” he asks.

OK, I ping. 25ms. That’s good right? Well, in isolation, that’s great. When it’s not so great is when talking about backend-to-DB communication! This like 1000s km distance.

You see, typically a loadbalancer just makes one request to the backend and gets one reply, so a 25ms roundtrip is nothing. However a backend is talking to the DB a lot. In this instance, for every incoming federation action the backend does like 20 database calls, to verify and submit. Multiply each of these by 25+25 roundtrip and you got 1000ms extra before any actual processing on the DB!

But how did this happen? I’m convinced all my servers are in the same geographic area. So I go to my provider panel and check. Nope, all my server BUT the backend are in the same geographic area. My backend happens to be around 2000 km away. Whoops!

Turns out, when I was migrating my backend back in the day I run into performance issues, I failed to pay attention to that little geographic detail. Nevertheless It all worked perfectly well until this specific set of circumstances where the biggest lemmy instance upgraded to 0.19.3 which caused a serial federation, which my slow-ass connection couldn’t keep up. In the past, I would just get flooded by sync requests by lemmy.world as they came. I would be slow, but I’d process them eventually. Now, the problem became obvious.

Alright, it’s time to put up my sleeves and it’s migrate servers! Thank fuck I have everyone written in Ansible as code, so the migration was relatively painless (other than slapping Debian 12 around to let me do fucking docker-compose operations with python, goddamnit!)

A couple of hours later, I had migrated my backend to the same DC as the Database, and as expected, suddenly my ingestion rate for federation actions was in the order of 50ms, instead of 1000ms. This means I could ingest closer to 20 actions per sec from lemmy.world and it was getting just 3/s new from its userbase. Finally we started catching up!

All in all, this has been a fairly frustrating experience and I can’t imagine anyone who’s not doing IT Infrastructure as their day job being able to solve this. As helpful as the other lemmy admins were, they were relying a lot on me knowing my shit around Linux, networking, docker and postgresql at the same time. I had to do extended DB analysis, fork repositories, compile docker containers from scratch and deploy them ad-hoc etc. Someone who just wants to host a lemmy server would give up way earlier than this.

For me, a very stressing component was the lack of replies in the chat. I would sometimes write pages of debug logs, and there was no reply from anyone for 6 hours or more. It gave me the impression that nobody had any clue what to do to help me and I was on my own. In fact, if it wasn’t for sunaurus specifically, who had enough Infrastructure, Rust and DB chops to get an insight out of where it was all going wrong, I would probably still be out there, pulling my hair.

As someone hosting a service like this, especially when it has 12K people in it, this is very scary! While 2 lemmy core developers were in the chat, the help they provided was very limited overall and this session mostly relied on my own skills to troubleshoot.

This reinforced in my mind that as much as I like the idea of lemmy (or any of the other threadiverse SW), this is only something experts should try hosting. Sadly, this will lead to more centralization of the lemmy community to few big servers instead of many small ones, but given the nature of problems one can encounter and the lack of support to fix them if they’re not experts, I don’t see an option.

Fortunately this saga ended and we’re now fully up to sync with lemmy.world. Ended? Not quite. You see today I realized I couldn’t upload images on my instance anymore. Remember when I started the development instance of lemmy by mistake from main? Welp that broke them. So I had to also learn how to downgrade a lemmy instance as well. Fortunately sunaurus had my back on this as well!

To spare some people the pain, I’ve sent a PR to the lemmy docs to expand the documentation for building docker containers and doing troubleshooting. My pain is your gain.

This also gave me an insight about how the federation of lemmy will eventually break when a single server (say, lemmy.world) grows big enough to start overwhelming even servers who are not badly setup like mine was. I have some ideas to work around some of this so I plan to a suggestion on how to become more future proof, which would incidentally prevent the same issue which happened to me in the first place.

In the meantime, enjoy the Divisions by zero, which as a result of the migration should now feel massively faster as well!