Year Two of the AI Horde!

The AI Horde has turned two years old. I take a look back in all that’s happened since.

Can you believe I blogged about the first birthday of the AI Horde approximately one year ago? If you can, go ahead and read that one first to see the first chapter of its existence.

Since we started recording stats, we’ve generated 113M images 145M texts, which just goes to show just how explosively the FOSS LLM scene has embraced the AI Horde since last year, completely outpacing the lifetime image generations within one year!

This year has been the first one since we received funding from NLNet, so let’s take a look at what we achieved:

Overall, development has continued throughout the last year and we’ve been trying to onboard as many new features as possible with 2 core devs. Sadly our donation income has completely collapsed since the same time last year, to the point where the money is just barely covering our infrastructure costs.

If you see value in what the AI Horde please consider supporting our infrastructure through patreon or github or consider onboarding your PC as a Dreamer or Scribe worker.

What was your favorite new addition to the AI Horde from the past year? Let me know if there’s any event I forgot to mention.

OCTGN Android:Netrunner sound effects

Recently a maintainer from jinteki.net contacted me about getting the license for the A:NR sound effects I had used in the OCTGN implementation to reuse in jinteki and casually mentioned that the Archer ICE noise was the coolest one. It had until now never occurred to me that people might appreciate the various sound effects I had inserted into the game back then for the flavour, so I did a quick search and run into this cute video about it (you can hear archer at the 13:00 mark).

Fascinating! I always like to make my games as flavorful as possible, and especially given the limitations of OCTGN, some flavour was sorely needed. So I had added custom fonts, little flavour blurbs in significant actions and finally I scoured the internet for hours and hours to find the sound effects which fit the cyberpunk theme of the various actions.

These were always meant to be just little things in an obscure game, so I’m kinda pleasantly surprised that some of them have received this sort of cult status in the netrunner community. Very cool. Hopefully these sound effects will find a second life in jinteki.net

If you want to check what the OCTGN game looked like, I have a tutorial video here, and I also have a bunch of videos about it on my youtube channel.

Transparent Generations

We have another new feature available for people to use from the AI Horde. This is the capacity to use Layer Diffuse to generate images with a transparent background directly (as opposed to stripping the image background with a post-processor).

As someone who’s dabbled into video game development in the past (which was in fact the reason I started the AI Horde) being able to generate sprites, icons and other assets can be quite useful, so once I saw this breakthrough, it immediately became something I wanted to support.

To use this feature, you simply need to flip on the transparent switch if your UI supports it, and the Horde will do the rest. If you’re an integrator, simply send “transparent: true” in your payload.

Take note that the images generated by this feature will not match the image you get with the same seed when transparency is not used! Don’t expect to take an image you like and remove the background this way. For that you need to use the post-processor approach.

Also keep in mind, not every prompt will work well for a transparent image generation. Experiment and find what works for you.

As part of making this update work, me and Tazlin also developed, discovered and fixed a number of other issues and bugs.

What would be most interesting for you is a slight change on how hires-fix works. I discovered that the implementation we were using was using the same amount of steps for the upscaled denoising which was completely unnecessary and wasting compute. So we now use a smart system which dynamically determines how many steps to use for the hires-fix based on the denoising strength you used for hires-fix and the steps for the main generation, and we also exposed a new key on the API where you can directly pass a hires-fix denoising strength.

The second fix is allowing hires-fix on SDXL models, so now you can try to generate larger SDXL images at the optimal resolution.

Finally there were a lot of other minor tweaks and fixes, primarily in the horde-engine. You can read further for more development details on this feature.

This update required a significant amount of work as it required that we onboard a new comfyUI node. Normally this isn’t difficult, but it turns out this node was automatically downloading its own LoRa models on startup, and those were not handled properly for either storage or memory. Due to the efficiency of the AI Horde worker, we do a lot of model preloading along with some fancy footwork in regards to RAM/VRAM usage.

So to make the new nodes work as expected, I had to reach in and modify the methods which were downloading models so that they use our internal mechanisms such as the model manager. Sadly the model manager wasn’t aware of strange models like layer diffuse, so it required me adding a new catch-all class of the model manager for all future utility models like these.

While waiting for Tazlin to be happy with the stability of the code, we discovered another major problem: The face-fixer post-processors we were using until now had started malfunctioning, and generating faces with a weird gray sheen. After some significant troubleshooting and investigation, we discovered that ComfyUI itself on the latest version had switched to a different internal library which didn’t play well with the custom nodes doing the face-fixing.

First I decided to update the code of the face-fixer nodes we were using, which is harder than it sounds, as it also downloads models automatically on startup, which again needs to be handled properly. Updating the custom nodes fixed the codeformer face-fixer, but gfpgan remained broken and the comfyUI devs mentioned that someone would have to fix it. Unfortunately those nodes didn’t seem to be actively maintained anymore so there was little hope to just wait for a quick fix.

Fortunately another custom node developer had run into the same problems, and created a bespoke solution for gfpgan licensed liberally, which I could copy. I love FOSS!

In the meantime, through our usual beta testing process, we discovered that there were still some funkiness in the new hires-fix approach, and Tazlin along with some power users of the community were able to tweak things so that they could work more optimally.

All in all, quite a bit of effort in the past month for this feature, but now we provide something which along with the embedded QR Code generation, I’ve seen very few other GenAI services provide, if at all.

Will you use the new transparent image generation? If so, let us know how! And remember if you have a decent GPU, you can help other generate images by adding your PC onto the horde!

Everything Haidra touches

The second fediverse canvas event just concluded and I’m very happy how this turned out. In case you don’t know what this is. Check out this post and then take your time to go and explore the second canvas in depth before it’s taken down, and look for all the interesting and sometimes even hidden pieces of pixel art.

This time I had a more interesting idea to participate. I decided to draw the Haidra Org logo. I didn’t expect a massive support, but was pleasantly surprised with how many people joined in to help create it after my initial post about it and my announcement on the AI Horde discord server. Some frontends like horde-ng even linked to it with an announcement.

Almost as soon as it started, we ended up conflicting in our placement with someone who was drawing a little forest on just below and to our left. I decided that they can have the foreground since we had plenty of space available which avoided any fighting over pixels. All in all, we managed to complete it within half a day or so which is pretty cool I like to think and we even got a small “garden” so to speak.

The final form of the Haidra drawing, including the little forest below and two Stus

Afterwards I thought it would be interesting to have the Haidra tendrils “touch” various points of importance or sprites that I like. I decided to extend out as if we’re made of water and a lot of other “canvaseers” joined in to help which I found really sweet.

First we extended towards the (then) center of the canvas (top left on the featured image above), passing next to the Godot logo, below OSU and finally reached the explosion of the beams. That took most of the first day but people were still pretty active, even though the infrastructure of the event had already started buckling under its own success.

Fortunately as we could “flow” like water and even “go under” other pixelart, we didn’t encounter any resistance in our journey, and a lot of people gave us a helping hand as well.

Once this was achieved on a whim, I decided to double down on the “river” similaity, and drew a little 17px pirate ship to show our roots and went to bed. When I woke up next morning, I was surprised to discover a Kraken was attacking it making a really cool little display of collaborative minimalistic art.

Haidra pirate ship fighting a Kraken

This kind of thing is why I love events like these. I love emergent stuff like these and seeing people putting the own little touches on what other started is awesome!

The next day the canvas had extended to be double in size and so a whole new area to the right was available, I had already noticed someone had created a little pirate banner towards the new canvas center, but it was alone and sad. So I decided we should try to give it a little bit of that Haidra embrace. So a long journey started with a new tendril to reach it. I had a rough idea of the path to follow as the direct route was blocked, but as soon as other started adding to it, it almost took a life of its own on its journey.

Eventually, towards the middle of the second day we reached it, passing under Belgium, through some letters and crossing the big under-construction trans flag before going over piracy, before I spawned yet another pirate ship before waterfalling down onto the mushroom house.

The path to piracy

At this point, the whole event took a dramatic turn as the performance problems had become so severe, that the admin decided to take the whole thing down to fix them, rather than let people get frustrated. This took half a dozen hours or so, and even though the event was extended by 24 hours to make up for it, the event momentum was kneecapped as well.

Once the canvas was back up for the third day, the next objective I had was a much longer journey to try and touch The Void that was extending from the top right. When I started, the path was still mostly empty, but as we moved towards it, the canvas became more more congested, forcing us to take some creative detours to avoid messing with other art.

All in all, we flowed over the Factorio cog, creating a little lake and spawning a rubber duckie in the process. Then through the second half of the trans flag, which caused a minor edit war, as the canvaseers thought we were vandalizing. Then the way up and over the massive English flag was sorta blocked, so we had to take a detour and slither between the Pokemon to its left first.

Until finally we reached the top of the English flag, where I took a little creative detour to draw a little naval battle. My plan was to have an English brigantine fighting with two pirate sloops, but as soon as I finished it, other jumped in with their own plans. First one of my pirate ships revealed itself as a Spanish privateer instead (which I suspect was a reference to the recent football events). And then over the course of the next two days, the three ships kept changing allegiances every couple of hours. Quite the little mini-story to see unfold.

Finally we were almost at our final objective, only to discover that our final objective was not there anymore. The Void had been thoroughly contained and blocked by a massive cat butler (catler?). The only thing left to touch, was a single solitary void tendril on the top. Surprisingly, as soon as we reached it, it livened and flourished into life, which was certainly not my original idea, but I went with it happily.

Having achieved all I wanted to do, and with the event (and the day) drawing to a close, I decided there’s no point setting any more goals and just left those interested start extending Haidra on a whim. You can see my final post here, which also links to all my previous posts, which also contain some historic canvas images, showing the actual state of the board at the time of the posting.

All in all, I had a lot of fun, and enjoyed this way more than Reddit /r/place which is botted to hell and back, making contributions by individual humans practically meaningless. Due to the lack of significant botting, not only was one’s own pixels more impactful, but humans tended to mostly collaborate instead of having scripts mindlessly enforcing a template. This ended with a much more creative canvas, as people worked off others ideas and themes, and where there was conflict, a lot of the time a compromise solution was discovered where both pieces of art could co-exist.

The conflict points tended to be political, as it so often happens. For example the Hexbears constantly trying to make the Nato flag into a swastika, or some effectively people rehashing the conflict around the Israel colonization of Palestine in pixel conflict form.

Some other things of interest:

  • I mentioned that the Spanish seem to have boarded and overtaken my pirate ship, and someone drew a little vertical ship coming up the stream for reinforcements. ❤️
  • Stus and AmongUs everywhere, sometimes in negative space, or only visible in the heatmap. Can you find them all?
  • The Void getting absolutely bodied when it tried to be destructive, but being allowed to extend a lot more when they actually played nice with other creations.
  • The amount of My Little Pony art is too damn high!
  • Pleasantly little national flag jingoism on display!
  • A very healthy amount of anarchist art and concepts and symbols. Well done mates! Ⓐ

See you next year!

Embedded QR Codes via the AI Horde

Around the same time last year, the first controlnet for generating QR codes with Stable Diffusion was released I was immediately enamored with the idea and wanted to have it ASAP as an option on the AI Horde. Unfortunately due to a lot of extenuating circumstances [gesticulates wildly] I had neither the time, nor the skills to do it myself, nor the people who could help us onboard it. So this fell on the wayside while way more pressing things were being developed.

Today I’m very excited to announce that I have finally achieved and deployed it to production! QR code generation via the AI Horde is here!

To use is fairly simply, assuming your front-end of choice supports it. You simply provide the text that you want represented as a QR code and the AI Horde will generate a QR code, and then using controlnet, will generate an image where the QR code is embedded into it, as if it’s part of the drawing. You can scan the examples below to see it in action.

You’ll notice that unlike some of the examples you’ll find online elsewhere, the QR code we generate is still fairly noticeable as a QR code, especially when zoomed out, or at a distance. The reason for this is that the more fitting you make to the image, the less likely it is that the QR code is scannable. The implementation I followed to achieve this result is specifically tailored to sacrifice “embedding” for the purpose of scannability.

So when you want to generate QR codes, you need to keep in mind that this is a very finicky workflow. The diffusion process can easily “eat” or modify some components of the QR code so that the final image is not readable anymore. The subject matter and model used matters surprisingly much. Subjects which are somewhat noisy (such as the brain prompt in the featured image above) tend to give enough to the model to work with to reshape that area in a way that creates a QR code. Wheres no matter how hard I tried, I couldn’t get it to generate a QR code with an anime model and an anime woman in the subject.

Along with the basic option to provide the QR Code text, you can also customize some more areas from it. For example you can choose where the QR code will be placed in the image. By default we’ll always display it in the center, but sometimes the composition might be easier if you choose to place it on the side, or to the bottom. You can choose a different prompt for the anchor squares, increase or decrease the border thickness, and more. Your front-end should hopefully be explaining these options to you.

If you want to try and make some yourselves right now, I’ve added the necessary functionality to my Lucid Creations front-end already, so feel free to give it a try right now.

Continue reading further to get some development details.

The road leading to me making this feature available was fairly long. Other than all the other priorities I had for the horde, we also had the misfortune that one of our core contributors on the backend/comfyUI side, went suddenly missing at the end of summer. As I am still more focused the middleware/api and infrastructure (plus so much more, halp!) and Tazlin is focused on efficiency, and code maintenance & quality, we didn’t have the necessary skills to add something as complex as QR code generation.

Once it was clear that our contributor wasn’t coming back and nobody else was stepping up to help, I finally accepted that if I want it done, I have to learn to do that part myself as well. So in the past few months I embarked on a journey to start adding more and more complex comfyUI workflows. First came Stable Cascade which required me to build code which can load 2 different model files at the same time. Then Stable Cascade Remix which required that I wrangle up to 5 source images together.

Note that I’m mostly re-using existing fairly straightforward ComfyUI workflows which do these tasks. I don’t have the bandwidth to learn ComfyUI itself that much. But the work of making said workflows function within the horde-engine with payloads that are send via the AI Horde REST API is quite a complex amount of work on top of those. As I hadn’t built this “translation layer”, I was avoiding that area of the code until now, and this work helped me build up enough knowledge and confidence to be able to pull of translating a much much more complex ComfyUI workflow like the QR codes.

So after many months, I decided it was finally the time to tackle this problem. The first issue is getting an actually good QR Code ComfyUI workflow. Unlike the previous workflows I used, it’s surprisingly difficult to find something that works immediately. Most simple QR Code workflows both required that one generates the QR image externally and generated mostly unscannable images.

I was fortunate enough to run into this excellent devlog by Corey Hanson who not only provided instructions on what works and what doesn’t for QR codes, but even provided a whole repository with prebuilt ComfyUI workflows and a custom node which would also generate a QR code as part of the workflow. Perfect!

Well, almost perfect. Turns out the provided ComfyUI workflows were fairly old, and at the rate GenerativeAI progresses even a couple of months means something can easily be too stale to use. On top of that they were using a lot of extra custom nodes in their examples that didn’t parse, which a ComfyUI newbie like me had to untangle. Finally those workflows were great, especially for local use, but a bit overkill for the horde usage.

So first order of business was to understand, then simplify the workflow to just do the bare needed to get a QR code. Honestly it took me a bit of time to simply get the workflow running in ComfyUI itself and half-way understand what all the nodes were doing. After that I had to translate it to the horde-engine format, which by itself required me to refactor how I parse all comfyUI workflows to make it more maintainable in the future.

Finally QR codes require a lot more potential text inputs, which I didn’t want to start explicitly storing in the DB as new columns as they’re used only for this specific purpose. So I had to come up with a new protocol for sending an open ended amount of extra text values. Fortunately I had already the extra_source_images code deployed so I just copied part of the same logic to speed things up.

And then it was time for unit tests and the public beta and all the potential bugs to fix. Which is when I realized that the results on SD 1.5 models were a bit…sucky, so I went back to ComfyUI itself and actually figured out how to make the workflow work with SDXL as well. The results were way more promising.

Unfortunately while the SDXL QR Codes are way nicer, the requirements to generate them are almost tripled compared to SD 1.5. Not only does one need to run SDXL models, but SDXL controlnets are almost as big as the models themselves. The QR code controlnet is 5G on its own, and all that needs to be loaded in VRAM at the same time as the mode. All this means that even middle-range GPUs struggle to generate SDXL QR codes in a reasonable amount of time. This meant that I also had to adjust the worker to give the option for people serving SDXL models to skip SDXL controlnet, and also properly route this switch via the AI Horde.

Nevertheless, this an areas that makes the AI Horde shine, as those with the necessary power, can support those who need it. Most people will find it really hard or frustrating to generate even a single QR code, never-mind an SDXL one, only to discover that it’s unscannable, but through the horde they can easily generate dozens with very little expertise needed and find the one that works for them.

So It’s been a long journey, but it’s finally here, and the expertise I gained by achieving it also means that I now have enough knowledge to start adding more features via ComfyUI. So stay tuned to see more awesome workflows on the AI Horde!

Eudaimonia community

I thought it might be interesting to point out that I opened a new community in the Divisions by zero lemmy to post things about content living as I couldn’t find any other fitting space. There’s just not a lot of locations one can share articles and discuss about such topics that also don’t devolve into spiritualism or self-help guru grifts, both of which I intensely dislike.

So that community is to post about things in materialistic context, with a preference for empiricism and scientific thinking about it, but more squishy secular philosophy is also encouraged for topics which don’t work too well empirically.

If you’ve been around, you probably know I’m going to be posting some Epicurus sooner or later 😀

Take a look and post some relevant stuff you run into.

Image Remix on the AI Horde

The initial deployment of the Stable Cascade (SC) on the AI Horde supported just text2image workflows, but that was just a subset of what this model can do. We still needed to onboard the rest of its capabilities.

One such capability was the “image variations” option, which allows you to send an image to the model, and get a variation of that image, perhaps with extra stuff added in, using the unClip technology. This required quite a bit of work on hordelib so that it uses a completely different ComfyUI workflow but ultimately this was not so much harder than just adding the img2img capabilities to SC.

The larger difficulty came when I wanted to add the feature to remix multiple images together. The problem being that until now the AI Horde only supported sending a single source image and a single source mask, so a varying amount of images was not possible at all.

So to support this, I needed to touch all areas of the AI Horde. The AI Horde had to accept and upload each of them on my R2 bucket and provide individual download links. The SDK had to know to expect and provide methods to download those images in parallel to avoid delays, to the reGen worker had to be able to receive those images and send them to hordelib which should know how to dynamically adjust a comfyUI pipeline on-the-fly to add as many extra nodes as required.

So after 2 weeks of developing and testing, we finally have this feature available. If your Horde front-end supports the “remix” feature. You can send up to 1-6 images to this workflow along with a prompt, and it will try its best to “squash” them all together into one composition. Note that the more images you send, and the larger the prompt, the harder it will be for the model to “retain” all of them in the composition. But it will try its best.

As an example, here’s how the model remixes my own avatar. You’ll notice that the result can understand the general concepts of the image, but can’t follow it exactly as it’s not doing img2img. The blur is probably caused by the need to upscale my original image, which is something I’d like to fix on the next pass.

Likewise, this is the Haidra logo

And finally, here’s a remix of both logo and avatar together

Pretty neat, huh?

This ability to send extra source images also lays the groundwork for the Horde to support things like InstantID, which I hope I’ll be able to work on supporting soon enough.

The playground schematic analogy for designing a fediverse service.

In recent days, the discussion around Lemmy has become a bit…spicy. There’s a few points of impact here. To list some examples:

This is not an exhaustive list. There’s significant grumbling about lemmy under the mastodon hashtag too.

On the flipside, there’s also been positive reinforcement towards lemmy and its dev, as can be seen by the admin of lemm.ee and many of the lemmy ecosystem in that thread.

You’ll notice all of these are frustrations about the (lack) of sufficient moderation in the tool-set of lemmy. This is typically coming from a lemmy admin’s perspective and the things that are very important to protect themselves and their communities.

In the discussions around these issues, a few common arguments have been made, which while sounding reasonable at face value, I think are the wrong thing to say to the situation at hand. The problem is somewhat that the one making these arguments feels like they’re being more than fair, while the ones receiving them feel dismissed or disrespected.

Before I go on, I want to make clear that I am writing this out of a place of support. I have been supporting lemmy years before the big lemmy exodus and after I made my own lemmy instance, I have created dozens of third party tools to help the ecosystem, because I want lemmy to succeed. That is to say, I’m not a random hater. I am just dismayed that the community is splintering like this, out of what seems to me, like primarily a communication issue.

So one of the analogies made in the sunaurus thread, likened lemmy development to designing a playground. It strikes me that this analogy is perfect, but not for the reason the one making it expects. Rather, it is perfect for exemplifying how someone coming from wholeheartedly supporting FOSS developers might still misjudge the situation and escalate a situation through miscommunication.

In this analogy, the commenter likened lemmy development like building a playground and external people asking for some completely unrelated feature, like a bird-watching tower, and expecting the developers to give it priority. The problem here is that the analogy is flawed. The developers are not building a playground for themselves. They’re building a playground schematic, which they expect people would and should deploy in many other locations.

Some people might indeed ask for “bird-watching observation posts” in such a schematic and it would be more than fair to ask them to build it themselves. but it is fallacious to liken any and all requests as something as out of scope as this. Some people might request safety features on the playground and those should absolutely be given more priority. We already know what can happen if you design a shitty playground, even if you give it for free!

To extend this analogy, the other lemmy admins, are not asking for luxury features. They are asking for improvements in the safety of the playground. Some people point out that metal slides become dangerous based on the weather. Some other point out that the playgrounds might be built in very unsafe areas, so a fence to protect the children from predators should be mandatory.

And here is the disconnect in communication happens. The overworked developer is already busy designing the next slide which can get them paid, or making sure things don’t break down as fast etc, and they perceive the safety requests as “luxury” items, someone should deal with themselves. However for the people who have to deal with upset parents and missing children, this dismissive attitude come out as downright malicious.

And thus you have a situation where both sides see the other are unreasonable. The devs see the people asking for the safety features as entitled, while the people who are suffering through the lack of those safety features perceive the developers as out-of-touch and dismissive.

Leave such a situation to fester long enough, and you start to get the exact situation that we have now. The Lemmy software starting to get a bad reputation in the areas concerned about most safety while forks and rebuilds are popping up.

All of this hurts the whole FOSS ecosystem by splintering development effort into multiple projects instead of collaborating on a single one. It turns our strength into a weakness!

This also brings me to another argument I see lemmy devs making somewhat too often. That they don’t have anything to gain from a larger community and just get more headaches. I always felt this was a patently absurd statement!

The lemmy devs are making more than 3K a month from lemmy. Enough that they are claiming they’re working on lemmy full-time. These funds don’t come because they’re running or developing a single forum for themselves. They come because they provide the “playground schematic”. If the community splinters into other software than lemmy, naturally the funds going towards lemmy development will likewise dry up.

This statement is completely upside down. The more people there are using and hosting lemmy, the more the lemmy developers benefit.

I would argue, the people lemmy developers should be listening to most are exactly the people hosting lemmy instances. These are the people putting incalculable hours into running and maintaining the servers and often paying out of pocket per month, for giving a service to others. Each admin is basically free value to the lemmy developers.

My position is in fact is that this scales down like layers. Lemmy Admins need to listen to instance admins most. Instance admins need to listen to community mods most. And finally community mods should listen to their users most. In this way you create bottom-up feedback mechanism, that doesn’t overwhelm any single person easily and everyone has a chance to be heard.

In AI Horde, I follow a similar approach. The segment of my community I listen to the most is in fact not the ones who are giving me money. It’s the ones who are providing their free time and idle compute for no other benefit than their own internal drive: The workers. They are effectively using their time for the benefit the whole ecosystem, which indirectly benefits me most. Without the workers, there would practically be no AI Horde, even if I am the only remaining. Likewise, without lemmy admins, lemmy as a software would be dead, even if the lemmy.ml people kept hosting forever.

So what can be done here. I think an important aspect here is to make sure we are talk in the same wavelength. Cutting down on miscommunication is very important to avoid exacerbating an already precarious situation.

Secondly, it’s totally understandable that lemmy devs don’t have enough time for everything. But likewise, there’s a ton of people who need safety features but can’t get them. As such, in my opinion priority should be put into making the frameworks that more easily allows people to extend lemmy functionality, even if it doesn’t match the lemmy developers visions, or immediate roadmap.

For this reason I strongly suggest that effort should be put into developing a plugin framework for lemmy. Ironic to suggest a completely different feature when the problem is too many things to do, but this specific feature is meant to empower the larger community to solve their own problems easier. So in the long term, it will massively reduce the incoming demands to the lemmy developers.

In the meantime, I do urge people to always consider that there’s always a human behind the monitor on the other side. A lot of time people don’t have the skills to effectively communicate what they mean, which is even worse in text form. We all need be a bit more charitable on what the other side is trying to say, especially when we’re trying to collaborate for a common FOSS project.

Post-Mortem: The massive lemmy.world -> lemmy.dbzer0.com federation delays.

A couple days ago, someone posted on /0 (the meta community for the Divisions by zero) that the incoming federation from lemmy.world (the largest lemmy instance by an order of magnitude) is malfunctioning. Alarmed, I started digging in, since a federation problem with lemmy.world will massively affect the content my community can see.

As always my first stop was the Lemmy General Chat on Matrix where I asked the lemmy.world admins if this appears to be something on their end. To their credit both their lead infra admin and the owner himself jumped in to assist me, changing their sync settings, adding custom DNS entries and so on. Nothing seemed to help.

But the problem is must still be somewhere in lemmy.world I thought. It’s the only instance where this is happening and they upgraded to 0.19.3 recently, so something must have broken. But wait, this didn’t start immediately after the upgrade. Someone pointed out this very useful federation status page, which kinda point that the problem is only on lemmy.world.

Not quite, other big instances like lemmy.ml and lemm.ee were not having any issues with federation with lemmy.world (even though 2 dozen others like lemmy.pt were), and they are as big if not bigger than lemmy.dbzer0.com. A problem originating from lemmy.world cannot be possibly affecting only some specific instances. To make matters worse, both me and lemmy.ml are using the same host (OVH), so I couldn’t even blame my hosting provider somehow.

So obviously the main culprit it somewhere in my backend, right? Well, maybe. Problem is, none of the components of my infrastructure were overloaded, everything sitting between 5-15% utilization. Nothing to even worry about.

OK, so first I need to make sure it’s not a network issue somehow specifically between me and lemmy.world specifically. I know OVH gave me a bum floating IP in the past and were completely useless at even understanding that their floating IP was faulty, so I had to stop using it. Maybe there’s some problem with my loadbalancers.

Still, I’m using haproxy, which is nothing if not fast and rock solid. So I didn’t really suspect the software. Rather, maybe it’s a network issue with the LB itself. So first thing I did is double the amount of Loadbalancers in play, by setting my DNS record to point to my secondary LB at the same time. This should lessen the amount of traffic hitting my LB and even take them at a completely different VM, and thus point if the problem is on the haproxy side. Sadly, this didn’t improve things at all.

OK so next step, I checked how long a request takes to return from the backend after haproxy sends it over. The results were not good.

I don’t blame you if you cannot read this, but what this basically says is that a request hitting a POST on my /inbox, took between 0.8 and 1.2 seconds. This is bad! This is supposed to be a tiny payload to tell you an event happened on another instance, it should be practically instant.

Even more weird, this is affecting all instances, not just lemmy.world. So this is clearly a problem on my end, but it also confused me. Why am I not having troubles with other instances? The answer came when I was informed that 0.19.3 added a brand new, special new federation queue.

You see, the old versions of lemmy used to send all federation actions over as soon as they received them. Fire and forget style. This naturally lead to federation events being dropped due to a myriad of issues, like network, downtimes, gremlins etc. So you would lose posts, comments and votes, and you would (probably) never realize.

The new queue added order to this madness, by making each instance send its requests serially. A request would be sent again and again until it succeeded. And the next one would only be sent if the previous one was done. This is great for instances not experiencing issues like mine. You see, at this point, I was processing 1 incoming federation request per second approximately, while lemmy.world was sending around 3. Even worse, I would occasionally timeout as well by exceeding 10 seconds to process, causing 2 more seconds or wait time.

Unlike lemmy.world, other federating instances to mine didn’t have nearly as much activity, so 1 per second was enough to keep up to sync with them. This explained why I seemingly was only affected by lemmy.world and nobody else. I was somewhat slow, but only slow enough to notice if the source had too much traffic.

OK, we know the “what”, now we needed to know the “why”.

At this point I’m starting to suspect something is going on my Database. So I have to start digging into stuff I’m really not that familiar. This is where the story gets quite frustrating, because there’s just not a lot of admins in the chat who know much about the DB stuff of lemmy internals. So I would ask a question, or provide logs, and then had to wait sometimes hours for a reply. Fortunately both sunaurus from lemm.ee and phiresky were around, who could review some of my queries.

Still, I had to know enough sql to craft and finetune those queries myself and how to enable things like pg_stat_activity etc.

Through trial and error we did discover that some insert/update queries were taking a bit too much time to do their thing, which could mean that we were I/O bound. Easy fix, disable synchronous_commit, sacrificing some safety for speed. Those slow queries went away, but the problem remained the same. WTF?!

There was nothing else clearly slow in the DB, so there was nothing more we could do there. So my next thought was, maybe it’s a networking issue between my loadbalancers and my backend. OK so I needed to remove that from the equation. I set up a haproxy directly on top of my backend which would allow me to go through the loopback interface and have 0 latency. For this I had to ask the lemmy.world admins to kindly add lemmy.dbzer0.com directly to their /etc/hosts file so they alone would hit my local haproxy.

No change whatsoever!

At this point I’m starting to lose my mind. It’s not networking between my LB and my backend, and it’s not the DB. It has to be the backend. But it’s not under any load and there’s no errors. Well, not quite. There’s some “INFO” logs which refer to lost connections, or unexpected errors, but nobody in the chat seems to worry about them.

Right, that must mean the problem is networking between my backend and my database, right? Unlike most lemmy instances, I keep my lemmy DB and my backend separated. Also, the DB has a limited amount of connections and lemmy backend itself limits itself to a small pool of connections. Maybe I run out of connections because of slow queries?

OK let’s increase that to a couple thousands and see what happens.

Nothing happens, that’s what happens. Same 1 per second requests.

As I’m spiraling more and more towards madness, and the chat is running out of suggestions, sunaurus suggests that he adds some extra debugging to lemmy and I will run that to try and figure out which DB query is losing time. Great idea. Problem is, I have to compile lemmy from scratch to do that. I’ve never done that before. Not only that, I barely know how to use docker in the first place!

Alright, nothing else I can do, got to bite that bullet. So I clone the lemmy backend and while waiting for sunaurus to come online, I start hacking at it to figure out how to make it compile a docker lemmy backend from scratch. I run into immediate crashes and despair. Fortunately nutomic (one of the core devs) walked by and told me the git commands to run to fix it, so I could proceed in cooking my very first lemmy container. Then nutomic helped me realize I don’t need to set up a whole online repo to transfer my docker container. The more you know…

Alright, so I cooked a container and plugged it onto a whole separate docker infra, which is only connected to the lemmy.world loadbalancer, so I can remove all other logs from anything but federation requests. So far so good.

Well, not quite, unfortunately I forgot that the “main” branch of lemmy is actually the development branch and has untested code in there. So when I was testing my custom docker deployment, I migrated my DB to whatever the experimental schema is on main. Whoops!

OK, nothing seemingly broke. Problem for a different day? No, just foreshadowing.

Finally sunaurus comes back online and gives me a debug fork. I eagerly compile and deploy it on prod and then send some logs to sunaurus. We were expecting we’d see 1 or 2 queries that were struggling, so maybe a bad lock situation somewhere. We did not expect we’d see ALL queries, including the most simple query such as lookup a language, take 100ms or more! That can’t be good!

Sunaurus connects the dots and asks the pertinent question: “Is your DB close to the Backend, geographically?”

Well, “Yes”, I reply, “I got them in the same datacenter”. “Can you ping?” he asks.

OK, I ping. 25ms. That’s good right? Well, in isolation, that’s great. When it’s not so great is when talking about backend-to-DB communication! This like 1000s km distance.

You see, typically a loadbalancer just makes one request to the backend and gets one reply, so a 25ms roundtrip is nothing. However a backend is talking to the DB a lot. In this instance, for every incoming federation action the backend does like 20 database calls, to verify and submit. Multiply each of these by 25+25 roundtrip and you got 1000ms extra before any actual processing on the DB!

But how did this happen? I’m convinced all my servers are in the same geographic area. So I go to my provider panel and check. Nope, all my server BUT the backend are in the same geographic area. My backend happens to be around 2000 km away. Whoops!

Turns out, when I was migrating my backend back in the day I run into performance issues, I failed to pay attention to that little geographic detail. Nevertheless It all worked perfectly well until this specific set of circumstances where the biggest lemmy instance upgraded to 0.19.3 which caused a serial federation, which my slow-ass connection couldn’t keep up. In the past, I would just get flooded by sync requests by lemmy.world as they came. I would be slow, but I’d process them eventually. Now, the problem became obvious.

Alright, it’s time to put up my sleeves and it’s migrate servers! Thank fuck I have everyone written in Ansible as code, so the migration was relatively painless (other than slapping Debian 12 around to let me do fucking docker-compose operations with python, goddamnit!)

A couple of hours later, I had migrated my backend to the same DC as the Database, and as expected, suddenly my ingestion rate for federation actions was in the order of 50ms, instead of 1000ms. This means I could ingest closer to 20 actions per sec from lemmy.world and it was getting just 3/s new from its userbase. Finally we started catching up!

All in all, this has been a fairly frustrating experience and I can’t imagine anyone who’s not doing IT Infrastructure as their day job being able to solve this. As helpful as the other lemmy admins were, they were relying a lot on me knowing my shit around Linux, networking, docker and postgresql at the same time. I had to do extended DB analysis, fork repositories, compile docker containers from scratch and deploy them ad-hoc etc. Someone who just wants to host a lemmy server would give up way earlier than this.

For me, a very stressing component was the lack of replies in the chat. I would sometimes write pages of debug logs, and there was no reply from anyone for 6 hours or more. It gave me the impression that nobody had any clue what to do to help me and I was on my own. In fact, if it wasn’t for sunaurus specifically, who had enough Infrastructure, Rust and DB chops to get an insight out of where it was all going wrong, I would probably still be out there, pulling my hair.

As someone hosting a service like this, especially when it has 12K people in it, this is very scary! While 2 lemmy core developers were in the chat, the help they provided was very limited overall and this session mostly relied on my own skills to troubleshoot.

This reinforced in my mind that as much as I like the idea of lemmy (or any of the other threadiverse SW), this is only something experts should try hosting. Sadly, this will lead to more centralization of the lemmy community to few big servers instead of many small ones, but given the nature of problems one can encounter and the lack of support to fix them if they’re not experts, I don’t see an option.

Fortunately this saga ended and we’re now fully up to sync with lemmy.world. Ended? Not quite. You see today I realized I couldn’t upload images on my instance anymore. Remember when I started the development instance of lemmy by mistake from main? Welp that broke them. So I had to also learn how to downgrade a lemmy instance as well. Fortunately sunaurus had my back on this as well!

To spare some people the pain, I’ve sent a PR to the lemmy docs to expand the documentation for building docker containers and doing troubleshooting. My pain is your gain.

This also gave me an insight about how the federation of lemmy will eventually break when a single server (say, lemmy.world) grows big enough to start overwhelming even servers who are not badly setup like mine was. I have some ideas to work around some of this so I plan to a suggestion on how to become more future proof, which would incidentally prevent the same issue which happened to me in the first place.

In the meantime, enjoy the Divisions by zero, which as a result of the migration should now feel massively faster as well!

Stable Cascade on the AI Horde!

A while ago Stability.ai released a new model on a different architecture, that seems to provide very promising results and very fast training: Stable Cascade. I really wished to offer it on the AI Horde so after getting explicit permission from Emad in Reddit PMs (due to its more restrictive license for APIs), I set out to implement it.

Unfortunately the Stable Cascade model and ComfyUI workflow require the use of two different checkpoints, which went against the AI Horde worker paradigm at the time, which expected one file per model, so I had to make multiple changes in a lot of packages which expected this paradigm. The Worker, hordelib, the model reference and its SDK, all of them required tweaking to avoid crashing.

Fortunately, while the changes were complicated, I managed to implement them without much debugging. I did initially run into some troubles with the image quality being garbage, which turned out required ComfyAnon tweaking the implementation on ComfyUI a bit, but once that was done, everything fell in place and now you can use the AI Horde to request Stable Cascade images and therefore check the capability of this model, even if you don’t have 20G VRAM to spare.

You can try it out on Artbot

Alongside Stable Cascade, I thought it’s high time we start expanding our SDXL model selection, so the following models have also been onboarded.

  • Juggernaut XL
  • Anime Illust Diffusion XL
  • Pony Diffusion XL
  • Animagine XL
  • DreamShaper XL (Lightning version)

We quickly realized that we also need to expand our model reference to better inform people of the requirements for some of these models. For example Pony Diffusion XL doesn’t work unless you set clip_skip to 2, and DreamShaper requires low steps, cfg and specific samplers. If you know to set those settings correctly, you’ll get amazing images, else you get hot garbage. Soon the horde will be warning you when trying to use a model outside its specifications.

Other than that, we haven’t been completely idle. Some other notable achievements in the previous weeks are:

Firstly, the AI Horde now supports an educator role for accounts. If you are an education institution and you want to use one of the AI Horde free tools for the classroom, you can request your account to be set as an educator, which will force all your requests to be SFW and increase your account’s concurrency.

I also spent some time improving the AI Generation of the Mastodon bot @dungeons, so that it gets nicer images for each campaign protagonist. Will admit I had a lot more fun than I should improving the versatility and variability of the generations and tweaking then results for each model. You can see (or follow) the results in the dedicated account replying with those images.

On the worker side, Tazlin has also been very busy improving the efficiency of our generations. We have added now some improvements such as downloading the loras for the next job, while performing the inference for the previous one, or adding more efficiency for those people with more powerful machines.

I’m now hard at work trying to onboard more Stable Cascade capabilities as they are added to ComfyUI and to add support for more advanced workflow capabilities.