AI-powered anti-CSAM filter for Stable Diffusion

One of the big problems we’ve been fighting against since I created the AI Horde was attempts to use it to generate CSAM. While this technology is very new and there’s a lot of question to answer on whether it even is illegal to generate CSAM for personal use, I erred on the safe side and made it a rule from the start, that the one thing that is going against the AI Horde, is such generated content without exceptions.

It’s is not a overstatement to say I’ve spend weeks of work-hours on this problem. From adding capabilities for the workers to set their own comfort level through a blacklist and a censorlist and a bunch of other variables, to blocking VPN access, to the massive horde-regex filter that sits before every request and tries to ascertain from the prompt sent whether it intends to generate CSAM or not.

However the biggest problem is not just pedos, it’s is stupid, but cunning pedos! Stupid because they keep trying to use a free service which is recording all their failed attempts without a VPN. Cunning because they keep looking for ways to bypass our filters.

And that’s where the biggest problem lied until now. The regex filter is based on language which is not only flexible about the same concept, but very frustratingly, the AI is capable of understanding multiple typos of various words and other languages perfectly well. This strains what I can achieve with regex to the breaking point, and led to a cat&mouse game where dedicated pedos kept trying to bypass the filter using typos and translations, and I kept expanding the regex.

But it was inherently a losing game which was wasting an incredible amount of my time, so I needed to find a more robust approach. My new solution was to onboard image interrogation capability to the worker code. The way I go about this is by using image2text, AKA image interrogation. It’s basically AI Model which you feed an image and number of words or sentences and it will tell you how how well each of those words is represented in that image.

So what we’ve started doing is that every AI Horde Worker will now automatically scan every image they generate with clip and look for a number of words. Some of them are looking for underage context, while some of them are looking for lewd context. The trick is detecting one, or the other context is OK. You’re allowed to draw children, and you’re allowed to draw porn. It’s when these two combines that we filter goes into effect and censors the image!

But this is not even the whole plan. While the clip scanning on its own is fairly accurate, I further tweaked my approach by taking into account things like the value of other words interrogated. For example I noticed that when looking for “infant” in the generated image pregnant women would also have a very high rating for it, causing the csam-filter to censor out naked pregnant women consistently. My solution was then to also interrogate for “pregnant” and if the likelihood of that is very high, adjust the threshold to hit infant higher.

The second trick I did was to also utilize the prompt. A lot of pedos were trying to bypass my filters (which were looking for things like “young”, “child” etc) by not using those words, and instead specifying “old”, “mature” etc in the negative prompt. Effectively going the long route around to make Stable Diffusion draw children without explicitly telling it to. This was downright impossible to block using pure regex without causing a lot of false positives or an incredible amount of regex crafting.

So I implemented a little judo-trick instead. My new CSAM filter now also scans prompt and negative prompt for some words using regex and if they exist, also slightly adjusts the interrogated words based on the author intended. So let’s say the author used “old” in the negative prompt, this will automatically cause the “child” weight to increase by 0.05. This may not sound by a lot, but most words tend to variate from 0.13 to 0.22, so it’s actually has a significant chance to push a borderline word (which it would be at a successful CSAM) over the top. This converts the true/false result of a regex query, into a fine-grained approach, where each regex hit reduces the detection threshold only slightly, allowing non-CSAM images to remain unaffected (since the weight of the interrogated word would start low) while making more likely to catch the intended results.

Now the above is not the perfect description of what I’m doing, in the aim of keeping things understandable for the layperson, but if you want to see the exact implementation you can always look at my code directly (and suggest improvements 😉 ).

In my tests, the new filter has fairly great accuracy with very few false positives, mostly around anime which makes every woman look extraordinarily young as a matter of fact. But in any case, with the amount of images the horde generates I’ll have plenty of time to continue tweaking and maybe craft more specific filter for the models of each type (realistic, anime, furry etc)

Of course I can never expects this to be perfect, but that was never the idea. No such filter can ever catch everything, but what my hope is that this filter, along with my other countermeasures like the regex filter, will have enough of a detection rate to frustrate even the most dedicated pedos off of the platform.