Recently my provider send me a nastygram about my Database VPS using too much bandwidth, 150Mbit/s or more, over 10 days, and how they have already throttled it to 100Mbit/s to avoid affecting other customers.
This caught me by surprise as I know that my Database is the central location where all my nodes converge to pull data, but the transfer between them should be just text. 150 Mbit/s would be insane quantities of text.
Fortunately my provider has also crashed my DB just a day before on the weekend, and their lack of response outside working hours forced me to urgently deploy a new VM with a new postgres DB until they recovered and I had switched all my nodes to use that already. Nevertheless, on checking the new DB, I discovered that it too was using the same incredible amount of bandwidth constantly. This meant that my new DB VM was also on a timer as Contabo throttles you, if your VM takes too much bandwidth for 10 days in a row. I had to resolve this fast.
First order of business was to swap the code so that all source images and source masks used for img2img are also stored in my cloudflare r2 CDN. The img2img requests are about 1/6 of the total stable horde traffic, but until now they were stored as base64 strings inside the database, which means that whenever those requests were retrieved, say for a worker to pick one, they transferred all that data back and forth.
This made a small dent in the traffic, but not nearly enough. I was still at 150Mbit/s outside rush hours and 200Mbit/s during peak demand.
I started getting a bit desperate as I expected that was a big part of this. At this point I decided to open a discord thread to ask my community for help in debugging this as I was getting out of my depth. The best suggestion came from someone who told me to enable pg_stat_statements.
That, along with the query to retrieve the most used queries, lead me to find a few areas where I could do some improvements through caching. One was the worker retrieving the models from the DB all the time for example, which I instead made it cache on redis.
Unfortunately none of my tweak seemed to do much difference in the bandwidth. I was starting to lose my mind. Fortunately someone mentioned the amount of rows retrieved so I decided to sort my postgres statements by amount of rows retrieved and that finally got me the thing I was looking for. There was one statement which retrieved a number of rows two whole orders of magnitude more than every other statement!
That statement was an innocent looking query to retrieve all the performance statistics for all workers, which I then created an average for. Given hundreds of workers and 20 rows per worker, and this statement being retrieved once per second per status check on a request, you can imagine the amount of traffic it generated!
My initial stop was to cache it on redis as well. That just shifted the problem because while there wasn’t a load on the DB, the amount of traffic stayed the same, just on a different VM. So the next step was to try and cache it locally. I initially turned to python cachetools and their TTL function caching. That seemed to work but it was a big mistake. The cachetools are absolutely not thread safe and my code relies on python waitress WSGI server, and that one spawns dozens of threads all over the place.
So one day later, I get reports of random 500 errors and other problems. I look in to find my logs spewing hundreds of logs about missing key errors on the cache. Whoops!
I did make an attempt to see how I can make cachetools thread safe, but that involved adding python locks which would delay everything too much, on ultimately something that is not needed. So instead I just created my own simple cache using global vars and built-in types which are thread-safe by default. That solved the crashing issue.
But then I remembered that I’m stupid and instead of pulling thousands of rows of integers so I can make an average using python, I can just ask PostgreSQL to calculate the average and simply return that final number to me. Duh! This is what happens when my head is not yet tuned into using Databases correctly.
So finally, I wiped my internal cache and switched to that approach. And fortunately that was it! My Mbit/s on my database server dropped from 150 average, to 15! a 10x reduction!