The 150Mbit/s problem

Recently my provider send me a nastygram about my Database VPS using too much bandwidth, 150Mbit/s or more, over 10 days, and how they have already throttled it to 100Mbit/s to avoid affecting other customers.

This caught me by surprise as I know that my Database is the central location where all my nodes converge to pull data, but the transfer between them should be just text. 150 Mbit/s would be insane quantities of text.

Fortunately my provider has also crashed my DB just a day before on the weekend, and their lack of response outside working hours forced me to urgently deploy a new VM with a new postgres DB until they recovered and I had switched all my nodes to use that already. Nevertheless, on checking the new DB, I discovered that it too was using the same incredible amount of bandwidth constantly. This meant that my new DB VM was also on a timer as Contabo throttles you, if your VM takes too much bandwidth for 10 days in a row. I had to resolve this fast.

First order of business was to swap the code so that all source images and source masks used for img2img are also stored in my cloudflare r2 CDN. The img2img requests are about 1/6 of the total stable horde traffic, but until now they were stored as base64 strings inside the database, which means that whenever those requests were retrieved, say for a worker to pick one, they transferred all that data back and forth.

This made a small dent in the traffic, but not nearly enough. I was still at 150Mbit/s outside rush hours and 200Mbit/s during peak demand.

I started getting a bit desperate as I expected that was a big part of this. At this point I decided to open a discord thread to ask my community for help in debugging this as I was getting out of my depth. The best suggestion came from someone who told me to enable pg_stat_statements.

That, along with the query to retrieve the most used queries, lead me to find a few areas where I could do some improvements through caching. One was the worker retrieving the models from the DB all the time for example, which I instead made it cache on redis.

Unfortunately none of my tweak seemed to do much difference in the bandwidth. I was starting to lose my mind. Fortunately someone mentioned the amount of rows retrieved so I decided to sort my postgres statements by amount of rows retrieved and that finally got me the thing I was looking for. There was one statement which retrieved a number of rows two whole orders of magnitude more than every other statement!

That statement was an innocent looking query to retrieve all the performance statistics for all workers, which I then created an average for. Given hundreds of workers and 20 rows per worker, and this statement being retrieved once per second per status check on a request, you can imagine the amount of traffic it generated!

My initial stop was to cache it on redis as well. That just shifted the problem because while there wasn’t a load on the DB, the amount of traffic stayed the same, just on a different VM. So the next step was to try and cache it locally. I initially turned to python cachetools and their TTL function caching. That seemed to work but it was a big mistake. The cachetools are absolutely not thread safe and my code relies on python waitress WSGI server, and that one spawns dozens of threads all over the place.

So one day later, I get reports of random 500 errors and other problems. I look in to find my logs spewing hundreds of logs about missing key errors on the cache. Whoops!

I did make an attempt to see how I can make cachetools thread safe, but that involved adding python locks which would delay everything too much, on ultimately something that is not needed. So instead I just created my own simple cache using global vars and built-in types which are thread-safe by default. That solved the crashing issue.

But then I remembered that I’m stupid and instead of pulling thousands of rows of integers so I can make an average using python, I can just ask PostgreSQL to calculate the average and simply return that final number to me. Duh! This is what happens when my head is not yet tuned into using Databases correctly.

So finally, I wiped my internal cache and switched to that approach. And fortunately that was it! My Mbit/s on my database server dropped from 150 average, to 15! a 10x reduction!

I made an itch.io page

I always knew I had to do CI/CD for my development, but I just hadn’t given any priority yet. As such, until now I was still manually exporting Hypnagonia when I had a stable release, then manually uploading the html to my own site through scp, ssh to the site, and copy the files to the web directory. All so that I can link people to a bland page so that they can play the game online.

And for more feature-packed releases, I would also export the versions for Linux, Windows and mac, create a new github release and upload them there.

It was an unnecessary manual procedure and it always bothered me. But I had other priorities on my end and I kept pushing back the improvement on that end.

That is, until yesterday, where I run into this video introducing DevOps practices to game development:

And that kinda shamed me into going through with it.

Unfortunately interacting with my own webhost was always hacky and felt amateurish. So I decided to onboard Hypnagonia to itch.io. This provides me some benefits. As part of the community there, I can start getting more organic discovery. It provides some interesting metrics and very imprtantly, it’s an estabilished platform which means there’s plenty of guides and products to integrate it easily into my Continuous Delivery pipeline.

Unfortunately, I realized that embedding the game into itch.io doesn’t play well with my current viewport size and the “static” way I had designed it. And that was a rabbit hole…

First I adjusted my main menu screens to adjust automatically according to the viewport size. That worked OK-ish for not ridiculously-small resolutions. It’s fine, we can look at that later

But then I discovered that even small reductions in the viewport, make the cards look massive, as they are never scaled. So I needed to shrink cards according to the viewport size somehow…

Long-story short: I spent around 6 hours yesterday just giving the Godot Card Game Framework functionality to handle smaller resolutions somewhat gracefully. It’s not perferct, but it’s good enough to allow itch.io embed.

And since I was at it, I made some quality of life improvements. And wrote about them into an itch.io devlog.

I’m going to be writing some devlogs there now and then. Not sure how I will handle the integration with this blog. Maybe just post a link to them here when I write them? What do you think?