Update from the development department

Bubbles, FC Sunnyvale Developer 25 October 2019, 15:33

Phew, we've had our work cut out for us this past month. Let me tell you a little about what we have been up to.

The Season update October 20th

As you may have noticed, there were issues with the season update on Sunday.

This happened because we were running a new fixtures creation algorithm, which aims to speed up the season update.

Since moving to our new server host, the season update has been somewhat slow. The server that executes it has the fastest CPU available from our new host, and on paper it is faster than the old one. But, when the season update is runing, we can see that neither the server nor our database are consuming very many of their resources, so our assumption is that it may be due to slightly higher latency on the network. And since the season update is a process that performs a huge amount of small database reads and writes in just one single thread, just a slightly higher latency on the network might be enough to cause the two servers to waste a lot of time waiting. A few weeks ago we also fixed a bug which had meant that bot clubs hadn't been topped up for a long time (the English league was completely empty of bots). Once all these missing bots were created, the season update became significantly slower.

That's why we implemented a multi-threaded process for creating the fixtures. The new process runs 1 thread per country, so the fixtures are created concurrently, instead of one country at a time.

However, multi-threaded programming can be risky and a tiny mistake can have crazy consequences.

What happened was that the correct number of matches were set up against the right opponents, but the meta-data that makes a connection between a club and their matches, and between a match and its participants' tactics, was created 14 times per match instead of 1.

2x14 extra rows of data per match may not sound like much, but it nevertheless caused the match pages to die completely and take the rest of the site down with it. Loading the fixtures page resulted in the creation of objects for:

30 matches x 14 club / match links x 14 match / tactics links x 2 clubs (you and your opponent) x 11 players + all sorts of other data = hundreds of thousands of objects that swallowed up all the server's RAM. This caused the entire site to run very slowly Sunday night.

In addition, even though the matches were set up with the correct divisions and group, they were in random countries!

In the end, we had to delete all the matches and create entirely new fixtures using the old method.

When we implement risky or critical things, such as changes to the season update, we also write a number of automatic tests. Automatic testing is a small bit of code that sets up data for a specific scenario, runs it through part of our code, and then checks to see if the result is correct. Today we have approx. 1600 tests, with over 5000 individual checkpoints that we run whenever we implement anything new.

But, of course, tests don't write themselves, so we can only test for things that we're able to imagine going wrong. Our tests checked if the correct number of matches were created against the right opponents - which they were - but they didn't look at the metadata where the problem arose.

So this week we fixed the error and implemented a number of extra measures against anything similar happening again. For starters, we wrote a number of extra tests that check the things that went wrong last Sunday.

In addition, we put a unique index on the two database tables where all that extra data was erroneously created. This ensures that if we ever again happen to write any code that tries to insert more than 1 match-to-club link for the same club and match, the database will block it completely, and our code will crash immediately. And, it is always best if flawed code crashes immediately instead of running on and writing 10 million rows of rubbish data, which is only discovered later when it causes problems elsewhere in the system.

There was some good news though! The creation of the fixtures was actually really fast - despite the fact made it also created almost 10 million extra rows of rubbish data.

503 Error / Load balancer

In recent weeks, we have also managed to fix a problem that we experienced after moving to the new server host.

In our new setup we have 3 web servers that sit behind a load balancer. The load balancer receives all the requests coming in and distributes them among the 3 web servers. Periodically, it checks each server's "health" by sending it a message and seeing if it responds. If this health check fails, then the load balancer takes that server out of the rotation. If all 3 fail the health check, then a user will see a white page with a "503" error.

What we experienced with the new setup was that the health checks sometimes started to fail, although we could see that the servers were working fine and had very little load. It was a total mystery, because it came and went, and we could see in our server logs that the servers responded to all the health checks they received from the load balancer. That was all we were able to see, because we do not have direct access to the load balancer and its logs.

It took a lot of going back and forth with the supporters at our new host before we found out it was due to a network problem. Our web servers are tucked away behind a firewall that rejects all outside traffic, and only lets traffic through that comes from the local area network.

What we figured out was that our host could not actually guarantee that the load balancer's health check always came from the local area network. The health checks come from many different IPs, and at times we were unlucky that they all came from an IP that was blocked by the firewall, which was the cause of our problem. The load balancer would then think that the servers were down, and then didn't send any traffic through until health checks came through from other IPs.

Eventually, however, the supporter dug up a newly implemented and undocumented setting that meant that the firewall would always let traffic through from the load balancer, and since we got that set up, there have been no more 503s (except on Sunday, where the servers were actually in pretty poor health :P).

Upgrade to Rails 4.2

Upgrading to Rails 4.2 and Ruby 2.4.9 is also complete and all servers have been running exclusively on Rails 4 for the past 2 weeks. It was a really big job, but also necessary since our previous versions of Rails and Ruby were no longer getting security updates.

Very old bug in the tactics editor

This week we also uploaded a fix to a very old issue in the tactics editor, which has been a bit of a mystery. It was hard to find because it only happened sporadically and because we couldn't recreate it ourselves. And, when you can't see the problem yourself and may not even be sure that it exists, then it's obviously quite impossible to fix it.

We have occasionally been receiving reports that changes in lineups didn't get saved and that the manager didn't think that they had sold any players while the editor was open. But, of course, these reports only come several days later, when they discover a gap in their lineup, making it nearly impossible to investigate. We could never reproduce it, no matter how many times we dragged players around the lineup.

But some time ago we were finally able to recreate the problem, and once the details became came to light, it was clear why it was so difficult to find.

The problem could only arise in 2 very specific situations:

  1. When dragging a player from the reserves to an empty slot on the pitch.
  2. When dragging a player from the field to an empty slot in the reserves.

Any other action, such as dragging players back and forth between the pitch and the player list, or between the reserves and the player list, or dragging him to a place where there was already a player were not affected.

When moving a player from the pitch to the reserves, two things must happen: first we have to remove him from the pitch, and only then can we add him to the reserve list.

In these 2 very specific situations, the tactics editor would send 2 separate messages to the server:

  1. Remove player X from the reserve list
  2. Insert player X on the field

...and the other way around

The problem was that due to the way the internet works, the messages cannot be guaranteed to arrive at the server in the order they were sent.

So what could happen was that the message saying to add the player to the pitch would arrive BEFORE the message saying to remove him from the reserve list.

But a player cannot exist in both places at the same time - not even for a split second - so the server would refuse to put the player on the pitch because he was still in the reserves. Once the first message then finally arrived, the player would be removed from the reserve list, but it would be too late and you'd be left with a gap in our lineup.

For all of the other ways you can move a player, only 1 message is sent. For example, if you drag a player into a space that is not empty, then the editor would instead send 1 message saying that "player X and Y should trade places", and then timing isn't an issue.

This also became the solution to the problem. We have implemented a way in which moving a player from the reserves to the pitch - or vice versa - requires only 1 message to the server. It then removes the player from one place and puts him in the other in the right order.

But why were we suddenly able to reproduce the problem ourselves when we had never been able to before?

We have a theory that this is due to the server move. Previously, the site was hosted in Denmark (where we live), and we only had 1 web server that processed all requests.

The new server setup is located in Frankfurt, and requests first go through Cloudflare, then our load balancer, and then on to 1 of our 3 web servers. The longer route to the servers makes it more likely that the Internet will send the 2 messages on different routes, which may result in message number 2 arriving first. So, moving the game to new servers much further away meant that we could finally see the problem for ourselves.

Although we have now fixed this specific issue, you still need to pay attention to the cases where a player is sold in the background while you have the tactics editor open. In those situations we cannot guarantee that everything will work when you drag a player around who has actually left the club. However, we have a few ideas for how we might make the editor more robust against this type of incident in the future.

Have a great weekend!