License Server incident report

This is a full report regarding the License Server blackout occurred on Nov 10th 2024, between hours 00:00 a.m. and 10:00 a.m. UTC, approximately, which has affected many live SmartFoxServer instances (with the exclusion of those running in Overcast or using an IP-based license).

The incident was mainly caused by a scheduled AWS maintenance activity which was supposed to update the SSL Root CA for the database cluster, while in fact the update did not update the certificate, causing a broken channel between the frontend and backend. Additionally the client side of the License System (in SFS2X) was not able to handle the situation correctly by entering the “grace period” in which multiple recovery attempts are performed. Instead it reverted to Community Edition with 100 CCU.

We then enabled a temporary license that, after a hiccup, was able to cover everyone’s need to return to normal activity, while at the same time another part of the team was manually applying the SSL patch and restoring the database connectivity. Finally we brought back the system at around 11:00 a.m. CET (Nov 10th) while the global license was still active. We spent the rest of the Sunday testing the transition back from the temp license to normal license in our staging environment and kept answering the many emails and forum posts that were still arriving.

Yesterday, Nov 11th, after having sent several email notifications in advance, we restored the default system which allows SmartFoxServer instances to transition back to normal after a manual restart (which can be done at any time). While this has worked for the vast majority of our customers there has been a few cases in which an extra restart was necessary, due to how SmartFox instances are deployed (e.g. Docker container or similar), which do not retain “memory” of their state before the incident (as they get simply shut down and replaced with new instances, instead of being restarted).

Our plans moving forward

We would like to renew our sincere apologies for the disruption caused: this is the first incident in our 20+ years career and it is our highest priority to avoid any similar situations for at least another 20 years (or ever!). To do so we have planned a number of activities that will take precedence over our current projects for the incoming weeks:

Manage AWS critical updates manually: the misunderstanding with the AWS CA Root update was something that could have been avoided if AWS had communicated in clearer way. We don’t want to put all the blame on them but this instance was particularly bad and from now on any critical update will be managed and supervised manually by members of our team to avoid any service disruptions.
Improve the LS client side: we are going to fix the handling of server error codes to make sure no exception can disrupt the standard grace period and we’ll also double the grace period, going from 48hrs to 96hrs, to give ample margin for recovery. The update will be released as a patch later this month.
Activate a status service: we’ll set up a public website offering a detailed state of the LicenseServer during the day and previous weeks. This will run multiple levels of monitoring that send alerts back to us at any time of the day and night, allowing for prompt interventions in case of a failure.
Disaster Recovery: we’re planning the launch of a secondary/backup service in a different AWS region that could be used for a fast License Server switch in case of a catastrophic event that would bring down the whole system (AWS region failure, flood, earthquake, etc…)

Finally we plan to keep up every one updated by posting more details about each of these activities as they become available and provide a final report of the work that has been done once our plan is completed.

If you have any questions please get in touch with via our support forums or support@… email box.

The SmartFoxServer Team