VMs resize to improve CPU load in US-3
Scheduled Maintenance Report for zCloud
Completed
The scheduled maintenance has been completed.

We were able to do everything we had planned.

- Created new bare-metal servers (almost doubling our capacity at US-3)
- Created new VMs and resized old VMs
- Balanced the VMs to make sure all of them have similar loads
- Created new Dashboards for our internal monitoring of these VMs, especially about CPU Load in periods of 1 minute and 5 minutes
- Created new Alerts to let us know when CPU Load is above 60% in one VM for 1 minute (low severity) and when CPU Load is above 90% in one VM for 5 minutes (high severity)

Our idea here is to avoid problems like the ones we had this week, where a single VM used a lot of CPU, causing the bare-metal machine to suffer.

Why do we believe these changes will avoid that?

The new setup splits each bare-metal machine into at least 2 VMs, and we are pinning the CPU threads that each VM can use, so even if one VM starts to use a lot of CPU, that will not cause the whole machine to suffer.

We also leave a few threads without any VMs, so they are free for the bare-metal host system to use as needed.

Why did this happen this week and not previously? Was it an attack?

No, it was not an attack.

What happened was that some databases of our clients and apps were increasing CPU demand at the same time. Then, with the VM causing the whole bare-metal machine to suffer, it also affected the performance of other clients.

In all the cases, we could check the process ID, understand which container corresponded to this PID, and see in our metrics that this app env or db env was consuming CPU above the average.

If you see anything unusual in your logs, especially in the database logs, don’t worry. Our operators were working to restart and make your db running at total capacity again. Nothing was happening that was not expected or controlled by us.

How were you able to address the issues in minutes?

We were able to address the issues in minutes because we had extra capacity. The problem was not a lack of capacity. The problem was the CPU usage concentrated in specific VMs, so we are confident the new setup will help.

The new bare-metal machines and VMs were created today because we are growing, and then we are preparing more capacity for the next months using this maintenance window as well.

Thank you for your patience during this intense week, and if you have any further questions, feel free to contact us.
Posted Jan 28, 2024 - 04:31 UTC
In progress
Scheduled maintenance is currently in progress. We will provide updates as necessary.
Posted Jan 28, 2024 - 03:30 UTC
Scheduled
We will resize our VMs to improve CPU load distribution in US-3 to avoid cases where a specific CPU thread or core is overloaded, affecting many resources.

To do that, we will need to restart the containers; the process will be done in a way that even apps with a single container should be online all the time, but we always recommend running production workloads with 2 or more containers. 3 is the perfect number as we spread the containers in different physical machines, so only in a drastic case where more than 3 machines are down should it affect your app.

We will also add new VMs in this process, adding new bare-metal machines.

With these new sizes of VMs and new VMs, we will have more IPs. Follow the new IPs below and check our IPs docs page to see the whole list. https://docs.zcloud.ws/docs/deploy/ips
Posted Jan 26, 2024 - 02:31 UTC
This scheduled maintenance affected: Web App (App (app.zcloud.ws)), App Servers (United States 3 (St. Louis, MO)), and DB Servers (United States 3 (St. Louis, MO)).