Cloud service manhandled by human error

On the night of Tuesday, February 28 to Wednesday, March 1, some Web Services were unavailable after the shutdown of some servers. On Thursday, March 2, AWS unveiled the cause of the Amazon S3 crash: a human error.



On the night of last Tuesday in Virginia, the failure of an AWS data center caused the shutdown of sites, applications, connected objects and services related to the platform. A possible shortfall for users of the cloud service powered by the Web giant who had to wait three hours before the reactivation of services.

Obviously, the customers and Amazon were trying to find out the source of this failure. The Amazon S3 system for Simple Storage Service has been fraught with human error. The maintenance team, in attempting to solve the billing system problem, extended the outage to other services. The cause ? A human error. An unfortunate typo.

The typo that upsets the services

This line of code which was "supposed to remove a small number of servers in one of the S3 subsystems used in the S3 billing process" also affected two other subsystems according to the report about the Amazon S3 case .

The first manages " metadata and location information of all objects in the region " and the second one is allocated to the allocation of new storage and requests that indexed subsystems function properly to perform its function.

Queries managed by these subsystems ensure the operability of Amazon S3 APISs. In case of breakdown, the services are therefore unavailable. The solution: restart all the subsystems concerned thus affecting the US EAST 1 region.

Amazon S3: enhanced maintenance

As a result, the repair took longer than expected and resulted in service outages for merchants for almost three and a half hours. It must be said that the restart operation had not been carried out for almost a year. The Amazon S3 team learns from their mistakes and therefore plans a partition of the subsystems.

If AWS apologizes flatly for the unavailability of the service , it is mainly an awareness for suppliers like Instagram, Slack or American Airlines whose services were directly impacted. Finally, the owners of connected objects that have experienced serious problems: Smart TV unavailable, lock connected open, etc. Being a market leader does not prevent problems from occurring and some customers should follow Instagram's next step.

Comments

Popular posts from this blog

Brand massacre in the server market

Amazon Web Services provides billing to the nearest second

Huawei is looking to connect with the cloud leader