Alibaba Cloud has shared more information about a technology it uses to improve failure prediction and detection on its servers, claiming a 10% improvement compared to existing models.
The Chinese company’s latest tool, Time-Aware Attention-Based Transformer (TAAT), addressed the limitations of existing machine learning tools that overlook the importance of log timestamps.
A new research paper co-authored by Alibaba Cloud employees and a researcher at Huazhong University of Science and Technology in Wuhan details how TAAT uses timestamps to make failure predictions more accurate.
Alibaba Cloud increases server failure predictions by 10%
The authors of the paper highlight growing concerns about server reliability and stability in light of “broad applications of cloud computing,” which impact the availability of virtual machines.
Since past failures can help companies predict future failures, the company has chosen to use timestamps to improve accuracy.
TAAT integrates semantic and temporal data by using the Bidirectional Encoder Representations from Transformers (BERT) language model developed by Google, which Alibaba claims is good at analyzing log data. An enhancement to BERT's capabilities is the addition of a time-aware attention mechanism.
As a result, Alibaba Cloud now uses TAAT in its daily operations to improve predictions. The company has also released the real-world cloud computing failure prediction dataset used in its study to help the community further develop its products. The dataset contains approximately 2.7 billion records from around 300,000 servers, collected over a four-month period, and is believed to be the largest record of its kind.
With TAAT, Alibaba hopes for a more reliable cloud infrastructure, and while the tool is not yet available for public download, it paves the way for an increasingly cloud-based landscape.