The dataset is about Dynamic Random Access Memory (DRAM) errors and server failures due to DRAM errors. It includes DRAM error logs and trouble tickets due to DRAM errors collected from more than 250K servers. The dataset is provided by Alibaba.
Large-scale Dataset for Prediction of Server Failures due to DRAM Errors
The dataset is about Dynamic Random Access Memory (DRAM) errors and server failures due to DRAM errors. It includes DRAM error logs and trouble tickets due to DRAM errors collected from more than 250K servers. The dataset is provided by Alibaba.
DRAMs are typically adopted as main memory in modern data centers. However, DRAM errors become prevalent in large-scale production environments. What's worse, DRAM errors also correlate with server failures. To encourage researchers to explore the characteristics of DRAM errors as well as correlation between DRAM errors and server failures, we release a dataset including more than 70 million DRAM errors, thousands of trouble tickets that describe the server failures caused by DRAM errors, and hardware configuration inventory logs. Our dataset is collected from more than 250K servers and 3 million DIMMs over an eight-month span at Alibaba.
The dataset has three files:
Field | Type | Description |
---|---|---|
sid | string | The server ID |
memoryid | integer | The DIMM ID, range from 0 to 23, note that a server attaches at most 24 DIMMs |
rankid | integer | The rank ID, range from 0 to 1, each DIMM has 1 or 2 ranks |
bankid | integer | The bank ID, range from 0 to 15, each rank has 16 banks |
row | integer | The row ID, range from 0 to |
col | integer | The column ID, range from 0 to |
error_type | integer | The error type: 1 for read error, for scrubbing error, 3 for write error |
error_time | string | The time when the error is detected in format YYYY-MM-DD hh:mm:ss |
Field | Type | Description |
---|---|---|
sid | string | The server ID |
server_manufacturter | string | The server manufacturer, in annoymized format |
DRAM_model | string | The DRAM model, in anonymized format |
DIMM_num | string | The number of DIMMs attached to the server, should be 8 or 12 or 16 or 24 |
Field | Type | Description |
---|---|---|
sid | string | The server ID |
failure_type | integer | The server failure type, 1 for UE-driven failures, 2 for CE-driven failures, and 3 for miscellaneous failures. |
failure_time | string | The time when the server failures happened, in format "YYYY-MM-DD hh:mm:ss" |
Note that we have anonymized the exact dates, the server manufacturer, and the DRAM model to avoid sensitive information being inferred. Specially, the date starts from the year 0001 month 01 day 01. For the manufacturer, we use M1, M2, M3, and M4 to represents the four server manufacturer vendors, respectively. Finally, for DRAM model, we use A1, A2, B1, B2, B3, C1, and C2 to represent the seven different DRAM models where A, B, and C represent three main DRAM vendors, respectively, and numbers 1, 2, 3 denote the different models from the same DRAM vendor.
Please cite our paper if you use this dataset.
@inproceedings {cheng2022,
title = {An In-Depth Correlative Study Between DRAM Errors and Server Failures in Production Data Centers},
author = {Cheng, Zhinan and Han, Shujie and Lee, Patrick PC and Li, Xin and Liu, Jiongzhou and Li, Zhan}
booktitle = {41st International Symposium on Reliable Distributed Systems ({SRDS} 2022)},
year = {2022}
}
The dataset is distributed under the CC BY-SA 4.0 license.