Home
Conference
Schedule
SCHEDULE: NOV 11-17, 2006
Warning: It appears you do not have Javascript enabled.
If so, you will have trouble creating and viewing your itinerary information.
The Computer Failure Data Repository (CFDR): Collecting, Sharing and Analyzing Failure Data
Session:
Poster Reception
Event Type:
Poster
Time:
5:15pm - 7:15pm
Author(s)
:
Bianca Schroeder, Garth Gibson
Location:
Ballroom Corridor
Abstract:
With the large component count of coming PetaFLOPS systems, component failures will be frequent. The success of those systems will therefore depend on their ability to avoid, cope and recover from failures. Unfortunately, research in this area is hampered by the limited understanding of what failures in real, large-scale systems look like, due to lack of publicly available failure data. This paper introduces the Computer Failure Data Repository (CFDR), an effort being undertaken at CMU to collect, share, and analyze failure data. The CFDR has been stimulated by the recent public release of a large set of failure data that has been collected over the past 9 years at Los Alamos National Laboratory and includes 23000 failures on more than 20 different systems, mostly large clusters. We present a high-level overview of some of our results from analyzing this data and describe the ongoing efforts and long-term plans of the CFDR.
Chair/Author Details:
Bianca Schroeder
Carnegie Mellon University
Garth Gibson
Carnegie Mellon University
Home
|
About
|
Contact Us
|
Registration
|
Sitemap