SC|06 Powerful Beyond Imagination
SC06 is the International Conference for High Performance Computing Networking and Storage

About Registration Conference Technical Program Exhibits News and Press Travel

Home Conference Schedule



SCHEDULE: NOV 11-17, 2006

Entire WeekSaturdaySundayMondayTuesdayWednesdayThursdayFriday
My Itinerary



The Computer Failure Data Repository (CFDR): Collecting, Sharing and Analyzing Failure Data

Session: Poster Reception

Event Type: Poster

Time: 5:15pm - 7:15pm

Author(s): Bianca Schroeder, Garth Gibson

Location: Ballroom Corridor

Abstract:
With the large component count of coming PetaFLOPS systems, component failures will be frequent. The success of those systems will therefore depend on their ability to avoid, cope and recover from failures. Unfortunately, research in this area is hampered by the limited understanding of what failures in real, large-scale systems look like, due to lack of publicly available failure data. This paper introduces the Computer Failure Data Repository (CFDR), an effort being undertaken at CMU to collect, share, and analyze failure data. The CFDR has been stimulated by the recent public release of a large set of failure data that has been collected over the past 9 years at Los Alamos National Laboratory and includes 23000 failures on more than 20 different systems, mostly large clusters. We present a high-level overview of some of our results from analyzing this data and describe the ongoing efforts and long-term plans of the CFDR.




Chair/Author Details:

Bianca Schroeder
Carnegie Mellon University

Garth Gibson
Carnegie Mellon University






Home | About | Contact Us | Registration | Sitemap
IEEEComputer SocietyACM