SC|06 Powerful Beyond Imagination
SC06 is the International Conference for High Performance Computing Networking and Storage

About Registration Conference Technical Program Exhibits News and Press Travel

Home Conference Schedule



SCHEDULE: NOV 11-17, 2006

Entire WeekSaturdaySundayMondayTuesdayWednesdayThursdayFriday
My Itinerary



Improving Fault Resilience of High Performance Applications

Session: Poster Reception

Event Type: Poster

Time: 5:15pm - 7:15pm

Author(s): Yawei Li, Zhiling Lan

Location: Ballroom Corridor

Abstract:
For large-scale systems with hundreds to thousands of nodes, failures are likely to be more frequent as the system reliability decreases exponentially with the increasing count of components. Many parallel applications that span a large number of nodes are designed to run for days or weeks until completion. Hence, application-level fault resilience is of critical importance to the continued scaling of high performance computing (HPC). In this poster, we present and evaluate an adaptive fault resilience framework for HPC applications which adaptively selects an optimal corrective or preventive action based upon failure predictions at runtime. The proposed framework is implemented with a production-level MPI package and assessed with a variety of real-world parallel applications on production HPC systems. The experiment results demonstrate promising performance improvement of FT-Pro against traditional checkpointing/recovery schemes under a wide range of prediction accuracies and application characteristics.




Chair/Author Details:

Yawei Li
Illinois Institute of Technology

Zhiling Lan
Illinois Institute of Technology






Home | About | Contact Us | Registration | Sitemap
IEEEComputer SocietyACM