Home
Conference
Schedule
SCHEDULE: NOV 11-17, 2006
Warning: It appears you do not have Javascript enabled.
If so, you will have trouble creating and viewing your itinerary information.
Improving Fault Resilience of High Performance Applications
Session:
Poster Reception
Event Type:
Poster
Time:
5:15pm - 7:15pm
Author(s)
:
Yawei Li, Zhiling Lan
Location:
Ballroom Corridor
Abstract:
For large-scale systems with hundreds to thousands of nodes, failures are likely to be more frequent as the system reliability decreases exponentially with the increasing count of components. Many parallel applications that span a large number of nodes are designed to run for days or weeks until completion. Hence, application-level fault resilience is of critical importance to the continued scaling of high performance computing (HPC). In this poster, we present and evaluate an adaptive fault resilience framework for HPC applications which adaptively selects an optimal corrective or preventive action based upon failure predictions at runtime. The proposed framework is implemented with a production-level MPI package and assessed with a variety of real-world parallel applications on production HPC systems. The experiment results demonstrate promising performance improvement of FT-Pro against traditional checkpointing/recovery schemes under a wide range of prediction accuracies and application characteristics.
Chair/Author Details:
Yawei Li
Illinois Institute of Technology
Zhiling Lan
Illinois Institute of Technology
Home
|
About
|
Contact Us
|
Registration
|
Sitemap