-
Checkpointing strategies with prediction windows
This paper deals with the impact of fault prediction techniques on checkpointing strategies. We suppose that the fault-prediction system provides prediction windows... -
Combining Process Replication and Checkpointing for Resilience on Exascale Sy...
Processor failures in post-petascale settings are common occurrences. The traditional fault-tolerance solution, checkpoint-rollback, severely limits parallel... -
Multilevel communication optimal LU and QR factorizations for hierarchical pl...
This study focuses on the performance of two classical dense linear algebra algorithms, the LU and the QR factorizations, on multilevel hierarchical platforms. We... -
Comments on ''Improving the computing efficiency of HPC systems using a combi...
In this short note, we provide some comments on the recent paper ''Improving the computing efficiency of HPC systems using a combination of proactive and preventive... -
Using group replication for resilience on exascale systems
High performance computing applications must be resilient to faults, which are common occurrences especially in post-petascale settings. The traditional... -
Checkpointing strategies with prediction windows
International audience -
Checkpointing algorithms and fault prediction
Accepted to be published in JPDC -
Checkpointing algorithms and fault prediction
International audience
