-
Transparent Message-Passing Parallel Applications Checkpointing in Kerrighed
Nowadays, clusters are widely used to execute scientific applications. These applications are often message-passing parallel applications with long execution time.... -
Ghost Process: a Sound Basis to Implement Process Duplication, Migration and ...
Today, clusters are widely used to execute numerical applications. Mechanisms are needed to ease cluster use and to take advantage of the cluster distributed... -
A Fault Tolerance protocol for ASP calculus: Design and Proof
This research report first details a communication induced checkpointing fault tolerance protocol adapted to ProActive, a Java library that implements the ASP model.... -
Energy-aware checkpointing of divisible tasks with soft or hard deadlines
In this paper, we aim at minimizing the energy consumption when executing a divisible workload under a bound on the total execution time, while resilience is provided... -
Checkpointing strategies with prediction windows
This paper deals with the impact of fault prediction techniques on checkpointing strategies. We suppose that the fault-prediction system provides prediction windows... -
Enabling user-driven Checkpointing strategies in Reverse-mode Automatic Diffe...
This paper presents a new functionality of the Automatic Differentiation (AD) tool Tapenade. Tapenade generates adjoint codes which are widely used for optimization or... -
Combining Process Replication and Checkpointing for Resilience on Exascale Sy...
Processor failures in post-petascale settings are common occurrences. The traditional fault-tolerance solution, checkpoint-rollback, severely limits parallel... -
Comments on ''Improving the computing efficiency of HPC systems using a combi...
In this short note, we provide some comments on the recent paper ''Improving the computing efficiency of HPC systems using a combination of proactive and preventive... -
On the Combination of Silent Error Detection and Checkpointing
In this paper, we revisit traditional checkpointing and rollback recovery strategies, with a focus on silent data corruption errors. Contrarily to fail-stop failures,... -
Using group replication for resilience on exascale systems
High performance computing applications must be resilient to faults, which are common occurrences especially in post-petascale settings. The traditional... -
On the Combination of Silent Error Detection and Checkpointing
International audience -
Checkpointing strategies with prediction windows
International audience -
Checkpointing algorithms and fault prediction
Accepted to be published in JPDC -
Using group replication for resilience on exascale systems
International audience
