Combining Process Replication and Checkpointing for Resilience on Exascale Systems

Processor failures in post-petascale settings are common occurrences. The traditional fault-tolerance solution, checkpoint-rollback, severely limits parallel efficiency. One solution is to replicate application processes so that a processor failure does not necessarily imply an application failure. Process replication, combined with checkpoint-rollback, has been recently advocated by Ferreira et al. We first identify an incorrect analogy made in their work between process replication and the birthday problem, and derive correct values for the Mean Number of Failures To Interruption and Mean Time To Interruption for exponentially distributed failures. We then extend these results to arbitrary failure distributions, including closed-form solutions for Weibull distributions. Finally, we evaluate process replication using both synthetic and real-world failure traces. Our main findings are: (i) replication is less beneficial than claimed by Ferreira et al; (ii) although the choice of the checkpointing period can have a high impact on application execution in the no-replication case, with process replication this choice is no longer critical.

Data and Resources

Combining Process Replication and...HTML
Explore
- More information
- Go to resource

Additional Info

Field	Value
Source	https://inria.hal.science/hal-00697180
Author	Casanova, Henri, Robert, Yves, Vivien, Frédéric, Zaidouni, Dounia
Maintainer	CCSD
Last Updated	May 12, 2026, 22:00 (UTC)
Created	May 12, 2026, 22:00 (UTC)
Identifier	Report N°: RR-7951
Language	en
Rights	https://about.hal.science/hal-authorisation-v1/
contributor	Concurrency Research Group (CoRG) ; University of Hawai‘i [Mānoa] (UHM)
creator	Casanova, Henri
date	2012-05-14T00:00:00
harvest_object_id	6f227ea4-fee5-47bf-ba89-8e0f0f004ef4
harvest_source_id	3374d638-d20b-4672-ba96-a23232d55657
harvest_source_title	test moissonnage SELUNE
metadata_modified	2025-10-13T00:00:00
set_spec	type:REPORT