On the Combination of Silent Error Detection and Checkpointing

In this paper, we revisit traditional checkpointing and rollback recovery strategies, with a focus on silent data corruption errors. Contrarily to fail-stop failures, such latent errors cannot be detected immediately, and a mechanism to detect them must be provided. We consider two models: (i) errors are detected after some delays following a probability distribution (typically, an Exponential distribution); (ii) errors are detected through some verification mechanism. In both cases, we compute the optimal period in order to minimize the waste, i.e., the fraction of time where nodes do not perform useful computations. In practice, only a fixed number of checkpoints can be kept in memory, and the first model may lead to an irrecoverable failure. In this case, we compute the minimum period required for an acceptable risk. For the second model, there is no risk of irrecoverable failure, owing to the verification mechanism, but the corresponding overhead is included in the waste. Finally, both models are instantiated using realistic scenarios and application/architecture parameters.

Data and Resources

Additional Info

Field Value
Source https://inria.hal.science/hal-00836871
Author Aupy, Guillaume, Benoit, Anne, Herault, Thomas, Robert, Yves, Vivien, Frédéric, Zaidouni, Dounia
Maintainer CCSD
Last Updated May 10, 2026, 15:10 (UTC)
Created May 10, 2026, 15:10 (UTC)
Identifier Report N°: RR-8319
Language en
Rights https://about.hal.science/hal-authorisation-v1/
contributor Laboratoire de l'Informatique du Parallélisme (LIP) ; École normale supérieure de Lyon (ENS de Lyon) ; Université de Lyon-Université de Lyon-Université Claude Bernard Lyon 1 (UCBL) ; Université de Lyon-Institut National de Recherche en Informatique et en Automatique (Inria)-Centre National de la Recherche Scientifique (CNRS)
creator Aupy, Guillaume
date 2013-06-21T00:00:00
harvest_object_id 7e7af8fa-3ebf-445c-9a5a-c8ce04dc37c3
harvest_source_id 3374d638-d20b-4672-ba96-a23232d55657
harvest_source_title test moissonnage SELUNE
metadata_modified 2025-10-13T00:00:00
set_spec type:REPORT