DeLIA: A Dependability Library for Iterative Applications applied to parallel geophysical problems

Tipo
Artigo
Data de publicação
2024
Periódico
Computers and Geosciences
Citações (Scopus)
0
Autores
Santana C.
Araujo R.C.F.
Sardina I.M.
Assis I.A.S.
Barros T.
Bianchini C.P.
Oliveira A.D.D.S.
de Araujo J.M.
Chauris H.
Tadonki C.
Xavier-de-Souza S.
Orientador
Título da Revista
ISSN da Revista
Título de Volume
Membros da banca
Programa
Resumo
© 2024 The AuthorsMany geophysical imaging applications, such as full-waveform inversion, often rely on high-performance computing to meet their demanding computational requirements. The failure of a subset of computer nodes during the execution of such applications can have a significant impact, as it may take several days or even weeks to recover the lost computation. To mitigate the consequences of these failures, it is crucial to employ effective fault tolerance techniques that do not introduce substantial overhead or hinder code optimization efforts. This paper addresses the primary research challenge of developing fault tolerance techniques with minimal impact on execution and optimization. To achieve this, we propose DeLIA, a Dependability Library for Iterative Applications designed for parallel programs that require data synchronization among all processes to maintain a globally consistent state after each iteration. DeLIA efficiently performs checkpointing and rollback of both the application's global state and each process's local state. Furthermore, DeLIA incorporates interruption detection mechanisms. One of the key advantages of DeLIA is its flexibility, allowing users to configure various parameters such as checkpointing frequency, selection of data to be saved, and the specific fault tolerance techniques to be applied. To validate the effectiveness of DeLIA, we applied it to a 3D full-waveform inversion code and conducted experiments to measure its overhead under different configurations using two workload schedulers. We also analyzed its behavior in preemptive circumstances. Our experiments revealed a maximum overhead of 8.8%, and DeLIA demonstrated its capability to detect termination signals and save the state of nodes in preemptive scenarios. Overall, the results of our study demonstrate the suitability of DeLIA to provide fault tolerance for iterative parallel applications.
Descrição
Palavras-chave
Assuntos Scopus
Check pointing , Computational requirements , Fault tolerance techniques , Faults detection , Full-waveform inversion , Geophysical imaging , Heartbeat monitoring , High-performance computing , Imaging applications , Performance computing
Citação
DOI (Texto completo)