DeLIA: A Dependability Library for Iterative Applications applied to parallel geophysical problems

dc.contributor.authorSantana C.
dc.contributor.authorAraujo R.C.F.
dc.contributor.authorSardina I.M.
dc.contributor.authorAssis I.A.S.
dc.contributor.authorBarros T.
dc.contributor.authorBianchini C.P.
dc.contributor.authorOliveira A.D.D.S.
dc.contributor.authorde Araujo J.M.
dc.contributor.authorChauris H.
dc.contributor.authorTadonki C.
dc.contributor.authorXavier-de-Souza S.
dc.date.accessioned2024-08-01T06:16:24Z
dc.date.available2024-08-01T06:16:24Z
dc.date.issued2024
dc.description.abstract© 2024 The AuthorsMany geophysical imaging applications, such as full-waveform inversion, often rely on high-performance computing to meet their demanding computational requirements. The failure of a subset of computer nodes during the execution of such applications can have a significant impact, as it may take several days or even weeks to recover the lost computation. To mitigate the consequences of these failures, it is crucial to employ effective fault tolerance techniques that do not introduce substantial overhead or hinder code optimization efforts. This paper addresses the primary research challenge of developing fault tolerance techniques with minimal impact on execution and optimization. To achieve this, we propose DeLIA, a Dependability Library for Iterative Applications designed for parallel programs that require data synchronization among all processes to maintain a globally consistent state after each iteration. DeLIA efficiently performs checkpointing and rollback of both the application's global state and each process's local state. Furthermore, DeLIA incorporates interruption detection mechanisms. One of the key advantages of DeLIA is its flexibility, allowing users to configure various parameters such as checkpointing frequency, selection of data to be saved, and the specific fault tolerance techniques to be applied. To validate the effectiveness of DeLIA, we applied it to a 3D full-waveform inversion code and conducted experiments to measure its overhead under different configurations using two workload schedulers. We also analyzed its behavior in preemptive circumstances. Our experiments revealed a maximum overhead of 8.8%, and DeLIA demonstrated its capability to detect termination signals and save the state of nodes in preemptive scenarios. Overall, the results of our study demonstrate the suitability of DeLIA to provide fault tolerance for iterative parallel applications.
dc.description.volume191
dc.identifier.doi10.1016/j.cageo.2024.105662
dc.identifier.issnNone
dc.identifier.urihttps://dspace.mackenzie.br/handle/10899/39036
dc.relation.ispartofComputers and Geosciences
dc.rightsAcesso Aberto
dc.subject.otherlanguageCheckpointing
dc.subject.otherlanguageFault detection
dc.subject.otherlanguageFault tolerance
dc.subject.otherlanguageFull-waveform inversion
dc.subject.otherlanguageHeartbeat monitoring
dc.subject.otherlanguageHigh-performance computing
dc.titleDeLIA: A Dependability Library for Iterative Applications applied to parallel geophysical problems
dc.typeArtigo
local.scopus.citations0
local.scopus.eid2-s2.0-85197022978
local.scopus.subjectCheck pointing
local.scopus.subjectComputational requirements
local.scopus.subjectFault tolerance techniques
local.scopus.subjectFaults detection
local.scopus.subjectFull-waveform inversion
local.scopus.subjectGeophysical imaging
local.scopus.subjectHeartbeat monitoring
local.scopus.subjectHigh-performance computing
local.scopus.subjectImaging applications
local.scopus.subjectPerformance computing
local.scopus.updated2024-12-01
local.scopus.urlhttps://www.scopus.com/inward/record.uri?partnerID=HzOxMe3b&scp=85197022978&origin=inward
Arquivos