SCOPUS 정보 검색 플랫폼

Proceedings of the International Conference on Parallel and Distributed Systems - ICPADS

Volumn , Issue , 2010, Pages 524-533

Hybrid checkpointing for MPI jobs in HPC environments

(4) Wang, Chao a Mueller, Frank a Engelmann, Christian b Scott, Stephen L b

a North Carolina State University (United States)

b OAK RIDGE NATIONAL LABORATORY (United States)

Author keywords

Checkpoint restart; Fault tolerance; High performance computing

Indexed keywords

CHECK POINTING; CHECKPOINT/RESTART; HIGH PERFORMANCE APPLICATIONS; HIGH PERFORMANCE COMPUTING SYSTEMS; HIGH-PERFORMANCE COMPUTING; OPTIMAL BALANCE; ORDER OF MAGNITUDE;

COST ACCOUNTING; FAULT TOLERANCE; FAULT TOLERANT COMPUTER SYSTEMS;

QUALITY ASSURANCE;

EID: 79951790076 PISSN: 15219097 EISSN: None Source Type: Conference Proceeding
DOI: 10.1109/ICPADS.2010.48 Document Type: Conference Paper

Times cited : (44)

References (28)

1
- 33845420448
- A power-aware run-time system for high-performance computing
- C.-H. Hsu and W.-C. Feng, "A power-aware run-time system for high-performance computing," in Supercomputing, 2005.
- (2005) Supercomputing
- Hsu, C.-H.¹ Feng, W.-C.²

2
- 79951791884
- Jun., O. R. N. Laboratory
- O. R. N. Laboratory, "National center for computational sciences," http://info.nccs.gov/resources/jaguar, Jun. 2007.
- (2007) National Center for Computational Sciences

3
- 77951478277
- Software failures and the road to a petaflop machine
- I. Philp, "Software failures and the road to a petaflop machine," in Workshop on High Performance Computing Reliability Issues, 2005.
- (2005) Workshop on High Performance Computing Reliability Issues
- Philp, I.¹

4
- 79951775997
- Application MTTFE vs. Platform MTTF: A fresh perspective on system reliability and application throughput for computations at scale
- May
- J. T. Daly, L. A. Pritchett-Sheats, and S. E. Michalak, "Application MTTFE vs. platform MTTF: A fresh perspective on system reliability and application throughput for computations at scale," in Workshop on Resiliency in High Performance Computing, May 2008, pp. 19-22.
- (2008) Workshop on Resiliency in High Performance Computing , pp. 19-22
- Daly, J.T.¹ Pritchett-Sheats, L.A.² Michalak, S.E.³

5
- 34548768671
- A job pause service under LAM/MPI+BLCR for transparent fault tolerance
- Apr.
- C. Wang, F. Mueller, C. Engelmann, and S. Scott, "A job pause service under LAM/MPI+BLCR for transparent fault tolerance," in International Parallel and Distributed Processing Symposium, Apr. 2007.
- (2007) International Parallel and Distributed Processing Symposium
- Wang, C.¹ Mueller, F.² Engelmann, C.³ Scott, S.⁴

6
- 34548771116
- Dejavu: Transparent userlevel checkpointing, migration, and recovery for distributed systems
- J. Ruscio, M. Heffner, and S. Varadarajan, "Dejavu: Transparent userlevel checkpointing, migration, and recovery for distributed systems," in International Parallel and Distributed Processing Symposium, 2007.
- (2007) International Parallel and Distributed Processing Symposium
- Ruscio, J.¹ Heffner, M.² Varadarajan, S.³

7
- 35248827046
- A Component architecture for LAM/MPI
- ser. Lecture Notes in Computer Science, no. 2840, Sep.
- J. M. Squyres and A. Lumsdaine, "A Component Architecture for LAM/MPI," in European PVM/MPI Users' Group Meeting, ser. Lecture Notes in Computer Science, no. 2840, Sep. 2003, pp. 379-387.
- (2003) European PVM/MPI Users' Group Meeting , pp. 379-387
- Squyres, J.M.¹ Lumsdaine, A.²

8
- 12344277946
- Lawrence Berkeley National Laboratory, TR
- J. Duell, "The design and implementation of berkeley lab's linux checkpoint/restart," Lawrence Berkeley National Laboratory, TR, 2000.
- (2000) The Design and Implementation of Berkeley Lab's Linux Checkpoint/Restart
- Duell, J.¹

9
- 20444444457
- The LAM/MPI checkpoint/restart framework: System-initiated checkpointing
- Oct.
- S. Sankaran, J. M. Squyres, B. Barrett, A. Lumsdaine, J. Duell, P. Hargrove, and E. Roman, "The LAM/MPI checkpoint/restart framework: System-initiated checkpointing," in LACSI Symposium, Oct. 2003.
- (2003) LACSI Symposium
- Sankaran, S.¹ Squyres, J.M.² Barrett, B.³ Lumsdaine, A.⁴ Duell, J.⁵ Hargrove, P.⁶ Roman, E.⁷

10
- 34548789748
- The design and implementation of checkpoint/restart process fault tolerance for Open MPI
- J. Hursey, J. M. Squyres, T. I. Mattox, and A. Lumsdaine, "The design and implementation of checkpoint/restart process fault tolerance for Open MPI," in Workshop on Dependable Parallel, Distributed and Network-Centric Systems, 03 2007.
- (2007) Workshop on Dependable Parallel, Distributed and Network-Centric Systems , pp. 03
- Hursey, J.¹ Squyres, J.M.² Mattox, T.I.³ Lumsdaine, A.⁴

11
- 0026812659
- The design and implementation of a log-structured file system
- Feb.
- M. Rosenblum and J. K. Ousterhout, "The design and implementation of a log-structured file system," in ACM Trans. on Computer Systems, Vol. 10, No. 1, Feb. 1992.
- (1992) ACM Trans. on Computer Systems , vol.10 , Issue.1
- Rosenblum, M.¹ Ousterhout, J.K.²

12
- 70350755748
- Proactive processlevel live migration in hpc environments
- C. Wang, F. Mueller, C. Engelmann, and S. Scott, "Proactive processlevel live migration in hpc environments," in Supercomputing, 2008.
- (2008) Supercomputing
- Wang, C.¹ Mueller, F.² Engelmann, C.³ Scott, S.⁴

13
- 85014969248
- Architectural requirements and scalability of the NAS parallel benchmarks
- F. Wong, R. Martin, R. Arpaci-Dusseau, and D. Culler, "Architectural requirements and scalability of the NAS parallel benchmarks," in Supercomputing, 1999.
- (1999) Supercomputing
- Wong, F.¹ Martin, R.² Arpaci-Dusseau, R.³ Culler, D.⁴

14
- 33746047855
- The design, implementation, and evaluation of mpiBLAST
- A. Darling, L. Carey, and W. Feng, "The design, implementation, and evaluation of mpiBLAST," in ClusterWorld Conference and Expo, 2003.
- (2003) ClusterWorld Conference and Expo
- Darling, A.¹ Carey, L.² Feng, W.³

15
- 50649087527
- Reliability-aware approach: An incremental checkpoint/restart model in hpc environments
- N. Naksinehaboon, Y. Liu, C. B. Leangsuksun, R. Nassar, M. Paun, and S. Scott, "Reliability-aware approach: An incremental checkpoint/restart model in hpc environments," in Symposium on Cluster Computing and the Grid, 2008, pp. 783-788.
- (2008) Symposium on Cluster Computing and the Grid , pp. 783-788
- Naksinehaboon, N.¹ Liu, Y.² Leangsuksun, C.B.³ Nassar, R.⁴ Paun, M.⁵ Scott, S.⁶

16
- 0029713612
- CoCheck: Checkpointing and process migration for MPI
- Honolulu, HI, USA, 15-19 April, IEEE, Ed. 1109 Spring Street, Suite 300, Silver Spring, MD 20910, USA: IEEE Computer Society Press, 1996
- G. Stellner, "CoCheck: checkpointing and process migration for MPI," in Proceedings of IPPS'96. The 10th International Parallel Processing Symposium: Honolulu, HI, USA, 15-19 April 1996, IEEE, Ed. 1109 Spring Street, Suite 300, Silver Spring, MD 20910, USA: IEEE Computer Society Press, 1996, pp. 526-531.
- (1996) Proceedings of IPPS'96. the 10th International Parallel Processing Symposium , pp. 526-531
- Stellner, G.¹

17
- 0038194608
- MPICH-V: Toward a scalable fault tolerant MPI for volatile nodes
- Nov.
- G. Bosilca, A. Boutellier, and F. Cappello, "MPICH-V: Toward a scalable fault tolerant MPI for volatile nodes," in Supercomputing, Nov. 2002.
- (2002) Supercomputing
- Bosilca, G.¹ Boutellier, A.² Cappello, F.³

18
- 60449096682
- MPICH-V2: A fault tolerant MPI for volatile nodes based on pessimistic sender based message logging
- B. Bouteiller, F. Cappello, T. Herault, K. Krawezik, P. Lemarinier, and M. Magniette, "MPICH-V2: a fault tolerant MPI for volatile nodes based on pessimistic sender based message logging," in Supercomputing, 2003.
- (2003) Supercomputing
- Bouteiller, B.¹ Cappello, F.² Herault, T.³ Krawezik, K.⁴ Lemarinier, P.⁵ Magniette, M.⁶

19
- 34250708320
- Analysis of the component architecture overhead in Open MPI
- September
- B. Barrett, J. M. Squyres, A. Lumsdaine, R. L. Graham, and G. Bosilca, "Analysis of the component architecture overhead in Open MPI," in European PVM/MPI Users' Group Meeting, September 2005.
- (2005) European PVM/MPI Users' Group Meeting
- Barrett, B.¹ Squyres, J.M.² Lumsdaine, A.³ Graham, R.L.⁴ Bosilca, G.⁵

20
- 33845434226
- Transparent, incremental checkpointing at kernel level: A foundation for fault tolerance for parallel computers
- R. Gioiosa, J. C. S., S. Jiang, and F. Petrini, "Transparent, incremental checkpointing at kernel level: a foundation for fault tolerance for parallel computers," in Supercomputing, 2005.
- (2005) Supercomputing
- Gioiosa, R.¹ Sancho, J.C.² Jiang, S.³ Petrini, F.⁴

21
- 33644504479
- Space-efficient page-level incremental checkpointing
- J. Heo, S. Yi, Y. Cho, J. Hong, and S. Y. Shin, "Space-efficient page-level incremental checkpointing," in ACM Symposium on Applied computing, 2005, pp. 1558-1562.
- (2005) ACM Symposium on Applied Computing , pp. 1558-1562
- Heo, J.¹ Yi, S.² Cho, Y.³ Hong, J.⁴ Shin, S.Y.⁵

22
- 0031224013
- Continuous checkpointing: Joining the checkpointing with virtual memory paging
- S.-T. Hsu and R.-C. Chang, "Continuous checkpointing: joining the checkpointing with virtual memory paging," Softw. Pract. Exper., vol. 27, no. 9, pp. 1103-1120, 1997. (Pubitemid 127582120)
- (1997) Software - Practice and Experience , vol.27 , Issue.9 , pp. 1103-1120
- Hsu, S.-T.¹ Chang, R.-C.²

23
- 33751065156
- Adaptive page-level incremental checkpointing based on expected recovery time
- Applied Computing 2006 - The 21st Annual ACM Symposium on Applied Computing - Proceedings of the 2006 ACM Symposium on Applied Computing
- S. Yi, J. Heo, Y. Cho, and J. Hong, "Adaptive page-level incremental checkpointing based on expected recovery time," in ACM Symposium on Applied computing, 2006, pp. 1472-1476. (Pubitemid 44759028)
- (2006) Proceedings of the ACM Symposium on Applied Computing , vol.2 , pp. 1472-1476
- Yi, S.¹ Heo, J.² Cho, Y.³ Hong, J.⁴

24
- 8344232253
- Adaptive incremental checkpointing for massively parallel systems
- New York, NY, USA: ACM
- S. Agarwal, R. Garg, M. S. Gupta, and J. E. Moreira, "Adaptive incremental checkpointing for massively parallel systems," in International Conference on Supercomputing. New York, NY, USA: ACM, 2004, pp. 277-286.
- (2004) International Conference on Supercomputing , pp. 277-286
- Agarwal, S.¹ Garg, R.² Gupta, M.S.³ Moreira, J.E.⁴

25
- 79951788489
- Incremental checkpointing for grids
- Jul.
- J. Mehnert-Spahn, E. Feller, and M. Schoettner, "Incremental checkpointing for grids," in Linux Symposium, Jul. 2009.
- (2009) Linux Symposium
- Mehnert-Spahn, J.¹ Feller, E.² Schoettner, M.³

26
- 84976846528
- A first order approximation to the optimum checkpoint interval
- J. W. Young, "A first order approximation to the optimum checkpoint interval," Commun. ACM, vol. 17, no. 9, pp. 530-531, 1974.
- (1974) Commun. ACM , vol.17 , Issue.9 , pp. 530-531
- Young, J.W.¹

27
- 28044460018
- A higher order estimate of the optimum checkpoint interval for restart dumps
- DOI 10.1016/j.future.2004.11.016, PII S0167739X04002213
- J. T. Daly, "A higher order estimate of the optimum checkpoint interval for restart dumps," Future Gener. Comput. Syst., vol. 22, no. 3, pp. 303-312, 2006. (Pubitemid 41689812)
- (2006) Future Generation Computer Systems , vol.22 , Issue.3 , pp. 303-312
- Daly, J.T.¹

28
- 51049108820
- An optimal checkpoint/restart model for a large scale high performance computing system
- Apr.
- Y. Liu, R. Nassar, C. B. Leangsuksun, N. Naksinehaboon, M. Paun, and S. Scott, "An optimal checkpoint/restart model for a large scale high performance computing system," in International Parallel and Distributed Processing Symposium, Apr. 2008.
- (2008) International Parallel and Distributed Processing Symposium
- Liu, Y.¹ Nassar, R.² Leangsuksun, C.B.³ Naksinehaboon, N.⁴ Paun, M.⁵ Scott, S.⁶

* 이 정보는 Elsevier사의 SCOPUS DB에서 KISTI가 분석하여 추출한 것입니다.