SCOPUS 정보 검색 플랫폼

Proceedings of 2011 SC - International Conference for High Performance Computing, Networking, Storage and Analysis

Volumn , Issue , 2011, Pages

Checkpointing strategies for parallel jobs

(5) Bougeret, Marin a Casanova, Henri b Rabie, Mikael a Robert, Yves a Vivien, Frédéric c

a UNIVERSITÉ DE LYON (France)

b UNIVERSITY OF HAWAII (United States)

c INRIA (France)

Author keywords

Checkpointing; Fault tolerance; Parallel job; Sequential job

Indexed keywords

CHECK POINTING; DYNAMIC PROGRAMMING ALGORITHM; EXPECTED EXECUTION TIME; EXTENSIVE SIMULATIONS; INTER-ARRIVAL TIME; JOB EXECUTION; JOB PARALLELISM; OPTIMAL SOLUTIONS; PARALLEL JOB; PARALLEL JOBS; PERIODIC CHECKPOINTING; PROCESSOR FAILURES; REAL-WORLD SYSTEM; SEQUENTIAL JOB; SIMULATION EXPERIMENTS; WEIBULL;

CLUSTERING ALGORITHMS; COMPUTER SOFTWARE SELECTION AND EVALUATION; EXPERIMENTS; FAULT TOLERANCE; OPTIMIZATION; WEIBULL DISTRIBUTION;

DYNAMIC PROGRAMMING;

EID: 83155184556 PISSN: None EISSN: None Source Type: Conference Proceeding
DOI: 10.1145/2063384.2063428 Document Type: Conference Paper

Times cited : (67)

References (31)

1
- 85060036181
- The validity of the single processor approach to achieving large scale computing capabilities
- AFIPS Press
- G. Amdahl. The validity of the single processor approach to achieving large scale computing capabilities. In AFIPS Conference Proceedings, volume 30, pages 483-485. AFIPS Press, 1967.
- (1967) AFIPS Conference Proceedings , vol.30 , pp. 483-485
- Amdahl, G.¹

2
- 83155195319
- L. Bautista Gomez, A. Nukada, N. Maruyama, F. Cappello, and S. Matsuoka. Transparent low-overhead checkpoint for GPU-accelerated clusters. https://wiki.ncsa.illinois.edu/download/attachments/17630761/ INRIA-UIUC-WS4-lbautista.pdf?version=1&modificationDate=1290470402000.
- Transparent Low-overhead Checkpoint for GPU-accelerated Clusters
- Gomez, L.B.¹ Nukada, A.² Maruyama, N.³ Cappello, F.⁴ Matsuoka, S.⁵

3
- 0003615167
- SIAM
- L. S. Blackford, J. Choi, A. Cleary, E. D'Azevedo, J. Demmel, I. Dhillon, J. Dongarra, S. Hammarling, G. Henry, A. Petitet, K. Stanley, D. Walker, and R. C. Whaley. ScaLAPACK Users'Guide. SIAM, 1997.
- (1997) ScaLAPACK Users'Guide
- Blackford, L.S.¹ Choi, J.² Cleary, A.³ D'Azevedo, E.⁴ Demmel, J.⁵ Dhillon, I.⁶ Dongarra, J.⁷ Hammarling, S.⁸ Henry, G.⁹ Petitet, A.¹⁰ Stanley, K.¹¹ Walker, D.¹² Whaley, R.C.¹³

4
- 83155171268
- Jaguar: The world's most powerful computer
- A. Bland, R. Kendall, D. Kothe, J. Rogers, and G. Shipman. Jaguar: The World's Most Powerful Computer. In GUC'2009, 2009.
- (2009) GUC'2009
- Bland, A.¹ Kendall, R.² Kothe, D.³ Rogers, J.⁴ Shipman, G.⁵

5
- 83155195316
- Checkpointing strategies for parallel jobs
- France, Jan. Available at
- M. Bougeret, H. Casanova, M. Rabie, Y. Robert, and F. Vivien. Checkpointing strategies for parallel jobs. Research Report 7520, INRIA, France, Jan. 2011. Available at http://graal.ens-lyon.fr/~fvivien/.
- (2011) Research Report 7520, INRIA
- Bougeret, M.¹ Casanova, H.² Rabie, M.³ Robert, Y.⁴ Vivien, F.⁵

6
- 77955097389
- A exible checkpoint/restart model in distributed systems
- volume 6067 of LNCS
- M.-S. Bouguerra, T. Gautier, D. Trystram, and J.-M. Vincent. A exible checkpoint/restart model in distributed systems. In PPAM, volume 6067 of LNCS, pages 206-215, 2010.
- (2010) PPAM , pp. 206-215
- Bouguerra, M.-S.¹ Gautier, T.² Trystram, D.³ Vincent, J.-M.⁴

7
- 83455190312
- An optimal algorithm for scheduling checkpoints with variable costs
- Oct.
- M. S. Bouguerra, D. Trystram, and F. Wagner. An optimal algorithm for scheduling checkpoints with variable costs. Technical report, INRIA, Oct. 2010.
- (2010) Technical Report, INRIA
- Bouguerra, M.S.¹ Trystram, D.² Wagner, F.³

8
- 78649559128
- Checkpointing vs. Migration for post-petascale supercomputers
- IEEE Computer Society Press
- F. Cappello, H. Casanova, and Y. Robert. Checkpointing vs. migration for post-petascale supercomputers. In ICPP'2010. IEEE Computer Society Press, 2010.
- (2010) ICPP'2010
- Cappello, F.¹ Casanova, H.² Robert, Y.³

9
- 0035266102
- Proactive management of software aging
- V. Castelli, R. E. Harper, P. Heidelberger, S. W. Hunter, K. S. Trivedi, K. Vaidyanathan, and W. P. Zeggert. Proactive management of software aging. IBM J. Res. Dev., 45(2):311-332, 2001. (Pubitemid 32736915)
- (2001) IBM Journal of Research and Development , vol.45 , Issue.2 , pp. 311-332
- Castelli, V.¹ Harper, R.E.² Heidelberger, P.³ Hunter, S.W.⁴ Trivedi, K.S.⁵ Vaidyanathan, K.⁶ Zeggert, W.P.⁷

10
- 28044460018
- A higher order estimate of the optimum checkpoint interval for restart dumps
- DOI 10.1016/j.future.2004.11.016, PII S0167739X04002213
- J. T. Daly. A higher order estimate of the optimum checkpoint interval for restart dumps. Future Generation Computer Systems, 22(3):303-312, 2004. (Pubitemid 41689812)
- (2006) Future Generation Computer Systems , vol.22 , Issue.3 , pp. 303-312
- Daly, J.T.¹

11
- 70450159193
- The international exascale software project: A call to cooperative action by the global high-performance community
- J. Dongarra, P. Beckman, P. Aerts, F. Cappello, T. Lippert, S. Matsuoka, P. Messina, T. Moore, R. Stevens, A. Trefethen, and M. Valero. The international exascale software project: a call to cooperative action by the global high-performance community. Int. J. High Perform. Comput. Appl., 23(4):309-322, 2009.
- (2009) Int. J. High Perform. Comput. Appl. , vol.23 , Issue.4 , pp. 309-322
- Dongarra, J.¹ Beckman, P.² Aerts, P.³ Cappello, F.⁴ Lippert, T.⁵ Matsuoka, S.⁶ Messina, P.⁷ Moore, T.⁸ Stevens, R.⁹ Trefethen, A.¹⁰ Valero, M.¹¹

12
- 0042078549
- A survey of rollback-recovery protocols in message-passing systems
- E. N. M. Elnozahy, L. Alvisi, Y.-M. Wang, and D. B. Johnson. A survey of rollback-recovery protocols in message-passing systems. ACM Computing Survey, 34:375-408, 2002.
- (2002) ACM Computing Survey , vol.34 , pp. 375-408
- Elnozahy, E.N.M.¹ Alvisi, L.² Wang, Y.-M.³ Johnson, D.B.⁴

13
- 0036041277
- Improving cluster availability using workstation validation
- T. Heath, R. P. Martin, and T. D. Nguyen. Improving cluster availability using workstation validation. SIGMETRICS Perf. Eval. Rev., 30(1):217-227, 2002. (Pubitemid 35009524)
- (2002) Performance Evaluation Review , vol.30 , Issue.1 , pp. 217-227
- Heath, T.¹ Martin, R.P.² Nguyen, T.D.³

14
- 51049086184
- Scalable group-based checkpoint/restart for large-scale message-passing systems
- IEEE
- J. Ho, C. Wang, and F. Lau. Scalable group-based checkpoint/restart for large-scale message-passing systems. In Parallel and Distributed Processing, 2008. IPDPS 2008. IEEE International Symposium on, pages 1-12. IEEE, 2008.
- (2008) Parallel and Distributed Processing, 2008. IPDPS 2008. IEEE International Symposium on , pp. 1-12
- Ho, J.¹ Wang, C.² Lau, F.³

15
- 78650009816
- Impact of sub-optimal checkpoint intervals on application efficiency in computational clusters
- ACM
- W. Jones, J. Daly, and N. DeBardeleben. Impact of sub-optimal checkpoint intervals on application efficiency in computational clusters. In HPDC'10, pages 276-279. ACM, 2010.
- (2010) HPDC'10 , pp. 276-279
- Jones, W.¹ Daly, J.² DeBardeleben, N.³

16
- 0028994247
- Software rejuvenation: Analysis, module and applications
- Washington, DC, USA, IEEE CS
- N. Kolettis and N. D. Fulton. Software rejuvenation: Analysis, module and applications. In FTCS'95, page 381, Washington, DC, USA, 1995. IEEE CS.
- (1995) FTCS'95 , pp. 381
- Kolettis, N.¹ Fulton, N.D.²

17
- 77954903245
- The failure trace archive: Enabling comparative analysis of failures in diverse distributed systems
- 0
- D. Kondo, B. Javadi, A. Iosup, and D. Epema. The failure trace archive: Enabling comparative analysis of failures in diverse distributed systems. Cluster Computing and the Grid, IEEE International Symposium on, 0:398-407, 2010.
- (2010) Cluster Computing and the Grid, IEEE International Symposium on , pp. 398-407
- Kondo, D.¹ Javadi, B.² Iosup, A.³ Epema, D.⁴

18
- 0023995854
- Computing optimal checkpointing strategies for rollback and recovery systems
- P. L'Ecuyer and J. Malenfant. Computing optimal checkpointing strategies for rollback and recovery systems. IEEE Transactions on computers, 37(4):491-496, 2002.
- (2002) IEEE Transactions on Computers , vol.37 , Issue.4 , pp. 491-496
- L'Ecuyer, P.¹ Malenfant, J.²

19
- 0035390088
- A variational calculus approach to optimal checkpoint placement
- DOI 10.1109/12.936236
- Y. Ling, J. Mi, and X. Lin. A variational calculus approach to optimal checkpoint placement. IEEE Transactions on computers, pages 699-708, 2001. (Pubitemid 32720123)
- (2001) IEEE Transactions on Computers , vol.50 , Issue.7 , pp. 699-708
- Ling, Y.¹ Mi, J.² Lin, X.³

20
- 51049108820
- An optimal checkpoint/restart model for a large scale high performance computing system
- IEEE
- Y. Liu, R. Nassar, C. Leangsuksun, N. Naksinehaboon, M. Paun, and S. Scott. An optimal checkpoint/restart model for a large scale high performance computing system. In IPDPS 2008, pages 1-9. IEEE, 2008.
- (2008) IPDPS 2008 , pp. 1-9
- Liu, Y.¹ Nassar, R.² Leangsuksun, C.³ Naksinehaboon, N.⁴ Paun, M.⁵ Scott, S.⁶

21
- 83155174134
- E. Meneses. Clustering Parallel Applications to Enhance Message Logging Protocols. https://wiki.ncsa.illinois.edu/download/attachments/17630761/INRIA- UIUC-WS4-emenese.pdf?version=1&modificationDate=1290466786000.
- Clustering Parallel Applications to Enhance Message Logging Protocols
- Meneses, E.¹

22
- 78650831692
- Design, modeling, and evaluation of a scalable multi-level checkpointing system
- A. Moody, G. Bronevetsky, K. Mohror, and B. R. d. Supinski. Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System. In Proceedings of the ACM/IEEE SC Conference, pages 1-11, 2010.
- (2010) Proceedings of the ACM/IEEE SC Conference , pp. 1-11
- Moody, A.¹ Bronevetsky, G.² Mohror, K.³ Supinski, B.R.D.⁴

23
- 33646721605
- Distribution-free checkpoint placement algorithms based on min-max principle
- T. Ozaki, T. Dohi, H. Okamura, and N. Kaio. Distribution-free checkpoint placement algorithms based on min-max principle. IEEE TDSC, pages 130-140, 2006.
- (2006) IEEE TDSC , pp. 130-140
- Ozaki, T.¹ Dohi, T.² Okamura, H.³ Kaio, N.⁴

24
- 85102627959
- Wiley
- M. L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley, 2005.
- (2005) Markov Decision Processes: Discrete Stochastic Dynamic Programming
- Puterman, M.L.¹

25
- 77954734639
- White paper available at
- V. Sarkar and others. Exascale software study: Software challenges in extreme scale systems, 2009. White paper available at: http://users.ece.gatech. edu/mrichard/ExascaleComputingStudyReports/ECSS%20report%20101909.pdf.
- (2009) Exascale Software Study: Software Challenges in Extreme Scale Systems
- Sarkar, V.¹

26
- 33845593340
- A large-scale study of failures in high-performance computing systems
- DOI 10.1109/DSN.2006.5, 1633514, Proceedings - DSN 2006: 2006 International Conference on Dependable Systems and Networks
- B. Schroeder and G. A. Gibson. A large-scale study of failures in high-performance computing systems. In Proc. of DSN, pages 249-258, 2006. (Pubitemid 44930426)
- (2006) Proceedings of the International Conference on Dependable Systems and Networks , vol.2006 , pp. 249-258
- Schroeder, B.¹ Gibson, G.A.²

27
- 84976696875
- Performance analysis of checkpointing strategies
- A. Tantawi and M. Ruschitzka. Performance analysis of checkpointing strategies. ACM TOCS, 2(2):123-144, 1984.
- (1984) ACM TOCS , vol.2 , Issue.2 , pp. 123-144
- Tantawi, A.¹ Ruschitzka, M.²

28
- 0021473687
- On the optimum checkpoint selection problem
- S. Toueg and O. Babaoglu. On the optimum checkpoint selection problem. SIAM J. Computing, 13(3):630-649, 1984.
- (1984) SIAM J. Computing , vol.13 , Issue.3 , pp. 630-649
- Toueg, S.¹ Babaoglu, O.²

29
- 83155195315
- Analysis of dependencies of checkpoint cost and checkpoint interval of fault tolerant MPI applications
- K. Venkatesh. Analysis of Dependencies of Checkpoint Cost and Checkpoint Interval of Fault Tolerant MPI Applications. Analysis, 2(08):2690-2697, 2010.
- (2010) Analysis , vol.2 , Issue.8 , pp. 2690-2697
- Venkatesh, K.¹

30
- 27544513113
- Modeling coordinated checkpointing for large-scale supercomputers
- Proceedings - 2005 International Conference on Dependable Systems and Networks
- L. Wang, P. Karthik, Z. Kalbarczyk, R. Iyer, L. Votta, C. Vick, and A. Wood. Modeling Coordinated Checkpointing for Large-Scale Supercomputers. In Proc. of the International Conference on Dependable Systems and Networks, pages 812-821, June 2005. (Pubitemid 41538294)
- (2005) Proceedings of the International Conference on Dependable Systems and Networks , pp. 812-821
- Wang, L.¹ Pattabiraman, K.² Kalbarczyk, Z.³ Iyer, R.K.⁴ Votta, L.⁵ Vick, C.⁶ Wood, A.⁷

31
- 84976846528
- A first order approximation to the optimum checkpoint interval
- J. W. Young. A first order approximation to the optimum checkpoint interval. Communications of the ACM, 17(9):530-531, 1974.
- (1974) Communications of the ACM , vol.17 , Issue.9 , pp. 530-531
- Young, J.W.¹

* 이 정보는 Elsevier사의 SCOPUS DB에서 KISTI가 분석하여 추출한 것입니다.