SCOPUS 정보 검색 플랫폼

Proceedings of the International Conference on Dependable Systems and Networks

Volumn , Issue , 2008, Pages 217-226

A fast restart mechanism for checkpoint/recovery protocols in networked environments

(2) Li, Yawei a Lan, Zhiling a

a Illinois Institute of Technology (United States)

Author keywords

[No Author keywords available]

Indexed keywords

BENCHMARKING; MECHANISMS; NETWORK PROTOCOLS; REAL TIME SYSTEMS; SENSOR NETWORKS; SYSTEMS ENGINEERING;

CHECK-POINTING; CHECKPOINT/RECOVERY; DEPENDABLE SYSTEMS; INTERNATIONAL CONFERENCES; LINUX SYSTEMS; MEMORY FOOTPRINTS; NETWORKED ENVIRONMENTS; NON-TRIVIAL; OPTIMIZATION TECHNIQUES; PROCESS DATA; RESEARCH EFFORTS; RESTART MECHANISM; SYSTEM FAILURES;

COMPUTER NETWORKS;

EID: 53349121135 PISSN: None EISSN: None Source Type: Conference Proceeding
DOI: 10.1109/DSN.2008.4630090 Document Type: Conference Paper

Times cited : (17)

References (27)

1
- 84976789801
- The recovery box: Using fast recovery to provide high availability in the UNIX environment
- M. Baker and M. Sullivan, "The recovery box: Using fast recovery to provide high availability in the UNIX environment," in Proceedings of Summer USENIX Technical Conference, 1992.
- (1992) Proceedings of Summer USENIX Technical Conference
- Baker, M.¹ Sullivan, M.²

2
- 33746779994
- MPICH-V: A multiprotocol automatic fault tolerant MPI
- A. Bouteiller, T. Herault, G. Krawezik, P. Lemarinier, F. Cappello, "MPICH-V: A multiprotocol automatic fault tolerant MPI," International Journal of High Performance Computing and Applications, vol. 20(3), pp. 319-333, 2005.
- (2005) International Journal of High Performance Computing and Applications , vol.20 , Issue.3 , pp. 319-333
- Bouteiller, A.¹ Herault, T.² Krawezik, G.³ Lemarinier, P.⁴ Cappello, F.⁵

3
- 27544461132
- A model for predicting the optimum checkpoint interval for restart dumps
- J. Daly, "A model for predicting the optimum checkpoint interval for restart dumps," in Proceedings of International Conference on Computational Science, 2003.
- (2003) Proceedings of International Conference on Computational Science
- Daly, J.¹

4
- 9144223280
- Checkpointing for peta-scale systems: A look into the future of practical rollback-recovery
- E. Elnozahy and J. Plank, "Checkpointing for peta-scale systems: A look into the future of practical rollback-recovery," IEEE Trans. on Dependable and Secure Computing, vol. 1(2), pp. 97-108, 2004.
- (2004) IEEE Trans. on Dependable and Secure Computing , vol.1 , Issue.2 , pp. 97-108
- Elnozahy, E.¹ Plank, J.²

5
- 84976813771
- IGOR: A system for program debugging via reversible execution
- S. Feldman and C. Brown, "IGOR: A system for program debugging via reversible execution," in Proceedings of ACM SIGPLAN and SIGOPS workshop on parallel and distributed debugging, 1989.
- (1989) Proceedings of ACM SIGPLAN and SIGOPS workshop on parallel and distributed debugging
- Feldman, S.¹ Brown, C.²

6
- 48049114689
- Berkeley lab checkpoint/restart (BLCR) for Linux clusters
- P. Hargrove and J. Duell, "Berkeley lab checkpoint/restart (BLCR) for Linux clusters," in Proceedings of SciDAC, 2006.
- (2006) Proceedings of SciDAC
- Hargrove, P.¹ Duell, J.²

7
- 85160681664
- Transparent checkpoint-restart of multiple processes on commodity operating systems
- O. Laadan and J. Nieh, "Transparent checkpoint-restart of multiple processes on commodity operating systems," in Proceedings of USENIX Annual Technical Conference, 2007.
- (2007) Proceedings of USENIX Annual Technical Conference
- Laadan, O.¹ Nieh, J.²

8
- 53349100980
- Adaptive fault management of parallel applications for high performance computing
- in press
- Z. Lan and Y. Li, "Adaptive fault management of parallel applications for high performance computing," IEEE Trans. on Computers, in press.
- IEEE Trans. on Computers
- Lan, Z.¹ Li, Y.²

9
- 0028485392
- Low-latency, concurrent checkpointing for parallel programs
- K. Li, J. Naughton and J. Plank, "Low-latency, concurrent checkpointing for parallel programs," IEEE Trans. Parallel and Distributed Systems, vol. 5(8), pp. 874-879, 1994.
- (1994) IEEE Trans. Parallel and Distributed Systems , vol.5 , Issue.8 , pp. 874-879
- Li, K.¹ Naughton, J.² Plank, J.³

10
- 0035390088
- A variational calculus approach to optimal checkpoint placement
- Y. Ling, J. Mi and X. Lin, "A variational calculus approach to optimal checkpoint placement," IEEE Trans. Computers, vol. 50(7), pp. 699-708, 2001.
- (2001) IEEE Trans. Computers , vol.50 , Issue.7 , pp. 699-708
- Ling, Y.¹ Mi, J.² Lin, X.³

11
- 0345044000
- Process migration
- D. Milojičić, F. Douglis, Y. Paindaveine, R. Wheeler and S. Zhou, "Process migration," ACM Comput. Surv., vol. 32(3), pp. 241-299, 2000.
- (2000) ACM Comput. Surv , vol.32 , Issue.3 , pp. 241-299
- Milojičić, D.¹ Douglis, F.² Paindaveine, Y.³ Wheeler, R.⁴ Zhou, S.⁵

12
- 53349117725
- NCSA web site
- NCSA web site, http://teragrid.ncsa.uiuc.edu.

13
- 34547424386
- Cooperative checkpointing: A robust approach to large-scale systems reliability
- A. Oliner, L. Rudolph and R. Sahoo, "Cooperative checkpointing: A robust approach to large-scale systems reliability," in Proceedings of International Conference on Supercomputing, 2006.
- (2006) Proceedings of International Conference on Supercomputing
- Oliner, A.¹ Rudolph, L.² Sahoo, R.³

14
- 53349171078
- Oracle high availability document website, http://www.oracle.com/ technology/deploy/availability/htdocs/fs_on-demand_rollback.htm.
- Oracle high availability document website

15
- 0004015896
- Recovery-oriented computing (ROC): Motivation, definition, techniques, and case studies,
- UCB//CSD-02-1175
- D. Patterson et al., "Recovery-oriented computing (ROC): Motivation, definition, techniques, and case studies," UC Berkeley Computer Science Technical Report UCB//CSD-02-1175, 2002.
- (2002) UC Berkeley Computer Science Technical Report
- Patterson, D.¹

16
- 0033077475
- Memory exclusion: Optimizing the performance of checkpointing systems
- J. Plank, Y. Chen and K. Li and M. Beck and G. Kingsley, "Memory exclusion: Optimizing the performance of checkpointing systems," Software - Practice and Experience, vol. 29(2), pp. 125-142, 1999.
- (1999) Software - Practice and Experience , vol.29 , Issue.2 , pp. 125-142
- Plank, J.¹ Chen, Y.² Li, K.³ Beck, M.⁴ Kingsley, G.⁵

17
- 0032179680
- Diskless checkpointing
- J. Plank, K. Li and M. Puening, "Diskless checkpointing," IEEE Trans. Parallel and Distributed Systems, vol. 9(10), pp. 972-986, 1998.
- (1998) IEEE Trans. Parallel and Distributed Systems , vol.9 , Issue.10 , pp. 972-986
- Plank, J.¹ Li, K.² Puening, M.³

18
- 0035201417
- Processor allocation and checkpoint interval selection in cluster computing systems
- J. Plank and M. Thomason, "Processor allocation and checkpoint interval selection in cluster computing systems," Journal of Parallel and Distributed Computing, vol. 61(11), pp. 1570-1590, 2001.
- (2001) Journal of Parallel and Distributed Computing , vol.61 , Issue.11 , pp. 1570-1590
- Plank, J.¹ Thomason, M.²

19
- 0033721199
- The cost of recovery in message logging protocols
- S. Rao, L. Alvisi and H. Vin, "The cost of recovery in message logging protocols," IEEE Trans. on Knowledge and Data Engineering, vol. 12(2), pp. 160-173, 2000.
- (2000) IEEE Trans. on Knowledge and Data Engineering , vol.12 , Issue.2 , pp. 160-173
- Rao, S.¹ Alvisi, L.² Vin, H.³

20
- 12444268355
- On the feasibility of incremental checkpointing for scientific computing
- J. Sancho, F. Petrini, G. Johnson, J. Fernandez and E. Frachtenberg, "On the feasibility of incremental checkpointing for scientific computing," in Proceedings of International Parallel and Distributed Processing Symposium, 2004.
- (2004) Proceedings of International Parallel and Distributed Processing Symposium
- Sancho, J.¹ Petrini, F.² Johnson, G.³ Fernandez, J.⁴ Frachtenberg, E.⁵

21
- 53349127182
- SPEC CPU 2006 benchmark website, http://www.spec.org/cpu2006/.
- SPEC CPU 2006 benchmark website, http://www.spec.org/cpu2006/.

22
- 13944251545
- A component architecture for LAM/MPI
- J. Squyres and A. Lumsdaine, "A component architecture for LAM/MPI," in Proceedings of European PVM/MPI Users' Group Meeting, 2003.
- (2003) Proceedings of European PVM/MPI Users' Group Meeting
- Squyres, J.¹ Lumsdaine, A.²

23
- 0004120131
- 2nd ed, New Jersey: Prentice-Hall
- nd ed., New Jersey: Prentice-Hall, 1997.
- (1997) Operating Systems: Design and Implementation
- Tanenbaum, A.¹ Woodhull, A.²

24
- 0029251277
- The Condor distributed processing system
- T. Tannenbaum and M. Litzkow, "The Condor distributed processing system," Dr. Dobb's Journal, vol. 227, pp. 40-48, 1995.
- (1995) Dr. Dobb's Journal , vol.227 , pp. 40-48
- Tannenbaum, T.¹ Litzkow, M.²

25
- 0031388399
- Impact of checkpoint latency on overhead ratio of a checkpointing scheme
- N. Vaidya, "Impact of checkpoint latency on overhead ratio of a checkpointing scheme," IEEE Trans. on Computers, vol. 46(8), pp. 942-947, 1997.
- (1997) IEEE Trans. on Computers , vol.46 , Issue.8 , pp. 942-947
- Vaidya, N.¹

26
- 84976846528
- A first order approximation to the optimal checkpoint interval
- J. Young, "A first order approximation to the optimal checkpoint interval," Comm. ACM, vol. 17(9), pp. 530-531, 1974.
- (1974) Comm. ACM , vol.17 , Issue.9 , pp. 530-531
- Young, J.¹

27
- 12844271066
- Dynamic tracking of page miss ratio curve for memory management
- P. Zhou, V. Pandey, J. Sundaresan, A. Raghuraman, Y. Zhou and S. Kumar, "Dynamic tracking of page miss ratio curve for memory management," in Proceedings of International Conference on Architectural Support for Programming Languages and Operating Systems, 2004.
- (2004) Proceedings of International Conference on Architectural Support for Programming Languages and Operating Systems
- Zhou, P.¹ Pandey, V.² Sundaresan, J.³ Raghuraman, A.⁴ Zhou, Y.⁵ Kumar, S.⁶

* 이 정보는 Elsevier사의 SCOPUS DB에서 KISTI가 분석하여 추출한 것입니다.