SCOPUS 정보 검색 플랫폼

IEEE Transactions on Computers

Volumn 60, Issue 5, 2011, Pages 639-652

FREM: A fast restart mechanism for general checkpoint/restart

(2) Li, Yawei a Lan, Zhiling b

a GOOGLE INC (United States)

b Illinois Institute of Technology (United States)

Author keywords

Fast restart; fault tolerance; high performance computing; Linux; operating system

Indexed keywords

APPLICATION RECOVERY; CHECKPOINT/RESTART; FAILURE RATE; FAST RESTART; HIGH PERFORMANCE COMPUTING; LARGE SYSTEM; LATENCY PROBLEM; LINUX; LINUX ENVIRONMENT; OPERATING SYSTEM; PROCESS DATA; PROTOTYPE SYSTEM; REAL APPLICATIONS; RESTART MECHANISM; SYSTEM DEPENDABILITY;

COMPUTER OPERATING SYSTEMS; COMPUTER SOFTWARE SELECTION AND EVALUATION; FAULT TOLERANCE; FAULT TOLERANT COMPUTER SYSTEMS;

QUALITY ASSURANCE;

EID: 79953201544 PISSN: 00189340 EISSN: None Source Type: Journal
DOI: 10.1109/TC.2010.129 Document Type: Article

Times cited : (15)

References (43)

1
- 72049130706
- Technical Report UTEP-CS-08-24
- S. Arunagiri, J. Daly, P. Teller, S. Seelam, R. Oldfield, M. Varela, and R. Riesen, "Opportunistic Checkpoint Intervals to Improve System Performance," Technical Report UTEP-CS-08-24, 2008.
- (2008) Opportunistic Checkpoint Intervals to Improve System Performance
- Arunagiri, S.¹ Daly, J.² Teller, P.³ Seelam, S.⁴ Oldfield, R.⁵ Varela, M.⁶ Riesen, R.⁷

2
- 84976789801
- The recovery box: Using fast recovery to provide high availability in the UNIX environment
- M. Baker and M. Sullivan, "The Recovery Box: Using Fast Recovery to Provide High Availability in the UNIX Environment," Proc. Summer USENIX Technical Conf., 1992.
- (1992) Proc. Summer USENIX Technical Conf.
- Baker, M.¹ Sullivan, M.²

3
- 85059766484
- Live migration of virtual machines
- C. Clark, K. Fraser, H. Steven, J. Hansen, E. Jul, C. Limpach, I. Pratt, and A. Warfield, "Live Migration of Virtual Machines," Proc. ACM/USENIX Symp. Networked Systems Design and Implementation, 2005.
- (2005) Proc. ACM/USENIX Symp. Networked Systems Design and Implementation
- Clark, C.¹ Fraser, K.² Steven, H.³ Hansen, J.⁴ Jul, E.⁵ Limpach, C.⁶ Pratt, I.⁷ Warfield, A.⁸

4
- 27544461132
- A model for predicting the optimum checkpoint interval for restart dumps
- J. Daly, "A Model for Predicting the Optimum Checkpoint Interval for Restart Dumps," Proc. Int'l Conf. Computational Science, 2003.
- (2003) Proc. Int'l Conf. Computational Science
- Daly, J.¹

5
- 0033364817
- An evaluation of linear models for host load prediction
- P. Dinda and D. O'Hallaron, "An Evaluation of Linear Models for Host Load Prediction," Proc. IEEE Int'l Symp. High Performance Distributed Computing, 1999.
- (1999) Proc. IEEE Int'l Symp. High Performance Distributed Computing
- Dinda, P.¹ O'Hallaron, D.²

6
- 9144223280
- Checkpointing for peta-scale systems: A look into the future of practical rollback-recovery
- Apr.-June
- E. Elnozahy and J. Plank, "Checkpointing for Peta-Scale Systems: A Look Into the Future of Practical Rollback-Recovery," IEEE Trans. Dependable and Secure Computing, vol. 1, no. 2, pp. 97-108, Apr.-June 2004.
- (2004) IEEE Trans. Dependable and Secure Computing , vol.1 , Issue.2 , pp. 97-108
- Elnozahy, E.¹ Plank, J.²

7
- 79953195047
- EPCKPT: A Checkpoint Utility for Linux Kernel, http://www. research.rutgers.edu/~edpin/epckpt, 2010.
- (2010) EPCKPT: A Checkpoint Utility for Linux Kernel

8
- 84976813771
- IGOR: A system for program debugging via reversible execution
- S. Feldman and C. Brown, "IGOR: A System for Program Debugging via Reversible Execution," Proc. ACM SIGPLAN and SIGOPS Workshop Parallel and Distributed Debugging, 1989.
- (1989) Proc. ACM SIGPLAN and SIGOPS Workshop Parallel and Distributed Debugging
- Feldman, S.¹ Brown, C.²

9
- 31344436964
- On designing direct dependency - Based fast recovery algorithms for distributed systems
- DOI 10.1145/974104.974110
- B. Gupta, Z. Liu, and Z. Liang, "On Designing Direct Dependency-Based Fast Recovery Algorithms for Distributed Systems," ACM SIGOPS Operating Systems Rev., vol. 38, no. 1, pp. 58-73, 2004. (Pubitemid 46746979)
- (2004) Operating Systems Review (ACM) , vol.38 , Issue.1 , pp. 58-73
- Gupta, B.¹ Liu, Z.² Liang, Z.³

10
- 48049114689
- Berkeley lab checkpoint/restart (BLCR) for linux clusters
- P. Hargrove and J. Duell, "Berkeley Lab Checkpoint/Restart (BLCR) for Linux Clusters," Proc. Scientific Discovery through Advanced Computing (SciDAC), 2006.
- (2006) Proc. Scientific Discovery through Advanced Computing (SciDAC)
- Hargrove, P.¹ Duell, J.²

11
- 77950594233
- SPEC CPU2000 memory footprint
- J. Henning, "SPEC CPU2000 Memory Footprint," ACM SIGARCH Computer Architecture News, vol. 35, no. 1, pp. 84-89, 2007.
- (2007) ACM SIGARCH Computer Architecture News , vol.35 , Issue.1 , pp. 84-89
- Henning, J.¹

12
- 0032095071
- Virtual memory: Issues of implementation
- B. Jacob and T. Mudge, "Virtual Memory: Issues of Implementation," Computer, vol. 31, no. 6, pp. 33-43, June 1998. (Pubitemid 128550816)
- (1998) Computer , vol.31 , Issue.6 , pp. 33-43
- Jacob, B.¹ Mudge, T.²

13
- 85160681664
- Transparent checkpoint-restart of multiple processes on commodity operating systems
- O. Laadan and J. Nieh, "Transparent Checkpoint-Restart of Multiple Processes on Commodity Operating Systems," Proc. USENIX Ann. Technical Conf., 2007.
- (2007) Proc. USENIX Ann. Technical Conf.
- Laadan, O.¹ Nieh, J.²

14
- 57049111494
- Adaptive fault management of parallel applications for high performance computing
- Dec.
- Z. Lan and Y. Li, "Adaptive Fault Management of Parallel Applications for High Performance Computing," IEEE Trans. Computers, vol. 57, no. 12, pp. 1647-1660, Dec. 2008.
- (2008) IEEE Trans. Computers , vol.57 , Issue.12 , pp. 1647-1660
- Lan, Z.¹ Li, Y.²

15
- 78649627101
- A fast recovery mechanism for checkpointing in networked environments
- Y. Li and Z. Lan, "A Fast Recovery Mechanism for Checkpointing in Networked Environments," Proc. Int'l Conf. Dependable Systems and Networks (DSN), 2008.
- (2008) Proc. Int'l Conf. Dependable Systems and Networks (DSN)
- Li, Y.¹ Lan, Z.²

16
- 67649883517
- Fault-aware runtime strategies for high-performance computing
- Apr.
- Y. Li, Z. Lan, P. Gujrati, and X. Sun, "Fault-Aware Runtime Strategies for High-Performance Computing," IEEE Trans. Parallel and Distributed Systems, vol. 20, no. 4, pp. 460-473, Apr. 2009.
- (2009) IEEE Trans. Parallel and Distributed Systems , vol.20 , Issue.4 , pp. 460-473
- Li, Y.¹ Lan, Z.² Gujrati, P.³ Sun, X.⁴

17
- 0028485392
- Low-latency, concurrent checkpointing for parallel programs
- Aug.
- K. Li, J. Naughton, and J.S. Plank, "Low-Latency, Concurrent Checkpointing for Parallel Programs," IEEE Trans. Parallel and Distributed Systems, vol. 5, no. 8, pp. 874-879, Aug. 1994.
- (1994) IEEE Trans. Parallel and Distributed Systems , vol.5 , Issue.8 , pp. 874-879
- Li, K.¹ Naughton, J.² Plank, J.S.³

18
- 0035390088
- A variational calculus approach to optimal checkpoint placement
- DOI 10.1109/12.936236
- Y. Ling, J. Mi, and X. Lin, "A Variational Calculus Approach to Optimal Checkpoint Placement," IEEE Trans. Computers, vol. 50, no. 7, pp. 699-708, July 2001. (Pubitemid 32720123)
- (2001) IEEE Transactions on Computers , vol.50 , Issue.7 , pp. 699-708
- Ling, Y.¹ Mi, J.² Lin, X.³

19
- 36949009638
- PhD thesis, Univ. of Illinois at Urbana-Champaign
- C. Lu, "Scalable Diskless Checkpointing for Large Parallel Systems," PhD thesis, Univ. of Illinois at Urbana-Champaign, 2005.
- (2005) Scalable Diskless Checkpointing for Large Parallel Systems
- Lu, C.¹

20
- 0345044000
- Process migration
- D. Milojičić, F. Douglis, Y. Paindaveine, R. Wheeler, and S. Zhou, "Process Migration," ACM Computing Surveys, vol. 32, no. 3, pp. 241-299, 2000.
- (2000) ACM Computing Surveys , vol.32 , Issue.3 , pp. 241-299
- Milojičić, D.¹ Douglis, F.² Paindaveine, Y.³ Wheeler, R.⁴ Zhou, S.⁵

21
- 79953179921
- NCSA web site
- NCSA web site, http://teragrid.ncsa.uiuc.edu, 2009.
- (2009)

22
- 34547424386
- Cooperative checkpointing: A robust approach to large-scale systems reliability
- A. Oliner, L. Rudolph, and R. Sahoo, "Cooperative Checkpointing: A Robust Approach to Large-Scale Systems Reliability," Proc. Int'l Conf. Supercomputing, 2006.
- (2006) Proc. Int'l Conf. Supercomputing
- Oliner, A.¹ Rudolph, L.² Sahoo, R.³

23
- 79953221715
- OpenSolaris
- OpenSolaris, http://hub.opensolaris.org, 2010.
- (2010)

24
- 79953192410
- Oracle high availability document
- Oracle high availability document, http://www.oracle.com/technology/ deploy/availability/htdocs/fs-on-demand-rollback.htm, 2010.
- (2010)

25
- 0004015896
- Technical Report UCB//CSD-02-1175, UC Berkeley Computer Science
- D. Patterson et al., "Recovery-Oriented Computing (ROC): Motivation, Definition, Techniques, and Case Studies," Technical Report UCB//CSD-02-1175, UC Berkeley Computer Science, 2002.
- (2002) Recovery-Oriented Computing (ROC): Motivation, Definition, Techniques, and Case Studies
- Patterson, D.¹

26
- 0033077475
- Memory exclusion: Optimizing the performance of checkpointing systems
- J. Plank, Y. Chen, K. Li, M. Beck, and G. Kingsley, "Memory Exclusion: Optimizing the Performance of Checkpointing Systems," Software-Practice and Experience, vol. 29, no. 2, pp. 125-142, 1999.
- (1999) Software-Practice and Experience , vol.29 , Issue.2 , pp. 125-142
- Plank, J.¹ Chen, Y.² Li, K.³ Beck, M.⁴ Kingsley, G.⁵

27
- 0032179680
- Diskless checkpointing
- J. Plank, K. Li, and M. Puening, "Diskless Checkpointing," IEEE Trans. Parallel and Distributed Systems, vol. 9, no. 10, pp. 972-986, Oct. 1998. (Pubitemid 128747893)
- (1998) IEEE Transactions on Parallel and Distributed Systems , vol.9 , Issue.10 , pp. 972-986
- Plank, J.S.¹ Li, K.² Puening, M.A.³

28
- 0035201417
- Processor allocation and checkpoint interval selection in cluster computing systems
- DOI 10.1006/jpdc.2001.1757
- J. Plank and M.G. Thomason, "Processor Allocation and Checkpoint Interval Selection in Cluster Computing Systems," J. Parallel and Distributed Computing, vol. 61, no. 11, pp. 1570-1590, 2001. (Pubitemid 33119054)
- (2001) Journal of Parallel and Distributed Computing , vol.61 , Issue.11 , pp. 1570-1590
- Plank, J.S.¹ Thomason, M.G.²

29
- 0033721199
- The cost of recovery in message logging protocols
- Mar./Apr.
- S. Rao, L. Alvisi, and H. Vin, "The Cost of Recovery in Message Logging Protocols," IEEE Trans. Knowledge and Data Eng., vol. 12, no. 2, pp. 160-173, Mar./Apr. 2000.
- (2000) IEEE Trans. Knowledge and Data Eng. , vol.12 , Issue.2 , pp. 160-173
- Rao, S.¹ Alvisi, L.² Vin, H.³

30
- 0029703004
- Fast dynamic process migration
- E. Roush and R. Campbell, "Fast Dynamic Process Migration," Proc. Int'l Conf. Distributed Computing Systems, 1996.
- (1996) Proc. Int'l Conf. Distributed Computing Systems
- Roush, E.¹ Campbell, R.²

31
- 79953216957
- SPEC CPU benchmark
- SPEC CPU benchmark, http://www.spec.org/cpu2006/, 2006.
- (2006)

32
- 12444268355
- On the feasibility of incremental checkpointing for scientific computing
- J. Sancho, F. Petrini, G. Johnson, J. Fernández, and E. Frachtenberg, "On the Feasibility of Incremental Checkpointing for Scientific Computing," Proc. Int'l Parallel and Distributed Processing Symp., 2004.
- (2004) Proc. Int'l Parallel and Distributed Processing Symp.
- Sancho, J.¹ Petrini, F.² Johnson, G.³ Fernández, J.⁴ Frachtenberg, E.⁵

33
- 33845593340
- A large scale study of failures in high-performance-computing systems
- B. Schroeder and G. Gibson, "A Large Scale Study of Failures in High-Performance-Computing Systems," Proc. Int'l Symp. Dependable Systems and Networks, 2006.
- (2006) Proc. Int'l Symp. Dependable Systems and Networks
- Schroeder, B.¹ Gibson, G.²

34
- 13944251545
- A component architecture for LAM/MPI
- J. Squyres and A. Lumsdaine, "A Component Architecture for LAM/MPI," Proc. European PVM/MPI Users' Group Meeting, 2003.
- (2003) Proc. European PVM/MPI Users' Group Meeting
- Squyres, J.¹ Lumsdaine, A.²

35
- 39449084838
- Managing disruptions to supply chains
- L. Snyder and Z. Shen, "Managing Disruptions to Supply Chains," The Bridge, vol. 36, no. 4, pp. 39-45, 2006.
- (2006) The Bridge , vol.36 , Issue.4 , pp. 39-45
- Snyder, L.¹ Shen, Z.²

36
- 0029251277
- The condor distributed processing system
- T. Tannenbaum and M. Litzkow, "The Condor Distributed Processing System," Dr. Dobb's J., vol. 227, pp. 40-48, 1995.
- (1995) Dr. Dobb's J. , vol.227 , pp. 40-48
- Tannenbaum, T.¹ Litzkow, M.²

37
- 0004120131
- second ed., Prentice-Hall
- A. Tanenbaum and A. Woodhull, Operating Systems: Design and Implementation, second ed., Prentice-Hall, 1997.
- (1997) Operating Systems: Design and Implementation
- Tanenbaum, A.¹ Woodhull, A.²

38
- 79953200370
- The FreeBSD Project
- The FreeBSD Project, http://www.freebsd.org, 2010.
- (2010)

39
- 0031388399
- Impact of checkpoint latency on overhead ratio of a checkpointing scheme
- N. Vaidya, "Impact of Checkpoint Latency on Overhead Ratio of a Checkpointing Scheme," IEEE Trans. Computers, vol. 46, no. 8, pp. 942-947, 1997. (Pubitemid 127760644)
- (1997) IEEE Transactions on Computers , vol.46 , Issue.8 , pp. 942-947
- Vaidya, N.H.¹

40
- 77952260024
- On the design of a new linux readahead framework
- F. Wu, H. Xi, and C. Xu, "On the Design of a New Linux Readahead Framework," ACM SIGOPS Operating Systems Rev., vol. 42, no.5, pp. 75-84, 2008.
- (2008) ACM SIGOPS Operating Systems Rev. , vol.42 , Issue.5 , pp. 75-84
- Wu, F.¹ Xi, H.² Xu, C.³

41
- 85130634439
- Dynamically forecasting network performance using the network weather service
- R. Wolski, "Dynamically Forecasting Network Performance Using the Network Weather Service," J. Cluster Computing, vol. 1, no.1, pp. 119-132, 1998.
- (1998) J. Cluster Computing , vol.1 , Issue.1 , pp. 119-132
- Wolski, R.¹

42
- 84976846528
- A first order approximation to the optimal checkpoint interval
- J. Young, "A First Order Approximation to the Optimal Checkpoint Interval," Comm. ACM, vol. 17, no. 9, pp. 530-531, 1974.
- (1974) Comm. ACM , vol.17 , Issue.9 , pp. 530-531
- Young, J.¹

43
- 20444463494
- FTC-Charm++: An in-memory checkpoint-based fault tolerant runtime for Charm++ and MPI
- G. Zheng, L. Shi, and L. Kale, "FTC-Charm++: An In-Memory Checkpoint-Based Fault Tolerant Runtime for Charm++ and MPI," Proc. IEEE Cluster Computing, 2004.
- (2004) Proc. IEEE Cluster Computing
- Zheng, G.¹ Shi, L.² Kale, L.³

* 이 정보는 Elsevier사의 SCOPUS DB에서 KISTI가 분석하여 추출한 것입니다.