메뉴 건너뛰기




Volumn 60, Issue 5, 2011, Pages 639-652

FREM: A fast restart mechanism for general checkpoint/restart

Author keywords

Fast restart; fault tolerance; high performance computing; Linux; operating system

Indexed keywords

APPLICATION RECOVERY; CHECKPOINT/RESTART; FAILURE RATE; FAST RESTART; HIGH PERFORMANCE COMPUTING; LARGE SYSTEM; LATENCY PROBLEM; LINUX; LINUX ENVIRONMENT; OPERATING SYSTEM; PROCESS DATA; PROTOTYPE SYSTEM; REAL APPLICATIONS; RESTART MECHANISM; SYSTEM DEPENDABILITY;

EID: 79953201544     PISSN: 00189340     EISSN: None     Source Type: Journal    
DOI: 10.1109/TC.2010.129     Document Type: Article
Times cited : (15)

References (43)
  • 2
    • 84976789801 scopus 로고
    • The recovery box: Using fast recovery to provide high availability in the UNIX environment
    • M. Baker and M. Sullivan, "The Recovery Box: Using Fast Recovery to Provide High Availability in the UNIX Environment," Proc. Summer USENIX Technical Conf., 1992.
    • (1992) Proc. Summer USENIX Technical Conf.
    • Baker, M.1    Sullivan, M.2
  • 4
    • 27544461132 scopus 로고    scopus 로고
    • A model for predicting the optimum checkpoint interval for restart dumps
    • J. Daly, "A Model for Predicting the Optimum Checkpoint Interval for Restart Dumps," Proc. Int'l Conf. Computational Science, 2003.
    • (2003) Proc. Int'l Conf. Computational Science
    • Daly, J.1
  • 6
    • 9144223280 scopus 로고    scopus 로고
    • Checkpointing for peta-scale systems: A look into the future of practical rollback-recovery
    • Apr.-June
    • E. Elnozahy and J. Plank, "Checkpointing for Peta-Scale Systems: A Look Into the Future of Practical Rollback-Recovery," IEEE Trans. Dependable and Secure Computing, vol. 1, no. 2, pp. 97-108, Apr.-June 2004.
    • (2004) IEEE Trans. Dependable and Secure Computing , vol.1 , Issue.2 , pp. 97-108
    • Elnozahy, E.1    Plank, J.2
  • 9
    • 31344436964 scopus 로고    scopus 로고
    • On designing direct dependency - Based fast recovery algorithms for distributed systems
    • DOI 10.1145/974104.974110
    • B. Gupta, Z. Liu, and Z. Liang, "On Designing Direct Dependency-Based Fast Recovery Algorithms for Distributed Systems," ACM SIGOPS Operating Systems Rev., vol. 38, no. 1, pp. 58-73, 2004. (Pubitemid 46746979)
    • (2004) Operating Systems Review (ACM) , vol.38 , Issue.1 , pp. 58-73
    • Gupta, B.1    Liu, Z.2    Liang, Z.3
  • 11
  • 12
    • 0032095071 scopus 로고    scopus 로고
    • Virtual memory: Issues of implementation
    • B. Jacob and T. Mudge, "Virtual Memory: Issues of Implementation," Computer, vol. 31, no. 6, pp. 33-43, June 1998. (Pubitemid 128550816)
    • (1998) Computer , vol.31 , Issue.6 , pp. 33-43
    • Jacob, B.1    Mudge, T.2
  • 13
    • 85160681664 scopus 로고    scopus 로고
    • Transparent checkpoint-restart of multiple processes on commodity operating systems
    • O. Laadan and J. Nieh, "Transparent Checkpoint-Restart of Multiple Processes on Commodity Operating Systems," Proc. USENIX Ann. Technical Conf., 2007.
    • (2007) Proc. USENIX Ann. Technical Conf.
    • Laadan, O.1    Nieh, J.2
  • 14
    • 57049111494 scopus 로고    scopus 로고
    • Adaptive fault management of parallel applications for high performance computing
    • Dec.
    • Z. Lan and Y. Li, "Adaptive Fault Management of Parallel Applications for High Performance Computing," IEEE Trans. Computers, vol. 57, no. 12, pp. 1647-1660, Dec. 2008.
    • (2008) IEEE Trans. Computers , vol.57 , Issue.12 , pp. 1647-1660
    • Lan, Z.1    Li, Y.2
  • 16
    • 67649883517 scopus 로고    scopus 로고
    • Fault-aware runtime strategies for high-performance computing
    • Apr.
    • Y. Li, Z. Lan, P. Gujrati, and X. Sun, "Fault-Aware Runtime Strategies for High-Performance Computing," IEEE Trans. Parallel and Distributed Systems, vol. 20, no. 4, pp. 460-473, Apr. 2009.
    • (2009) IEEE Trans. Parallel and Distributed Systems , vol.20 , Issue.4 , pp. 460-473
    • Li, Y.1    Lan, Z.2    Gujrati, P.3    Sun, X.4
  • 17
    • 0028485392 scopus 로고
    • Low-latency, concurrent checkpointing for parallel programs
    • Aug.
    • K. Li, J. Naughton, and J.S. Plank, "Low-Latency, Concurrent Checkpointing for Parallel Programs," IEEE Trans. Parallel and Distributed Systems, vol. 5, no. 8, pp. 874-879, Aug. 1994.
    • (1994) IEEE Trans. Parallel and Distributed Systems , vol.5 , Issue.8 , pp. 874-879
    • Li, K.1    Naughton, J.2    Plank, J.S.3
  • 18
    • 0035390088 scopus 로고    scopus 로고
    • A variational calculus approach to optimal checkpoint placement
    • DOI 10.1109/12.936236
    • Y. Ling, J. Mi, and X. Lin, "A Variational Calculus Approach to Optimal Checkpoint Placement," IEEE Trans. Computers, vol. 50, no. 7, pp. 699-708, July 2001. (Pubitemid 32720123)
    • (2001) IEEE Transactions on Computers , vol.50 , Issue.7 , pp. 699-708
    • Ling, Y.1    Mi, J.2    Lin, X.3
  • 21
    • 79953179921 scopus 로고    scopus 로고
    • NCSA web site
    • NCSA web site, http://teragrid.ncsa.uiuc.edu, 2009.
    • (2009)
  • 22
    • 34547424386 scopus 로고    scopus 로고
    • Cooperative checkpointing: A robust approach to large-scale systems reliability
    • A. Oliner, L. Rudolph, and R. Sahoo, "Cooperative Checkpointing: A Robust Approach to Large-Scale Systems Reliability," Proc. Int'l Conf. Supercomputing, 2006.
    • (2006) Proc. Int'l Conf. Supercomputing
    • Oliner, A.1    Rudolph, L.2    Sahoo, R.3
  • 23
    • 79953221715 scopus 로고    scopus 로고
    • OpenSolaris
    • OpenSolaris, http://hub.opensolaris.org, 2010.
    • (2010)
  • 24
    • 79953192410 scopus 로고    scopus 로고
    • Oracle high availability document
    • Oracle high availability document, http://www.oracle.com/technology/ deploy/availability/htdocs/fs-on-demand-rollback.htm, 2010.
    • (2010)
  • 26
    • 0033077475 scopus 로고    scopus 로고
    • Memory exclusion: Optimizing the performance of checkpointing systems
    • J. Plank, Y. Chen, K. Li, M. Beck, and G. Kingsley, "Memory Exclusion: Optimizing the Performance of Checkpointing Systems," Software-Practice and Experience, vol. 29, no. 2, pp. 125-142, 1999.
    • (1999) Software-Practice and Experience , vol.29 , Issue.2 , pp. 125-142
    • Plank, J.1    Chen, Y.2    Li, K.3    Beck, M.4    Kingsley, G.5
  • 28
    • 0035201417 scopus 로고    scopus 로고
    • Processor allocation and checkpoint interval selection in cluster computing systems
    • DOI 10.1006/jpdc.2001.1757
    • J. Plank and M.G. Thomason, "Processor Allocation and Checkpoint Interval Selection in Cluster Computing Systems," J. Parallel and Distributed Computing, vol. 61, no. 11, pp. 1570-1590, 2001. (Pubitemid 33119054)
    • (2001) Journal of Parallel and Distributed Computing , vol.61 , Issue.11 , pp. 1570-1590
    • Plank, J.S.1    Thomason, M.G.2
  • 29
    • 0033721199 scopus 로고    scopus 로고
    • The cost of recovery in message logging protocols
    • Mar./Apr.
    • S. Rao, L. Alvisi, and H. Vin, "The Cost of Recovery in Message Logging Protocols," IEEE Trans. Knowledge and Data Eng., vol. 12, no. 2, pp. 160-173, Mar./Apr. 2000.
    • (2000) IEEE Trans. Knowledge and Data Eng. , vol.12 , Issue.2 , pp. 160-173
    • Rao, S.1    Alvisi, L.2    Vin, H.3
  • 31
    • 79953216957 scopus 로고    scopus 로고
    • SPEC CPU benchmark
    • SPEC CPU benchmark, http://www.spec.org/cpu2006/, 2006.
    • (2006)
  • 35
    • 39449084838 scopus 로고    scopus 로고
    • Managing disruptions to supply chains
    • L. Snyder and Z. Shen, "Managing Disruptions to Supply Chains," The Bridge, vol. 36, no. 4, pp. 39-45, 2006.
    • (2006) The Bridge , vol.36 , Issue.4 , pp. 39-45
    • Snyder, L.1    Shen, Z.2
  • 36
    • 0029251277 scopus 로고
    • The condor distributed processing system
    • T. Tannenbaum and M. Litzkow, "The Condor Distributed Processing System," Dr. Dobb's J., vol. 227, pp. 40-48, 1995.
    • (1995) Dr. Dobb's J. , vol.227 , pp. 40-48
    • Tannenbaum, T.1    Litzkow, M.2
  • 38
    • 79953200370 scopus 로고    scopus 로고
    • The FreeBSD Project
    • The FreeBSD Project, http://www.freebsd.org, 2010.
    • (2010)
  • 39
    • 0031388399 scopus 로고    scopus 로고
    • Impact of checkpoint latency on overhead ratio of a checkpointing scheme
    • N. Vaidya, "Impact of Checkpoint Latency on Overhead Ratio of a Checkpointing Scheme," IEEE Trans. Computers, vol. 46, no. 8, pp. 942-947, 1997. (Pubitemid 127760644)
    • (1997) IEEE Transactions on Computers , vol.46 , Issue.8 , pp. 942-947
    • Vaidya, N.H.1
  • 40
    • 77952260024 scopus 로고    scopus 로고
    • On the design of a new linux readahead framework
    • F. Wu, H. Xi, and C. Xu, "On the Design of a New Linux Readahead Framework," ACM SIGOPS Operating Systems Rev., vol. 42, no.5, pp. 75-84, 2008.
    • (2008) ACM SIGOPS Operating Systems Rev. , vol.42 , Issue.5 , pp. 75-84
    • Wu, F.1    Xi, H.2    Xu, C.3
  • 41
    • 85130634439 scopus 로고    scopus 로고
    • Dynamically forecasting network performance using the network weather service
    • R. Wolski, "Dynamically Forecasting Network Performance Using the Network Weather Service," J. Cluster Computing, vol. 1, no.1, pp. 119-132, 1998.
    • (1998) J. Cluster Computing , vol.1 , Issue.1 , pp. 119-132
    • Wolski, R.1
  • 42
    • 84976846528 scopus 로고
    • A first order approximation to the optimal checkpoint interval
    • J. Young, "A First Order Approximation to the Optimal Checkpoint Interval," Comm. ACM, vol. 17, no. 9, pp. 530-531, 1974.
    • (1974) Comm. ACM , vol.17 , Issue.9 , pp. 530-531
    • Young, J.1
  • 43
    • 20444463494 scopus 로고    scopus 로고
    • FTC-Charm++: An in-memory checkpoint-based fault tolerant runtime for Charm++ and MPI
    • G. Zheng, L. Shi, and L. Kale, "FTC-Charm++: An In-Memory Checkpoint-Based Fault Tolerant Runtime for Charm++ and MPI," Proc. IEEE Cluster Computing, 2004.
    • (2004) Proc. IEEE Cluster Computing
    • Zheng, G.1    Shi, L.2    Kale, L.3


* 이 정보는 Elsevier사의 SCOPUS DB에서 KISTI가 분석하여 추출한 것입니다.