메뉴 건너뛰기




Volumn , Issue , 2011, Pages

Evaluating the viability of process replication reliability for exascale systems

Author keywords

[No Author keywords available]

Indexed keywords

APPLICATION SCALABILITY; CHECK POINTS; COMPUTING TECHNIQUES; COSTS AND BENEFITS; EMPIRICAL ANALYSIS; FAILURE DISTRIBUTIONS; FAULT TOLERANCE MECHANISMS; HIGH-END COMPUTING; I/O BANDWIDTH; MEAN TIME TO FAILURE; MISSION CRITICAL SYSTEMS; STATE MACHINE; STATE MACHINE REPLICATION; TIME-TO-SOLUTION;

EID: 83155188951     PISSN: None     EISSN: None     Source Type: Conference Proceeding    
DOI: 10.1145/2063384.2063443     Document Type: Conference Paper
Times cited : (159)

References (41)
  • 5
    • 78149257903 scopus 로고    scopus 로고
    • Transparent redundant computing with mpi
    • R. Keller, E. Gabriel, M. M. Resch, and J. Dongarra, Eds. vol. 6305 of Lecture Notes in Computer Science, Springer
    • BRIGHTWELL, R., FERREIRA, K. B., AND RIESEN, R. Transparent redundant computing with mpi. In EuroMPI (2010), R. Keller, E. Gabriel, M. M. Resch, and J. Dongarra, Eds., vol. 6305 of Lecture Notes in Computer Science, Springer, pp. 208-218.
    • (2010) EuroMPI , pp. 208-218
    • Brightwell, R.1    Ferreira, K.B.2    Riesen, R.3
  • 6
    • 68249127079 scopus 로고    scopus 로고
    • Fault tolerance in petascale/exascale systems: Current knowledge, challenges and research opportunities
    • CAPPELLO, F. Fault tolerance in petascale/exascale systems: Current knowledge, challenges and research opportunities. IJHPCA 23, 3 (2009), 212-226.
    • (2009) IJHPCA , vol.23 , Issue.3 , pp. 212-226
    • Cappello, F.1
  • 7
    • 0345757358 scopus 로고    scopus 로고
    • Practical Byzantine Fault Tolerance and Proactive Recovery
    • DOI 10.1145/571637.571640
    • CASTRO, M., AND LISKOV, B. Practical byzantine fault tolerance and proactive recovery. ACM Transactions on Computer Systems (TOCS) 20, 4 (Nov. 2002), 398-461. (Pubitemid 135702591)
    • (2002) ACM Transactions on Computer Systems , vol.20 , Issue.4 , pp. 398-461
    • Castro, M.1    Liskov, B.2
  • 10
    • 28044460018 scopus 로고    scopus 로고
    • A higher order estimate of the optimum checkpoint interval for restart dumps
    • DOI 10.1016/j.future.2004.11.016, PII S0167739X04002213
    • DALY, J. T. A higher order estimate of the optimum checkpoint interval for restart dumps. Future Gener. Comput. Syst. 22, 3 (2006), 303-312. (Pubitemid 41689812)
    • (2006) Future Generation Computer Systems , vol.22 , Issue.3 , pp. 303-312
    • Daly, J.T.1
  • 12
    • 9144223280 scopus 로고    scopus 로고
    • Checkpointing for peta-scale systems: A look into the future of practical rollback-recovery
    • Apr.
    • ELNOZAHY, E., AND PLANK, J. Checkpointing for peta-scale systems: a look into the future of practical rollback-recovery. Dependable and Secure Computing, IEEE Transactions on 1, 2 (Apr. 2004), 97-108.
    • (2004) Dependable and Secure Computing, IEEE Transactions on , vol.1 , Issue.2 , pp. 97-108
    • Elnozahy, E.1    Plank, J.2
  • 13
    • 0042078549 scopus 로고    scopus 로고
    • A survey of rollback-recovery protocols in message-passing systems
    • ELNOZAHY, E. N. M., ALVISI, L., WANG, Y.-M., AND JOHNSON, D. B. A survey of rollback-recovery protocols in message-passing systems. ACM Comput. Surv. 34, 3 (2002), 375-408.
    • (2002) ACM Comput. Surv. , vol.34 , Issue.3 , pp. 375-408
    • Elnozahy, E.N.M.1    Alvisi, L.2    Wang, Y.-M.3    Johnson, D.B.4
  • 16
    • 0345415768 scopus 로고    scopus 로고
    • Fundamentals of fault-tolerant distributed computing in asynchronous environments
    • March
    • GÄRTNER, F. C. Fundamentals of fault-tolerant distributed computing in asynchronous environments. ACM Computing Surveys 31, 1 (March 1999), 1-26.
    • (1999) ACM Computing Surveys , vol.31 , Issue.1 , pp. 1-26
    • Gärtner, F.C.1
  • 18
    • 67349271621 scopus 로고    scopus 로고
    • An analysis of clustered failures on large supercomputing systems
    • July
    • HACKER, T. J., ROMERO, F., AND CAROTHERS, C. D. An analysis of clustered failures on large supercomputing systems. J. Parallel Distrib. Comput. 69 (July 2009), 652-665.
    • (2009) J. Parallel Distrib. Comput. , vol.69 , pp. 652-665
    • Hacker, T.J.1    Romero, F.2    Carothers, C.D.3
  • 23
    • 0017996760 scopus 로고
    • Time, clocks, and the ordering of events in a distributed system
    • DOI 10.1145/359545.359563
    • LAMPORT, L. Time, clocks, and the ordering of events in a distributed system. Commun. ACM 21, 7 (1978), 558-565. (Pubitemid 8615486)
    • (1978) Communications of the ACM , vol.21 , Issue.7 , pp. 558-565
    • Lamport Leslie1
  • 24
    • 0344850282 scopus 로고
    • A generalized birthday problem
    • MATHIS, F. H. A generalized birthday problem. SIAM Review 33, 2 (1991), 265-270.
    • (1991) SIAM Review , vol.33 , Issue.2 , pp. 265-270
    • Mathis, F.H.1
  • 25
    • 0003321148 scopus 로고    scopus 로고
    • An overview of the Intel TFLOPS supercomputer
    • MATTSON, T. G., AND HENRY, G. An overview of the Intel TFLOPS supercomputer. Intel Technology Journal, Q1 (1998), 12.
    • (1998) Intel Technology Journal , vol.Q1 , pp. 12
    • Mattson, T.G.1    Henry, G.2
  • 26
    • 15044360879 scopus 로고
    • The architecture of tandem's nonstop system
    • New York, NY, USA, ACM
    • MCEVOY, D. The architecture of tandem's nonstop system. In ACM'81: Proceedings of the ACM'81 conference (New York, NY, USA, 1981), ACM, p. 245.
    • (1981) ACM'81: Proceedings of the ACM'81 Conference , pp. 245
    • McEvoy, D.1
  • 32
    • 0028994249 scopus 로고
    • Algorithm-based diskless checkpointing for fault tolerant matrix operations
    • Pasadena, CA, USA, June 1995, Los Alamitos, CA, USA : IEEE Comput. Soc. Press
    • PLANK, J. S., KIM, Y. B., AND DONGARRA, J. J. Algorithm-based diskless checkpointing for fault tolerant matrix operations. In Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers (Pasadena, CA, USA, June 1995), Los Alamitos, CA, USA : IEEE Comput. Soc. Press, 1995, pp. 351-360.
    • (1995) Twenty-Fifth International Symposium on Fault-Tolerant Computing. Digest of Papers , pp. 351-360
    • Plank, J.S.1    Kim, Y.B.2    Dongarra, J.J.3
  • 33
    • 0002467378 scopus 로고
    • Fast parallel algorithms for short-range molecular dynamics
    • PLIMPTON, S. J. Fast parallel algorithms for short-range molecular dynamics. J Comp Phys 117, 1 (1995), 1-19.
    • (1995) J Comp Phys , vol.117 , Issue.1 , pp. 1-19
    • Plimpton, S.J.1
  • 35
    • 83155177911 scopus 로고    scopus 로고
    • home page, Apr. 10
    • Sandia National Laboratory. Mantevo project home page. https://software.sandia.gov/mantevo, Apr. 10 2010.
    • (2010) Mantevo Project
  • 36
    • 0025564050 scopus 로고
    • Implementing fault-tolerant services using the state machine approach: A tutorial
    • SCHNEIDER, F. Implementing fault-tolerant services using the state machine approach: A tutorial. ACM Computing Surveys 22, 4 (1990), 299-319.
    • (1990) ACM Computing Surveys , vol.22 , Issue.4 , pp. 299-319
    • Schneider, F.1
  • 39
    • 84864756973 scopus 로고    scopus 로고
    • An experimental study about diskless checkpointing
    • Vasteras, Sweden, August, IEEE Computer Society Press
    • SILVA, L. M., AND SILVA, J. G. An experimental study about diskless checkpointing. In 24th EUROMICRO Conference (Vasteras, Sweden, August 1998), IEEE Computer Society Press, pp. 395 - 402.
    • (1998) 24th EUROMICRO Conference , pp. 395-402
    • Silva, L.M.1    Silva, J.G.2
  • 40
    • 46049083585 scopus 로고    scopus 로고
    • Joshua: Symmetric active/active replication for highly available hpc job and resource management
    • Los Alamitos, CA, USA, IEEE Computer Society
    • UHLEMANN, K., ENGELMANN, C., AND SCOTT, S. Joshua: Symmetric active/active replication for highly available hpc job and resource management. In Proceedings of the 2006 IEEE International Conference on Cluster Computing (Los Alamitos, CA, USA, 2006), IEEE Computer Society.
    • (2006) Proceedings of the 2006 IEEE International Conference on Cluster Computing
    • Uhlemann, K.1    Engelmann, C.2    Scott, S.3


* 이 정보는 Elsevier사의 SCOPUS DB에서 KISTI가 분석하여 추출한 것입니다.