메뉴 건너뛰기




Volumn , Issue , 2011, Pages 24-31

A redundant communication approach to scalable fault tolerance in PGAS programming models

Author keywords

Computational chemistry; Fault tolerance; Global Arrays; NWChem

Indexed keywords

CHECKPOINT/RESTART; COST OF FAILURE; FAULT TOLERANCE MECHANISMS; GLOBAL ARRAYS; HIGH-PERFORMANCE COMPUTING; LARGE MACHINES; LONG-RUNNING APPLICATIONS; MEAN TIME BETWEEN FAILURES; MEMORY BANDWIDTHS; NWCHEM; PERFORMANCE ISSUES; PROGRAMMING MODELS; RECENT TRENDS; REMOTE MEMORY ACCESS;

EID: 79955038373     PISSN: None     EISSN: None     Source Type: Conference Proceeding    
DOI: 10.1109/PDP.2011.72     Document Type: Conference Paper
Times cited : (18)

References (34)
  • 2
    • 21244491597 scopus 로고    scopus 로고
    • Soft errors in advanced computer systems
    • DOI 10.1109/MDT.2005.69
    • R. Baumann, "Soft errors in advanced computer systems," IEEE Design & Test of Computers, vol. 22, no. 3, pp. 258-266, 2005. (Pubitemid 40889826)
    • (2005) IEEE Design and Test of Computers , vol.22 , Issue.3 , pp. 258-266
    • Baumann, R.1
  • 3
    • 79955046559 scopus 로고    scopus 로고
    • "Roadrunner," http://www.lanl.gov/roadrunner.
    • Roadrunner
  • 4
    • 79955008050 scopus 로고    scopus 로고
    • "Jaguar," http://www.nccs.gov/jaguar.
    • Jaguar
  • 6
    • 9144223280 scopus 로고    scopus 로고
    • Checkpointing for peta-scale systems: A look into the future of practical rollback-recovery
    • E. N. Elnozahy and J. S. Plank, "Checkpointing for peta-scale systems: A look into the future of practical rollback-recovery," IEEE Transactions on Dependable and Secure Computing, vol. 1, no. 2, pp. 97-108, 2004.
    • (2004) IEEE Transactions on Dependable and Secure Computing , vol.1 , Issue.2 , pp. 97-108
    • Elnozahy, E.N.1    Plank, J.S.2
  • 7
    • 0042078549 scopus 로고    scopus 로고
    • A survey of rollback-recovery protocols in message-passing systems
    • E. N. Elnozahy, L. Alvisi, Y.-M. Wang, and D. B. Johnson, "A survey of rollback-recovery protocols in message-passing systems," ACM Computing Surveys, vol. 34, no. 3, pp. 375-408, 2002.
    • (2002) ACM Computing Surveys , vol.34 , Issue.3 , pp. 375-408
    • Elnozahy, E.N.1    Alvisi, L.2    Wang, Y.-M.3    Johnson, D.B.4
  • 8
    • 0031570635 scopus 로고    scopus 로고
    • Application level fault tolerance in heterogeneous networks of workstations
    • DOI 10.1006/jpdc.1997.1338, PII S0743731597913381
    • A. Beguelin, E. Seligman, and P. Stephan, "Application level fault tolerance in heterogeneous networks of workstations," Journal of Parallel and Distributed Computing, vol. 43, no. 2, pp. 147-155, 1997. (Pubitemid 127171411)
    • (1997) Journal of Parallel and Distributed Computing , vol.43 , Issue.2 , pp. 147-155
    • Beguelin, A.1    Seligman, E.2    Stephan, P.3
  • 11
    • 33749067567 scopus 로고    scopus 로고
    • Berkeley lab checkpoint/restart (BLCR) for Linux clusters
    • DOI 10.1088/1742-6596/46/1/067, 067
    • P. H. Hargrove and J. C. Duell, "Berkeley lab checkpoint/restart (BLCR) for Linux clusters," Journal of Physics: Conference Series, vol. 46, no. 1, pp. 494-499, 2006. (Pubitemid 44461038)
    • (2006) Journal of Physics: Conference Series , vol.46 , Issue.1 , pp. 494-499
    • Hargrove, P.H.1    Duell, J.C.2
  • 14
    • 0022020346 scopus 로고
    • Distributed snapshots: Determining global states of distributed systems
    • DOI 10.1145/214451.214456
    • K. M. Chandy and L. Lamport, "Distributed snapshots: Determining global states of distributed systems," ACM Transactions on Computer Systems, vol. 3, no. 1, pp. 63-75, 1985. (Pubitemid 15597765)
    • (1985) ACM Transactions on Computer Systems , vol.3 , Issue.1 , pp. 63-75
    • Chandy K.Mani1    Lamport Leslie2
  • 17
    • 0021439162 scopus 로고
    • Algorithm-based fault tolerance for matrix operations
    • K.-H. Huang and J. A. Abraham, "Algorithm-based fault tolerance for matrix operations," IEEE Transactions on Computers, vol. 33, no. 6, pp. 518-528, 1984.
    • (1984) IEEE Transactions on Computers , vol.33 , Issue.6 , pp. 518-528
    • Huang, K.-H.1    Abraham, J.A.2
  • 19
    • 33847240498 scopus 로고    scopus 로고
    • Algorithm-based checkpoint-free fault tolerance for parallel matrix computations on volatile resources
    • Apr.
    • Z. Chen and J. Dongarra, "Algorithm-based checkpoint-free fault tolerance for parallel matrix computations on volatile resources," in IEEE International Parallel & Distributed Processing Symposium, Apr. 2006.
    • (2006) IEEE International Parallel & Distributed Processing Symposium
    • Chen, Z.1    Dongarra, J.2
  • 21
    • 20444463494 scopus 로고    scopus 로고
    • FTC-Charm++: An in-memory checkpoint-based fault tolerant runtime for Charm++ and MPI
    • 2004 IEEE International Conference on Cluster Computing, ICCC 2004
    • G. Zheng, L. Shi, and L. V. Kale, "FTC-Charm++: An in-memory checkpoint-based fault tolerant runtime for Charm++ and MPI," in IEEE International Conference on Cluster Computing, Sep. 2004, pp. 93-103. (Pubitemid 40822360)
    • (2004) Proceedings - IEEE International Conference on Cluster Computing, ICCC , pp. 93-103
    • Zheng, G.1    Shi, L.2    Kale, L.V.3
  • 23
    • 77951481809 scopus 로고    scopus 로고
    • CIFTS: A coordinated infrastructure for fault-tolerant systems
    • R. Gupta et al., "CIFTS: A coordinated infrastructure for fault-tolerant systems," in International Conference on Parallel Processing, 2009, pp. 237-245.
    • (2009) International Conference on Parallel Processing , pp. 237-245
    • Gupta, R.1
  • 27
    • 84994456017 scopus 로고    scopus 로고
    • "Global Arrays," http://www.emsl.pnl.gov/docs/global.
    • Global Arrays
  • 28
    • 77955309392 scopus 로고    scopus 로고
    • NWChem: A comprehensive and scalable open-source solution for large scale molecular simulations
    • M. Valiev et al., "NWChem: A comprehensive and scalable open-source solution for large scale molecular simulations," Computer Physics Communications, vol. 181, no. 9, pp. 1477-1489, 2010.
    • (2010) Computer Physics Communications , vol.181 , Issue.9 , pp. 1477-1489
    • Valiev, M.1
  • 29
    • 77953931510 scopus 로고    scopus 로고
    • Utilizing high performance computing for chemistry: Parallel computational chemistry
    • W. A. Jong et al., "Utilizing high performance computing for chemistry: parallel computational chemistry," Physical Chemistry Chemical Physics, vol. 12, no. 26, pp. 6896-6920, 2010.
    • (2010) Physical Chemistry Chemical Physics , vol.12 , Issue.26 , pp. 6896-6920
    • Jong, W.A.1
  • 31
    • 33746091677 scopus 로고    scopus 로고
    • ScalaBLAST: A scalable implementation of BLAST for high-performance data-intensive bioinformatics analysis
    • DOI 10.1109/TPDS.2006.112
    • C. Oehmen and J. Nieplocha, "ScalaBLAST: A scalable implementation of BLAST for high-performance data-intensive bioinformatics analysis," IEEE Transactions on Parallel and Distributed Systems, vol. 17, no. 8, pp. 740-749, 2006. (Pubitemid 44070144)
    • (2006) IEEE Transactions on Parallel and Distributed Systems , vol.17 , Issue.8 , pp. 740-749
    • Oehmen, C.1    Nieplocha, J.2


* 이 정보는 Elsevier사의 SCOPUS DB에서 KISTI가 분석하여 추출한 것입니다.