메뉴 건너뛰기




Volumn , Issue , 2011, Pages 31-38

Redundant execution of HPC applications with MR-MPI

Author keywords

Fault tolerance; High performance computing; Message Passing Interface; Redundancy; Resilience

Indexed keywords

CHECKPOINT/RESTART; COMMUNICATION CONTENTION; EXTREME SCALE; HIGH-PERFORMANCE COMPUTING; MESSAGE PASSING INTERFACE; MPI PROCESS; MULTI-CORE SYSTEMS; NEGATIVE IMPACTS; PARALLEL APPLICATION; PARALLEL FILE SYSTEM; PARTIAL REPLICATION; PERFORMANCE TOOLS; POINT-TO-POINT BENCHMARK; REDUNDANCY CONFIGURATION; REDUNDANT NODES; RESILIENCE;

EID: 79958180996     PISSN: None     EISSN: None     Source Type: Conference Proceeding    
DOI: 10.2316/P.2011.719-031     Document Type: Conference Paper
Times cited : (43)

References (18)
  • 2
    • 78650807026 scopus 로고    scopus 로고
    • Scalable I/O systems via node-local storage: Approaching 1 TB/sec file I/O
    • Livermore, CA, USA, Aug. URL http://dx.doi.org/10.2172/964079
    • G. Bronevetsky and A. Moody. Scalable I/O systems via node-local storage: Approaching 1 TB/sec file I/O. Technical Report TR-JLPC-09-01, Lawrence Livermore National Laboratory, Livermore, CA, USA, Aug. 2009. URL http://dx.doi.org/10.2172/964079.
    • (2009) Technical Report TR-JLPC-09-01, Lawrence Livermore National Laboratory
    • Bronevetsky, G.1    Moody, A.2
  • 4
    • 0030129232 scopus 로고    scopus 로고
    • The transis approach to high availability cluster communication
    • D. Dolev and D. Malki. The Transis approach to high availability cluster communication. Communications of the ACM, 39(4):64-70, 1996. ISSN 0001-0782. URL http://doi.acm.org/10.1145/227210.227227. (Pubitemid 126428118)
    • (1996) Communications of the ACM , vol.39 , Issue.4 , pp. 64-70
    • Dolev, D.1    Malki, D.2
  • 8
    • 58149131807 scopus 로고    scopus 로고
    • DDMR: Dynamic and scalable dual modular redundancy with short validation intervals
    • URL http://doi.ieeecomputersociety.org/10.1109/L-CA.2008.12
    • A. Golander, S. Weiss, and R. Ronen. DDMR: Dynamic and scalable dual modular redundancy with short validation intervals. IEEE Computer Architecture Letters, 7(2):65-68, 2008. URL http://doi.ieeecomputersociety.org/10.1109/L-CA. 2008.12.
    • (2008) IEEE Computer Architecture Letters , vol.7 , Issue.2 , pp. 65-68
    • Golander, A.1    Weiss, S.2    Ronen, R.3
  • 9
    • 33749067567 scopus 로고    scopus 로고
    • Berkeley lab checkpoint/restart (BLCR) for Linux clusters
    • DOI 10.1088/1742-6596/46/1/067, 067
    • P. H. Hargrove and J. C. Duell. Berkeley Lab Checkpoint/Restart (BLCR) for Linux clusters. In Journal of Physics: Proceedings of the Scientific Discovery through Advanced Computing Program (SciDAC) Conference 2006, volume 46, pages 494-499, Denver, CO, USA, June 25-29, 2006. Institute of Physics Publishing, Bristol, UK. URL http://www.iop.org/EJ/article/1742-6596/46/1/067/ jpconf646067.pdf. (Pubitemid 44461038)
    • (2006) Journal of Physics: Conference Series , vol.46 , Issue.1 , pp. 494-499
    • Hargrove, P.H.1    Duell, J.C.2
  • 10
    • 11244269589 scopus 로고    scopus 로고
    • Configurable Fault-Tolerant Processor (CFTP) for spacecraft onboard processing
    • 1097, 2004 IEEE Aerospace Conference Proceedings
    • C. A. Hulme, H. H. Loomis, A. A. Ross, and R. Yuan. Configurable fault-tolerant processor (CFTP) for spacecraft onboard processing. In Proceedings of the IEEE Aerospace Conference 2004, volume 4, pages 2269-2276, Big Sky, MT, USA, Mar. 6-13, 2002. IEEE Computer Society. ISBN 0-7803-8155-6. URL http://ieeexplore.ieee.org/xpls/absall.jsp?arnumber=1368020. (Pubitemid 40057225)
    • (2004) IEEE Aerospace Conference Proceedings , vol.4 , pp. 2269-2276
    • Hulme, C.A.1    Loomis, H.H.2    Ross, A.A.3    Yuan, R.4
  • 12
    • 70350469329 scopus 로고    scopus 로고
    • Volpexmpi: An MPI library for execution of parallel applications on volatile nodes
    • Espoo, Finland, Sept. 7-10 Springer Verlag, Berlin, Germany. ISBN 978-3-540-75415-2. URL http://dx.doi.org/10.1007/978-3-642-03770-219
    • th European PVM/MPI Users' Group Meeting (EuroPVM/MPI) 2009, volume 5759, pages 124-133, Espoo, Finland, Sept. 7-10, 2009. Springer Verlag, Berlin, Germany. ISBN 978-3-540-75415-2. URL http://dx.doi.org/10.1007/978-3-642-03770-219.
    • (2009) th European PVM/MPI Users' Group Meeting (EuroPVM/MPI) 2009 , vol.5759 , pp. 124-133
    • LeBlanc, T.1    Anand, R.2    Gabriel, E.3    Subhlok, J.4
  • 13
    • 34548212768 scopus 로고    scopus 로고
    • Power efficient approaches to redundant multithreading
    • DOI 10.1109/TPDS.2007.1090
    • N. Madan and R. Balasubramonian. Power efficient approaches to redundant multithreading. IEEE Transactions on Parallel and Distributed Systems (TPDS), 18(8):1066-1079, 2007. ISSN 1045-9219. URL http://doi.ieeecomputersociety.org/ 10.1109/TPDS.2007.1090. (Pubitemid 47315989)
    • (2007) IEEE Transactions on Parallel and Distributed Systems , vol.18 , Issue.8 , pp. 1066-1079
    • Madan, N.1    Balasubramonian, R.2
  • 15
    • 67649255075 scopus 로고    scopus 로고
    • PLR: A software approach to transient fault tolerance for multicore architectures
    • ISSN 1545-5971. URL http://doi.ieeecomputersociety.org/10.1109/TDSC.2008. 62
    • A. Shye, J. Blomstedt, T. Moseley, V. J. Reddi, and D. A. Connors. PLR: A software approach to transient fault tolerance for multicore architectures. IEEE Transactions on Dependable and Secure Computing (TDSC), 6(2):135-148, 2009. ISSN 1545-5971. URL http://doi.ieeecomputersociety.org/10.1109/TDSC.2008.62.
    • (2009) IEEE Transactions on Dependable and Secure Computing (TDSC) , vol.6 , Issue.2 , pp. 135-148
    • Shye, A.1    Blomstedt, J.2    Moseley, T.3    Reddi, V.J.4    Connors, D.A.5
  • 16
    • 0026404704 scopus 로고
    • Architecture of fault-tolerant computers: An historical perspective
    • ISSN 0018-9219. URL http://dx.doi.org/10.1109/5.119549
    • D. P. Siemwiorek. Architecture of fault-tolerant computers: An historical perspective. Proceedings of the IEEE, 79(12):1710-1734, 1991. ISSN 0018-9219. URL http://dx.doi.org/10.1109/5.119549.
    • (1991) Proceedings of the IEEE , vol.79 , Issue.12 , pp. 1710-1734
    • Siemwiorek, D.P.1
  • 18
    • 78249259344 scopus 로고    scopus 로고
    • MMPI: A scalable fault tolerance mechanism for MPI large scale parallel computing
    • Bradford, UK, June 29 - July 1 IEEE Computer Society. ISBN 978-0-7695-4108-2. URL http://doi.ieeecomputersociety.org/10.1109/CIT.2010.226
    • th IEEE International Conference on Computer and Information Technology (CIT) 2010, pages 1251-1256, Bradford, UK, June 29 - July 1, 2009. IEEE Computer Society. ISBN 978-0-7695-4108-2. URL http://doi.ieeecomputersociety.org/10.1109/CIT.2010. 226.
    • (2009) th IEEE International Conference on Computer and Information Technology (CIT) 2010 , pp. 1251-1256
    • Yang, X.1    Wang, Z.2    Zhou, Y.3


* 이 정보는 Elsevier사의 SCOPUS DB에서 KISTI가 분석하여 추출한 것입니다.