SCOPUS 정보 검색 플랫폼

Proceedings of 2011 SC - International Conference for High Performance Computing, Networking, Storage and Analysis

Volumn , Issue , 2011, Pages

Modeling and tolerating heterogeneous failures in large parallel systems

(6) Heien, Eric a Kondo, Derrick a Gainaru, Ana b,c Lapine, Dan b Kramer, Bill b Cappello, Franck a

a INRIA (France)

b UNIVERSITY OF ILLINOIS AT URBANA CHAMPAIGN (United States)

c UNIVERSITY POLITEHNICA OF BUCHAREST (Romania)

Author keywords

[No Author keywords available]

Indexed keywords

APPLICATION-CENTRIC; CHECK POINTING; COMPONENT FAILURES; FAILURE MODEL; FAILURE RATE; FAULT-TOLERANT ALGORITHMS; GENERAL MODEL; HARDWARE COMPONENTS; HARDWARE FAILURES; HIGH PERFORMANCE COMPUTING SYSTEMS; OR-NETWORKS; PARALLEL SYSTEM; SPACE AND TIME; SPECIFIC COMPONENT; SUPERCOMPUTING APPLICATIONS; SYSTEM FAILURES;

COMPUTER SOFTWARE SELECTION AND EVALUATION; FAULT TOLERANT COMPUTER SYSTEMS; SUPERCOMPUTERS; SYSTEMS ENGINEERING;

COMPUTATION THEORY;

EID: 83155160934 PISSN: None EISSN: None Source Type: Conference Proceeding
DOI: 10.1145/2063384.2063444 Document Type: Conference Paper

Times cited : (72)

References (27)

1
- 83155186191
- Personal communication, May
- William D. Gropp. Personal communication, May 2010.
- (2010)
- Gropp, W.D.¹

2
- 70450206305
- Toward exascale resilience
- November
- Franck Cappello, Al Geist, Bill Gropp, Laxmikant Kale, Bill Kramer, and Marc Snir. Toward exascale resilience. Int. J. High Perform. Comput. Appl., 23:374-388, November 2009.
- (2009) Int. J. High Perform. Comput. Appl. , vol.23 , pp. 374-388
- Cappello, F.¹ Geist, A.² Gropp, B.³ Kale, L.⁴ Kramer, B.⁵ Snir, M.⁶

3
- 83155195268
- Hierarchical event log organizer
- Sep
- Ana Gainaru, Franck Cappello, Stefan Trausan-Matu, and Bill Kramer. Hierarchical event log organizer. Technical Report of the INRIA-Illinois Joint Laboratory on PetaScale Computing, pages 1-24, Sep 2010.
- (2010) Technical Report of the INRIA-Illinois Joint Laboratory on PetaScale Computing , pp. 1-24
- Gainaru, A.¹ Cappello, F.² Trausan-Matu, S.³ Kramer, B.⁴

4
- 85076902294
- Availability in globally distributed storage systems
- Daniel Ford, Francois Labelle, Florentina Popovici, Murray Stokely, Van-Anh Truong, Luiz Barroso, Carrie Grimes, and Sean Quinlan. Availability in globally distributed storage systems. In Proceedings of the 9th USENIX Symposium on Operating Systems Design and Implementation, 2010.
- (2010) Proceedings of the 9th USENIX Symposium on Operating Systems Design and Implementation
- Ford, D.¹ Labelle, F.² Popovici, F.³ Stokely, M.⁴ Truong, V.-A.⁵ Barroso, L.⁶ Grimes, C.⁷ Quinlan, S.⁸

5
- 36049041275
- Understanding disk failure rates: What does an mttf of 1, 000, 000 hours mean to you?
- Oct
- Bianca Schroeder and Garth Gibson. Understanding disk failure rates: What does an mttf of 1, 000, 000 hours mean to you? Transactions on Storage (TOS, 3(3), Oct 2007.
- (2007) Transactions on Storage (TOS) , vol.3 , Issue.3
- Schroeder, B.¹ Gibson, G.²

6
- 33845593340
- A large-scale study of failures in high-performance computing systems
- DOI 10.1109/DSN.2006.5, 1633514, Proceedings - DSN 2006: 2006 International Conference on Dependable Systems and Networks
- Bianca Schroeder and Garth A. Gibson. A large-scale study of failures in high-performance computing systems. In Proceedings of the International Conference on Dependable Systems and Networks, pages 249-258, Washington, DC, USA, 2006. IEEE Computer Society. (Pubitemid 44930426)
- (2006) Proceedings of the International Conference on Dependable Systems and Networks , vol.2006 , pp. 249-258
- Schroeder, B.¹ Gibson, G.A.²

7
- 0003594381
- Duxbury
- G. Casella and R. Berger. Statistical Inference. Duxbury, 2002.
- (2002) Statistical Inference
- Casella, G.¹ Berger, R.²

8
- 38049182471
- How are real grids used? the analysis of four grid traces and its implications
- A. Iosup, C. Dumitrescu, D. H. J. Epema, H. Li, and L. Wolters. How are real grids used? the analysis of four grid traces and its implications. In GRID, pages 262-269, 2006.
- (2006) GRID , pp. 262-269
- Iosup, A.¹ Dumitrescu, C.² Epema, D.H.J.³ Li, H.⁴ Wolters, L.⁵

9
- 38049172300
- Catalog of boinc projects. http://www.boinc-wiki.info/Catalog-of-BOINC- Powered-Projects.
- Catalog of Boinc Projects

10
- 38149135341
- D. Baker. ROSETTA@home. http://boinc.bakerlab.org/rosetta/.
- ROSETTA@home
- Baker, D.¹

11
- 84900592671
- EINSTEN@home. http://einstein.phys.uwm.edu.
- EINSTEN@home

12
- 27344436659
- Scalable molecular dynamics with NAMD
- DOI 10.1002/jcc.20289
- James C. Phillips, Rosemary Braun, Wei Wang, James Gumbart, Emad Tajkhorshid, Elizabeth Villa, Christophe Chipot, Robert D. Skeel, Laxmikant V. Kalé, and Klaus Schulten. Scalable molecular dynamics with namd. Journal of Computational Chemistry, 26(16):1781-1802, 2005. (Pubitemid 43078511)
- (2005) Journal of Computational Chemistry , vol.26 , Issue.16 , pp. 1781-1802
- Phillips, J.C.¹ Braun, R.² Wang, W.³ Gumbart, J.⁴ Tajkhorshid, E.⁵ Villa, E.⁶ Chipot, C.⁷ Skeel, R.D.⁸ Kale, L.⁹ Schulten, K.¹⁰

13
- 84976846528
- A first order approximation to the optimum checkpoint interval
- September
- John W. Young. A first order approximation to the optimum checkpoint interval. Commun. ACM, 17:530-531, September 1974.
- (1974) Commun. ACM , vol.17 , pp. 530-531
- Young, J.W.¹

14
- 79961165170
- On the scheduling of checkpoints in desktop grids
- M. S. Bouguerra, D. Kondo, and D. Trystram. On the scheduling of checkpoints in desktop grids. In Proceedings of the 11th IEEE International Symposium on Cluster Computing and Grid (CCGrid), 2011.
- (2011) Proceedings of the 11th IEEE International Symposium on Cluster Computing and Grid (CCGrid)
- Bouguerra, M.S.¹ Kondo, D.² Trystram, D.³

15
- 67049096648
- Alert detection in system logs
- A Oliner, A Aiken, and J Stearley. Alert detection in system logs. Data Mining, 2008. ICDM'08. Eighth IEEE International Conference on, pages 959-964, 2008.
- (2008) Data Mining, 2008. ICDM'08. Eighth IEEE International Conference on , pp. 959-964
- Oliner, A.¹ Aiken, A.² Stearley, J.³

16
- 20444471122
- Towards informatic analysis of syslogs
- 2004 IEEE International Conference on Cluster Computing, ICCC 2004
- J. Stearley. Towards informatic analysis of syslogs. In Proceedings of the 2004 IEEE International Conference on Cluster Computing, pages 309-318, Washington, DC, USA, 2004. IEEE Computer Society. (Pubitemid 40822381)
- (2004) Proceedings - IEEE International Conference on Cluster Computing, ICCC , pp. 309-318
- Stearley, J.¹

17
- 77954903245
- The failure trace archive: Enabling comparative analysis of failures in diverse distributed systems
- Derrick Kondo, Bahman Javadi, Alexandru Iosup, and Dick Epema. The failure trace archive: Enabling comparative analysis of failures in diverse distributed systems. Cluster, Cloud and Grid Computing (CCGrid), 2010 10th IEEE/ACM International Conference on, pages 398 -407, 2010.
- (2010) Cluster, Cloud and Grid Computing (CCGrid), 2010 10th IEEE/ACM International Conference on , pp. 398-407
- Kondo, D.¹ Javadi, B.² Iosup, A.³ Epema, D.⁴

18
- 2442427625
- Understanding availability
- R. Bhagwan, S. Savage, and G. Voelker. Understanding Availability. In Proceedings of IPTPS'03, 2003.
- (2003) Proceedings of IPTPS'03
- Bhagwan, R.¹ Savage, S.² Voelker, G.³

19
- 76349120592
- Mining for statistical availability models in large-scale distributed systems: An empirical study of seti@home
- September
- B. Javadi, D. Kondo, JM. Vincent, and D.P. Anderson. Mining for statistical availability models in large-scale distributed systems: An empirical study of seti@home. In 17th IEEE/ACM International Symposium on Modelling, Analysis and Simulation of Computer and Telecommunication Systems (MASCOTS), September 2009.
- (2009) 17th IEEE/ACM International Symposium on Modelling, Analysis and Simulation of Computer and Telecommunication Systems (MASCOTS)
- Javadi, B.¹ Kondo, D.² Vincent, J.M.³ Anderson, D.P.⁴

20
- 33244467640
- Is remote host availability governed by a universal law?
- John R. Douceur. Is remote host availability governed by a universal law? SIGMETRICS Performance Evaluation Review, 31(3):25-29, 2003.
- (2003) SIGMETRICS Performance Evaluation Review , vol.31 , Issue.3 , pp. 25-29
- Douceur, J.R.¹

21
- 38449113154
- Quantifying machine availability in networked and desktop grid systems
- University of California at Santa Barbara, November
- J. Brevik, D. Nurmi, and R. Wolski. Quantifying Machine Availability in Networked and Desktop Grid Systems. Technical Report CS2003-37, Dept. of Computer Science and Engineering, University of California at Santa Barbara, November 2003.
- (2003) Technical Report CS2003-37, Dept. of Computer Science and Engineering
- Brevik, J.¹ Nurmi, D.² Wolski, R.³

22
- 21844470195
- On correlated failures in survivable storage systems
- Mehmet Bakkaloglu, Jay J. Wylie, Chenxi Wang, and Gregory R. Ganger. On correlated failures in survivable storage systems. Technical Report CMU-CS-02-129, Carnegie Mellon University, 2002.
- (2002) Technical Report CMU-CS-02-129, Carnegie Mellon University
- Bakkaloglu, M.¹ Wylie, J.J.² Wang, C.³ Ganger, G.R.⁴

23
- 38049145912
- Characterizing result errors in internet desktop grids
- D Kondo, F Araujo, P Malecot, P Domingues, LM Silva, G Fedak, and F Cappello. Characterizing result errors in internet desktop grids. Lecture Notes in Computer Science, 4641:361, 2007.
- (2007) Lecture Notes in Computer Science , vol.4641 , pp. 361
- Kondo, D.¹ Araujo, F.² Malecot, P.³ Domingues, P.⁴ Silva, L.M.⁵ Fedak, G.⁶ Cappello, F.⁷

24
- 70449657893
- Dram errors in the wild: A large-scale field study
- Jun
- Bianca Schroeder, Eduardo Pinheiro, and Wolf-Dietrich Weber. Dram errors in the wild: a large-scale field study. SIGMETRICS'09: Proceedings of the eleventh international joint conference on Measurement and modeling of computer systems, Jun 2009.
- (2009) SIGMETRICS'09: Proceedings of the Eleventh International Joint Conference on Measurement and Modeling of Computer Systems
- Schroeder, B.¹ Pinheiro, E.² Weber, W.-D.³

25
- 85077125713
- A realistic evaluation of memory hardware errors and software system susceptibility
- Jun
- Xin Li, Michael Huang, Kai Shen, and Lingkun Chu. A realistic evaluation of memory hardware errors and software system susceptibility. USENIXATC'10: Proceedings of the 2010 USENIX conference on USENIX annual technical conference, Jun 2010.
- (2010) USENIXATC'10: Proceedings of the 2010 USENIX Conference on USENIX Annual Technical Conference
- Li, X.¹ Huang, M.² Shen, K.³ Chu, L.⁴

26
- 70449844295
- Dmtcp: Transparent checkpointing for cluster computations and the desktop
- 0
- Jason Ansel, Kapil Arya, and Gene Cooperman. Dmtcp: Transparent checkpointing for cluster computations and the desktop. Parallel and Distributed Processing Symposium, International, 0:1-12, 2009.
- (2009) Parallel and Distributed Processing Symposium, International , pp. 1-12
- Ansel, J.¹ Arya, K.² Cooperman, G.³

27
- 34548282622
- Blocking vs. Non-blocking coordinated checkpointing for large-scale fault tolerant mpi
- Camille Coti, Thomas Herault, Pierre Lemarinier, Laurence Pilard, Ala Rezmerita, Eric Rodriguez, and Franck Cappello. Blocking vs. non-blocking coordinated checkpointing for large-scale fault tolerant mpi. In SC 2006 Conference, Proceedings of the ACM/IEEE, page 18, 2006.
- (2006) SC 2006 Conference, Proceedings of the ACM/IEEE , pp. 18
- Coti, C.¹ Herault, T.² Lemarinier, P.³ Pilard, L.⁴ Rezmerita, A.⁵ Rodriguez, E.⁶ Cappello, F.⁷

* 이 정보는 Elsevier사의 SCOPUS DB에서 KISTI가 분석하여 추출한 것입니다.