SCOPUS 정보 검색 플랫폼

Proceedings - IEEE International Conference on Cluster Computing, ICCC

Volumn , Issue , 2009, Pages

Reliability-aware scalability models for high performance computing

(2) Zheng, Ziming a Lan, Zhiling a

a Illinois Institute of Technology (United States)

Author keywords

[No Author keywords available]

Indexed keywords

ANALYTICAL TOOL; APPLICATION PERFORMANCE; APPLICATION SCALABILITY; DEVELOPED MODEL; FAULT TOLERANCE TECHNIQUES; HIGH PERFORMANCE COMPUTING; PARALLEL APPLICATION; TRACE-BASED SIMULATION;

CLUSTER COMPUTING; COMPUTER SCIENCE; QUALITY ASSURANCE; SCALABILITY; SIMULATORS; TECHNICAL PRESENTATIONS;

FAULT TOLERANCE;

EID: 72049124295 PISSN: 15525244 EISSN: None Source Type: Conference Proceeding
DOI: 10.1109/CLUSTR.2009.5289177 Document Type: Conference Paper

Times cited : (27)

References (38)

1
- 36049013419
- What supercomputers say: A study of five system logs
- A. Oliner and J. Stearly, "What Supercomputers Say: A Study of Five System Logs," Proc. of DSN, 2007.
- (2007) Proc. of DSN
- Oliner, A.¹ Stearly, J.²

2
- 33845593340
- A large-scale study of failures in highperformance-computing systems
- B. Schroeder and G. Gibson, "A Large-scale Study of Failures in Highperformance-computing Systems," Proc. of DSN, 2006.
- (2006) Proc. of DSN
- Schroeder, B.¹ Gibson, G.²

3
- 85060036181
- Validity of the single processor approach to achieving large-scale computing capabilities
- G. Amdahl, "Validity of the Single Processor Approach to Achieving Large-Scale Computing Capabilities," Proc. of AFIPS Spring Joint Computer Conference, 1967.
- (1967) Proc. of AFIPS Spring Joint Computer Conference
- Amdahl, G.¹

4
- 0024012163
- Reevaluating amdahl's law
- J. Gustafson, "Reevaluating Amdahl's law," Communications of the ACM, 31(5):532-533,1988.
- (1988) Communications of the ACM , vol.31 , Issue.5 , pp. 532-533
- Gustafson, J.¹

5
- 72049112957
- Cluster survivability with ByzwATCH: A byzantine hardware fault detector for parallel machines with charm++
- D. Mogilevsky, G. Koenig, and W. Yurcik, "Cluster Survivability with ByzwATCH: A Byzantine Hardware Fault Detector for Parallel Machines with Charm++", Proc. of the 2nd Workshop on High Performance Computing Reliability Issues, 2006.
- (2006) Proc. of the 2nd Workshop on High Performance Computing Reliability Issues
- Mogilevsky, D.¹ Koenig, G.² Yurcik, W.³

6
- 72049101354
- Adaptive grid-enabled SIMOX simulation on Japan-US grid testbed
- Y. Tanaka, H. Takemiya, S. Sekiguchi, S. Ogata, A. Nakano, R. Kalia, and P. Vashishta, "Adaptive Grid-enabled SIMOX Simulation on Japan-US Grid Testbed", Proc. of TeraGrid, 2006.
- (2006) Proc. of TeraGrid
- Tanaka, Y.¹ Takemiya, H.² Sekiguchi, S.³ Ogata, S.⁴ Nakano, A.⁵ Kalia, R.⁶ Vashishta, P.⁷

7
- 84870399830
- Top500 supercomputing sites. http://top500.org/.
- Top500 Supercomputing Sites

8
- 51049111944
- Big systems and big reliability challenges
- D. Reed, C. Lu, and C. Mendes, "Big systems and big reliability challenges," Proc. of Parallel Computing, 2003.
- (2003) Proc. of Parallel Computing
- Reed, D.¹ Lu, C.² Mendes, C.³

9
- 0025502686
- Error log analysis: Statistical modeling and heuristic trend analysis
- T. Lin and D. Siewiorek, "Error log analysis: statistical modeling and heuristic trend analysis," IEEE Trans. on Reliability, 39(4):419-432, 1990.
- (1990) IEEE Trans. on Reliability , vol.39 , Issue.4 , pp. 419-432
- Lin, T.¹ Siewiorek, D.²

10
- 0025556948
- Another view on parallel speedup
- X. Sun and L. Ni, "Another View on Parallel Speedup," Proc. Of Supercomputing, 1990.
- (1990) Proc. of Supercomputing
- Sun, X.¹ Ni, L.²

11
- 52949107193
- Algorithm-system scalability of heterogeneous computing
- Y. Chen, X. Sun, and M. Wu, "Algorithm-System Scalability of Heterogeneous Computing," Journal of Parallel and Distributed Computing, 68(11):1403-1412, 2008.
- (2008) Journal of Parallel and Distributed Computing , vol.68 , Issue.11 , pp. 1403-1412
- Chen, Y.¹ Sun, X.² Wu, M.³

12
- 33745170068
- Scalability of heterogeneous computing
- X. Sun, Y. Chen, and M.Wu, "Scalability of Heterogeneous Computing," Proc. of ICPP, 2005.
- (2005) Proc. of ICPP
- Sun, X.¹ Chen, Y.² Wu, M.³

13
- 34548800708
- Power-aware speedup
- R. Ge and K. Cameron, "Power-Aware Speedup," Proc. of IPDPS, 2007.
- (2007) Proc. of IPDPS
- Ge, R.¹ Cameron, K.²

14
- 56749158844
- Performance under failure of high-end computing
- M. Wu, X. Sun, and H. Jin, "Performance under Failure of High-End Computing," Proc. of SuperComputing, 2007.
- (2007) Proc. of SuperComputing
- Wu, M.¹ Sun, X.² Jin, H.³

15
- 28044460018
- A higher order estimate of the optimum checkpoint interval for restart dumps
- J. Daly, "A Higher Order Estimate of the Optimum Checkpoint Interval for Restart Dumps," Future Generation Computer Systems, 22(3): 303-312, 2006.
- (2006) Future Generation Computer Systems , vol.22 , Issue.3 , pp. 303-312
- Daly, J.¹

16
- 0012237782
- Minimizing completion time of a program by checkpointing and rejuvenation
- S. Garg, Y. Huang, C. Kintala, and K. Trivedi, "Minimizing Completion Time of a Program by Checkpointing and Rejuvenation," Proc. Of SIGMETRICS, 1996.
- (1996) Proc. of SIGMETRICS
- Garg, S.¹ Huang, Y.² Kintala, C.³ Trivedi, K.⁴

17
- 0035201417
- Processor allocation and checkpoint interval selection in cluster computing systems
- J. Plank and M. Thomason, "Processor allocation and checkpoint interval selection in cluster computing systems," Journal of Parallel and Distributed Computing, 61(11): 1570-1590, 2001.
- (2001) Journal of Parallel and Distributed Computing , vol.61 , Issue.11 , pp. 1570-1590
- Plank, J.¹ Thomason, M.²

18
- 85014175705
- Experimental assessment of workstation failures and their impact on checkpointing systems
- J. Plank and W. Elwasif, "Experimental Assessment of Workstation Failures and Their Impact on Checkpointing Systems," Proc. of FTCS, 1998.
- (1998) Proc. of FTCS
- Plank, J.¹ Elwasif, W.²

19
- 9144223280
- Checkpointing for peta-scale systems: A look into the future of practical rollback-recovery
- E. Elnozahy and J. Plank, "Checkpointing for Peta-Scale Systems: A Look into the Future of Practical Rollback-Recovery," IEEE Trans. On Dependable and Secure Computing, 1(2):97-108, 2004.
- (2004) IEEE Trans. on Dependable and Secure Computing , vol.1 , Issue.2 , pp. 97-108
- Elnozahy, E.¹ Plank, J.²

20
- 27544513113
- Modeling coordinated checkpointing for large-scale supercomputers
- L. Wang, K. Pattabiraman, Z. Kalbarczyk, and R. Iyer, "Modeling Coordinated Checkpointing for Large-Scale Supercomputers," Proc. Of DSN, 2005.
- (2005) Proc. of DSN
- Wang, L.¹ Pattabiraman, K.² Kalbarczyk, Z.³ Iyer, R.⁴

21
- 57049111494
- Adaptive fault management of parallel applications for high performance computing
- Z. Lan and Y. Li, "Adaptive Fault Management of Parallel Applications for High Performance Computing," IEEE Trans. Computers, 57(12): 1647-1660, 2008.
- (2008) IEEE Trans. Computers , vol.57 , Issue.12 , pp. 1647-1660
- Lan, Z.¹ Li, Y.²

22
- 55849147399
- Dynamic meta-learning for failure prediction in large-scale systems: A case study
- J. Gu, Z. Zheng, Z. Lan, J. White, E. Hocks, and B-H. Park, "Dynamic Meta-Learning for Failure Prediction in Large-scale Systems: A Case Study", Proc. of ICPP, 2008.
- (2008) Proc. of ICPP
- Gu, J.¹ Zheng, Z.² Lan, Z.³ White, J.⁴ Hocks, E.⁵ Park, B.-H.⁶

23
- 72049113723
- Reliability aware optimal K node of parallel applications in large scale HPC systems
- N. Gottumukkala, C. Leangsuksun, R. Nassar, M. Paun, D. Sule, and S. Scott, "Reliability Aware Optimal K Node of Parallel applications in Large Scale HPC Systems," Proc. of High Availability and Performance Computing Workshop, 2008.
- (2008) Proc. of High Availability and Performance Computing Workshop
- Gottumukkala, N.¹ Leangsuksun, C.² Nassar, R.³ Paun, M.⁴ Sule, D.⁵ Scott, S.⁶

24
- 84976846528
- A first order approximation to the optimal checkpoint interval
- J. Young, "A First Order Approximation to the Optimal Checkpoint Interval," Comm. ACM, 17(9): 530-531, 1974.
- (1974) Comm. ACM , vol.17 , Issue.9 , pp. 530-531
- Young, J.¹

25
- 33845595513
- Performance implications of failures in large-scale cluster scheduling
- Y. Zhang, M. Squillante, A. Sivasubramaniam, and R. Sahoo, " Performance implications of failures in large-scale cluster scheduling," Proc. Of Workshop on JSSPP, SIGMETRICS, 2004.
- (2004) Proc. of Workshop on JSSPP, SIGMETRICS
- Zhang, Y.¹ Squillante, M.² Sivasubramaniam, A.³ Sahoo, R.⁴

26
- 33746286070
- Performance implications of periodic checkpointing on large-scale cluster systems
- A. Oliner, R. Sahoo, J. Moreira, and M. Gupta, "Performance Implications of Periodic Checkpointing on Large-scale Cluster Systems," Proc. Of IPDPS, 2005.
- (2005) Proc. of IPDPS
- Oliner, A.¹ Sahoo, R.² Moreira, J.³ Gupta, M.⁴

27
- 72049130706
- Opportunistic checkpoint intervals to improve system performance
- S. Arunagiri, J. Daly, P. Teller, S. Seelam, R. Oldfield, M. Varela, and R. Riesen, "Opportunistic Checkpoint Intervals to Improve System Performance," Technical Report UTEP-CS-08-24, 2008.
- (2008) Technical Report UTEP-CS-08-24
- Arunagiri, S.¹ Daly, J.² Teller, P.³ Seelam, S.⁴ Oldfield, R.⁵ Varela, M.⁶ Riesen, R.⁷

28
- 72049129021
- Improved message logging versus improved coordinated checkpointing for fault tolerant MPI
- A. Bouteiller, P. Lemarinier, G. Krawezik, and F. Cappello, "Improved message logging versus improved coordinated checkpointing for fault tolerant MPI," Proc. of Cluster, 2003.
- (2003) Proc. of Cluster
- Bouteiller, A.¹ Lemarinier, P.² Krawezik, G.³ Cappello, F.⁴

29
- 85027617648
- Analysis of scalability of parallel algorithms and architectures: A survey
- V. Kumar and A. Gupta, "Analysis of scalability of parallel algorithms and architectures: a survey," Proc of ICS, 1991.
- (1991) Proc of ICS
- Kumar, V.¹ Gupta, A.²

30
- 64049097304
- Extending Amdahl's law for energy-efficient computing in the many-core era
- D. Woo and H. Lee, "Extending Amdahl's law for energy-efficient computing in the many-core era," IEEE Computer, 41(12):24-31, 2008.
- (2008) IEEE Computer , vol.41 , Issue.12 , pp. 24-31
- Woo, D.¹ Lee, H.²

31
- 34547424386
- Cooperative checkpointing: A robust approach to large-scale systems reliability
- A. Oliner, L. Rudolph, and R. Sahoo, "Cooperative checkpointing: A robust approach to large-scale systems reliability," Proc. of ICS, 2006.
- (2006) Proc. of ICS
- Oliner, A.¹ Rudolph, L.² Sahoo, R.³

32
- 34548100442
- Investigating lightweight storage and overlay networks for fault tolerance
- R. Oldfield, "Investigating lightweight storage and overlay networks for fault tolerance," Proc. of High Availability and Performance Computing Workshop, 2006.
- (2006) Proc. of High Availability and Performance Computing Workshop
- Oldfield, R.¹

33
- 12444268325
- System-level faulttolerance in largescale parallel machines with buffered coscheduling
- F. Petrini, K. Davis, and J. Sancho, "System-level faulttolerance in largescale parallel machines with buffered coscheduling," Proc. of IPDPS, 2004.
- (2004) Proc. of IPDPS
- Petrini, F.¹ Davis, K.² Sancho, J.³

34
- 0004244684
- Checkpointing and modelling of program execution time
- John Wiley and Sons
- V. Nicola, "Checkpointing and modelling of program execution time. Software Fault Tolerance," John Wiley and Sons, 1995.
- (1995) Software Fault Tolerance
- Nicola, V.¹

35
- 78649627101
- A fast recovery mechanism for checkpointing in networked environments
- Y. Li and Z. Lan, "A Fast Recovery Mechanism for Checkpointing in Networked Environments," Proc. of DSN, 2008.
- (2008) Proc. of DSN
- Li, Y.¹ Lan, Z.²

36
- 78449285638
- Proactive processlevel live migration in HPC environments
- C. Wang, F. Mueller, C. Engelmann, and S. Scott, "Proactive processlevel live migration in HPC environments," Proc. of Supercomputing, 2008.
- (2008) Proc. of Supercomputing
- Wang, C.¹ Mueller, F.² Engelmann, C.³ Scott, S.⁴

37
- 55849086811
- Los Alamos National Laboratory
- Los Alamos National Laboratory, Operational Data to Support and Enable Computer Science Research, http://institute.lanl.gov/data/lanldata.shtml.
- Operational Data to Support and Enable Computer Science Research

38
- 50649107313
- Application MTFE vs platform MTBF: A fresh perspective on system reliabilty and application throughput for computations at scale
- J. Daly, L. Pritchett-Sheats, and S. Michala, "Application MTFE vs Platform MTBF: A Fresh Perspective on System Reliabilty and Application Throughput for Computations at Scale," Proc. of CCGRID, 2008.
- (2008) Proc. of CCGRID
- Daly, J.¹ Pritchett-Sheats, L.² Michala, S.³

* 이 정보는 Elsevier사의 SCOPUS DB에서 KISTI가 분석하여 추출한 것입니다.