SCOPUS 정보 검색 플랫폼

Proceedings of the 16th International Symposium on High Performance Distributed Computing 2007, HPDC'07

Volumn , Issue , 2007, Pages 43-54

Using queue structures to improve job reliability

(2) Hacker, Thomas J a Meglicki, Zdzislaw b

a PURDUE UNIVERSITY (United States)

b INDIANA UNIVERSITY (United States)

Author keywords

Cluster design and architecture; Reliability

Indexed keywords

COMPUTER ARCHITECTURE; PROGRAM PROCESSORS; RELIABILITY ANALYSIS; SERVERS;

CLUSTER DESIGN; HIGH PERFORMANCE COMPUTING SYSTEMS; SMALL SCALE HPC SYSTEMS; STAND ALONE SERVER SYSTEMS;

DISTRIBUTED COMPUTER SYSTEMS;

EID: 34548092060 PISSN: None EISSN: None Source Type: Conference Proceeding
DOI: 10.1145/1272366.1272373 Document Type: Conference Paper

Times cited : (12)

References (27)

1
- 29344435659
- A strategy for running large scale applications based on a model that optimizes the checkpoint interval for restart dumps
- J. T. Daly, "A strategy for running large scale applications based on a model that optimizes the checkpoint interval for restart dumps," International Workshop on Software Engineering for High Performance Computing System Applications, 2004.
- (2004) International Workshop on Software Engineering for High Performance Computing System Applications
- Daly, J.T.¹

2
- 27844542760
- The lam/mpi checkpoint/restart framework: System-initiated checkpointing
- S. Sankaran, J. M. Squyres, B. Barrett, V. Sahay, and A. Lumsdaine, "The lam/mpi checkpoint/restart framework: System-initiated checkpointing," International Journal of High Performance Computing Applications, vol. 19, no. 4, pp. 479-493, 2005.
- (2005) International Journal of High Performance Computing Applications , vol.19 , Issue.4 , pp. 479-493
- Sankaran, S.¹ Squyres, J.M.² Barrett, B.³ Sahay, V.⁴ Lumsdaine, A.⁵

3
- 12444292740
- Computation-at-risk: Assessing job portfolio management risk on clusters
- IEEE Computer Society
- S. D. Kleban and S. H. Clearwater, "Computation-at-risk: Assessing job portfolio management risk on clusters," in IPDPS. IEEE Computer Society, 2004.
- (2004) IPDPS
- Kleban, S.D.¹ Clearwater, S.H.²

4
- 0012910285
- Los Alamos National Laboratory, Tech. Rep. LA-UR-00-4201
- K. J. Ryan and C. S. Reese, "Estimating reliability trends for the world's fastest computer," Los Alamos National Laboratory, Tech. Rep. LA-UR-00-4201, 2000.
- (2000) Estimating reliability trends for the world's fastest computer
- Ryan, K.J.¹ Reese, C.S.²

5
- 0036041277
- Improving cluster availability using workstation validation
- ACM
- T. Heath, R. P. Martin, and T. D. Nguyen, "Improving cluster availability using workstation validation," in SIGMETRICS. ACM, 2002, pp. 217-227.
- (2002) SIGMETRICS , pp. 217-227
- Heath, T.¹ Martin, R.P.² Nguyen, T.D.³

6
- 34548056878
- D. Nurmi, J. Brevik, and R. Wolski, Quantifying machine availability in networked and desktop grid systems, University of California, Santa Barbara, Computer Science, Tech. Rep. ucsb.cs:TR-2003-37, Nov. 2003.
- D. Nurmi, J. Brevik, and R. Wolski, "Quantifying machine availability in networked and desktop grid systems," University of California, Santa Barbara, Computer Science, Tech. Rep. ucsb.cs:TR-2003-37, Nov. 2003.

7
- 4544337911
- Automatic methods for predicting machine availability in desktop grid and peer-to-peer systems
- IEEE Computer Society
- J. Brevik, D. Nurmi, and R. Wolski, "Automatic methods for predicting machine availability in desktop grid and peer-to-peer systems," in CCGRID. IEEE Computer Society, 2004, pp. 190-199.
- (2004) CCGRID , pp. 190-199
- Brevik, J.¹ Nurmi, D.² Wolski, R.³

8
- 27144534020
- Modeling machine availability in enterprise and wide-area distributed computing environments
- Euro-Par 2005, Parallel Processing, 11th International Euro-Par Conference, Lisbon, Portugal, August 30, September 2, 2005, Proceedings, Springer
- D. Nurmi, J. Brevik, and R. Wolski, "Modeling machine availability in enterprise and wide-area distributed computing environments," in Euro-Par 2005, Parallel Processing, 11th International Euro-Par Conference, Lisbon, Portugal, August 30 - September 2, 2005, Proceedings, ser. Lecture Notes in Computer Science, vol. 3648. Springer, 2005, pp. 432-441.
- (2005) ser. Lecture Notes in Computer Science , vol.3648 , pp. 432-441
- Nurmi, D.¹ Brevik, J.² Wolski, R.³

9
- 23944448107
- Performance implications of failures in large-scale cluster scheduling
- JSSPP, Springer
- Y. Zhang, M. S. Squillante, A. Sivasubramaniam, and R. K. Sahoo, "Performance implications of failures in large-scale cluster scheduling," in JSSPP, ser. Lecture Notes in Computer Science, vol. 3277. Springer, 2004, pp. 233-252.
- (2004) ser. Lecture Notes in Computer Science , vol.3277 , pp. 233-252
- Zhang, Y.¹ Squillante, M.S.² Sivasubramaniam, A.³ Sahoo, R.K.⁴

10
- 33845593340
- A large-scale study of failures in high-performance computing systems
- IEEE Computer Society
- B. Schroeder and G. A. Gibson, "A large-scale study of failures in high-performance computing systems," in Proceedings of International Symposium on Dependable Systems and Networks (DSN). IEEE Computer Society, 2006, pp. 249-258.
- (2006) Proceedings of International Symposium on Dependable Systems and Networks (DSN) , pp. 249-258
- Schroeder, B.¹ Gibson, G.A.²

11
- 0003740665
- Boston, MA: McGraw-Hill
- C. Ebeling, An Introduction to Reliability and Maintainability Engineering. Boston, MA: McGraw-Hill, 1997.
- (1997) An Introduction to Reliability and Maintainability Engineering
- Ebeling, C.¹

12
- 0004271645
- New York, NY: John Wiley
- D. L. Grosh, Primer of Reliability Theory. New York, NY: John Wiley, 1989.
- (1989) Primer of Reliability Theory
- Grosh, D.L.¹

13
- 34548108278
- Los Alamos National Laboratory, data on system failures, Online, Available
- Los Alamos National Laboratory. (2006) Raw operational data on system failures. [Online]. Available: http://www.lanl.gov/projects/computerscience/ data/
- (2006) Raw operational

14
- 34548074590
- EasyFit Statistical Package, "http://www.mathwave.com/products/ easyfit.html."
- EasyFit Statistical Package

15
- 34548088531
- Reliability analysis in hpc clusters
- 2
- N. Raju, Gottumukkala, Y. Liu, C. B. Leangsuksun, R. Nassar, and S. Scott2, "Reliability analysis in hpc clusters," Proceedings of the High Availability and Performance Computing Workshop, 2006.
- (2006) Proceedings of the High Availability and Performance Computing Workshop
- Raju, N.¹ Gottumukkala² Liu, Y.³ Leangsuksun, C.B.⁴ Nassar, R.⁵ Scott, S.⁶

16
- 0033344278
- Failure data analysis of a LAN of windows NT based computers
- Washington, Brussels, Tokyo: IEEE, Oct
- M. Kalyanakrishnam, Z. Kalbarczyk, and R. Iyer, "Failure data analysis of a LAN of windows NT based computers," in Proceedings of the 18th IEEE Symposium on Reliable Distributed Systems (SRDS '99). Washington - Brussels - Tokyo: IEEE, Oct. 1999, pp. 178-189.
- (1999) Proceedings of the 18th IEEE Symposium on Reliable Distributed Systems (SRDS '99) , pp. 178-189
- Kalyanakrishnam, M.¹ Kalbarczyk, Z.² Iyer, R.³

17
- 85084160707
- Disk failures in the real world: What does an mttf of 1,000,000 hours mean to you?
- USENIX, Feb. 13-16
- B. Schroeder and G. Gibson, "Disk failures in the real world: What does an mttf of 1,000,000 hours mean to you?" in Proceedings of the 5th USENIX Conference on File and Storage Technologies (FAST 2007). USENIX, Feb. 13-16 2007.
- (2007) Proceedings of the 5th USENIX Conference on File and Storage Technologies (FAST 2007)
- Schroeder, B.¹ Gibson, G.²

18
- 84947200665
- Failure trends in a large disk drive population
- USENIX, Feb. 13-16
- E. Pinheiro, W.-D. Weber, and L. A. Barroso, "Failure trends in a large disk drive population," in Proceedings of the 5th USENIX Conference on File and Storage Technologies (FAST 2007). USENIX, Feb. 13-16 2007.
- (2007) Proceedings of the 5th USENIX Conference on File and Storage Technologies (FAST 2007)
- Pinheiro, E.¹ Weber, W.-D.² Barroso, L.A.³

19
- 2142685609
- Weibull Models
- Wiley-Interscience
- D. N. P. Murthy, M. Xie, and R. Jiang, Weibull Models. Wiley Series in Probability and Statistics, Wiley-Interscience, 2003.
- (2003) Wiley Series in Probability and Statistics
- Murthy, D.N.P.¹ Xie, M.² Jiang, R.³

20
- 10644279458
- Second Edition. Wiley-Interscience
- M. Rausand and A. Høyland, System Reliability Theory: Models, Statistical Methods and Applications Second Edition. Wiley-Interscience, 2003.
- (2003) System Reliability Theory: Models, Statistical Methods and Applications
- Rausand, M.¹ Høyland, A.²

21
- 84898046897
- Scaling to Thousands of Processors with Buffer Coscheduling
- Pittsburgh, PA, Aug
- F. Petøini, "Scaling to Thousands of Processors with Buffer Coscheduling," in Scaling to New Heights Workshop, Pittsburgh, PA, Aug 2002.
- (2002) Scaling to New Heights Workshop
- Petøini, F.¹

22
- 0345446547
- The workload on parallel supercomputers: Modeling the characteristics of rigid jobs
- Lublin and Feitelson, "The workload on parallel supercomputers: Modeling the characteristics of rigid jobs," JPDC: Journal of Parallel and Distributed Computing, vol. 63, 2003.
- (2003) JPDC: Journal of Parallel and Distributed Computing , vol.63
- Lublin¹ Feitelson²

23
- 0031388399
- Impact of checkpoint latency on overhead ratio of a checkpointing scheme
- N. H. Vaidya, "Impact of checkpoint latency on overhead ratio of a checkpointing scheme," IEEE Trans. Computers, vol. 46, no. 8, pp. 942-947, 1997.
- (1997) IEEE Trans. Computers , vol.46 , Issue.8 , pp. 942-947
- Vaidya, N.H.¹

24
- 33847764225
- University of California, Santa Barbara, Computer Science, Tech. Rep. TR, Nov. 6
- D. Nurmi, R. Wolski, and J. Brevik, "Model-based checkpoint scheduling for volatile resource environments," University of California, Santa Barbara, Computer Science, Tech. Rep. TR-2004-25, Nov. 6 2004.
- (2004) Model-based checkpoint scheduling for volatile resource environments , pp. 2004
- Nurmi, D.¹ Wolski, R.² Brevik, J.³

25
- 34548105831
- N. Stone, J. Kochmar, R. Reddy, J. R. Scott, J. Sommerfield, and C. Vizinok, A checkpoint and recovery system for the Pittsburgh supercomputing center terascale computing system, Pittsburgh Supercomputer Center, Tech. Rep. CMU-PSC-TR-2001-0002, 2001.
- N. Stone, J. Kochmar, R. Reddy, J. R. Scott, J. Sommerfield, and C. Vizinok, "A checkpoint and recovery system for the Pittsburgh supercomputing center terascale computing system," Pittsburgh Supercomputer Center, Tech. Rep. CMU-PSC-TR-2001-0002, 2001.

26
- 84978437474
- Pastiche: Making backup cheap and easy
- Proceedings of the 5th ACM Symposium on Operating System Design and Implementation OSDI-02, New York: ACM Press, Dec. 9-11
- L. P. Cox, C. D. Murray, and B. Noble, "Pastiche: Making backup cheap and easy," in Proceedings of the 5th ACM Symposium on Operating System Design and Implementation (OSDI-02), ser. Operating Systems Review. New York: ACM Press, Dec. 9-11 2007, pp. 285-298.
- (2007) ser. Operating Systems Review , pp. 285-298
- Cox, L.P.¹ Murray, C.D.² Noble, B.³

27
- 34548100442
- Investigating lightweight storage and overlay network for fault tolerance
- R. A. Oldfield, "Investigating lightweight storage and overlay network for fault tolerance," Proceedings of the High Availability and Performance Computing Workshop, 2006.
- (2006) Proceedings of the High Availability and Performance Computing Workshop
- Oldfield, R.A.¹

* 이 정보는 Elsevier사의 SCOPUS DB에서 KISTI가 분석하여 추출한 것입니다.