SCOPUS 정보 검색 플랫폼

International Journal of Parallel, Emergent and Distributed Systems

Volumn 29, Issue 4, 2014, Pages 363-378

Cost-oriented proactive fault tolerance approach to high performance computing (HPC) in the cloud

(5) Egwutuoha, Ifeanyi P a Chen, Shiping b Levy, David a Selic, Bran a Calvo, Rafael a

a UNIVERSITY OF SYDNEY (Australia)

b CSIRO (Australia)

Author keywords

Cloud computing; computationintensive; HaaS; HPC; proactive fault tolerance

Indexed keywords

ALGORITHMS; CLOCKS; CLOUD COMPUTING; FAULT TOLERANCE;

COMPUTATIONINTENSIVE; COMPUTING PARADIGM; ELECTRONIC COMPONENT; EXECUTION ENVIRONMENTS; HAAS; HIGH PERFORMANCE COMPUTING (HPC); HPC; PROACTIVE FAULT;

COST REDUCTION;

EID: 84898914916 PISSN: 17445760 EISSN: 17445779 Source Type: Journal
DOI: 10.1080/17445760.2013.803686 Document Type: Article

Times cited : (8)

References (33)

1
- 84898855433
- [Online] Available at
- Amazon. [Online]. Available at. http://aws.amazon.com/ec2/
- Amazon¹

2
- 84898923825
- [Online] Available at
- Baremetalcloud. [Online]. Available at. http://baremetalcloud.com/index. php/en/
- Baremetalcloud

3
- 73549113094
- A taxonomy and survey of cloud computing systems
- Washington, DC, USA IEEE Comp. Society
- B.P. Rimal, E. Choi, and I. Lumb, A taxonomy and survey of cloud computing systems, in NCM '09: Proceedings of the 2009 Fifth International Joint Conference on INC, IMS and IDC. Washington, DC, USA, IEEE Comp. Society, 2009, pp. 44-51.
- NCM '09: Proceedings of the 2009 Fifth International Joint Conference on INC, IMS and IDC , vol.2009 , pp. 44-51
- Rimal, B.P.¹ Choi, E.² Lumb, I.³

4
- 84898916874
- Nicholas Carr. [Online] Available at
- Nicholas Carr. [Online]. Available at. http://www.roughtype.com/p279

5
- 84898837506
- Available at
- CFDR. Available at. http://cfdr.usenix.org (2012).
- (2012)

6
- 78149470110
- A large-scale study of failures in high performance computing systems, dependable and secure computing
- B. Schroeder and G.A. Gibson, A Large-Scale Study of Failures in High Performance Computing Systems, Dependable and Secure Computing, IEEE Transactions 7(4) (2010), pp. 337-351.
- (2010) IEEE Transactions , vol.7 , Issue.4 , pp. 337-351
- Schroeder, B.¹ Gibson, G.A.²

7
- 12444258147
- Development of naturally fault tolerant algorithms for computing on 100,000 processors
- Available at
- Al Geist and Christian Engelmann, Development of naturally fault tolerant algorithms for computing on 100,000 processors, J. Parallel Distributed Comput. (2002). Available at www.csm.ornl.gov/,geist
- (2002) J Parallel Distributed Comput
- Geist, A.¹ Engelmann, C.²

8
- 77956584397
- See applications run throughput jump: The case for redundant computing in HPC
- Washington, DC, USA IEEE Computer Society
- R. Riesen, K. Ferreira, and J. Stearley, See Applications Run and Throughput Jump: The Case for Redundant Computing in HPC, in Proceedings of the 2010 International Conference on Dependable Systems and Networks Workshops (DSN-W), DSNW '10, Washington, DC, USA, IEEE Computer Society, 2010, pp. 29-34.
- (2010) Proceedings of the 2010 International Conference on Dependable Systems and Networks Workshops (DSN-W), DSNW '10 , pp. 29-34
- Riesen, R.¹ Ferreira, K.² Stearley, J.³

9
- 67349137506
- Cost-oriented task allocation and hardware redundancy policies in heterogeneous distributed computing systems considering software reliability
- Huajun Hu, Suchang Guo, and Bo Yang, Cost-oriented task allocation and hardware redundancy policies in heterogeneous distributed computing systems considering software reliability, Comput. Ind. Eng. 56(4) (2009), pp. 1687-1696.
- (2009) Comput. Ind. Eng. , vol.56 , Issue.4 , pp. 1687-1696
- Hu, H.¹ Guo, S.² Yang, B.³

10
- 0037410519
- Optimal task allocation and hardware redundancy policies in distributed computing systems
- C.-C. Hsieh, Optimal task allocation and hardware redundancy policies in distributed computing systems, Eur. J. Operational Res. 147(2) (2003), pp. 430-447.
- (2003) Eur. J. Operational Res. , vol.147 , Issue.2 , pp. 430-447
- Hsieh, C.-C.¹

11
- 68249127079
- Fault Tolerance in Petascale/exascale systems: Current knowledge, challenges and research opportunities
- F. Cappello, Fault Tolerance in Petascale/exascale systems: Current knowledge, challenges and research opportunities, Int. J. High Perform. Comput. Appl. 23(3) (2009), pp. 212-226.
- (2009) Int. J. High Perform. Comput. Appl. , vol.23 , Issue.3 , pp. 212-226
- Cappello, F.¹

12
- 84898850445
- The MPI forum
- Available at
- The MPI Forum, The MPI message-passing interface standard, 1995. Available at: Http://www.mcs.anl.gov/mpi/standard.html
- (1995) The MPI Message-passing Interface Standard

13
- 78649985381
- Cloud computing for parallel scientific HPC applications: Feasibility of running coupled atmosphere-ocean climate models on amazon's EC2
- (CCA-08), October 2008, Chicago, IL ACM
- C. Evangelinos and C.N. Hill, Cloud computing for parallel scientific HPC Applications: Feasibility of running coupled Atmosphere-Ocean climate models on Amazon's EC2, in Cloud Computing and Its Applications 2008 (CCA-08), October 2008, Chicago, IL, ACM.
- (2008) Cloud Computing and Its Applications
- Evangelinos, C.¹ Hill, C.N.²

14
- 84863649355
- A fault tolerance framework for high performance computing in cloud, in cluster, cloud and grid computing (CCGrid)
- Ottawa, Canada, IEEE
- I.P. Egwutuoha, S. Chen, D. Levy, and B. Selic, A fault tolerance framework for high performance computing in cloud, in Cluster, Cloud and Grid Computing (CCGrid), 2012 12th IEEE/ACM International Symposium, Ottawa, Canada, IEEE, 2012, pp. 709-710.
- (2012) 2012 12th IEEE/ACM International Symposium , pp. 709-710
- Egwutuoha, I.P.¹ Chen, S.² Levy, D.³ Selic, B.⁴

15
- 37249022917
- System-level dynamic thermal management for highperformance microprocessors
- A. Kumar, L. Shang, L. Peh, and N. Jha, System-level dynamic thermal management for highperformance microprocessors, IEEE Trans. Computer-Aided Design Integr. Circuits Syst. 27(1) (2008), pp. 96-108.
- (2008) IEEE Trans. Computer-Aided Design Integr. Circuits Syst. , vol.27 , Issue.1 , pp. 96-108
- Kumar, A.¹ Shang, L.² Peh, L.³ Jha, N.⁴

16
- 36049013419
- What supercomputers say: A study of five system logs
- Washington, DC, USA
- J. Stearley and A. Oliner, What Supercomputers Say: A Study of Five System Logs, in DSN 07: Proceedings of the 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, Washington, DC, USA, 2007, pp. 575-584.
- (2007) DSN 07: Proceedings of the 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks , pp. 575-584
- Stearley, J.¹ Oliner, A.²

17
- 34548046749
- Proactive fault tolerance for HPC with xen virtualization
- Seattle, Washington
- Arun Babu Nagarajan, Frank Mueller, Christian Engelmann, and Stephen L. Scot, Proactive Fault Tolerance for HPC with Xen Virtualization, in Proceedings of the 21st Annual International Conference on Supercomputing, Seattle, Washington, 2007, pp. 23-32.
- (2007) Proceedings of the 21st Annual International Conference on Supercomputing , pp. 23-32
- Babu Nagarajan, A.¹ Mueller, F.² Engelmann, C.³ Scot, S.L.⁴

18
- 84898908341
- [Online]. Aavailable At
- Lm-sensors. [Online]. Aavailable at. http://lm-sensors.org/wiki/ Documentation

19
- 70350746349
- The cost of doing science on the cloud: The montage example
- Piscataway, NJ, USA
- E. Deelman, G. Singh, M. Livny, B. Berriman, and J. Good, The Cost of Doing Science on the Cloud: The Montage Example, in SC 08: Proceedings of the 2008 ACM/IEEE Conference on Supercomputing, Piscataway, NJ, USA, 2008, pp. 1-12.
- (2008) SC 08: Proceedings of the 2008 ACM/IEEE Conference on Supercomputing , pp. 1-12
- Deelman, E.¹ Singh, G.² Livny, M.³ Berriman, B.⁴ Good, J.⁵

20
- 77950347409
- A view of cloud computing
- M. Armbrust, A. Fox, R. Griffith, A. Joseph, R. Katz, A. Konwinski, G. Lee, D. Patterson, A. Rabkin, I. Stoica, and M. Zaharia, A view of cloud computing. Commun. ACM. 53(4) (2010), pp. 50-58.
- (2010) Commun. ACM. , vol.534 , pp. 50-58
- Armbrust, M.¹ Fox, A.² Griffith, R.³ Joseph, A.⁴ Katz, R.⁵ Konwinski, A.⁶ Lee, G.⁷ Patterson, D.⁸ Rabkin, A.⁹ Stoica, I.¹⁰ Zaharia, M.¹¹

21
- 84898918714
- Available at
- Open-iscsi, 2013. Available at: Http://www.open-iscsi.org
- (2013)

22
- 84898845581
- Xen Hypervisor. [Online]. Available at
- Xen, Xen hypervisor. [Online]. Available at. http://www.xen.org/products/ xenhyp.html

23
- 84898888450
- HPL. [Online]. Available at
- A. Petitet, R.C. Whaley, J. Dongarra, and A. Cleary, HPL. [Online]. Available at. http://www. netlib.org/benchmark/hpl/(2008).
- (2008)
- Petitet, A.¹ Whaley, R.C.² Dongarra, J.³ Cleary, A.⁴

24
- 0042078549
- A survey of rollback-recovery protocols in message-passing systems
- E.N.M. Elnozahy, L. Alvisi, Y.M. Wang, and D.B. Johnson, A survey of rollback-recovery protocols in message-passing systems, ACM Comput. Surv. (CSUR) 34(3) (2002), pp. 375-408.
- (2002) ACM Comput. Surv. (CSUR) , vol.34 , Issue.3 , pp. 375-408
- Elnozahy, E.N.M.¹ Alvisi, L.² Wang, Y.M.³ Johnson, D.B.⁴

25
- 84898921532
- [Online]. Available at
- Checkpointing.org, [Online]. Available at. http://checkpointing.org/

26
- 84897731347
- A brief review of cloud computing, challenges and potential solutions
- I.P. Egwutuoha, D. Schragl, and R. Calvo, A Brief Review of Cloud Computing, Challenges and Potential Solutions, J. Parallel Cloud Comput. 2(1) (2013).
- (2013) J Parallel Cloud Comput. , vol.2 , Issue.1
- Egwutuoha, I.P.¹ Schragl, D.² Calvo, R.³

27
- 0029633168
- GROMACS: A message-passing parallel molecular dynamics implementation
- H.J. Berendsen, D. van der Spoel, and R. van Drunen, GROMACS: A message-passing parallel molecular dynamics implementation, Comput. Phys. Commun. 91(1) (1995), pp. 43-56.
- (1995) Comput. Phys. Commun. , vol.91 , Issue.1 , pp. 43-56
- Berendsen, H.J.¹ Van Der Spoel, D.² Van Drunen, R.³

28
- 84874623366
- A proactive fault tolerance approach to high performance computing (HPC)
- Xiangtan, Hunan, China IEEE
- I.P. Egwutuoha, S. Chen, D. Levy, B. Selic, and R. Calvo, A Proactive Fault Tolerance Approach to High Performance Computing (HPC) in 2012 Second International Conference on the Cloud, in Cloud and Green Computing (CGC), Xiangtan, Hunan, China, IEEE, 2012, pp. 268-273.
- (2012) 2012 Second International Conference on the Cloud, in Cloud and Green Computing (CGC) , pp. 268-273
- Egwutuoha, I.P.¹ Chen, S.² Levy, D.³ Selic, B.⁴ Calvo, R.⁵

29
- 62949146437
- Toward a unified ontology of cloud computing
- Nov
- L. Youseff, M. Butrico, and D.D. Silva, Toward a unified ontology of cloud computing, in Proc. of the Grid Computing Environments Workshop (GCE08), Nov 2008, pp. 1-10.
- (2008) Proc. of the Grid Computing Environments Workshop (GCE08)
- Youseff, L.¹ Butrico, M.² Silva, D.D.³

30
- 84881374819
- A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems
- 10.1007/s11227-013-0884-0890
- I.P. Egwutuoha, D. Levy, B. Selic, and S. Chen, A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems, J. Supercomput. (2013). 10.1007/s11227-013-0884-0
- (2013) J. Supercomput.
- Egwutuoha, I.P.¹ Levy, D.² Selic, B.³ Chen, S.⁴

31
- 28044460018
- A higher order estimate of the optimum checkpoint interval for restart dumps
- J.T. Daly, A higher order estimate of the optimum checkpoint interval for restart dumps, Generation Comput. Syst. 22 (2006), pp. 303-312.
- (2006) Generation Comput. Syst. , vol.22 , pp. 303-312
- Daly, J.T.¹

32
- 85059766484
- Live migration of virtual machines
- USENIX Association
- C. Clark, K. Fraser, S. Hand et al. Live migration of virtual machines, in Proceedings of the 2nd conference on Symposium on Networked Systems Design & Implementation-Volume 2, USENIX Association 2005, pp. 273-286.
- (2005) Proceedings of the 2nd Conference on Symposium on Networked Systems Design & Implementation , vol.2 , pp. 273-286
- Clark, C.¹ Fraser, K.² Hand, S.³

33
- 0028485392
- Low-latency, concurrent checkpointing for parallel programs
- K. Li, J.F. Naughton, and J.S. Plank, Low-latency, concurrent checkpointing for parallel programs, IEEE Transactions on Parallel and Distributed Systems 5(8) (1994), pp. 874-879.
- (1994) IEEE Transactions on Parallel and Distributed Systems , vol.5 , Issue.8 , pp. 874-879
- Li, K.¹ Naughton, J.F.² Plank, J.S.³

* 이 정보는 Elsevier사의 SCOPUS DB에서 KISTI가 분석하여 추출한 것입니다.