SCOPUS 정보 검색 플랫폼

IEEE Transactions on Services Computing

Volumn 10, Issue 6, 2017, Pages 969-983

Reliable computing service in massive-scale systems through rapid low-cost failover

(8) Yang, Renyu a Zhang, Yang b Garraghan, Peter c Feng, Yihui b Ouyang, Jin b Xu, Jie d Zhang, Zhuo b Li, Chao b

a BEIHANG UNIVERSITY (China)

b ALIBABA GROUP (China)

c LANCASTER UNIVERSITY (United Kingdom)

d UNIVERSITY OF LEEDS (United Kingdom)

Author keywords

Cloud computing; Failover; Reliability; Resource management; Services

Indexed keywords

CLOUD COMPUTING; COST EFFECTIVENESS; LARGE SCALE SYSTEMS; NATURAL RESOURCES MANAGEMENT; RELIABILITY; RESOURCE ALLOCATION;

BUSINESS REQUIREMENT; FAILOVER; LARGE-SCALE DISTRIBUTED SYSTEM; RELIABLE COMPUTING; RESOURCE MANAGEMENT; RESOURCE MANAGEMENT SYSTEMS; SERVICES; SOFTWARE AND HARDWARES;

COSTS;

EID: 85027256100 PISSN: 19391374 EISSN: None Source Type: Journal
DOI: 10.1109/TSC.2016.2544313 Document Type: Article

Times cited : (18)

References (48)

1
- 79956268427
- Intercloud: Utilityoriented federation of cloud computing environments for scaling of application services
- R. Buyya, R. Ranjan, and R. N. Calheiros, "Intercloud: Utilityoriented federation of cloud computing environments for scaling of application services," in Proc. 10th Int. Conf. Algorithms Archit. Parallel Process., 2010, pp. 13-31.
- (2010) Proc. 10th Int. Conf. Algorithms Archit. Parallel Process. , pp. 13-31
- Buyya, R.¹ Ranjan, R.² Calheiros, R.N.³

2
- 84975146878
- Cisco Systems Inc Cisco, White paper
- Cisco Systems Inc, "Cisco Global Cloud Index: Forecast and Methodology, 2014-2019," Cisco, White paper, 2015.
- (2015) Cisco Global Cloud Index: Forecast and Methodology, 2014-2019

3
- 84893305113
- Mesos: A platform for finegrained resource sharing in the data center
- B. Hindman, A. Konwinski, M. Zaharia, A. Ghodsi, A. D. Joseph, R. H. Katz, S. Shenker, and I. Stoica, "Mesos: A platform for finegrained resource sharing in the data center," in Proc. 8th USENIX Conf. Netw. Syst. Des. Implementation, 2011, pp. 295-308.
- (2011) Proc. 8th USENIX Conf. Netw. Syst. Des. Implementation , pp. 295-308
- Hindman, B.¹ Konwinski, A.² Zaharia, M.³ Ghodsi, A.⁴ Joseph, A.D.⁵ Katz, R.H.⁶ Shenker, S.⁷ Stoica, I.⁸

4
- 84893249524
- Apache hadoop yarn: Yet another resource negotiator
- Art. no. 5
- V. K. Vavilapalli, A. C. Murthy, C. Douglas, S. Agarwal, M. Konar, R. Evans, et al., "Apache hadoop yarn: Yet another resource negotiator," in Proc. 4th Annu. Symp. Cloud Comput., 2013, Art. no. 5.
- (2013) Proc. 4th Annu. Symp. Cloud Comput.
- Vavilapalli, V.K.¹ Murthy, A.C.² Douglas, C.³ Agarwal, S.⁴ Konar, M.⁵ Evans, R.⁶

5
- 84905826576
- Fuxi: A faulttolerant resource management and job scheduling system at internet scale
- Z. Zhang, C. Li, Y. Tao, R. Yang, H. Tang, and J. Xu, "Fuxi: A faulttolerant resource management and job scheduling system at internet scale," in Proc. Int. Conf. Very Large Databases, 2014, pp. 1393-1404.
- (2014) Proc. Int. Conf. Very Large Databases , pp. 1393-1404
- Zhang, Z.¹ Li, C.² Tao, Y.³ Yang, R.⁴ Tang, H.⁵ Xu, J.⁶

6
- 77950489272
- San Rafael, CA, USA: Morgan & Claypool Publishers
- L. A. Barroso, J. Clidaras, and U. Hölzle, The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines. San Rafael, CA, USA: Morgan & Claypool Publishers, 2013.
- (2013) The Datacenter As A Computer: An Introduction to the Design of Warehouse-Scale Machines
- Barroso, L.A.¹ Clidaras, J.² Hölzle, U.³

7
- 79951839892
- Fault tolerance and scaling in e-science cloud applications: Observations from the continuing development of modisazure
- J. Li, M. Humphrey, Y.-W. Cheah, Y. Ryu, D. Agarwal, K. Jackson, and C. van Ingen, "Fault tolerance and scaling in e-science cloud applications: Observations from the continuing development of modisazure," in Proc. IEEE 6th Int. Conf. e-Sci., 2010, pp. 246-253.
- (2010) Proc. IEEE 6th Int. Conf. E-Sci. , pp. 246-253
- Li, J.¹ Humphrey, M.² Cheah, Y.-W.³ Ryu, Y.⁴ Agarwal, D.⁵ Jackson, K.⁶ Van Ingen, C.⁷

8
- 84978952293
- Computing at massive scale: Scalability and dependability challenges
- Oxford, U.K.
- R. Yang and J. Xu, "Computing at massive scale: Scalability and dependability challenges," in presented at the IEEE 10th Int. Symp. Service Oriented System Engineering, Oxford, U.K., 2016.
- (2016) IEEE 10th Int. Symp. Service Oriented System Engineering
- Yang, R.¹ Xu, J.²

9
- 84979068789
- (2008). Amazon suffers u.s. outage on friday internet [Online]. Available: Http://news.cnet.com/
- (2008) Amazon Suffers U.s. Outage on Friday Internet

10
- 0001314414
- The evolution of the recovery block concept
- New York, NY, USA: Wiley
- B. Randell and J. Xu, "The evolution of the recovery block concept," in Softw. Fault Tolerance, New York, NY, USA: Wiley, 1995.
- (1995) Softw. Fault Tolerance
- Randell, B.¹ Xu, J.²

11
- 0003943408
- Englewood Cliffs, NJ, USA: Prentice-Hall
- P. Jalote and P. Jalote, Fault Tolerance in Distributed Systems. Englewood Cliffs, NJ, USA: Prentice-Hall, 1994.
- (1994) Fault Tolerance in Distributed Systems
- Jalote, P.¹ Jalote, P.²

12
- 1542747971
- Fast transparent failover for reliable web service
- N. Aghdaie and Y. Tamir, "Fast transparent failover for reliable web service," in Proc. Int. Conf. Parallel Distrib. Comput. Syst., 2003, pp. 757-762.
- (2003) Proc. Int. Conf. Parallel Distrib. Comput. Syst. , pp. 757-762
- Aghdaie, N.¹ Tamir, Y.²

13
- 85076887355
- Apollo: Scalable and coordinated scheduling for cloud-scale computing
- E. Boutin, J. Ekanayake, W. Lin, B. Shi, J. Zhou, Z. Qian, M. Wu, and L. Zhou, "Apollo: Scalable and coordinated scheduling for cloud-scale computing," in Proc. 11th USENIX Conf. Operating Syst. Des. Implementation, 2014, pp. 285-300.
- (2014) Proc. 11th USENIX Conf. Operating Syst. Des. Implementation , pp. 285-300
- Boutin, E.¹ Ekanayake, J.² Lin, W.³ Shi, B.⁴ Zhou, J.⁵ Qian, Z.⁶ Wu, M.⁷ Zhou, L.⁸

14
- 85044264755
- (2013). [Online].Available: Https://issues.apache.org/jira/browse/YARN-556
- (2013)

15
- 85044300760
- (2013). [Online].Available: Https://issues.apache.org/jira/browse/YARN-1336
- (2013)

16
- 12344308304
- Basic concepts and taxonomy of dependable and secure computing
- Jan.-Mar.
- A. Avizienis, J.-C. Laprie, B. Randell, and C. Landwehr, "Basic concepts and taxonomy of dependable and secure computing," IEEE Trans. Dependable Secure Comput., vol. 1, no. 1, pp. 11-33, Jan.-Mar. 2004.
- (2004) IEEE Trans. Dependable Secure Comput. , vol.1 , Issue.1 , pp. 11-33
- Avizienis, A.¹ Laprie, J.-C.² Randell, B.³ Landwehr, C.⁴

17
- 84898609036
- An empirical failureanalysis of a large-scale cloud computing environment
- P. Garraghan, P. Townend, and J. Xu, "An empirical failureanalysis of a large-scale cloud computing environment," in Proc. IEEE 15th Int. Symp. High-Assurance Syst. Eng., 2014, pp. 113-120.
- (2014) Proc. IEEE 15th Int. Symp. High-Assurance Syst. Eng. , pp. 113-120
- Garraghan, P.¹ Townend, P.² Xu, J.³

18
- 84870524514
- Heterogeneity and dynamicity of clouds at scale: Google trace analysis
- Art. no. 7
- C. Reiss, A. Tumanov, G. R. Ganger, R. H. Katz, and M. A. Kozuch, "Heterogeneity and dynamicity of clouds at scale: Google trace analysis," in Proc. 3rdACMSymp. Cloud Comput., 2012, Art. no. 7.
- (2012) Proc. 3rdACMSymp. Cloud Comput.
- Reiss, C.¹ Tumanov, A.² Ganger, G.R.³ Katz, R.H.⁴ Kozuch, M.A.⁵

19
- 78649459815
- Cdrm: A cost-effective dynamic replication management scheme for cloud storage cluster
- Q. Wei, B. Veeravalli, B. Gong, L. Zeng, and D. Feng, "Cdrm: A cost-effective dynamic replication management scheme for cloud storage cluster," in Proc. IEEE Int. Conf. Cluster Comput., 2010, pp. 188-196.
- (2010) Proc. IEEE Int. Conf. Cluster Comput. , pp. 188-196
- Wei, Q.¹ Veeravalli, B.² Gong, B.³ Zeng, L.⁴ Feng, D.⁵

20
- 80053400298
- Adaptive fault tolerance in real time cloud computing
- S. Malik and F. Huet, "Adaptive fault tolerance in real time cloud computing," in Proc. IEEE World Congr. Servi., 2011, pp. 280-287.
- (2011) Proc. IEEE World Congr. Servi. , pp. 280-287
- Malik, S.¹ Huet, F.²

21
- 84962844401
- D2ps: A dependable data provisioning service in multi-tenants cloud environments
- R. Yang, T. Wo, C. Hu, J. Xu, and M. Zhang, "D2ps: A dependable data provisioning service in multi-tenants cloud environments," in Proc. IEEE 17th Int. Symp. High Assurance Syst. Eng., 2016, pp. 252-259.
- (2016) Proc. IEEE 17th Int. Symp. High Assurance Syst. Eng. , pp. 252-259
- Yang, R.¹ Wo, T.² Hu, C.³ Xu, J.⁴ Zhang, M.⁵

22
- 77954077554
- Improving MapReduce fault tolerance in the cloud
- Q. Zheng, "Improving MapReduce fault tolerance in the cloud," in Proc. IEEE Int. Symp. Parallel Distrib. Process. Workshops Phd Forum, 2010, pp. 1-6.
- (2010) Proc. IEEE Int. Symp. Parallel Distrib. Process. Workshops Phd Forum , pp. 1-6
- Zheng, Q.¹

23
- 76849100508
- Failure-aware resource management for high-availability computing clusters with distributed virtual machines
- S. Fu, "Failure-aware resource management for high-availability computing clusters with distributed virtual machines," J. Parallel Distrib. Comput., vol. 70, no. 4, pp. 384-393, 2010.
- (2010) J. Parallel Distrib. Comput. , vol.70 , Issue.4 , pp. 384-393
- Fu, S.¹

24
- 85044273960
- (2013). Amazon web services suffers outage [Online]. Available: Http://www.zdnet.com/article/amazon-web-services-suffersoutage-takes-d own-vine-instagram-others-with-it/
- (2013) Amazon Web Services Suffers Outage

25
- 0003217728
- The methodology of n-version programming
- Hoboken, NJ, USA: Wiley
- A. Avizienis, "The methodology of n-version programming," in Software Fault Tolerance, Hoboken, NJ, USA: Wiley, 1995.
- (1995) Software Fault Tolerance
- Avizienis, A.¹

26
- 0003533985
- New York, NY, USA: McGraw-Hill
- M. R. Lyu et al., Handbook of Software Reliability Engineering. New York, NY, USA: McGraw-Hill, 1996.
- (1996) Handbook of Software Reliability Engineering
- Lyu, M.R.¹

27
- 80051928903
- A scalable availability model for infrastructure-as-a-service cloud
- F. Longo, R. Ghosh, V. K. Naik, and K. S. Trivedi, "A scalable availability model for infrastructure-as-a-service cloud," in Proc. IEEE/IFIP 41st Int. Conf. Dependable Syst. Netw., 2011, pp. 335-346.
- (2011) Proc. IEEE/IFIP 41st Int. Conf. Dependable Syst. Netw. , pp. 335-346
- Longo, F.¹ Ghosh, R.² Naik, V.K.³ Trivedi, K.S.⁴

28
- 0029212717
- Reliability analysis of a complex standby redundant systems
- R. Subramanian and V. Anantharaman, "Reliability analysis of a complex standby redundant systems," Rel. Eng. Syst. Safety, vol. 48, no. 1, pp. 57-70, 1995.
- (1995) Rel. Eng. Syst. Safety , vol.48 , Issue.1 , pp. 57-70
- Subramanian, R.¹ Anantharaman, V.²

29
- 79951761350
- Zookeeper: Wait-free coordination for internet-scale systems
- P. Hunt, M. Konar, F. P. Junqueira, and B. Reed, "Zookeeper: Wait-free coordination for internet-scale systems," in Proc. USENIX Conf. USENIX Annu. Tech. Conf., 2010, p. 11.
- (2010) Proc. USENIX Conf. USENIX Annu. Tech. Conf. , pp. 11
- Hunt, P.¹ Konar, M.² Junqueira, F.P.³ Reed, B.⁴

30
- 85065181066
- The chubby lock service for loosely-coupled distributed systems
- M. Burrows, "The chubby lock service for loosely-coupled distributed systems," in Proc. 7th Symp. Operating Syst. Des. Implementation, 2006, pp. 335-350.
- (2006) Proc. 7th Symp. Operating Syst. Des. Implementation , pp. 335-350
- Burrows, M.¹

31
- 84930247783
- An analysis of failure-related energy waste in a large-scale cloud environment
- Jun.
- P. Garraghan, I. S. Moreno, P. Townend, and J. Xu, "An analysis of failure-related energy waste in a large-scale cloud environment," IEEE Trans. Emerging Topics Comput., vol. 2, no. 2, pp. 166-180, Jun. 2014.
- (2014) IEEE Trans. Emerging Topics Comput. , vol.2 , Issue.2 , pp. 166-180
- Garraghan, P.¹ Moreno, I.S.² Townend, P.³ Xu, J.⁴

32
- 0023090161
- Checkpointing and rollback-recovery for distributed systems
- Jan.
- R. Koo and S. Toueg, "Checkpointing and rollback-recovery for distributed systems," IEEE Trans. Softw. Eng., vol. SE-13, no. 1, pp. 23-31, Jan. 1987.
- (1987) IEEE Trans. Softw. Eng. , vol.SE13 , Issue.1 , pp. 23-31
- Koo, R.¹ Toueg, S.²

33
- 34548207637
- Torque resource manager
- Art. no. 8
- G. Staples, "Torque resource manager," in Proc. ACM/IEEE Conf. Supercomput., 2006, Art. no. 8.
- (2006) Proc. ACM/IEEE Conf. Supercomput.
- Staples, G.¹

34
- 84946125131
- Service-oriented computing: Concepts, characteristics and directions
- M. P. Papazoglou, "Service-oriented computing: Concepts, characteristics and directions," in Proc. 4th Int. Conf. Web Inform. Syst. Eng., 2003, 3-12.
- (2003) Proc. 4th Int. Conf. Web Inform. Syst. Eng. , pp. 3-12
- Papazoglou, M.P.¹

35
- 0036601844
- Grid services for distributed system integration
- Jun.
- I. Foster, C. Kesselman, J. M. Nick, and S. Tuecke, "Grid services for distributed system integration," in IEEE Comput., vol. 35, no. 6, pp. 37-46, Jun. 2002.
- (2002) IEEE Comput. , vol.35 , Issue.6 , pp. 37-46
- Foster, I.¹ Kesselman, C.² Nick, J.M.³ Tuecke, S.⁴

36
- 84995887987
- Hotrestore: A fast restore system for virtual machine cluster
- L. Cui, J. Li, T. Wo, B. Li, R. Yang, Y. Cao, and J. Huai, "Hotrestore: A fast restore system for virtual machine cluster," in Proc. 28th USENIX Conf. Large Installation Syst. Admin., 2014, pp. 1-16.
- (2014) Proc. 28th USENIX Conf. Large Installation Syst. Admin. , pp. 1-16
- Cui, L.¹ Li, J.² Wo, T.³ Li, B.⁴ Yang, R.⁵ Cao, Y.⁶ Huai, J.⁷

37
- 84929574917
- Large-scale cluster management at google with borg
- Art. no. 18
- A. Verma, L. Pedrosa, M. Korupolu, D. Oppenheimer, E. Tune, and J. Wilkes, "Large-scale cluster management at google with borg," in Proc. 10th Eur. Conf. Comput. Syst., 2015, Art. no. 18.
- (2015) Proc. 10th Eur. Conf. Comput. Syst.
- Verma, A.¹ Pedrosa, L.² Korupolu, M.³ Oppenheimer, D.⁴ Tune, E.⁵ Wilkes, J.⁶

38
- 78650831692
- Design, modeling, and evaluation of a scalable multi-level checkpointing system
- A. Moody, G. Bronevetsky, K. Mohror, and B. R. De Supinski, "Design, modeling, and evaluation of a scalable multi-level checkpointing system," in Proc. Int. Conf. High Perform. Comput. Netw. Storage Anal., 2010, pp. 1-11.
- (2010) Proc. Int. Conf. High Perform. Comput. Netw. Storage Anal. , pp. 1-11
- Moody, A.¹ Bronevetsky, G.² Mohror, K.³ De Supinski, B.R.⁴

39
- 84903598167
- VMCSnap: Taking snapshots of virtual machine cluster with memory deduplication
- Y. Huang, R. Yang, L. Cui, T. Wo, C. Hu, and B. Li, "VMCSnap: Taking snapshots of virtual machine cluster with memory deduplication," in Proc. IEEE 8th Int. Symp. Serv. Oriented Syst. Eng., 2014, pp. 314-319.
- (2014) Proc. IEEE 8th Int. Symp. Serv. Oriented Syst. Eng. , pp. 314-319
- Huang, Y.¹ Yang, R.² Cui, L.³ Wo, T.⁴ Hu, C.⁵ Li, B.⁶

40
- 84988273398
- Consnap: Taking continuous snapshots for running state protection of virtual machines
- J. Li, J. Zheng, L. Cui, and R. Yang, "Consnap: Taking continuous snapshots for running state protection of virtual machines," in Proc. IEEE 20th Int. Conf. Parallel Distrib. Syst., 2014, pp. 677-684.
- (2014) Proc. IEEE 20th Int. Conf. Parallel Distrib. Syst. , pp. 677-684
- Li, J.¹ Zheng, J.² Cui, L.³ Yang, R.⁴

41
- 84870488163
- EECS Dept. Univ. California, Berkeley, CA, USA, Tech. Rep. UCB/EECS-2012-17
- Y. Chen, S. Alspaugh, and R. H. Katz, "Design insights for Map-Reduce from diverse production workloads," EECS Dept. Univ. California, Berkeley, CA, USA, Tech. Rep. UCB/EECS-2012-17, 2012.
- (2012) Design Insights for Map-Reduce from Diverse Production Workloads
- Chen, Y.¹ Alspaugh, S.² Katz, R.H.³

42
- 77954901315
- An analysis of traces from a production MapReduce cluster
- S. Kavulya, J. Tan, R. Gandhi, and P. Narasimhan, "An analysis of traces from a production MapReduce cluster," in Proc. IEEE/ACM 10th Int. Conf. Cluster, Cloud Grid Comput., 2010, pp. 94-103.
- (2010) Proc. IEEE/ACM 10th Int. Conf. Cluster, Cloud Grid Comput. , pp. 94-103
- Kavulya, S.¹ Tan, J.² Gandhi, R.³ Narasimhan, P.⁴

43
- 85031898917
- Towards characterizing cloud backend workloads: Insights from google compute clusters
- A. K. Mishra, J. L. Hellerstein, W. Cirne, and C. R. Das, "Towards characterizing cloud backend workloads: Insights from google compute clusters," ACM SIGMETRICS Perform. Eval. Rev., vol. 37, no. 4, pp. 34-41, 2010.
- (2010) ACM SIGMETRICS Perform. Eval. Rev. , vol.37 , Issue.4 , pp. 34-41
- Mishra, A.K.¹ Hellerstein, J.L.² Cirne, W.³ Das, C.R.⁴

44
- 84965042403
- Analysis, modeling and simulation of workload patterns in a large-scale utility cloud
- Apr.-Jun.
- I. Solis Moreno, P. Garraghan, P. Townend, and J. Xu, "Analysis, modeling and simulation of workload patterns in a large-scale utility cloud," IEEE Trans. Cloud Comput., vol. 2, no. 2, pp. 208-221, Apr.-Jun. 2014.
- (2014) IEEE Trans. Cloud Comput. , vol.2 , Issue.2 , pp. 208-221
- Solis Moreno, I.¹ Garraghan, P.² Townend, P.³ Xu, J.⁴

45
- 84881145178
- An analysis of the server characteristics and resource utilization in Google cloud
- P. Garraghan, P. Townend, and J. Xu, "An analysis of the server characteristics and resource utilization in Google cloud," in Proc. IEEE Int. Conf. Cloud Eng., 2013, pp. 124-131.
- (2013) Proc. IEEE Int. Conf. Cloud Eng. , pp. 124-131
- Garraghan, P.¹ Townend, P.² Xu, J.³

46
- 84889640333
- Sparrow: Distributed, low latency scheduling
- K. Ousterhout, P. Wendell, M. Zaharia, and I. Stoica, "Sparrow: Distributed, low latency scheduling," in Proc. 24th ACM Symp. Operating Syst. Principles, 2013, pp. 69-84.
- (2013) Proc. 24th ACM Symp. Operating Syst. Principles , pp. 69-84
- Ousterhout, K.¹ Wendell, P.² Zaharia, M.³ Stoica, I.⁴

47
- 84873622276
- The tail at scale
- J. Dean and L. A. Barroso, "The tail at scale," in ACM Commun., vol. 56, no. 2, pp. 74-80, 2013.
- (2013) ACM Commun. , vol.56 , Issue.2 , pp. 74-80
- Dean, J.¹ Barroso, L.A.²

48
- 84962886155
- Timely long tail identification through agent based monitoring and analytics
- P. Garraghan, X. Ouyang, P. Townend, and J. Xu, "Timely long tail identification through agent based monitoring and analytics," proc. IEEE 18th Int. Symp. Real-Time Distrib. Comput., 2015, pp. 19-26.
- (2015) Proc. IEEE 18th Int. Symp. Real-Time Distrib. Comput. , pp. 19-26
- Garraghan, P.¹ Ouyang, X.² Townend, P.³ Xu, J.⁴

* 이 정보는 Elsevier사의 SCOPUS DB에서 KISTI가 분석하여 추출한 것입니다.