메뉴 건너뛰기




Volumn , Issue , 2012, Pages 79-89

Data-driven fault tolerance for work stealing computations

Author keywords

Fault tolerance; Load balancing; Work stealing

Indexed keywords

ADDRESS SPACE; DATA OPERATIONS; DATA STORE; DISTRIBUTED MEMORY; ENERGY CONSTRAINT; EXECUTION ENVIRONMENTS; FAULT TOLERANCE MECHANISMS; RE-EXECUTION; RECOVERY SCHEME; SPACE AND TIME; SYSTEM NOISE; TASK PARALLEL; WORK STEALING;

EID: 84864069036     PISSN: None     EISSN: None     Source Type: Conference Proceeding    
DOI: 10.1145/2304576.2304589     Document Type: Conference Paper
Times cited : (20)

References (30)
  • 5
    • 50649108554 scopus 로고    scopus 로고
    • Proactive fault tolerance in MPI applications via task migration
    • S. Chakravorty, C. Mendes, and L. Kalé. Proactive Fault Tolerance in MPI Applications via Task Migration. In High Performance Computing - HiPC 2006, volume 4297, pages 485-496. 2006.
    • (2006) High Performance Computing - HiPC 2006 , vol.4297 , pp. 485-496
    • Chakravorty, S.1    Mendes, C.2    Kalé, L.3
  • 8
    • 37549003336 scopus 로고    scopus 로고
    • Mapreduce: Simplified data processing on large clusters
    • Jan.
    • J. Dean and S. Ghemawat. Mapreduce: simplified data processing on large clusters. Commun. ACM, 51(1):107-113, Jan. 2008.
    • (2008) Commun. ACM , vol.51 , Issue.1 , pp. 107-113
    • Dean, J.1    Ghemawat, S.2
  • 11
    • 9144223280 scopus 로고    scopus 로고
    • Checkpointing for peta-scale systems: A look into the future of practical rollback-recovery
    • IEEE Transactions on April-June
    • E. Elnozahy and J. Plank. Checkpointing for peta-scale systems: a look into the future of practical rollback-recovery. Dependable and Secure Computing, IEEE Transactions on, 1(2):97 - 108, April-June 2004.
    • (2004) Dependable and Secure Computing , vol.1 , Issue.2 , pp. 97-108
    • Elnozahy, E.1    Plank, J.2
  • 12
    • 0042078549 scopus 로고    scopus 로고
    • A survey of rollback-recovery protocols in message-passing systems
    • E. N. Elnozahy, L. Alvisi, Y.-M. Wang, and D. B. Johnson. A survey of rollback-recovery protocols in message-passing systems. ACM Comput. Surv., 34(3):375-408, 2002.
    • (2002) ACM Comput. Surv. , vol.34 , Issue.3 , pp. 375-408
    • Elnozahy, E.N.1    Alvisi, L.2    Wang, Y.-M.3    Johnson, D.B.4
  • 17
    • 33845434226 scopus 로고    scopus 로고
    • Transparent, incremental checkpointing at kernel level: A foundation for fault tolerance for parallel computers
    • Washington, DC, USA IEEE Computer Society
    • R. Gioiosa, J. C. Sancho, S. Jiang, F. Petrini, and K. Davis. Transparent, incremental checkpointing at kernel level: a foundation for fault tolerance for parallel computers. In Proceedings of the 2005 ACM/IEEE conference on Supercomputing, SC '05, pages 9-, Washington, DC, USA, 2005. IEEE Computer Society.
    • (2005) Proceedings of the 2005 ACM/IEEE Conference on Supercomputing, SC '05 , pp. 9
    • Gioiosa, R.1    Sancho, J.C.2    Jiang, S.3    Petrini, F.4    Davis, K.5
  • 18
    • 33749067567 scopus 로고    scopus 로고
    • Berkeley lab checkpoint/restart (BLCR) for Linux clusters
    • DOI 10.1088/1742-6596/46/1/067, 067
    • P. H. Hargrove and J. C. Duell. Berkeley Lab Checkpoint/Restart (BLCR) for Linux clusters. In Journal of Physics: Conf. Series (SciDAC), volume 46, pages 494-499, June 2006. (Pubitemid 44461038)
    • (2006) Journal of Physics: Conference Series , vol.46 , Issue.1 , pp. 494-499
    • Hargrove, P.H.1    Duell, J.C.2
  • 19
    • 0021439162 scopus 로고
    • Algorithm-based fault tolerance for matrix operations
    • K.-H. Huang and J. Abraham. Algorithm-based fault tolerance for matrix operations. Computers, IEEE Transactions on, C-33(6):518-528, June 1984. (Pubitemid 14584528)
    • (1984) IEEE Transactions on Computers , vol.C-33 , Issue.6 , pp. 518-528
    • Huang, K.-H.1    Abraham, J.A.2
  • 20
    • 85160681664 scopus 로고    scopus 로고
    • Transparent checkpoint-restart of multiple processes on commodity operating systems
    • O. Laadan and J. Nieh. Transparent checkpoint-restart of multiple processes on commodity operating systems. In USENIX Annual Technical Conference, 2007.
    • (2007) USENIX Annual Technical Conference
    • Laadan, O.1    Nieh, J.2
  • 21
    • 77955933052 scopus 로고    scopus 로고
    • Cassandra: A decentralized structured storage system
    • April
    • A. Lakshman and P. Malik. Cassandra: a decentralized structured storage system. SIGOPS Oper. Syst. Rev., 44:35-40, April 2010.
    • (2010) SIGOPS Oper. Syst. Rev. , vol.44 , pp. 35-40
    • Lakshman, A.1    Malik, P.2
  • 22
    • 34548046749 scopus 로고    scopus 로고
    • Proactive fault tolerance for HPC with Xen virtualization
    • DOI 10.1145/1274971.1274978, Proceedings of ICS07: 21st ACM International Conference on Supercomputing
    • A. B. Nagarajan, F. Mueller, C. Engelmann, and S. L. Scott. Proactive fault tolerance for hpc with xen virtualization. In Proceedings of the 21st annual international conference on Supercomputing, ICS '07, pages 23-32, New York, NY, USA, 2007. ACM. (Pubitemid 47281603)
    • (2007) Proceedings of the International Conference on Supercomputing , pp. 23-32
    • Nagarajan, A.B.1    Mueller, F.2    Engelmann, C.3    Scott, S.L.4
  • 23
    • 34547424386 scopus 로고    scopus 로고
    • Cooperative checkpointing: A robust approach to large-scale systems reliability
    • DOI 10.1145/1183401.1183406, Proceedings of the 20th Annual International Conference on Supercomputing, ICS 2006
    • A. J. Oliner, L. Rudolph, and R. K. Sahoo. Cooperative checkpointing: a robust approach to large-scale systems reliability. In Proceedings of the 20th annual international conference on Supercomputing, ICS '06, pages 14-23, New York, NY, USA, 2006. ACM. (Pubitemid 47168488)
    • (2006) Proceedings of the International Conference on Supercomputing , pp. 14-23
    • Oliner, A.J.1    Rudolph, L.2    Sahoo, R.K.3
  • 26
    • 0026812659 scopus 로고
    • Design and implementation of a log-structured file system
    • DOI 10.1145/146941.146943
    • M. Rosenblum and J. Ousterhout. The design and implementation of a log-structured file system. ACM Transactions on Computer Systems (TOCS), 10(1):26-52, 1992. (Pubitemid 23598979)
    • (1992) ACM Transactions on Computer Systems , vol.10 , Issue.1 , pp. 26-52
    • Rosenblum Mendel1    Ousterhout John, K.2
  • 29
    • 34548768671 scopus 로고    scopus 로고
    • A job pause service under LAM/MPI+BLCR for transparent fault tolerance
    • C. Wang, F. Mueller, C. Engelmann, and S. L. Scott. A Job Pause Service under LAM/MPI+BLCR for Transparent Fault Tolerance. In IPDPS, pages 1-10, 2007.
    • (2007) IPDPS , pp. 1-10
    • Wang, C.1    Mueller, F.2    Engelmann, C.3    Scott, S.L.4
  • 30
    • 0028465953 scopus 로고
    • Algorithm-based fault tolerance for FFT networks
    • IEEE Transactions on Jul
    • S.-J. Wang and N. Jha. Algorithm-based fault tolerance for FFT networks. Computers, IEEE Transactions on, 43(7):849-854, Jul 1994.
    • (1994) Computers , vol.43 , Issue.7 , pp. 849-854
    • Wang, S.-J.1    Jha, N.2


* 이 정보는 Elsevier사의 SCOPUS DB에서 KISTI가 분석하여 추출한 것입니다.