-
1
-
-
8344232253
-
Adaptive incremental checkpointing for massively parallel systems
-
New York, NY, ACM Press
-
S. Agarwal, R. Garg, M. S. Gupta, and J. E. Moreira. Adaptive incremental checkpointing for massively parallel systems. In Proceedings of the 18th Annual International Conference on Supercomputing, pages 277-286, New York, NY, 2004. ACM Press.
-
(2004)
Proceedings of the 18th Annual International Conference on Supercomputing
, pp. 277-286
-
-
Agarwal, S.1
Garg, R.2
Gupta, M.S.3
Moreira, J.E.4
-
2
-
-
47249097978
-
An analysis of the consequences of a reduction in checkpoint latency for periodic checkpointing systems
-
Technical Report SAND2007-xxxx, Sandia National Laboratories
-
S. Arunagiri, S. Seelam, R. A. Oldfield, M. R. Varela, P. J. Teller, and R. Riesen. An analysis of the consequences of a reduction in checkpoint latency for periodic checkpointing systems. Technical Report SAND2007-xxxx, Sandia National Laboratories, 2007.
-
(2007)
-
-
Arunagiri, S.1
Seelam, S.2
Oldfield, R.A.3
Varela, M.R.4
Teller, P.J.5
Riesen, R.6
-
5
-
-
23244450120
-
Architectural specification for massively parallel computers: An experience and measurement-based approach
-
March
-
R. Brightwell, W. Camp, B. Cole, E. DeBenedictis, R. Leland, J. Tomkins, and A. B. Maccabe. Architectural specification for massively parallel computers: an experience and measurement-based approach. Concurrency and Computation: Practice and Experience, 17(10):1271-1316, March 2005.
-
(2005)
Concurrency and Computation: Practice and Experience
, vol.17
, Issue.10
, pp. 1271-1316
-
-
Brightwell, R.1
Camp, W.2
Cole, B.3
DeBenedictis, E.4
Leland, R.5
Tomkins, J.6
Maccabe, A.B.7
-
6
-
-
46049083336
-
The red storm computer architecture and its implementation
-
Salishan Lodge, Glenedon Beach, Oregon, April
-
W. J. Camp and J. L. Tomkins. The red storm computer architecture and its implementation. In The Conference on High-Speed Computing: LANL/LLNL/SNL, Salishan Lodge, Glenedon Beach, Oregon, April 2003.
-
(2003)
The Conference on High-Speed Computing: LANL/LLNL/SNL
-
-
Camp, W.J.1
Tomkins, J.L.2
-
7
-
-
0029715009
-
Evaluation of checkpoint mechanisms for massively parallel machines
-
Sendai, Japan, June, IEEE Computer Society Press
-
T.-C. Chiueh and P. Deng. Evaluation of checkpoint mechanisms for massively parallel machines. In Proceedings of the Annual Symposium on Fault Tolerant Computing, pages 370-379, Sendai, Japan, June 1996. IEEE Computer Society Press.
-
(1996)
Proceedings of the Annual Symposium on Fault Tolerant Computing
, pp. 370-379
-
-
Chiueh, T.-C.1
Deng, P.2
-
8
-
-
28044438299
-
A model for predicting the optimum checkpoint interval for restart dumps
-
August
-
J. Daly. A model for predicting the optimum checkpoint interval for restart dumps. Lecture Notes in Computer Science, 2660:3-12, August 2003.
-
(2003)
Lecture Notes in Computer Science
, vol.2660
, pp. 3-12
-
-
Daly, J.1
-
9
-
-
29344435659
-
A strategy for running large scale applications based on a model that optimizes the checkpoint interval for restart dumps
-
Edinburgh, Scotland, UK, May
-
J. Daly. A strategy for running large scale applications based on a model that optimizes the checkpoint interval for restart dumps. In Proceedings of the 26th International Conference on Software Engineering, pages 70-74, Edinburgh, Scotland, UK, May 2004.
-
(2004)
Proceedings of the 26th International Conference on Software Engineering
, pp. 70-74
-
-
Daly, J.1
-
10
-
-
28044460018
-
A higher order estimate of the optimum checkpoint interval for restart dumps
-
J. Daly. A higher order estimate of the optimum checkpoint interval for restart dumps. Future Generation Computer Systems, 22:303-312, 2006.
-
(2006)
Future Generation Computer Systems
, vol.22
, pp. 303-312
-
-
Daly, J.1
-
11
-
-
0042078549
-
A survey of rollback-recovery protocols in message-passing systems
-
September
-
E. N. Elnozahy, L. Alvisi, Y. M. Wang, and D. B. Johnson. A survey of rollback-recovery protocols in message-passing systems. ACM Computing Surveys, 34(3):375-408, September 2002.
-
(2002)
ACM Computing Surveys
, vol.34
, Issue.3
, pp. 375-408
-
-
Elnozahy, E.N.1
Alvisi, L.2
Wang, Y.M.3
Johnson, D.B.4
-
12
-
-
84871146551
-
The performance of consistent checkpointing
-
Houston, TX, October, IEEE Computer Society Press
-
E. N. Elnozahy, D. B. Johnson, and W. Zwaenepoel. The performance of consistent checkpointing. In Proceedings of the 11th Symposium on Reliable Distributed Systems, pages 39-47, Houston, TX, October 1992. IEEE Computer Society Press.
-
(1992)
Proceedings of the 11th Symposium on Reliable Distributed Systems
, pp. 39-47
-
-
Elnozahy, E.N.1
Johnson, D.B.2
Zwaenepoel, W.3
-
13
-
-
9144223280
-
Checkpointing for petascale systems: A look into the future of practical rollback-recovery
-
April-June
-
E. N. Elnozahy and J. S. Plank. Checkpointing for petascale systems: A look into the future of practical rollback-recovery. IEEE Transactions on Dependable and Secure Computing, 1(2):97-108, April-June 2004.
-
(2004)
IEEE Transactions on Dependable and Secure Computing
, vol.1
, Issue.2
, pp. 97-108
-
-
Elnozahy, E.N.1
Plank, J.S.2
-
14
-
-
0033362477
-
Reducing data distribution bottlenecks by employing data visualization filters
-
Redondo Beach, CA, August, IEEE Computer Society Press
-
E. Franke and M. Magee. Reducing data distribution bottlenecks by employing data visualization filters. In Proceedings of the Eighth IEEE International Symposium on High Performance Distributed Computing, pages 255-262, Redondo Beach, CA, August 1999. IEEE Computer Society Press.
-
(1999)
Proceedings of the Eighth IEEE International Symposium on High Performance Distributed Computing
, pp. 255-262
-
-
Franke, E.1
Magee, M.2
-
15
-
-
84944067663
-
Network processors as building blocks in overlay networks
-
August
-
A. Gavrilovska, K. Schwan, O. Nordstrom, and H. Seifu. Network processors as building blocks in overlay networks. In Proceedings of the 11 th Symposium on High Performance Interconnects (HOTI03), pages 83-88, August 2003.
-
(2003)
Proceedings of the 11 th Symposium on High Performance Interconnects (HOTI03)
, pp. 83-88
-
-
Gavrilovska, A.1
Schwan, K.2
Nordstrom, O.3
Seifu, H.4
-
16
-
-
0029481201
-
Expanding the potential for disk-directed I/O
-
San Antonio, TX, October, IEEE Computer Society Press
-
D. Kotz. Expanding the potential for disk-directed I/O. In Proceedings of the 1995 IEEE Symposium on Parallel and Distributed Processing, pages 490-495, San Antonio, TX, October 1995. IEEE Computer Society Press.
-
(1995)
Proceedings of the 1995 IEEE Symposium on Parallel and Distributed Processing
, pp. 490-495
-
-
Kotz, D.1
-
17
-
-
3042648680
-
Querying very large multi-dimensional datasets in ADR
-
Portland, OR, November, ACM Press and IEEE Computer Society Press
-
T Kurc, C. Chang, R. Ferreira, and A. Sussman. Querying very large multi-dimensional datasets in ADR. In Proceedings of SC99: High Performance Networking and Computing, Portland, OR, November 1999. ACM Press and IEEE Computer Society Press.
-
(1999)
Proceedings of SC99: High Performance Networking and Computing
-
-
Kurc, T.1
Chang, C.2
Ferreira, R.3
Sussman, A.4
-
18
-
-
0028485392
-
Low-latency, concurrent checkpointing for parallel programs
-
August
-
K. Li, J. S. Naughton, and J. S. Plank. Low-latency, concurrent checkpointing for parallel programs. IEEE Transactions on Parallel and Distributed Systems, 5(8):874-879, August 1994.
-
(1994)
IEEE Transactions on Parallel and Distributed Systems
, vol.5
, Issue.8
, pp. 874-879
-
-
Li, K.1
Naughton, J.S.2
Plank, J.S.3
-
20
-
-
46049092568
-
Lightweight I/O for scientific applications
-
Barcelona, Spain, September
-
R. A. Oldfield, A. B. Maccabe, S. Arunagiri, T. Kordenbrock, R. Riesen, L. Ward, and P. Widener. Lightweight I/O for scientific applications. In Proceedings of the IEEE International Conference on Cluster Computing, Barcelona, Spain, September 2006.
-
(2006)
Proceedings of the IEEE International Conference on Cluster Computing
-
-
Oldfield, R.A.1
Maccabe, A.B.2
Arunagiri, S.3
Kordenbrock, T.4
Riesen, R.5
Ward, L.6
Widener, P.7
-
21
-
-
0032164545
-
Efficient parallel I/O in seismic imaging
-
Fall
-
R. A. Oldfield, D. E. Womble, and C. C. Ober. Efficient parallel I/O in seismic imaging. The International Journal of High Performance Computing Applications, 12(3):333-344, Fall 1998.
-
(1998)
The International Journal of High Performance Computing Applications
, vol.12
, Issue.3
, pp. 333-344
-
-
Oldfield, R.A.1
Womble, D.E.2
Ober, C.C.3
-
22
-
-
27544513113
-
Modeling coordinated checkpointing for large-scale supercomputers
-
Washington, DC, IEEE Computer Society
-
K. Pattabiraman, C. Vick, and A. Wood. Modeling coordinated checkpointing for large-scale supercomputers. In Proceedings of the 2005 International Conference on Dependable Systems and Networks (DSN'05), pages 812-821, Washington, DC, 2005. IEEE Computer Society.
-
(2005)
Proceedings of the 2005 International Conference on Dependable Systems and Networks (DSN'05)
, pp. 812-821
-
-
Pattabiraman, K.1
Vick, C.2
Wood, A.3
-
25
-
-
0030392072
-
Improving the performance of coordinated checkpointers on networks of workstations using RAID techniques
-
J. S. Plank. Improving the performance of coordinated checkpointers on networks of workstations using RAID techniques. In Proceedings of the Symposium on Reliable Distributed Systems, pages 76-85, 1996.
-
(1996)
Proceedings of the Symposium on Reliable Distributed Systems
, pp. 76-85
-
-
Plank, J.S.1
-
26
-
-
0031570636
-
Fault-tolerant matrix operations for networks of workstations using diskless checkpointing
-
June
-
J. S. Plank, Y. Kim, and J. J. Dongarra. Fault-tolerant matrix operations for networks of workstations using diskless checkpointing. Journal of Parallel and Distributed Computing, 43(2):125-138, June 1997.
-
(1997)
Journal of Parallel and Distributed Computing
, vol.43
, Issue.2
, pp. 125-138
-
-
Plank, J.S.1
Kim, Y.2
Dongarra, J.J.3
-
29
-
-
84877034501
-
MRNet: A software-based multicast/reduction network for scalable tools
-
Pheonix, AZ, Nov
-
P. C. Roth, D. C. Arnold, and B. P. Miller. MRNet: A software-based multicast/reduction network for scalable tools. In Proceedings of SC2003: High Performance Networking and Computing, Pheonix, AZ, Nov. 2003.
-
(2003)
Proceedings of SC2003: High Performance Networking and Computing
-
-
Roth, P.C.1
Arnold, D.C.2
Miller, B.P.3
-
30
-
-
33845593340
-
A large-scale study of failures in high-performance computing systems
-
Philadelphia, PA, June, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA
-
B. Schroeder and G. A. Gibson. A large-scale study of failures in high-performance computing systems. In Proceedings of the International Conference on Dependable Systems and Networks (DSN2006), Philadelphia, PA, June 2006. School of Computer Science, Carnegie Mellon University, Pittsburgh, PA.
-
(2006)
Proceedings of the International Conference on Dependable Systems and Networks (DSN2006)
-
-
Schroeder, B.1
Gibson, G.A.2
-
31
-
-
84864756973
-
An experimental study about diskless checkpointing
-
Vasteras, Sweden, August, IEEE Computer Society Press
-
L. M. Silva and J. G. Silva. An experimental study about diskless checkpointing. In Proceedings of the 24th EUROMICRO Conference, pages 395-402, Vasteras, Sweden, August 1998. IEEE Computer Society Press.
-
(1998)
Proceedings of the 24th EUROMICRO Conference
, pp. 395-402
-
-
Silva, L.M.1
Silva, J.G.2
-
33
-
-
47249146088
-
-
T. B. Team. An overview of the BlueGene/L supercomputer. In Proceedings of SC2002: High Performance Networking and Computing, Baltimore, MD, November 2002.
-
T. B. Team. An overview of the BlueGene/L supercomputer. In Proceedings of SC2002: High Performance Networking and Computing, Baltimore, MD, November 2002.
-
-
-
-
34
-
-
47249142897
-
A conservative path to petaflop computing: The Red Storm architecture scaled to a petaflop and beyond
-
October
-
J. Tomkins. A conservative path to petaflop computing: The Red Storm architecture scaled to a petaflop and beyond. 4th Annual Workshop on Linux Clusters for Supercomputing, October 2003.
-
(2003)
4th Annual Workshop on Linux Clusters for Supercomputing
-
-
Tomkins, J.1
-
35
-
-
84877699694
-
A case for two-level distributed recovery schemes
-
N. H. Vaidya. A case for two-level distributed recovery schemes. SIGMETRICS Perform. Eval. Rev., 23(1):64-73, 1995.
-
(1995)
SIGMETRICS Perform. Eval. Rev
, vol.23
, Issue.1
, pp. 64-73
-
-
Vaidya, N.H.1
-
36
-
-
0031388399
-
Impact of checkpoint latency on overhead ratio of a checkpointing scheme
-
N. H. Vaidya. Impact of checkpoint latency on overhead ratio of a checkpointing scheme. IEEE Transactions on Computers, 46(8):942-947, 1997.
-
(1997)
IEEE Transactions on Computers
, vol.46
, Issue.8
, pp. 942-947
-
-
Vaidya, N.H.1
-
37
-
-
33847216540
-
Early evaluation of the cray xt3
-
Oak Ridge National Laboratory, April
-
J. S. Vetter, S. R. Alam, J. Thomas H. Dunigan, M. R. Fahey, P. C. Roth, and P. H. Worley. Early evaluation of the cray xt3. In Proceedings of the International Parallel and Distributed Processing Symposium. Oak Ridge National Laboratory, April 2006.
-
(2006)
Proceedings of the International Parallel and Distributed Processing Symposium
-
-
Vetter, J.S.1
Alam, S.R.2
Thomas, J.3
Dunigan, H.4
Fahey, M.R.5
Roth, P.C.6
Worley, P.H.7
-
39
-
-
84976846528
-
A first order approximation to the optimum checkpoint interval
-
J. W. Young. A first order approximation to the optimum checkpoint interval. Communications of the ACM, 17(9):530-531, 1974.
-
(1974)
Communications of the ACM
, vol.17
, Issue.9
, pp. 530-531
-
-
Young, J.W.1
|