-
1
-
-
36148941068
-
Understanding failures in petascale computers
-
Jul. [Online]. Available
-
B. Schroeder and G. A. Gibson, "Understanding Failures in Petascale Computers," Journal of Physics: Conference Series, vol. 78, no. 1, pp. 012 022+, Jul. 2007. [Online]. Available: http://dx.doi.org/10.1088/ 1742-6596/78/1/012022
-
(2007)
Journal of Physics: Conference Series
, vol.78
, Issue.1
, pp. 012022
-
-
Schroeder, B.1
Gibson, G.A.2
-
4
-
-
78650831692
-
Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System
-
Washington, DC, USA: IEEE Computer Society, Nov. [Online]. Available
-
A. Moody, G. Bronevetsky, K. Mohror, and B. R. de Supinski, "Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System," in Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, ser. SC '10. Washington, DC, USA: IEEE Computer Society, Nov. 2010, pp. 1-11. [Online]. Available: http://dx.doi.org/10.1109/sc.2010.18
-
(2010)
Proceedings of the 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, Ser. SC '10
, pp. 1-11
-
-
Moody, A.1
Bronevetsky, G.2
Mohror, K.3
De Supinski, B.R.4
-
5
-
-
84884918986
-
-
[Online]. Available
-
"MPI Forum." [Online]. Available: http://www.mpi-forum.org/
-
MPI Forum
-
-
-
6
-
-
67649211140
-
ZOID: I/O forwarding infrastructure for petascale architectures
-
K. Iskra, J. W. Romein, K. Yoshii, and P. Beckman, "ZOID: I/OForwarding Infrastructure for Petascale Architectures," in PPoPP '08: Proceedings of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2008, pp. 153-162.
-
(2008)
PPoPP '08: Proceedings of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
, pp. 153-162
-
-
Iskra, K.1
Romein, J.W.2
Yoshii, K.3
Beckman, P.4
-
7
-
-
78650808492
-
-
Blue Gene/L Consortium Quarterly Newsletter, Tech. Rep., First Quarter
-
R. Ross, J. Moreira, K. Cupps, and W. Pfeiffer, "Parallel I/O on the IBM Blue Gene/L System," Blue Gene/L Consortium Quarterly Newsletter, Tech. Rep., First Quarter, 2006.
-
(2006)
Parallel I/O on the IBM Blue Gene/L System
-
-
Ross, R.1
Moreira, J.2
Cupps, K.3
Pfeiffer, W.4
-
8
-
-
85084160707
-
Disk failures in the real world: What does an mttf of 1,000,000 hours mean to you?
-
Berkeley, CA, USA: USENIX Association, [Online]. Available
-
B. Schroeder and G. A. Gibson, "Disk failures in the real world: what does an mttf of 1,000,000 hours mean to you?" in Proceedings of the 5th USENIX conference on File and Storage Technologies, ser. FAST '07. Berkeley, CA, USA: USENIX Association, 2007. [Online]. Available: http://dl.acm.org/ citation.cfm?id=1267903.1267904
-
(2007)
Proceedings of the 5th USENIX Conference on File and Storage Technologies, Ser. FAST '07
-
-
Schroeder, B.1
Gibson, G.A.2
-
9
-
-
0021392066
-
Error-correcting codes for semiconductor memory applications: A state-of-the-art review
-
Mar. [Online]. Available
-
C. L. Chen and M. Y. Hsiao, "Error-correcting codes for semiconductor memory applications: a state-of-the-art review," IBM J. Res. Dev., vol. 28, no. 2, pp. 124-134, Mar. 1984. [Online]. Available: http://dx.doi.org/10.1147/rd.282.0124
-
(1984)
IBM J. Res. Dev.
, vol.28
, Issue.2
, pp. 124-134
-
-
Chen, C.L.1
Hsiao, M.Y.2
-
10
-
-
84944041103
-
A Case for Redundant Arrays of Inexpensive Disks (RAID)
-
New York, NY, USA: ACM, [Online]. Available
-
D. A. Patterson, G. Gibson, and R. H. Katz, "A Case for Redundant Arrays of Inexpensive Disks (RAID)," in Proceedings of the 1988 ACM SIGMOD international conference on Management of data, ser. SIGMOD '88. New York, NY, USA: ACM, 1988, pp. 109-116. [Online]. Available: http://dx.doi.org/10.1145/ 50202.50214
-
(1988)
Proceedings of the 1988 ACM SIGMOD International Conference on Management of Data, Ser. SIGMOD '88
, pp. 109-116
-
-
Patterson, D.A.1
Gibson, G.2
Katz, R.H.3
-
11
-
-
84877700680
-
Design and Modeling of a Non-Blocking Checkpointing System
-
Salt Lake City, Utah: IEEE Computer Society Press, [Online]. Available
-
K. Sato, N. Maruyama, K. Mohror, A. Moody, T. Gamblin, B. R. de Supinski, and S. Matsuoka, "Design and Modeling of a Non-Blocking Checkpointing System," in Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, ser. SC '12. Salt Lake City, Utah: IEEE Computer Society Press, 2012. [Online]. Available: http://portal.acm.org/ citation.cfm?id=2389022
-
(2012)
Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, Ser. SC '12
-
-
Sato, K.1
Maruyama, N.2
Mohror, K.3
Moody, A.4
Gamblin, T.5
De Supinski, B.R.6
Matsuoka, S.7
-
12
-
-
83155160949
-
FTI: High performance Fault Tolerance Interface for hybrid systems
-
Seattle, WS, USA
-
L. Bautista-Gomez, D. Komatitsch, N. Maruyama, S. Tsuboi, F. Cappello, and S. Matsuoka, "FTI: high performance Fault Tolerance Interface for hybrid systems," in Proceedings of the 2011 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, Seattle, WS, USA, 2011.
-
(2011)
Proceedings of the 2011 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
-
-
Bautista-Gomez, L.1
Komatitsch, D.2
Maruyama, N.3
Tsuboi, S.4
Cappello, F.5
Matsuoka, S.6
-
13
-
-
0004381167
-
-
College Station, TX, USA, Tech. Rep., [Online]. Available
-
N. H. Vaidya, "On Checkpoint Latency," College Station, TX, USA, Tech. Rep., 1995. [Online]. Available: http://portal.acm.org/citation. cfm?id=892900
-
(1995)
On Checkpoint Latency
-
-
Vaidya, N.H.1
-
14
-
-
84868384848
-
-
Oct
-
R. L. Graham, R. Brightwell, B. Barrett, G. Bosilca, and Pjesivac-Grbović, "An Evaluation of Open MPI's Matching Transport Layer on the Cray XT," Oct 2007.
-
(2007)
An Evaluation of Open MPI's Matching Transport Layer on the Cray XT
-
-
Graham, R.L.1
Brightwell, R.2
Barrett, B.3
Bosilca, G.4
Pjesivac-Grbović5
-
15
-
-
84879817446
-
-
[Online]. Available
-
"PMGR COLLECTIVE." [Online]. Available: http://sourceforge.net/ projects/pmgrcollective/
-
PMGR Collective
-
-
-
16
-
-
84906712886
-
Design and Modeling of a Non- Blocking Checkpoint System
-
May
-
K. Sato, A. Moody, K. Mohror, T. Gamblin, B. R. de Supinski, N. Maruyama, and S. Matsuoka, "Design and Modeling of a Non- Blocking Checkpoint System," in ATIP - A*CRC Workshop on Accelerator Technologies in High Performance Computing, May 2012.
-
(2012)
ATIP - A*CRC Workshop on Accelerator Technologies in High Performance Computing
-
-
Sato, K.1
Moody, A.2
Mohror, K.3
Gamblin, T.4
De Supinski, B.R.5
Maruyama, N.6
Matsuoka, S.7
-
17
-
-
0242571753
-
Slurm: Simple linux utility for resource management
-
D. Feitelson, L. Rudolph, and U. Schwiegelshohn, Eds. Springer Berlin Heidelberg, [Online]. Available
-
A. Yoo, M. Jette, and M. Grondona, "Slurm: Simple linux utility for resource management," in Job Scheduling Strategies for Parallel Processing, ser. Lecture Notes in Computer Science, D. Feitelson, L. Rudolph, and U. Schwiegelshohn, Eds. Springer Berlin Heidelberg, 2003, vol. 2862, pp. 44-60. [Online]. Available: http://dx.doi.org/10. 1007/10968987 3
-
(2003)
Job Scheduling Strategies for Parallel Processing, Ser. Lecture Notes in Computer Science
, vol.2862
, pp. 44-60
-
-
Yoo, A.1
Jette, M.2
Grondona, M.3
-
19
-
-
84940567900
-
FT-MPI: Fault Tolerant MPI, Supporting Dynamic Applications in a Dynamic World
-
London UK, UK: Springer-Verlag, [Online]. Available
-
G. E. Fagg and J. Dongarra, "FT-MPI: Fault Tolerant MPI, Supporting Dynamic Applications in a Dynamic World," in Proceedings of the 7th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface. London, UK, UK: Springer-Verlag, 2000, pp. 346-353. [Online]. Available: http://portal.acm.org/citation.cfm?id=746632
-
(2000)
Proceedings of the 7th European PVM/MPI Users' Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface
, pp. 346-353
-
-
Fagg, G.E.1
Dongarra, J.2
-
20
-
-
84867646266
-
An evaluation of user-level failure mitigation support in mpi
-
Berlin, Heidelberg: Springer-Verlag, [Online]. Available
-
W. Bland, A. Bouteiller, T. Herault, J. Hursey, G. Bosilca, and J. J. Dongarra, "An evaluation of user-level failure mitigation support in mpi," in Proceedings of the 19th European conference on Recent Advances in the Message Passing Interface, ser. EuroMPI'12. Berlin, Heidelberg: Springer-Verlag, 2012, pp. 193-203. [Online]. Available: http://dx.doi.org/10. 1007/978-3-642-33518-1 24
-
(2012)
Proceedings of the 19th European Conference on Recent Advances in the Message Passing Interface, Ser. EuroMPI'12
, pp. 193-203
-
-
Bland, W.1
Bouteiller, A.2
Herault, T.3
Hursey, J.4
Bosilca, G.5
Dongarra, J.J.6
-
21
-
-
34547424834
-
Application-transparent checkpoint/restart for mpi programs over infiniband
-
IEEE Computer Society
-
Q. Gao, W. Yu, W. Huang, and D. K. Panda, "Application-transparent checkpoint/restart for mpi programs over infiniband," in In ICPP'06: Proceedings of the 35th International Conference on Parallel Processing. IEEE Computer Society, 2006, pp. 471-478.
-
(2006)
ICPP'06: Proceedings of the 35th International Conference on Parallel Processing
, pp. 471-478
-
-
Gao, Q.1
Yu, W.2
Huang, W.3
Panda, D.K.4
-
22
-
-
20444444457
-
The lam/mpi checkpoint/restart framework: System-initiated checkpointing
-
Sante Fe
-
S. Sankaran, J. M. Squyres, B. Barrett, and A. Lumsdaine, "The lam/mpi checkpoint/restart framework: System-initiated checkpointing," in in Proceedings, LACSI Symposium, Sante Fe, 2003, pp. 479-493.
-
(2003)
Proceedings, LACSI Symposium
, pp. 479-493
-
-
Sankaran, S.1
Squyres, J.M.2
Barrett, B.3
Lumsdaine, A.4
-
23
-
-
35048812506
-
Adaptive mpi
-
C. Huang, O. Lawlor, and L. V. Kal, "Adaptive mpi," in In Proceedings of the 16th International Workshop on Languages and Compilers for Parallel Computing (LCPC 03, 2003, pp. 306-322.
-
(2003)
Proceedings of the 16th International Workshop on Languages and Compilers for Parallel Computing (LCPC 03
, pp. 306-322
-
-
Huang, C.1
Lawlor, O.2
Kal, L.V.3
-
24
-
-
20444463494
-
FTC-Charm++: An in-memory checkpoint-based fault tolerant runtime for charm++ and MPI
-
Washington, DC, USA: IEEE Computer Society, [Online]. Available
-
G. Zheng, L. Shi, and L. V. Kale, "FTC-Charm++: An In- Memory Checkpoint-Based Fault Tolerant Runtime for Charm++ and MPI," in Proceedings of the 2004 IEEE International Conference on Cluster Computing, ser. CLUSTER '04. Washington, DC, USA: IEEE Computer Society, 2004, pp. 93-103. [Online]. Available: http://portal.acm.org/citation.cfm?id=1111712
-
(2004)
Proceedings of the 2004 IEEE International Conference on Cluster Computing, Ser. CLUSTER '04
, pp. 93-103
-
-
Zheng, G.1
Shi, L.2
Kale, L.V.3
-
25
-
-
77954904463
-
Distributed diskless checkpoint for large scale systems
-
IEEE, May [Online]. Available
-
L. A. Gomez, N. Maruyama, F. Cappello, and S. Matsuoka, "Distributed Diskless Checkpoint for Large Scale Systems," in Cluster, Cloud and Grid Computing (CCGrid), 2010 10th IEEE/ACM International Conference on. IEEE, May 2010, pp. 63-72. [Online]. Available: http://dx.doi.org/10.1109/ccgrid.2010.40
-
(2010)
Cluster, Cloud and Grid Computing (CCGrid), 2010 10th IEEE/ACM International Conference on
, pp. 63-72
-
-
Gomez, L.A.1
Maruyama, N.2
Cappello, F.3
Matsuoka, S.4
-
26
-
-
0032179680
-
Diskless checkpointing
-
Oct. [Online]. Available
-
J. S. Plank, K. Li, and M. A. Puening, "Diskless Checkpointing," IEEE Trans. Parallel Distrib. Syst., vol. 9, no. 10, pp. 972-986, Oct. 1998. [Online]. Available: http://dx.doi.org/10.1109/71.730527
-
(1998)
IEEE Trans. Parallel Distrib. Syst.
, vol.9
, Issue.10
, pp. 972-986
-
-
Plank, J.S.1
Li, K.2
Puening, M.A.3
-
27
-
-
0034782005
-
Chord: A scalable peer-to-peer lookup service for internet applications
-
New York, NY, USA: ACM, [Online]. Available
-
I. Stoica, R. Morris, D. Karger, M. F. Kaashoek, and H. Balakrishnan, "Chord: A scalable peer-to-peer lookup service for internet applications," in Proceedings of the 2001 conference on Applications, technologies, architectures, and protocols for computer communications, ser. SIGCOMM '01. New York, NY, USA: ACM, 2001, pp. 149-160. [Online]. Available: http://doi.acm.org/10.1145/383059.383071
-
(2001)
Proceedings of the 2001 Conference on Applications, Technologies, Architectures, and Protocols for Computer Communications, Ser. SIGCOMM '01
, pp. 149-160
-
-
Stoica, I.1
Morris, R.2
Karger, D.3
Kaashoek, M.F.4
Balakrishnan, H.5
|