메뉴 건너뛰기




Volumn , Issue , 2015, Pages 538-550

Scaling distributed cache hierarchies through computation and data co-scheduling

Author keywords

cache; NUCA; partitioning; thread scheduling

Indexed keywords

SCHEDULING;

EID: 84934297423     PISSN: None     EISSN: None     Source Type: Conference Proceeding    
DOI: 10.1109/HPCA.2015.7056061     Document Type: Conference Paper
Times cited : (33)

References (64)
  • 1
    • 33947715600 scopus 로고    scopus 로고
    • IPC considered harmful for multiprocessor workloads
    • A. Alameldeen and D. Wood, "IPC considered harmful for multiprocessor workloads," IEEE Micro, vol. 26, no. 4, 2006.
    • (2006) IEEE Micro , vol.26 , Issue.4
    • Alameldeen, A.1    Wood, D.2
  • 2
    • 34548008288 scopus 로고    scopus 로고
    • ASR: Adaptive selective replication for cmp caches
    • B. Beckmann, M. Marty, and D. Wood, "ASR: Adaptive selective replication for CMP caches," in Proc. MICRO-39, 2006.
    • (2006) Proc. MICRO-39
    • Beckmann, B.1    Marty, M.2    Wood, D.3
  • 3
    • 21644472427 scopus 로고    scopus 로고
    • Managing wire delay in large chipmultiprocessor caches
    • B. Beckmann and D. Wood, "Managing wire delay in large chipmultiprocessor caches," in Proc. MICRO-37, 2004.
    • (2004) Proc. MICRO-37
    • Beckmann, B.1    Wood, D.2
  • 4
    • 84887440618 scopus 로고    scopus 로고
    • Jigsaw: Scalable software-defined caches
    • N. Beckmann and D. Sanchez, "Jigsaw: Scalable Software-Defined Caches," in Proc. PACT-22, 2013.
    • (2013) Proc. PACT-22
    • Beckmann, N.1    Sanchez, D.2
  • 5
    • 84934268669 scopus 로고    scopus 로고
    • Talus: A simple way to remove cliffs in cache performance
    • N. Beckmann and D. Sanchez, "Talus: A Simple Way to Remove Cliffs in Cache Performance," in Proc. HPCA-21, 2015.
    • (2015) Proc. HPCA-21
    • Beckmann, N.1    Sanchez, D.2
  • 6
    • 49549108733 scopus 로고    scopus 로고
    • TILE64 processor: A 64-core soc with mesh interconnect
    • S. Bell, B. Edwards, J. Amann et al., "TILE64 processor: A 64-core SoC with mesh interconnect," in Proc. ISSCC, 2008.
    • (2008) Proc. ISSCC
    • Bell, S.1    Edwards, B.2    Amann, J.3
  • 8
    • 33845903561 scopus 로고    scopus 로고
    • Cooperative caching for chip multiprocessors
    • J. Chang and G. Sohi, "Cooperative caching for chip multiprocessors," in Proc. ISCA-33, 2006.
    • (2006) Proc. ISCA-33
    • Chang, J.1    Sohi, G.2
  • 9
    • 0033683314 scopus 로고    scopus 로고
    • Application-specific memory management for embedded systems using software-controlled caches
    • D. Chiou, P. Jain, L. Rudolph, and S. Devadas, "Application-specific memory management for embedded systems using software-controlled caches," in Proc. DAC-37, 2000.
    • (2000) Proc. DAC-37
    • Chiou, D.1    Jain, P.2    Rudolph, L.3    Devadas, S.4
  • 10
    • 27544432313 scopus 로고    scopus 로고
    • Optimizing replication, communication, and capacity allocation in cmps
    • Z. Chishti, M. Powell, and T. Vijaykumar, "Optimizing replication, communication, and capacity allocation in cmps," in ISCA-32, 2005.
    • (2005) ISCA-32
    • Chishti, Z.1    Powell, M.2    Vijaykumar, T.3
  • 11
    • 40349095122 scopus 로고    scopus 로고
    • Managing distributed, shared L2 caches through OS-level page allocation
    • S. Cho and L. Jin, "Managing distributed, shared L2 caches through OS-level page allocation," in Proc. MICRO-39, 2006.
    • (2006) Proc. MICRO-39
    • Cho, S.1    Jin, L.2
  • 12
    • 84881160871 scopus 로고    scopus 로고
    • A hardware evaluation of cache partitioning to improve utilization and energy-efficiency while preserving responsiveness
    • H. Cook, M. Moreto, S. Bird et al., "A hardware evaluation of cache partitioning to improve utilization and energy-efficiency while preserving responsiveness," in ISCA-40, 2013.
    • (2013) ISCA-40
    • Cook, H.1    Moreto, M.2    Bird, S.3
  • 13
    • 84906973650 scopus 로고    scopus 로고
    • GPU Computing: To exascale and beyond
    • W. J. Dally, "GPU Computing: To Exascale and Beyond," in SC Plenary Talk, 2010.
    • (2010) SC Plenary Talk
    • Dally, W.J.1
  • 14
    • 84880278122 scopus 로고    scopus 로고
    • Application-to-core mapping policies to reduce memory system interference in multi-core systems
    • R. Das, R. Ausavarungnirun, O. Mutlu et al., "Application-to-core mapping policies to reduce memory system interference in multi-core systems," in Proc. HPCA-19, 2013.
    • (2013) Proc. HPCA-19
    • Das, R.1    Ausavarungnirun, R.2    Mutlu, O.3
  • 15
    • 84875650624 scopus 로고    scopus 로고
    • Traffic management: A holistic approach to memory placement on NUMA systems
    • M. Dashti, A. Fedorova, J. Funston et al., "Traffic management: a holistic approach to memory placement on NUMA systems," in Proc. ASPLOS-18, 2013.
    • (2013) Proc. ASPLOS-18
    • Dashti, M.1    Fedorova, A.2    Funston, J.3
  • 16
    • 84873622276 scopus 로고    scopus 로고
    • The tail at scale
    • J. Dean and L. Barroso, "The Tail at Scale," CACM, vol. 56, 2013.
    • (2013) CACM , vol.56
    • Dean, J.1    Barroso, L.2
  • 18
    • 47349085427 scopus 로고    scopus 로고
    • A framework for providing quality of service in chip multi-processors
    • F. Guo, Y. Solihin, L. Zhao, and R. Iyer, "A framework for providing quality of service in chip multi-processors," in Proc. MICRO-40, 2007.
    • (2007) Proc. MICRO-40
    • Guo, F.1    Solihin, Y.2    Zhao, L.3    Iyer, R.4
  • 20
    • 70350601187 scopus 로고    scopus 로고
    • Reactive NUCA: Near-optimal block placement and replication in distributed caches
    • N. Hardavellas, M. Ferdman, B. Falsafi, and A. Ailamaki, "Reactive NUCA: near-optimal block placement and replication in distributed caches," in Proc. ISCA-36, 2009.
    • (2009) Proc. ISCA-36
    • Hardavellas, N.1    Ferdman, M.2    Falsafi, B.3    Ailamaki, A.4
  • 22
    • 84863354514 scopus 로고    scopus 로고
    • Database servers on chip multiprocessors: Limitations and opportunities
    • N. Hardavellas, I. Pandis, R. Johnson, and N. Mancheril, "Database Servers on Chip Multiprocessors: Limitations and Opportunities," in Proc. CIDR, 2007.
    • (2007) Proc. CIDR
    • Hardavellas, N.1    Pandis, I.2    Johnson, R.3    Mancheril, N.4
  • 24
    • 0000800074 scopus 로고
    • Geometrical cluster growth models and kinetic gelation
    • H. J. Herrmann, "Geometrical cluster growth models and kinetic gelation," Physics Reports, vol. 136, no. 3, pp. 153-224, 1986.
    • (1986) Physics Reports , vol.136 , Issue.3 , pp. 153-224
    • Herrmann, H.J.1
  • 25
    • 48249118853 scopus 로고    scopus 로고
    • Amdahl's law in the multicore era
    • M. D. Hill and M. R. Marty, "Amdahl's Law in the Multicore Era," Computer, vol. 41, no. 7, 2008.
    • (2008) Computer , vol.41 , Issue.7
    • Hill, M.D.1    Marty, M.R.2
  • 26
    • 84910129119 scopus 로고    scopus 로고
    • FIESTA: A sample-balanced multi-program workload methodology
    • A. Hilton, N. Eswaran, and A. Roth, "FIESTA: A sample-balanced multi-program workload methodology," in Proc. MoBS, 2009.
    • (2009) Proc. MoBS
    • Hilton, A.1    Eswaran, N.2    Roth, A.3
  • 27
    • 84934313335 scopus 로고    scopus 로고
    • Knights Landing: Next Generation Intel Xeon Phi
    • Intel, "Knights Landing: Next Generation Intel Xeon Phi," in SC Presentation, 2013.
    • (2013) SC Presentation
    • Intel1
  • 31
    • 0032131147 scopus 로고    scopus 로고
    • A fast and high quality multilevel scheme for partitioning irregular graphs
    • G. Karypis and V. Kumar, "A fast and high quality multilevel scheme for partitioning irregular graphs," SIAM J. Sci. Comput., vol. 20, 1998.
    • (1998) SIAM J. Sci. Comput. , vol.20
    • Karypis, G.1    Kumar, V.2
  • 32
    • 84897791436 scopus 로고    scopus 로고
    • Ubik: Efficient cache sharing with strict qos for latency-critical workloads
    • H. Kasture and D. Sanchez, "Ubik: Efficient Cache Sharing with Strict QoS for Latency-Critical Workloads," in Proc. ASPLOS-19, 2014.
    • (2014) Proc. ASPLOS-19
    • Kasture, H.1    Sanchez, D.2
  • 33
    • 0028445155 scopus 로고
    • A comparison of trace-sampling techniques for multi-megabyte caches
    • R. Kessler, M. Hill, and D. Wood, "A comparison of trace-sampling techniques for multi-megabyte caches," IEEE T. Comput., vol. 43, 1994.
    • (1994) IEEE T. Comput. , vol.43
    • Kessler, R.1    Hill, M.2    Wood, D.3
  • 34
    • 0036949388 scopus 로고    scopus 로고
    • An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches
    • C. Kim, D. Burger, and S. Keckler, "An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches," in ASPLOS, 2002.
    • (2002) ASPLOS
    • Kim, C.1    Burger, D.2    Keckler, S.3
  • 36
    • 79955893556 scopus 로고    scopus 로고
    • CloudCache: Expanding and shrinking private caches
    • H. Lee, S. Cho, and B. R. Childers, "CloudCache: Expanding and shrinking private caches," in Proc. HPCA-17, 2011.
    • (2011) Proc. HPCA-17
    • Lee, H.1    Cho, S.2    Childers, B.R.3
  • 37
    • 76749146060 scopus 로고    scopus 로고
    • McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures
    • S. Li, J. H. Ahn, R. Strong et al., "McPAT: an integrated power, area, and timing modeling framework for multicore and manycore architectures," in MICRO-42, 2009.
    • (2009) MICRO-42
    • Li, S.1    Ahn, J.H.2    Strong, R.3
  • 38
    • 57749186047 scopus 로고    scopus 로고
    • Gaining insights into multicore cache partitioning: Bridging the gap between simulation and real systems
    • J. Lin, Q. Lu, X. Ding et al., "Gaining insights into multicore cache partitioning: Bridging the gap between simulation and real systems," in Proc. HPCA-14, 2008.
    • (2008) Proc. HPCA-14
    • Lin, J.1    Lu, Q.2    Ding, X.3
  • 39
    • 84934313338 scopus 로고    scopus 로고
    • Memory system performance in a NUMA multicore multiprocessor
    • Z. Majo and T. R. Gross, "Memory system performance in a NUMA multicore multiprocessor," in Proc. ISMM, 2011.
    • (2011) Proc. ISMM
    • Majo, Z.1    Gross, T.R.2
  • 41
    • 35348900723 scopus 로고    scopus 로고
    • Virtual hierarchies to support server consolidation
    • M. Marty and M. Hill, "Virtual hierarchies to support server consolidation," in Proc. ISCA-34, 2007.
    • (2007) Proc. ISCA-34
    • Marty, M.1    Hill, M.2
  • 42
    • 77952573440 scopus 로고    scopus 로고
    • ESP-NUCA: A low-cost adaptive non-uniform cache architecture
    • J. Merino, V. Puente, and J. Gregorio, "ESP-NUCA: A low-cost adaptive non-uniform cache architecture," in Proc. HPCA-16, 2010.
    • (2010) Proc. HPCA-16
    • Merino, J.1    Puente, V.2    Gregorio, J.3
  • 45
    • 84934313339 scopus 로고    scopus 로고
    • A general constraint-centric scheduling framework for spatial architectures
    • T. Nowatzki, M. Tarm, L. Carli et al., "A general constraint-centric scheduling framework for spatial architectures," in Proc. PLDI-34, 2013.
    • (2013) Proc. PLDI-34
    • Nowatzki, T.1    Tarm, M.2    Carli, L.3
  • 46
    • 77954780208 scopus 로고    scopus 로고
    • The case for RAMClouds: Scalable high-performance storage entirely in DRAM
    • J. Ousterhout, P. Agrawal, D. Erickson et al., "The case for RAMClouds: scalable high-performance storage entirely in DRAM," ACM SIGOPS Operating Systems Review, vol. 43, no. 4, 2010.
    • (2010) ACM SIGOPS Operating Systems Review , vol.43 , Issue.4
    • Ousterhout, J.1    Agrawal, P.2    Erickson, D.3
  • 47
    • 38549120069 scopus 로고    scopus 로고
    • Partitioned cache architecture as a side-channel defence mechanism
    • 2005/280
    • D. Page, "Partitioned cache architecture as a side-channel defence mechanism," IACR Cryptology ePrint archive, no. 2005/280, 2005.
    • (2005) IACR Cryptology EPrint Archive
    • Page, D.1
  • 49
    • 77954949789 scopus 로고    scopus 로고
    • Buffer-space efficient and deadlock-free scheduling of stream applications on multi-core architectures
    • J. Park and W. Dally, "Buffer-space efficient and deadlock-free scheduling of stream applications on multi-core architectures," in Proc. SPAA-22, 2010.
    • (2010) Proc. SPAA-22
    • Park, J.1    Dally, W.2
  • 50
    • 0000529292 scopus 로고    scopus 로고
    • SCOTCH: A software package for static mapping by dual recursive bipartitioning of process and architecture graphs
    • F. Pellegrini and J. Roman, "SCOTCH: A software package for static mapping by dual recursive bipartitioning of process and architecture graphs," in Proc. HPCN, 1996.
    • (1996) Proc. HPCN
    • Pellegrini, F.1    Roman, J.2
  • 51
    • 64949187933 scopus 로고    scopus 로고
    • Adaptive spill-receive for robust high-performance caching in cmps
    • M. Qureshi, "Adaptive Spill-Receive for Robust High-Performance Caching in CMPs," in Proc. HPCA-10, 2009.
    • (2009) Proc. HPCA-10
    • Qureshi, M.1
  • 52
    • 34548042910 scopus 로고    scopus 로고
    • Utility-based cache partitioning: A low-overhead, high-performance, runtime mechanism to partition shared caches
    • M. Qureshi and Y. Patt, "Utility-based cache partitioning: A low-overhead, high-performance, runtime mechanism to partition shared caches," in Proc. MICRO-39, 2006.
    • (2006) Proc. MICRO-39
    • Qureshi, M.1    Patt, Y.2
  • 53
    • 80052521720 scopus 로고    scopus 로고
    • Vantage: Scalable and efficient fine-grain cache partitioning
    • D. Sanchez and C. Kozyrakis, "Vantage: Scalable and Efficient Fine-Grain Cache Partitioning," in Proc. ISCA-38, 2011.
    • (2011) Proc. ISCA-38
    • Sanchez, D.1    Kozyrakis, C.2
  • 54
    • 84881154274 scopus 로고    scopus 로고
    • Zsim: Fast and accurate microarchitectural simulation of thousand-core systems
    • D. Sanchez and C. Kozyrakis, "ZSim: Fast and Accurate Microarchitectural Simulation of Thousand-Core Systems," in ISCA-40, 2013.
    • (2013) ISCA-40
    • Sanchez, D.1    Kozyrakis, C.2
  • 55
    • 0034443570 scopus 로고    scopus 로고
    • Symbiotic jobscheduling for a simultaneous multithreading processor
    • A. Snavely and D. M. Tullsen, "Symbiotic jobscheduling for a simultaneous multithreading processor," in Proc. ASPLOS-8, 2000.
    • (2000) Proc. ASPLOS-8
    • Snavely, A.1    Tullsen, D.M.2
  • 56
    • 57749176037 scopus 로고    scopus 로고
    • Managing shared L2 caches on multicore systems in software
    • D. Tam, R. Azimi, L. Soares, and M. Stumm, "Managing shared L2 caches on multicore systems in software," in WIOSCA, 2007.
    • (2007) WIOSCA
    • Tam, D.1    Azimi, R.2    Soares, L.3    Stumm, M.4
  • 57
    • 47249165359 scopus 로고    scopus 로고
    • Thread clustering: Sharing-aware scheduling on SMP-CMP-SMT multiprocessors
    • D. Tam, R. Azimi, and M. Stumm, "Thread clustering: sharing-aware scheduling on smp-cmp-smt multiprocessors," in Proc. Eurosys, 2007.
    • (2007) Proc. Eurosys
    • Tam, D.1    Azimi, R.2    Stumm, M.3
  • 59
    • 84934313342 scopus 로고    scopus 로고
    • Asymmetry-aware execution placement on manycore chips
    • A. Tumanov, J. Wise, O. Mutlu, and G. R. Ganger, "Asymmetry-aware execution placement on manycore chips," in SFMA-3, 2013.
    • (2013) SFMA-3
    • Tumanov, A.1    Wise, J.2    Mutlu, O.3    Ganger, G.R.4
  • 60
    • 0001957806 scopus 로고    scopus 로고
    • Operating system support for improving data locality on CC-NUMA compute servers
    • B. Verghese, S. Devine, A. Gupta, and M. Rosenblum, "Operating system support for improving data locality on CC-NUMA compute servers," in Proc. ASPLOS, 1996.
    • (1996) Proc. ASPLOS
    • Verghese, B.1    Devine, S.2    Gupta, A.3    Rosenblum, M.4
  • 62
    • 80052529677 scopus 로고    scopus 로고
    • A comparison of capacity management schemes for shared cmp caches
    • C. Wu and M. Martonosi, "A Comparison of Capacity Management Schemes for Shared CMP Caches," in WDDD-7, 2008.
    • (2008) WDDD-7
    • Wu, C.1    Martonosi, M.2
  • 63
    • 27544495466 scopus 로고    scopus 로고
    • Victim replication: Maximizing capacity while hiding wire delay in tiled chip multiprocessors
    • M. Zhang and K. Asanovic, "Victim replication: Maximizing capacity while hiding wire delay in tiled chip multiprocessors," in ISCA, 2005.
    • (2005) ISCA
    • Zhang, M.1    Asanovic, K.2
  • 64
    • 77952248898 scopus 로고    scopus 로고
    • Addressing shared resource contention in multicore processors via scheduling
    • S. Zhuravlev, S. Blagodurov, and A. Fedorova, "Addressing shared resource contention in multicore processors via scheduling," in Proc. ASPLOS, 2010
    • (2010) Proc. ASPLOS
    • Zhuravlev, S.1    Blagodurov, S.2    Fedorova, A.3


* 이 정보는 Elsevier사의 SCOPUS DB에서 KISTI가 분석하여 추출한 것입니다.