Customizable Computing. Yu-Ting Chen
Чтение книги онлайн.
Читать онлайн книгу Customizable Computing - Yu-Ting Chen страница 3
4.4.2 Run-Time Mapping
4.4.3 CHARM
4.4.4 Using Composable Accelerators
5 On-Chip Memory Customization
5.1.1 Caches and Buffers (Scratchpads)
5.1.2 On-Chip Memory System Customizations
5.2 CPU Cache Customizations
5.2.1 Coarse-Grain Customization Strategies
5.2.2 Fine-Grain Customization Strategies
5.3 Buffers for Accelerator-Rich Architectures
5.3.1 Shared Buffer System Design for Accelerators
5.3.2 Customization of Buffers Inside an Accelerator
5.4 Providing Buffers in Caches for CPUs and Accelerators
5.4.1 Providing Software-Managed Scratchpads for CPUs
5.4.2 Providing Buffers for Accelerators
5.5 Caches with Disparate Memory Technologies
5.5.1 Coarse-Grain Customization Strategies
5.5.2 Fine-Grain Customization Strategies
6.2 Topology Customization
6.2.1 Application-Specific Topology Synthesis
6.2.2 Reconfigurable Shortcut Insertion
6.2.3 Partial Crossbar Synthesis and Reconfiguration
6.3 Routing Customization
6.3.1 Application-Aware Deadlock-Free Routing
6.3.2 Data Flow Synthesis
6.4 Customization Enabled by New Device/Circuit Technologies
6.4.1 Optical Interconnects
6.4.2 Radio-Frequency Interconnects
6.4.3 RRAM-Based Interconnects
Acknowledgments
This research is supported by the NSF Expeditions in Computing Award CCF-0926127, by C-FAR (one of six centers of STARnet, an SRC program sponsored by MARCO and DARPA), and by the NSF Graduate Research Fellowship Grant #DGE-0707424.
Yu-Ting Chen, Jason Cong, Michael Gill, Glenn Reinman, and Bingjun Xiao
June 2015
CHAPTER 1
Introduction
Since the introduction of the microprocessor in 1971, the improvement of processor performance in its first thirty years was largely driven by the Dennard scaling of transistors [45]. This scaling calls for for reduction of transistor dimensions by 30% every generation (roughly every two years) while keeping electric fields constant everywhere in the transistor to maintain reliability (which implies that the supply voltage needs to be reduced by 30% as well in each generation). Such scaling doubles the transistor density each generation, reduces the transistor delay by 30%, and at the same time improves the power by 50% and energy by 65% [7]. The increased transistor count also leads to more architecture design innovations, such as better memory hierarchy designs and more sophisticated instruction scheduling and pipelining supports. These factors combined led to over 1,000 times performance improvement of Intel processors in 20 years (from the 1.5 μm generation down to the 65 nm generation), as shown in [7].
Unfortunately, Dennard scaling came to an end in the early 2000s. Although the transistor dimension reduction by 30% per generation continues to follow Moore’s law, the supply voltage scaling had to almost come to a halt due to the rapid increase of leakage power. In this case, transistor density can continue to increase, but so can the power density. As a result, in order to continue meeting the ever-increasing computing needs, yet maintaining a constant power budget, in the past ten years the computing industry stopped simple processor frequency scaling and entered the era of parallelization, with tens to hundreds of computing cores integrated in a single processor, and hundreds to thousands of computing servers connected in a warehouse-scale data center. However, such highly parallel, general-purpose computing systems now face serious challenges in terms of performance, power, heat dissipation, space, and cost, as pointed out by a number of researchers. The term “utilization wall” was introduced in [128], where it shows that if the chip fills up with 64-bit adders (with input and output registers) designed in a 45 nm TSMC process technology running at the maximum operating frequency (5.2Ghz in their experiment), only 6.5% of 300mm2 of the silicon can be active at the same time. This utilization ratio drops further to less than 3.5% in the 32nm fabrication technology, roughly by a factor of two in each technology generation following their leakage-limited scaling model [128].
A similar but more detailed and realistic study on dark silicon projection was carried out in [51]. It uses a set of 20 representative Intel and AMD cores to build up empirical models which capture the relationship between area vs. performance and the relationship between power vs. performance. These models, together with the device-scaling models, are used for projection of the core area, performance, and power in various technology generations. This also considers real parallel application workloads as represented by the PARSEC benchmark suite [9]. It further considers different multicore models, including the symmetric multicores, asymmetric multicores (consisting of both large and small cores), dynamic multicores (either large or small cores depending on if the power or area constraint is imposed), and composable multicores (where small cores can be fused into large cores). Their study concludes that at 22 nm, 21% of a fixed-size chip must be powered off, and at 8 nm, this dark silicon ratio grows to more than 50% [51]. This study also points to the end of simple core scaling.
Given the limitation of core scaling, the computing industry and research community are actively