Customizable Computing. Yu-Ting Chen
Чтение книги онлайн.
Читать онлайн книгу Customizable Computing - Yu-Ting Chen страница 4
The performance gap between a totally customized solution using an application-specific integrated circuit (ASIC) and a general-purpose processor can be very large, as documented in several studies. An early case study of the 128-bit key AES encryption algorithm was presented in [116]. An ASIC implementation of this algorithm in a 0.18 μm CMOS technology achieves a 3.86Gbits/second processing rate at 350mW power consumption, while the same algorithm coded in assembly languages yields a 31Mbits/second processing rate with 240mW power running on a StrongARM processor, and a 648Mbits/second processing rate with 41.4W power running on a Pentium III processor. This results in a performance/energy efficiency (measured in Gbits/second/W) gap of a factor of 85X and 800X, respectively, when compared with the ASIC implementation. In an extreme case, when the same algorithm is coded in the Java language and executed on an embedded SPARC processor, it yields 450bits/second with 120mW power, resulting in a performance/energy efficiency gap as large as a factor of 3 million (!) when compared to the ASIC solution.
Recent work studied a much more complex application for such gap analysis [67]. It uses a 720p high-definition H.264 encoder as the application driver, and a four-core CMP system using the Tensilica extensible RISC cores [119] as the baseline processor configuration. Compared to an optimized ASIC implementation, the baseline CMP is 250X slower and consumes 500X more energy. Adding 16-wide SIMD execution units to the baseline cores improves the performance by 10X and energy efficiency by 7X. Addition of custom-fused instructions is also considered, and it improves the performance and energy efficiency by an additional 1.4X. Despite these enhancements, the resulting enhanced CMP is still 50X less energy efficient than ASIC.
The large energy efficiency gap between the ASIC and general-purpose processors is the main motivation for architecture customization, which is the focus of this lecture. In particular, one way to significantly improve the energy efficiency is to introduce many special-purpose on-chip accelerators implemented in ASIC and share them among multiple processor cores, so that as much computation as possible is carried out on accelerators instead of using general-purpose cores. This leads to accelerator-rich architectures, which have received a growing interest in recent years [26, 28, 89]. Such architectures will be discussed in detail in Chapter 4.
There are two major concerns about using accelerators. One relates to their low utilization and the other relates to their narrow workload coverage. However, given the utilization wall [128] and the dark silicon problem [51] discussed earlier, low accelerator utilization is no longer a serious problem, as only a fraction of computing resources on-chip can be activated at one time in future technology generation, given the tight power and thermal budgets. So, it is perfectly fine to populate the chip with many accelerators, knowing that many of them will be inactive at any given time. But once an accelerator is used, it can deliver one to two orders of magnitude improvement in energy efficiency over the general-purpose cores.
The problem of narrow workload coverage can be addressed by introducing reconfigurability and using composable accelerators. Examples include the use of fine-grain field-programmable gate arrays (FPGAs), coarse-grain reconfigurable arrays [61, 62, 91, 94, 118], or dynamically composable accelerator building blocks [26, 27]. These approaches will be discussed in more detail in Section 4.4.
Given the significant energy efficiency advantage of accelerators and the promising progress in widening accelerator workload coverage, we increasingly believe that the future of processor architecture should be rich in accelerators, as opposed to having many general-purpose cores. To some extent, such accelerator-rich architectures are more like a human brain, which has many specialized neural microcircuits (accelerators), each dedicated to a different function (such as navigation, speech, vision, etc.). The high degree of customization in the human brain leads to a great deal of efficiency; the brain can perform various highly sophisticated cognitive functions while consuming only about 20W, an inspiring and challenging performance for computer architects to match.
Not only can the compute engine be customized, but so can the memory system and on-chip interconnects. For example, instead of only using a general-purpose cache, one may use program-managed or accelerator-managed buffers (or scratchpad memories). Customization is needed to flexibly partition these two types of on-chip memories. Memory customization will be discussed in Chapter 5. Also, instead of using a general-purpose mesh-based network-on-chip (NoC) for packet switching, one may prefer a customized circuit-switching topology between accelerators and the memory system. Customization of on-chip interconnects will be discussed in Chapter 6.
The remainder of this lecture is organized as follows. Chapter 2 gives a broad overview of the trajectory of customization in computing. Customization of compute cores, such as custom instructions, will be covered in Chapter 3. Loosely coupled compute engines will be discussed in Chapter 4. Chapter 5 will discuss customizations to the memory system, and Chapter 6 discusses custom network interconnect designs. Finally, Chapter 7 concludes the lecture with discussions of industrial trends and future research topics.
CHAPTER 2
Road Map
Customized computing involves the specialization of hardware for a particular domain, and often includes a software component to fully leverage this specialization in hardware. In this section, we will lay the foundation for customized computing, enumerating the design trade-offs and defining vocabulary.
2.1 CUSTOMIZABLE SYSTEM-ON-CHIP DESIGN
In order to provide efficient support of customized computing, the general-purpose CMP (chip multiprocessor) widely used today needs to be replaced or transformed into a Customizable System-on-a-Chip (CSoC), also called customizable heterogeneous platform (CHP) in some other publications [39], which can be customized for a particular domain through the specialization of four major components on such a computing platform, including: (1) processor cores, (2) accelerators and co-processors, (3) on-chip memory components, and (4) the network-on-chip (NoC) that connects various components. We will explore each of these in detail individually, as well as in concert with the other CSoC components.
2.1.1 COMPUTE RESOURCES
Compute components like processor cores handle the actual processing demands of the CSoC. There are a wide array of design choices in the compute components of the CSoC. But when looking at customized compute units, there are three major factors to consider, all of which are largely independent of one another:
• Programmability
• Specialization
• Reconfigurability
Programmability
A fixed function compute unit can do one operation on incoming data, and nothing else. For example, a compute unit that is designed to perform an FFT operation