Customizable Computing. Yu-Ting Chen

Чтение книги онлайн.

Читать онлайн книгу Customizable Computing - Yu-Ting Chen страница 5

Customizable Computing - Yu-Ting Chen Synthesis Lectures on Computer Architecture

Скачать книгу

data is fixed function. This inflexibility limits how much a compute unit may be leveraged, but it streamlines the design of the unit such that it may be highly optimized for that particular task. The amount of bits used within the datapath of the unit and the types of mathematical operators included for example can be precisely tuned to the particular operation the compute unit will perform.

      Contrasting this, a programmable compute unit executes sequences of instructions to define the tasks they are to perform. The instructions understood by the programmable compute unit constitute the instruction set architecture (ISA). The ISA is the interface for use of the programmable compute unit. Software that makes use of the programmable compute unit will consist of these instructions, and these instructions are typically chosen to maximize the expressive nature of the ISA to describe the nature of computation desired in the programmable unit. The hardware of the programmable unit will handle these instructions in a generally more flexible datapath than that of the fixed function compute unit. The fetching, decoding, and sequencing of instructions leads to performance and power overhead that is not required in a fixed function design. But the programmable compute unit is capable of executing different sequences of instructions to handle a wider array of functions than a fixed function pipeline.

      There exists a broad spectrum of design choices between these two alternatives. Programmable units may have a large number of instructions or a small number of instructions for example. A pure fixed function compute unit can be thought of as a programmable compute unit that only has a single implicit instruction (i.e., perform an FFT). The more instructions supported by the compute unit, the more compact the software needs to be to express desired functionality. The fewer instructions supported by the compute unit, the simpler the hardware required to implement these instructions and the more potential for an optimized and streamlined implementation. Thus the programmability of the compute unit refers to the degree to which it may be controlled via a sequence of instructions, from fixed function compute units that require no instructions at all to complex, expressive programmable designs with a large number of instructions.

       Specialization

      Customized computing targets a smaller set of applications and algorithms within a domain to improve performance and reduce power requirements. The degree to which components are customized to a particular domain is the specialization of those components. There are a large number of different specializations that a hardware designer may utilize, from the datapath width of the compute unit, to the number of type of functional units, to the amount of cache, and more.

      This is distinct from a general purpose design, which attempts to cover all applications rather than providing a customized architecture for a particular domain. General purpose designs may use a set of benchmarks from a target performance suite, but the idea is not to optimize specifically for those benchmarks. Rather, that performance suite may simply be used to gauge performance.

      There is again a broad spectrum of design choices between specialized and general purpose designs. One may consider general purpose designs to be those specialized for the domain of all applications. In some cases, general purpose designs are more cost effective since the design time may be amortized over more possible uses—an ALU that can be designed once and then used in a variety of compute units may amortize the cost of the design of the ALU, for example.

       Reconfigurability

      Once a design has been implemented, it can be beneficial to allow further adaptation to continue to customize the hardware to react to (1) changes in data usage patterns, (2) algorithmic changes or advancements, and (3) domain expansion or unintended use. For example, a compute unit may have been optimized to perform a particular algorithm for FFT, but a new approach may be faster. Hardware that can flexibly adapt even after tape out is reconfigurable hardware. The degree to which hardware may be reconfigured depends on the granularity of reconfiguration. While finer granularity reconfiguration can allow greater flexibility, the overhead of reconfiguration can mean that a reconfigurable design will perform worse and/or be less energy efficient than a static (i.e., non-reconfigurable) alternative. One example of a fine-grain reconfigurable platform is an FPGA, which can be used to implement a wide array of different compute units, from fixed function to programmable units, with all levels of specialization. But an FPGA implementation of a particular compute unit is less efficient than an ASIC implementation of the same compute unit. However, the ASIC implementation is static, and cannot adapt after design tape out. We will examine more coarse-grain alternatives for reconfigurable compute units in Section 4.4.

       Examples

      • Accelerators—early GPUs, mpeg/media decoders, crypto accelerators

      • Programmable Cores—modern GPUs, general purpose cores, ASIPs

      • Future designs may feature accelerators in primary computational role

      • Some programmable cores and or programmable fabric are still included for generality/longevity

      Section 3 covers the customization of processor cores and Section 4 covers coprocessors and accelerators. We split compute components into two sections to better manage the diversity of the design space for these components.

      Chips are fundamentally pin-limited, which impacts the amount of bandwidth that can be supplied to the compute units described in the previous section. This is further exacerbated by limitations in DRAM scaling. On-chip memory is one technique to mitigate this. On-chip memory can be used in a variety of ways, from providing data buffering for streaming data from off-chip to providing a place to store data that will be reused multiple times for computation. Once again, different applications will have different on-chip memory requirements. As with compute units, there are a wide array of design choices for hardware architects to consider in the design of the memory hierarchy.

       Transparency to Software

      A cache is relatively small, but fast memory which leverages the principle of locality to reduce the latency to access memory. Data which will be reused in the near future is kept in the cache to avoid accesses to longer latency memory. There are two primary approaches to managing a cache (i.e., orchestrating what data comes into the cache and what data leaves the cache): purely hardware approaches and software managed caches. In this book, we will use the term scratchpad to refer to software-based caches, where the application writer or the compiler will be responsible for explicitly bringing data into and out of the cache through special instructions. Hardware caches where actual control circuits orchestrate data movement without software intervention will just be referred to as caches in this book. Scratchpads have tremendous potential for application-specific customization since cache management can be tuned to a particular application, but they also come with coding overhead as the programmer or compiler writer must explicitly map out this orchestration. Conventional caches are more flexible, as they can handle a wider array of applications without requiring explicit management, and may be preferable in cases where the access pattern is unpredictable and therefore requires dynamic adaptation.

       Sharing

      On-chip memory may be kept private to a particular compute unit or may be shared among multiple compute units. Private on-chip memory means that the application will not need to contend for space with another application, and will get the full benefit of the on-chip memory. Shared on-chip memory can amortize the cost of on-chip memory over several compute units, providing a potentially larger pool of space that can be leveraged by these compute units than if the space was partitioned among the units as private memory. For example, four compute units can each have 1MB of on-chip memory dedicated to them. Each compute unit will always have 1MB regardless of the demand from other compute units. However, if four compute units share

Скачать книгу