Customizable Computing. Yu-Ting Chen

Чтение книги онлайн.

Читать онлайн книгу Customizable Computing - Yu-Ting Chen страница 8

Customizable Computing - Yu-Ting Chen Synthesis Lectures on Computer Architecture

Скачать книгу

parallelism or out-of-order scheduling, CoolFetch also observed a carry-over effect of reducing power consumption of other processor structures that normally operate in parallel, and also reduces energy spent on squashed instructions by reducing reducing the number of instructions stalling at retirement. In all, CoolFetch reported on average 8% energy savings, with a relatively trivial architectural modification and a negligible performance penalty.

      The efficiency of a processing core, both in terms of energy consumption and compute per area, tends to be reduced as the potential performance of the core increases. The primary cause of this is a shifting of focus from investing resources in compute engines in the case of small cores, to aggressive scheduling mechanisms in the case of big cores. In modern out-of-order processors, this scheduling mechanism constitutes the overwhelming majority of core area investment, and the overwhelming majority of energy consumption.

      An obvious conclusion then is that large sophisticated cores are not worth including in a system, since a sea of weak cores provides greater potential for system-wide throughput than a small group of powerful cores. The problem with this conclusion, however, is that parallelizing software is difficult: parallel code is prone to errors like race conditions, and many algorithms are limited by sequential components that are more difficult to parallelize. Some code cannot reasonably be parallelized at all. In fact, the majority of software is not parallelized at all, and thus cannot make use of a large volume of cores. In these situations a single powerful core is preferable, since it offers high single-thread throughput at the cost of restricting the capability to exploit thread-level parallelism. Because of this observation, it becomes clear that the best design depends on the number of threads that are exposed in software. Large numbers of threads can be run on a large number of cores, enabling higher system-wide throughput, while a few threads may be better run on a few powerful cores, since multiple cores cannot be utilized.

      This observation gave rise to a number of academic works that explore heterogeneous systems which feature a small group of very powerful cores on die with a large group of very efficient cores [64, 71, 84]. In addition to numerous academic perspectives on heterogeneous systems, industry has begun to adopt this trend, such as the ARM big.LITTLE [64]. While these designs are interesting, they still allocate compute resources statically, and thus cannot react to variation in the degree of parallelism present in software. To address this rigidity, core fusion [74], and other related work [31, 108, 115, 123], propose mechanisms for implementing powerful cores out of collaborative collections of weak cores. This allows a system to grow and shrink so that there are as many “cores” in the system as there are threads, and each of these cores is scaled to maximize performance with the current amount of parallelism in the system.

      Core fusion [74] accomplishes this scaling by splitting a core into two halves: a narrow-issue conventional core with the fetch engine stripped off, and an additional component that acts as a modular fetch/decode/commit component. This added component will either perform fetches for each core individually from individual program sequences, or a wide fetch to feed all processors. Similar to how a line buffer reads in multiple instructions in a single effort, this wide fetch engine will read an entire block of instructions and issue them across different cores. Decode and resource renaming is also performed collectively, with registers being stored as physically resident in various cores. A crossbar is added to move register values from one core to another when necessary. At the end of the pipeline, a reordering step is introduced to guarantee correct commit and exception handling. A diagram of this architecture is shown in Figure 3.3. Two additional instructions are added to this architecture that allow the operating system to merge and split core collections, thus adjusting the number of virtual cores available for scheduling.

      As shown in Figure 3.4, Core Fusion cores perform only slightly worse than a natural processor with the same issue width, achieving performance within 20% of a monolithic processor at equivalent effective issue width. The main reason for this is that the infrastructure that enables fusion comes with a performance cost. The strength of this system, however, is its adaptability, not necessarily its performance when compared to a processor designed for a particular software configuration. Furthermore, energy required to power structures necessary for wide out-of-order scheduling do not need to be active when cores are not fused. As a result, core fusion surrenders a portion of area used to implement the out-of-order scheduler and about 20% performance when fused to emulate a larger core. This enables run-time customization of core width and number, in a way that is binary compatible, and thus is completely transparent. For systems that do not have a priori knowledge of the type of workload that will be running on a processor, or expect the software to transition between sequential and parallel portions, the ability to adjust and accommodate varying workloads is a great benefit.

      Figure 3.3: A 4-core core fusion processor bundle with components added to support merging of cores. Adapted from [74].

      Figure 3.4: Comparison of performance between various processors of issue widths and 6-issue merged core fusion. Taken from [74].

      In a conventional, general-purpose processor design, each time an instruction is executed, it must pass through a number of stages of a processor pipeline. Each of these stages incurs a cost, which is dependent on the type of processor. Figure 3.1 showed the energy consumed in various stages of the processor pipeline. In terms of the core computational requirement of an application, the energy spent in the execute stage is energy spent doing productive compute work, and everything else (i.e., instruction fetch, renaming, instruction window allocation, wakeup and select logic) is overhead required to support and accelerate general-purpose instruction processing for a particular architecture. The reason for execution constituting such a small portion of energy consumed is that for most instructions, each performs a small amount of work.

      Extending the instruction set of an otherwise conventional compute core to increase the amount of work done per instruction is one way of improving both performance and energy efficiency for particular tasks. This is accomplished by merging the tasks that would have otherwise been performed by multiple instructions, into a single instruction. This is valuable because this single large instruction still only requires a single pass through the fetch, decode, and commit phases, and thus requires a reduced amount of bookkeeping to be maintained to perform the same task. In addition to reducing the overhead associated with processing an instruction, ISA extensions enable access to custom compute engines to implement these composite operations more efficiently than could be implemented otherwise.

      The strategy of instruction set customization ranges from very simple (e.g., [6, 95, 111]) to complex (e.g., [63, 66]). Simplistic but effective instruction set extensions are now common in commodity processors in the form of specialized vector instructions: SSE and AVX instructions. Section 3.4.1 discusses vector instructions, which allow for simple operations, mostly floating point operations, to be packed into a single instruction and operate over a large volume of data, potentially simultaneously. While these vector instructions are restricted to use in regular, compute-dense code, they lend a large enough performance advantage that processor manufacturers are continuing to push toward more feature-rich vector extensions [55].

      In addition to vector instructions, there has also been work proposed by both industry [95] and academia [63] that ties multiple operations together into a single compute engine that operates over a single element of data. These custom

Скачать книгу