Customizable Computing. Yu-Ting Chen

Чтение книги онлайн.

Читать онлайн книгу Customizable Computing - Yu-Ting Chen страница 6

Customizable Computing - Yu-Ting Chen Synthesis Lectures on Computer Architecture

Скачать книгу

on-chip memory, and if the different compute units use different amounts of memory, one compute unit may, for example, use more than 1MB of space at a particular time since there is a large pool of memory available. Sharing works particularly well when compute units use different amounts of memory at different times. Sharing is also extremely effective when compute units make use of the same memory locations. For example, if compute units are all working on an image in parallel, storing the image in a single memory shared among the units allows the compute units to more effectively cooperate on the shared data.

      On-chip memory stores the data needed by the compute units, but an important part of the overall CSoC is the communication infrastructure that allows this stored data to be distributed to the compute units, that allows the data to be delivered to/from the on-chip memory from/to the memory interfaces that communicate off-chip, and that allows compute units to synchronize and communicate with one another. In many applications there is a considerable amount of data that must be communicated to the compute units used to accelerate application performance. And with multiple compute units often employed to maximize data level parallelism, there are often multiple data streams being communicated around the CSoC. These requirements transcend the conventional bus-based design of older multicore designs, with designers instead choosing network-on-chip (NoC) designs. NoC designs enable the communication of more data between more CSoC components.

      Components interfacing with an NoC typically bundle transmitted data into packets, which contain at least address information as to the desired communication destination and the payload itself, which is some portion of the data to be transmitted to a particular destination. NoCs transmit messages via packets to enable flexible and reliable data transport—packets may be buffered at intermediate nodes within the network or reordered in some situations. Packet-based communication also avoids long latency arbitration that is associated with communication in a single hop over an entire chip. Each hop through a packet-based NoC performs local arbitration instead.

      The creation of an NoC involves a rich set of design decisions that may be highly customized for a set of applications in a particular domain. Most design decisions impact the latency or bandwidth of the NoC. In simple terms, the latency of the NoC is how long it takes a given piece of data to pass through the NoC. The bandwidth of the NoC is how much data can be communicated in the NoC at a particular time. Lower latency may be more important for synchronizing communication, like locks or barriers that impact multiple computational threads in an application. Higher bandwidth is more important for applications with streaming computation (i.e., low data locality) for example.

      One example of a design decision is the topology of an NoC. This refers to the pattern of links that connect particular components of the NoC. A simple topology is a ring, where each component in the NoC is connected to two neighboring components, forming a chain of components. More complex communication patterns may be realized by more highly connected topologies that allow more simultaneous communication or a shorter communication distance.

      Another example is the bandwidth of an individual link in the topology—wire that is traversed in one cycle of the network’s clock. Larger links can improve bandwidth but require more buffering space at intermediate network nodes, which can increase power cost.

      An NoC is typically designed with a particular level of utilization in mind, where decisions like topology or link bandwidth are chosen based on an expected level of service. For example, NoCs may be designed for worst case behavior, where the bandwidth of individual links is sized for peak traffic requirements, and every path in the network is capable of sustaining that peak bandwidth requirement. This is a flexible design in that the worst case behavior can manifest on any particular communication path in the NoC, and there will be sufficient bandwidth to handle it. But it can mean overprovisioning the NoC if worst case behavior is infrequent or sparsely exhibited in the NoC. In other words, the larger bandwidth components can mean wasted power (if only static power) or area in most cases. NoCs may also be designed for average case behavior, where the bandwidth is sized according to the average traffic requirement, but in such cases performance can suffer when worst case behavior is exhibited.

       Topological Customization

      Customized designs can specialize different parts of the NoC for different communication patterns seen in applications within a domain. For example, an architecture may specialize the NoC such that there is a high bandwidth connection between a memory interface and a particular compute unit that performs workload balancing and sorting for particular tasks, and then there is a lower bandwidth connection between that compute unit for workload balancing and the remainder of the compute units that perform the actual computation (i.e., work). More sophisticated designs can adapt bandwidth to the dynamic requirements of the application in execution. Customized designs may also adapt the topology of the NoC to the specific requirements of the application in execution. Section 6.2 will explore such flexible designs, along with some of the complexity in implementing NoC designs that are specialized for particular communication patterns.

       Routing Customization

      Another approach to specialization is to change the routing of packets in the NoC. Packets may be scheduled in different ways to avoid congestion in the NoC, for example. Another example would be circuit switching, where a particular route through the NoC is reserved for a particular communication, allowing packets in that communication to be expedited through the NoC without intermediate arbitration. This is useful in bursty communication where the cost of arbitration may be amortized over the communication of many packets.

       Physical Design Customization

      Some designs leverage different types of wires (i.e., different physical trade-offs) to provide a heterogeneous NoC with specialized communication paths. And there are also a number of exciting alternative interconnects that are emerging for use in NoC design. These alternative interconnects typically improve interconnect bandwidth and reduce communication latency, but may require some overhead (such as upconversion to an analog signal to make use of the alternative interconnect). These interconnects have some physical design and architectural challenges, but also provide some interesting options for customized computing, as we will discuss in Section 6.4.

      Customization is often a holistic process that involves both hardware customization and software orchestration. Application writers (i.e., domain experts) may have intimate knowledge of their applications which may not be expressed easily or at all in traditional programming languages. Such information could include knowledge of data value ranges or error tolerance, for example. Software layers should provide a sufficiently expressive language for programmers to communicate their knowledge of the applications in a particular domain to further customize the use of specialized hardware.

      There are a number of approaches to programming domain-specific hardware. A common approach is to create multiple layers of abstraction between the application programmer and the domain-specific hardware. The application programmer writes code in a relatively high level language that is expressive enough to capture domain-specific information. The high level language uses library routines implemented in the lower levels of abstraction as much as possible to cover the majority of computational tasks. The library routines may be implemented through further levels of abstraction, but ultimately lead to a set of primitives that directly leverage domain-specific hardware. As an example, library routines to do FFTs may leverage hardware accelerators specifically designed for FFT. This provides some portability of the higher level application programmer code, while still providing domain-specific specialization at the lower abstraction levels that directly leverages customized hardware. This also hides the complexity of customized hardware from the application writer.

      Another question is how much of the process of software mapping may

Скачать книгу