Heterogeneous Computing. Mohamed Zahran
Чтение книги онлайн.
Читать онлайн книгу Heterogeneous Computing - Mohamed Zahran страница 5
There are many layers in the computing stack that do dynamic voltage and frequency scaling (DVFS). It is done at the hardware level and per core. This means that even if we think we are writing an application for a homogeneous multicore, it is actually heterogeneous because it may have different performance measure-ments based on the processes running on each core. It is also done at the operating system (OS) level. With Intel processors, for example, the OS requests a particular level of performance, known as performance-level (P-level), from the processor. The processor then uses DVFS to try to meet the requested P-state. Up to that point, the programmer has no control and all this is happening under the hood. There are some techniques that involve application-directed DVFS. The programmer knows best when high performance is needed and when the program can tolerate lower performance for better power saving. However, this direction from the application can be overridden by the OS or the hardware.
1.2.3Memory system
Beginning programmers see the memory as just a big array that is, usually, byte addressable. As programmers gain more knowledge, the concept of virtual memory will arise and they will know that each process has its known virtual address space that is mapped to physical memory that they see as a big array that is, usually, byte addressable! Depending on the background of the programmers, the concept of cache memory may be known to them. But what programmers usually do not know is that the access time for the memory system and large caches is no longer fixed. To overcome complexity and power dissipation, both memory and large caches are divided into banks. Depending on the address accessed, the bank may be near, or far, from the requesting core, resulting in nonuniform memory access (NUMA for memory) [Braithwaite et al. 2012] and nonuniform cache access (NUCA for cache) [Chishti et al. 2003]. This is one of the results from heterogeneous performance of memory hierarchy.
Another factor that contributes to the heterogeneity in memory systems is the cache hits and misses. Professional programmers, and optimizing compilers to some extent, know how to write cache-friendly code. However, the multiprogramming environment, where several processes are running simultaneously, the virtual memory system, and nondeterminism in parallel code make the memory hierarchy response time almost unpredictable. And this is a kind of temporal heterogeneity.
Another form of heterogeneity in memory systems is the technology. In the last several decades, the de facto technology used in memory hierarchy is dynamic RAM (DRAM) for the system memory, and in the last decade embedded DRAM or eDRAM for last-level cache, for some processors, especially IBM POWER processors. For the cache hierarchy static RAM (SRAM) is the main choice. DRAM has higher density but higher latency, due to its refresh cycle. Despite many architecture tricks, DRAM is becoming a limiting factor for performance. This does not mean it will disappear from machines, at least not very soon, but it will need to be complemented with something else. SRAM has shorter latency and lower density. This is why it is used with caches that need to be fast but not as big as the main system memory. Caches are also a big source of static power dissipation, especially leakage [Zhang et al. 2005]. With more cores on chip and with larger datasets, the big-data era, we need larger caches and bigger memory. But DRAM and SRAM are giving us diminishing returns from different angles: size, access latency, and power dissipation/consumption. A new technology is needed, and this adds a third element of heterogeneity.
The last few years have seen several emerging technologies that are candidates for caches and system memory. These technologies have the high density of DRAM, the low latency of SRAM, and, on top of that, they are nonvolatile [Boukhobza et al. 2017]. These technologies are not yet mainstream, but some of them are very close, waiting to solve some challenges related to cost, power, and data consistency.
Table 1.1 shows a comparison between the current (volatile) memory technologies used for caches and main memory, namely, DRAM and SRAM, and the new nonvolatile memory (NVM) technologies. The numbers in the table are approximate and collected from different sources but for the most part are from Boukhobza et al. [2017]. Many of the nonvolatile memory technologies have much higher density than DRAM and SRAM; look at the cell size. They also have comparable read latency and even lower read power in most cases. There are several challenges in using NVM that need to be solved and are shown in the table. For instance, write endurance is much lower than DRAM and SRAM, which causes a reliability problem. The power needed for write is relatively high in NVM. Consistency is also a big issue. When there is a power outage, we know that the data in DRAM and SRAM are gone. But for NVM when do not know whether the data stored are stale or updated. The power may have gone off while in the middle of a data update. A lot of research is needed to address these challenges. NVM can be used in the memory hierarchy at a level by itself, for example, as a last-level-cache (LLC) or in main memory, which is a vertical integration. NVM can also be used in tandem with traditional DRAM or SRAM, which is a horizontal integration. The integration of NVM in the memory hierarchy can be managed by the hardware, managed by the operating system, or left to the programmer to decide where to place the data. The first two cases are beyond the programmer’s control. In the near future, memory hierarchy is expected to include volatile and nonvolatile memories, adding to the heterogeneity of the memory system.
Table 1.1:Comparison of Several Memory Technologies
Figure 1.2 shows a summary of the factors that we have just discussed.
Figure 1.2Factors Introducing Heterogeneity in Memory
1.3Heterogeneity Within Our Control
In the previous section we explored what happens under the hood that makes the system heterogeneous in nature. In this section we explore factors that are under our control and make us use the heterogeneity of the system. There is a big debate on how much control to give the programmer. The more control the better the performance and power efficiency we may get, depending of course on the expertise of the programmer, and the less the productivity. We discuss this issue later in the book. For this section we explore, from a programmer perspective, what we can control.
1.3.1The Algorithm and The Language
When you want to solve a program, you can find several algorithms for that. For instance, look at how many sorting algorithms we have. You decide which algorithm to pick. We have to be very careful here. In the good old days of sequential programming, our main issues were the big-O notation. This means we need to optimize for the amount of computations done. In parallel computing, computation is no longer the most expensive operation. Communication among computing nodes (or cores) and memory access are more expensive than computation. Therefore, it is sometimes wiser to pick a worse algorithm in terms of computation if it has a better communication pattern (i.e., less communication) and a better memory access pattern (i.e., locality). You can even find some algorithms with the same big-O, but one of them is an order of magnitude slower than the other.
Once you pick your algorithm, or set of algorithms in the case of more sophisticated applications, you need to translate it to a program using one of the many parallel programming languages available (and counting!). Here also you are in control: which language to pick. There are several issues to take into account when picking a programming language for your project. The first is how suitable this language is for the algorithm at hand. Any language can implement anything. This applies to sequential and parallel languages. But some languages are much easier than others for some tasks. For example, if you want to count the number of times a specific pattern of characters appears in a text file, you can write a C program to do it. But a small Perl or Python script will do the job in much fewer lines. If you want less control but higher productivity, you can pick some languages with a higher level of abstraction (like Java, Scala,