Heterogeneous Computing. Mohamed Zahran
Чтение книги онлайн.
Читать онлайн книгу Heterogeneous Computing - Mohamed Zahran страница 4
The solution is to stop increasing the clock frequency and instead increase the number of cores per chip, mostly at lower frequency. We can no longer increase frequency, otherwise power density becomes unbearable. With simpler cores and lower frequency, we reduce power dissipation and consumption. With multiple cores, we hope to maintain higher performance. Figure 1.1 [Rupp 2018] tells the whole story and shows the trends of several aspects of microprocessors throughout the years. As the figure shows, from around 2004 the number of logical cores started to increase beyond single core. The word logical includes physical cores with simultaneous multithreading (SMT) capability, also known as hyperthreading technology [Tullsen et al. 1995, Lo et al. 1997]. So a single core with four-way hyperthreading is counted as four logical cores. With SMT and the increase in the number of physical cores, we can see a sharp increase in the number of logical cores (note the logarithmic scale). If we look at the power metric, the 1990s was not a very friendly decade in terms of power. We see a steady increase. After we moved to multicore, things slowed down a bit due to the multicore era as well as the rise of dark-silicon techniques [Allred et al. 2012, Bose 2013, Esmaeilzadeh et al. 2011] and some VLSI tricks. “Dark silicon” refers to the parts of the processor that must not be turned off (hence dark) in order for the heat generated not to exceed the maximum capability that the cooling system can dissipate (called thermal design point, or TDP). How to manage dark silicon while trying to increase performance? This is the question that has resulted in many research papers in the second decade of the twenty-first century. We can think of the dark-silicon strategy as a way to continue increasing the number of cores per chip while keeping the power and temperature at a manageable level. The figure also shows that we stopped, or substantially slowed, increasing clock frequency. With this bag of tricks, we sustained, so far, a steady increase of transistors, as the figure shows at its top curve. There is one interesting curve remaining in the figure: the single thread (i.e., sequential programs) performance. There is a steady increase in single thread performance almost till 2010. The major reason is Moore’s law, which allowed computer architects to make use of these transistors to add more features (from pipelining to superscalar to speculative execution, etc.). Another reason is the increase in clock frequency that was maintained till around 2004. There are some minor factors that make single thread performance a bit better with multicore. One of them is that the single thread program has a higher chance of executing on a core by itself without sharing resources with another program. The other is that of thread migration. If a single thread program is running on a core and that core becomes warm, the frequency and voltage will be scaled down, slowing down the program. If the program is running on a multicore, and thread migration is supported, the program may migrate to another core, losing some performance in the migration process but continuing at full speed afterwards.
Figure 1.1Trend of different aspects of microprocessors. (Karl Rupp. 2018. 42 years of microprocessor trend data. Courtesy of Karl Rupp https://github.com/karlrupp/microprocessor-trend-data; last accessed March 2018)
The story outlined above relates the hardware tricks developed to manage the power and temperature near the end of Moore’s law and the end of Dennard scaling. Those tricks gave some relief to the hardware community but started a very difficult problem for software folks.
Now that we have multicore processors all over the place, single thread programs are no longer an option.
The free lunch is over [Sutter 2005]! In the good old days, you could write a sequential program and expect that your program would become faster with every new generation of processors. Now, unless you write parallel code, don’t expect to get that much of a performance boost anymore. Take another look at the single thread performance in Figure 1.1. We moved from single core to multicore not because the software community was ready for concurrency but because the hardware community could not afford to neglect the power issue. The problem is getting even harder because this multicore or parallel machine is no longer homogeneous. You are not writing code for a machine that consists of similar computing nodes but different ones. So now we need heterogeneous parallel programming.
We saw how we moved from single core to multiple homogeneous cores. How did heterogeneity arise? It is again a question of power, as we will see. But before we go deeper into heterogeneity, it is useful to categorize it into two types from a programmer’s perspective.
A machine is as useful as the programs written for it. So let’s look at heterogeneity from a programmer’s perspective. There is this heterogeneity that is beyond a programmer’s control. Surprisingly, this type has been around for several years now; and many programmers don’t know it exists! There is also heterogeneity within a programmer’s control. What is the difference? And how come we have been dealing with heterogeneity without knowing it?
1.2Heterogeneity Beyond Our Control
Multicore processors have been around now for more than a decade, and a lot of programs were written for them using different parallel programming paradigms and languages. However, almost everybody thinks they are writing programs for a heterogeneous machine, unless of course there is an explicit accelerator like a GPU or FPGA involved. In this section we show that we have not been programming a pure homogeneous machine even if we thought so!
1.2.1Process Technology
Everybody, software programmers included, knows that we are using CMOS electronics in our design of digital circuits, and to put them on integrated circuits we use process technology that is based on silicon. This has been the norm for decades. This is true. But even in process technology, there is heterogeneity.
Instead of silicon, semiconductor manufacturing uses a silicon-insulator-silicon structure. The main reason for using silicon on insulator (SOI) is to reduce device capacitance. This capacitance causes the circuit elements to behave in nonideal ways. SOI reduces this capacitance and hence results in performance enhancement.
Instead of traditional CMOS transistors, many manufacturers use what is called a Fin field-effect (FinFET) transistor. Without going into a lot of electronics details, a transistor, which is the main building block of processors, is composed of gate, drain, and source. Depending on the voltage at gate, the current flows from source to drain or is cut off. Switching speed (i.e., from on to off) affects the overall performance. FinFET transistors are found to have a much higher switching time than traditional CMOS technology. An example of a FinFET transistor is Intel’s tri-gate transistor, which was used in 2012 in the Ivy Bridge CPU.
Those small details are usually not known, or not well known, to the software community, making it harder to reason about the expected performance of a chip, or, even worse, of several chips in a multisocket system (i.e., several processors sharing the memory).
1.2.2Voltage and Frequency
The dynamic power consumed and dissipated by the digital circuits of all our processors is defined by this equation: P=C×Vcc2×F×N, where C is the capacitance, Vcc is the supply voltage, F is the frequency, and N is the number of bits switching. As we can see, there is a cubic relationship between the dynamic power and supply voltage and frequency. Reducing the frequency and reducing the supply voltage (up to a limit to avoid switching error) greatly reduces the