Heterogeneous Computing. Mohamed Zahran

Чтение книги онлайн.

Читать онлайн книгу Heterogeneous Computing - Mohamed Zahran страница 8

Heterogeneous Computing - Mohamed Zahran ACM Books

Скачать книгу

of Algorithm 2.1. Why do we have several blocks then? There are several reasons. First, threads assigned to the different execution units within the same block can exchange data and synchronize among each other. It would be extremely expensive to do that among the execution units of all the chips as there are hundreds in small GPUs and thousands in high-end GPUs. So this distributed design makes the cost manageable. Second, it gives some flexibility. You can execute different SIMD-friendly applications on different blocks. This is why we have high-level scheduling shown in the figure. Execution units of different blocks can communicate, albeit in a slow manner, through the memory shared among all the blocks, labeled “memory hierarchy” in the figure, because in some designs there are some cache levels above the global memory as well as specialized memories like texture memory.

images

      The confusing thing about GPUs is that each brand has its own naming convention. In NVIDIA parlance, those blocks are called streaming multiprocessors (SM or SMX in later version) and the execution units are called streaming processors (SPs) or CUDA cores. In AMD parlance those blocks are called shader engines and the execution units are called compute units. In Intel parlance, the blocks are called slices (or sub-slices) and the execution units are called just like that: execution units. There are some very slight differences between each design, but the main idea is almost the same.

      GPUs can be discrete, that is, stand-alone chips connected to the other processors using connections like PCIe or NVLink, or they can be embedded with the multicore processor in the same chip. On the one hand, the discrete ones are of course more powerful because they have more real estate. But they suffer from the communication overhead of sending the data back and forth between the GPU’s memory and the system’s memory [Jablin et al. 2011], even if the programmer sees a single virtual address space. On the other hand, the embedded GPUs, like Intel GPUs and many AMD APUs, are tightly coupled with the multicore and do not suffer from communication overhead. However, embedded GPUs have limited area because they share the chip with the processor and hence are weaker in terms of performance.

      If you have a discrete GPU in your system, there is a high chance you also have an embedded GPU in your multicore chip, which means you can make use of a multicore processor, an embedded GPU, and a discrete GPU, which is a nice exercise of heterogeneous programming!

      Let’s see an example of a recent GPU: the Volta architecture V100 from NVIDIA [2017]. Figure 2.4 shows the block diagram of the V100. The giga thread engine at the top of the figure is what we called high-level scheduling in our generic GPU of Figure 2.3. Its main purpose is to schedule blocks to SMs. A block, in NVIDIA parlance, is a group of threads, doing the same operations on different data, assigned to the same SM, so that they can share data more easily and synchronize. There is an L2 cache shared by all, and it is the last-level cache (LLC) before going off-chip to the GPU global memory, not shown in the figure. NVIDIA packs several SMs together in what are called GPU processing clusters (GPCs). In Volta there are six GPCs; each one has 14 SMs. You can think of a GPC as a small full-fledged GPU, with its SMs, raster engines, etc. The main players, who actually do the computations, are the SMs.

images images

      Figure 2.5 shows the internal configuration of a single SM. Each SM is equipped with an L1 data cache and a shared memory. The main difference is that the cache is totally transparent to the programmer. The shared memory is controllable by the programmer and can be used to share data among the block of threads assigned to that SM. This is why a block of threads assigned to the same SM can share data faster. There is also an L1 instruction cache and an L0 instruction cache for each warp scheduler. Warp? What is a warp?

      Конец ознакомительного фрагмента.

      Текст предоставлен ООО «ЛитРес».

      Прочитайте эту книгу целиком, купив полную легальную версию на ЛитРес.

      Безопасно оплатить книгу можно банковской картой Visa, MasterCard, Maestro, со счета мобильного телефона, с платежного терминала, в салоне МТС или Связной, через PayPal, WebMoney, Яндекс.Деньги, QIWI Кошелек, бонусными картами или другим удобным Вам способом.

/9j/4SnCRXhpZgAATU0AKgAAAAgABwESAAMAAAABAAEAAAEaAAUAAAABAAAAYgEbAAUAAAABAAAA agEoAAMAAAABAAIAAAExAAIAAAAeAAAAcgEyAAIAAAAUAAAAkIdpAAQAAAABAAAApAAAANAALcbA AAAnEAAtxsAAACcQQWRvYmUgUGhvdG9zaG9wIENTNiAoV2luZG93cykAMjAxOTowNToyOSAxNTo1 MjozOAAAA6ABAAMAAAABAAEAAKACAAQAAAABAAAIx6ADAAQAAAABAAAKyAAAAAAAAAAGAQMAAwAA AAEABgAAARoABQAAAAEAAAEeARsABQAAAAEAAAEmASgAAwAA

Скачать книгу