Multi-Processor System-on-Chip 2. Liliana Andrade
Чтение книги онлайн.
Читать онлайн книгу Multi-Processor System-on-Chip 2 - Liliana Andrade страница 10
With this in mind, let us make two high-end cases, first using the CA extension and the second high-end use case overlapping both the CA and multiple data layers. Calculating the first, we have 10.52 Gb/s and 84.16 Gb/s, respectively. These rates could be used, for example, for large file transfers. The difference between the low-end and the two high-end throughput corners is approximately 5 × 104× or 4 × 105×, respectively. Therefore, the system needs to deal with vastly varying data processing loads during operation, highlighting the need for flexibility of the compute engine.
1.2.2.3. Specification summary
We can now combine both the latency and data throughput requirement into low- and high-end corners on the 2D plane. Low-end use case is clear: LTE legacy 1.4 MHz. High-end, however, has two options: either the highest data of μ = 3, BW = 400 MHz or the shortest deadline μ = 4. We chose the former as the combined high-end due to nonlinear scaling of operations per data point in algorithms. For example, to compute the 4096 inverse discrete Fourier transform (IDFT) used in μ = 3, BW = 400 MHz for a single OFDM symbol, it takes more operations than to compute 2 × 2048 IDFT for two symbols used in μ = 4, BW = 400 MHz, i.e. over the same duration, there are more operations in μ = 3, BW = 400 MHz. The corner cases are presented in Table 1.1, with expression of throughput in other units as well, for a more reader-friendly overview.
Table 1.1. Processing requirement corners as per standard specification
Use Case | Throughput | TTI | ||||
[μs] | ||||||
Low-end LTE legacy (3GPP 2019a, b) | 72 | 6 | 6 | 6 | 0.2 m | 1,000 |
CA high-end FR2 (3GPP 2019d, f) 4×CA, µ = 3,400MHz | 3,168 | 264 | 1,056 | 8,448 | 10.52 | 125 |
MIMO CA high-end FR2 (3GPP 2019d, e, f)8×8, 4×CA, µ = 3,400MHz | 3,168 | 264 | 8,448 | 67,584 | 84.16 | 125 |
1.2.3. Outcome of workloads
We see that the 3GPP specifications follow the trend and vision of 5G laid out in section 1.2.1, incorporating the variability of workloads as the central paradigm.
With throughput requirements varying by several orders of magnitude, a homogeneous HW solution would be very inefficient for both high-end and low-end use cases. Rather, a heterogeneous HW architecture that is a mixture of HW accelerator engines, banks of programmable processing elements and supporting memory systems would be efficient. Accelerator engines such as dedicated (application-specific) HW accelerators and ASIPs are ideal to deal with extreme high-end use cases and easy-to-scale low-varying algorithms or processing steps, due to their speed and efficient energy per data point consumption. While banks of programmable processing elements such as vDSPs (SIMD cores with signal processing-oriented instruction set architecture) and generic scalar reduced instruction set computer (RISC) cores are ideal to deal with moderate–high to low-end use cases and processing steps that require flexibility, for example, choosing from a set which algorithm to perform, based on the device’s situational parameters and environmental conditions. Such HW is well suited for dealing with highly variable loads by powering HW modules on and off based on the current load. For example, if enough compute resources are available on the vDSPs, i.e. available idle cycles, we could run the communication kernels on the vDSPs in a time-multiplexed manner and keep the HW accelerators off.
Figure 1.5. Tiled “Kachel” MPSoC with decentralized tightly coupled memories
When it comes to MPSoC architecture in the sense of module arrangement and layout, we have recently published works with both decentralized (e.g. in Figure 1.5, (Fettweis et al. 2019)) and centralized (e.g. in Figure 1.6, (Damjancevic et al. 2019)) memory in mind. Although this chapter follows closely the later work, the programmable vector processor is a common processing element in both, and the lessons learned are universal13.
Figure 1.6. Heterogeneous MPSoC with a central shared memory architecture
Without discussing the layouts, as both have their advantages, let us delve into the common thread and analyze the combined effect of workloads and algorithms on HW provisioning requirements and possibly confirm our hypothesis that a heterogeneous MPSoC is required for an efficient future-proof solution.
1.3. GFDM algorithm breakdown
Knowledge of algorithm kernel14 requirements with respect to workload is key to determining whether or not the vDSP has enough available compute resources to execute the kernel. This section covers the GFDM algorithm and its processing graph extrapolated pseudo-code.
GFDM processing in the literature is divided into three categories:
1 1) frequency-domain processing (Michailow et al. 2014);
2 2) time-domain processing (Farhang et al. 2016; Matthé et al. 2016);
3 3) Radix-2 cascade processing (Nimr et al. 2018).