|
|
www.design-reuse-china.com |
Vitruvius+: An Area-Efficient RISC-V Decoupled Vector Coprocessor for High Performance Computing Applications
dl.acm.org, Mar. 01, 2023 –
The maturity level of RISC-V and the availability of domain-specific instruction set extensions, like vector processing, make RISC-V a good candidate for supporting the integration of specialized hardware in processor cores for the High Performance Computing (HPC) application domain. In this article,1 we present Vitruvius+, the vector processing acceleration engine that represents the core of vector instruction execution in the HPC challenge that comes within the EuroHPC initiative. It implements the RISC-V vector extension (RVV) 0.7.1 and can be easily connected to a scalar core using the Open Vector Interface standard. Vitruvius+ natively supports long vectors: 256 double precision floating-point elements in a single vector register. It is composed of a set of identical vector pipelines (lanes), each containing a slice of the Vector Register File and functional units (one integer, one floating point). The vector instruction execution scheme is hybrid in-order/out-of-order and is supported by register renaming and arithmetic/memory instruction decoupling. On a stand-alone synthesis, Vitruvius+ reaches a maximum frequency of 1.4 GHz in typical conditions (TT/0.80V/25°C) using GlobalFoundries 22FDX FD-SOI. The silicon implementation has a total area of 1.3 mm2 and maximum estimated power of ∼920 mW for one instance of Vitruvius+ equipped with eight vector lanes.
1 INTRODUCTION
The Covid-19 pandemic remarked the importance of scientific research. The heavy amount of computation needed to characterize the SARS-CoV-2 virus' genome [33] proves that there is a tangible need for investing in High Performance Computing (HPC) technologies to fit the computation requirements of the "race to Exascale" [18]. Generally speaking, Exascale computing refers to the capability of a machine to execute at least 1018
operations per second [16]. Among the commitments with these objectives [14, 16, 17, 22], the European Processor Initiative (EPI) aims to create a sustainable hardware/software ecosystem that could sign the independence of Europe on computing systems [15]. Nonetheless, the challenge to build Exascale machines within a 20-MW power envelope has led to a focus away from peak performance to energy-efficient performance. For instance, the 59th edition of the TOP500 list [44] revealed the Frontier system at the Oak Ridge National Laboratory (ORNL) to be the first true Exascale machine, yet ranking in the second position of the Green500 list [19]. This shows that energy efficiency is becoming a top priority for High Performance Computing (HPC) facilities [1, 20, 26]. The renewed interest in vector architectures due to their characteristic of efficiently exploiting Data-Level Parallelism (DLP) perfectly fits with the requirements of the Exascale challenges.
Historically, vector processing has always been associated with supercomputing. The golden era of vector processors started with the introduction of the CRAY-1 [35] in 1976, which broke up with the memory-to-memory philosophy of precedent machines like TI-ASC [45] and STAR-100 [7], instead introducing a Vector Register File (VRF) and interconnect to allow data movement between the functional units and the vector registers [13]. Vector machines dominated the supercomputing market for about 15 years, when they were extirpated by parallel machines based on multiple out-of-order microprocessors, as the advances in CMOS VLSI technology allowed more transistors to fit on a die. Although multicore architectures represent a valid approach to data-parallel problems, they still have efficiency issues due to their high instruction fetch and decode overheads. The renaissance of vector processing is a direct consequence of the slowdown of Moore's law and the limitations on energy efficiency imposed by the physics of CMOS circuit scaling [9, 11].
Vector processors operate on arrays of data, where a single datum of the array is referred to as a vector element [10]. A dedicated Instruction Set Architecture (ISA) defines the vector architectural parameters, such as the number of vector registers and the Maximum Vector Length (MVL). Particular features like reductions use common arithmetic operations to reduce a vector register to a scalar value. They are also characterized by unique memory operations like strided loads and stores, where the stride defines the increment, expressed in bytes, of memory locations marking the beginnings of new vector elements, and gather-scatter operations, which locate vector elements by accessing memory through a set of indices, represented by elements of another vector. When compared to Single Instruction Multiple Data (SIMD) architectures, vector processors offer a higher level of abstraction. Single Instruction Multiple Data (SIMD) architectures, like the ARM Neon [32] or the Intel AVX-512 [8], are characterized by the fact that more elements are packed in the same register, which can be computed by the available functional units. To exploit Data Level Parallelism (DLP), the software needs to know how many functional units, also called SIMD lanes, are available to produce effective code. Additionally, the maximum number of elements that can be processed in parallel is limited by the size of the registers. Any attempt to increase the size of the registers and/or the number of functional units implies the introduction of new dedicated instructions, reducing the portability of the Instruction Set Architecture (ISA). Ottavi et al. [29] solve this limitation by encapsulating the number of elements to process in the instruction encoding and controlling it through a Control and Status Register (CSR). Although this solution is feasible for specific Machine Learning (ML) workloads, the number of maximum elements within one operation is still limited by the size of the scalar registers. If the size of the scalar registers increases, new combinations of mixed-width operations are possible, and the ISA needs to be modified at least to specify the new setting of the Control and Status Register (CSR) that holds the SIMD width. On the contrary, vector ISAs are agnostic of the number of available functional units, and the amount of elements to be processed is only limited by the defined Maximum Vector Length (MVL). Advances in ISA offer vector architectures the opportunity to expand beyond HPC to other market segments such as Digital Signal Processing (DSP) and multimedia applications. Examples of it are the vector extensions for NEC [27], the ARM's Scalable Vector Extension (SVE) [42], and the RISC-V vector extension (RVV) [34]. The latter is currently gaining importance both in the academic and the industrial world [24]. RVV declares two implementation-specific parameters [34]. The maximum size in bits of a vector element (ELEN), with ELEN ≥
8; the number of bits in a single vector register (VLEN). Additionally, it includes CSRs that can be modified through specific instructions to change the operational vector length, vl, the Selected Element Width (SEW), and the vector register group multiplier (LMUL), which defines the number of vector registers to form a wider vector register group.