-
Notifications
You must be signed in to change notification settings - Fork 238
Architecture Overview
MIAOW implements a subset of the Southern Island (SI) ISA released by AMD. Included with the project are instruction traces and memory values for benchmarks provided by the AMD APP SDK, all of which are supported on MIAOW. What follows is an overview of MIAOW's design, based largely on AMD's public documentation of its Graphics Core Next (GCN) micro-architecture, as well as a discussion of some of the micro-architectural decisions made during the design and implementation.
Some terminology that is used throughout the documentation that we feel is only courteous to introduce here to avoid confusion.
- Compute Unit (CU): The module that actually performs the computation.
- Dispatcher: The module that schedules workloads on CUs. The Vertical Research Group has produced several variants from pure software to pure hardware over the course of development. The dispatcher is intended to receive workloads from the main processor, much as commercial GPUs receive kernels and shaders in PCs.
- Wavefront: Known as a 'warp' in NVidia documentation. It is basically the collection of threads (64) that have been scheduled to run on a compute unit. This is the unit of execution that the dispatcher schedules on CUs.
- Workitem: Known as a 'thread' in NVidia documentation.
- Workgroup: Known as a 'thread-block' in NVidia documentation. A collection of workitems that can run together, possibly across different wavefronts but all mapped to a single CU.
- Local Data Store: Known as 'shared memory' in NVidia documentation. Used for communication and coordination between workitems in a workgroup.
- Global Data Store: Known as 'global memory' in NVidia documentation. Used for sharing data between multiple workgroups.
A complete MIAOW instantiation is intended to be composed of all the components needed to create a highly parallel compute accelerator. These include not only the CUs and the dispatchers that control them but also the memory controller that mediates between the CUs and device memory. There also exists dedicated L1 caches for scalar data and instructions and a unified L2 cache.
From a high level perspective MIAOW has a fairly faithful implementation of the GCN architecture's compute unit, as shown in the below two diagrams. Note that the primary differences are in memory organization, which is highly specific to technology process used for manufacturing anyway, and graphics related functionality that were not part of the original design objectives for MIAOW but which could be added.
An overview of MIAOW's compute unit modules.
An overview of Kaveri's compute unit module, adapted from AMD's presentation at Hot Chips 26.
The CU is composed of the modules needed to perform both scalar and vector arithmetic operations. A fetch unit acts as the interface between the CU and the dispatcher, receiving the information needed to execute a wavefront when one is scheduled to the CU. A wavepool acts as a queue for all instructions that have been fetched. It is capable of keeping track of and supporting up to 40 different wavefronts at a time. The decode unit handles instruction decoding and also collating 64bit instructions. It also determines which ALU will carry out the instruction and perform address translation for the registers. The issue unit tracks all in-flight instructions and the resources they are using, resolving dependencies between them. It ensure that all the resources an instruction needs are available before allowing it to execute.
The CU supports both vector and scalar operations. In a full instantiation of the CU the vector ALUs are organized into banks of sixteen with four banks each for integer and and floating point inputs. This provides 64 ALUs to support simultaneous execution of all 64 threads in a wavefront assuming no conflicts of other resources.
Memory resources in the CU are organized into registers, a local scratch pad, and an interface to device memory. Registers are separated into dedicated register files for scalar and vector operations with the ports of the vector register file appropriately widened to support simultaneous access from an entire bank of vector ALUs.
Several dispatchers have been developed for testing MIAOW. The original one was a combination of Verilog and C code that acted as the testbench for running the unit tests and larger benchmarks. Another software dispatcher was implemented as an embedded C program for the Microblaze processor on the Virtex7 FPGA.
A hardware dispatcher was also implemented for MIAOW. It keeps track of allocated resources as well as workgroups that the CPU has requested it run. When a CU with sufficient resources becomes available, it will then divide the workgroup into wavefronts and hand each wavefront to the CU.
One question asked by many people is whether MIAOW is a realistic implementation of a GPU. Anecdotal statements by industry engineers that we have spoken with indicate that they do not see anything severely out of place and that any differences they do notice they consider to be due to differing priorities and goals, not problems with the design itself. Some of the most contentious feedback from non-industry sources has fixated on the instruction fetch width and issue rate, primarily because MIAOW's fetch bandwidth is narrower than that specified by GCN.
MIAOW's fetch unit currently fetches a single instruction at a time whereas the GCN documentation specifies a 16 or 32 instruction fetch. This we speculate to be due to a cache line alignment. The bandwidth of the fetch unit is highly dependent on the memory used to store instructions and is therefore highly subject to change. For this reason we focused on making the fetch unit easily modifiable instead of trying to settle on a hard bandwidth size.
##Wavepool slots Based on the back-of-the-envelope analysis of load balance, we decided on 6 wavepool slots. Our design evaluations show that all 6 slots of the wavepool are filled 50% of the time - suggesting that this is a reasonable and balanced estimate considering our fetch bandwidth. We expect the GCN design to have many more slots to accommodate the wider fetch. The number of queue slots is parameterized and can be easily changed. Since this pipeline stage has smaller area, it has less impact on area and power.
The issue rate of one instruction per cycle was designed to match that of the fetch bandwidth. Increasing it would require additional read ports for the register files for obvious reasons, as stalling on register file reads effectively negates any advantage that might be gained from issuing multiple instructions in a single cycle. The GCN documentation referenced by the team indicates an issue width of five instructions, implying a one scalar and four vector instructions combination.
By default MIAOW incorporates four integer and four floating point vector ALUs respectively, with each vector ALU supporting 16 operations simultaneously. This number was chosen based off of analysis of various GPGPU benchmarks but is parameterizable.
The register file implementation is complicated in that it is heavily dependent on the technology process selected. Synopsys for example provides an SRAM based register file that is single port and provides low levels of contention. The Synopsys register file is however proprietary and cannot be distributed, so the team created a five port register implementation using flip-flops for general VCS based simulation. A third implementation was created using block RAMs for use with Xilinx's FPGAs, which necessitated another set of design compromises. As such, the team made sure that it is relatively easy to swap out register file implementations so that the most appropriate tools are used for whatever situation MIAOW is used in.
MIAOW currently has two major limitations in its implementation, both of which would need to be resolved to put MIAOW into production as an actual graphics card. The first one also has an impact on its usage for GPGPU purposes. Specifically, MIAOW's memory interface is still in a state of flux. During work to bring up MIAOW on an FPGA it was noted that the existing memory interface was too wide and deep and that attempting to abstract over it would have resulted in significant overhead. The decision was made to rework the compute unit's pipeline, namely in the issue unit, to allow for a more practical memory interface. This work has not yet been completed.
The second limitation is with respect to MIAOW's usage as a graphics unit. As MIAOW was implemented for experimenting with GPGPU workloads, it is missing some instructions and functionality that were purely for graphics such as texture support and the like. It also does not have anything to actually output graphics. The set of supported instructions would need to be expanded and the auxiliary logic for graphic output would also need to be implemented, though the latter is more of an electrical design issue than the more digital logic one of the extra instructions.