Groq deconstructed the conventional processor and designed its chip in which the software takes control of the chip.
The Groq Tensor streaming processor architecture follows a growing trend of software controlling system functions, which has happened in self-driving cars, networks, and other hardware.
The architecture passes hardware controls from the chip to the compiler. The chip has integrated software control units at strategic points to optimize movement and data processing.
Units are organized in a manner consistent with the typical data flow found in machine learning models.
“Determinism enables this software-defined hardware approach. It’s not about ignoring the details. Our goal is to control the material underneath,” said Dennis Abts, lead architect at Groq.
Abts shared the architecture design of the Groq Tensor streaming processor at this week’s conference Hot fries conference. Hardware-software co-design isn’t new, but the concept saw a revival at the conference, with Intel CEO Pat Gelsinger in a keynote referring to the concept as central to the future lice or fleas.
Groq is one of many companies designing chips specifically for AI. AI chips have capabilities to determine outcomes based on probabilities and discovered associations of patterns, which also form the basis of software-enabled hardware controls over the architecture.
“What we’ve done is try to avoid some of this waste and fraud and abuse that happens at the system level,” Abts said.
System-level complexity often increases with tens or even thousands of processing units such as CPUs, GPUs, and smartNICs in heterogeneous computing environments with different performance, power, and failure profiles.
“As a result, you end up with a lot of variation in performance in terms of response time, latency, variation, for example. And that variation in latency ultimately slows down an internet-scale application,” Abts said. .
Groq re-examined on-chip hardware-software interfaces for deterministic processing. The company had to make design choices and uprooted conventional chip designs from scratch.
“It allows… an ISA that strengthens our software stack. We explicitly pass control to the software, especially the compiler, so that it can reason about correctness and program instructions on hardware from a principle-first standpoint,” Abts said.
At the top, the chip has a dynamic static interface, which gives the compiler a complete view of a system at any given time. This replaces runtime interfaces that can be found on conventional processors.
The dynamic static interface ensures that the hardware is fully controllable by the compiler, without abstracting hardware details. The compiler has a “miraculous view of what the hardware is doing at any given cycle,” Abts said.
Transferring hardware commands to software frees the hardware to perform other functions. The architecture is different from traditional systems, which encompass out-of-order execution, speculative execution and other techniques to bring parallelism and concurrency to memory, Abts said.
The system has 220 MB of “scratchpad” memory and “tensors” allocated so that compilers can determine the calculations, where they go in a chip and how they move each cycle. The chip design makes memory concurrency available throughout an entire system.
Groq also disaggregated functional elements typically found in a conventional processor, such as integer and vector units, and relocated them into separate groups. This is like having memory or storage in a single enclosure, with proximity providing performance benefits. This is particularly advantageous for AI applications.
The chip’s design is different from conventional CPUs and “it allows us to execute in the same way that conventional CPUs break larger instructions down into micro-operations. Similarly, we break deep learning operations into smaller micro-operations and run them as a whole that together achieve a larger goal,” Abts said.
The chip design features matrix multiplication units, which Abts said was the “workhorse” unit. It contains storage units for 409,600 “weights”, which provides the parallelism needed to make AI applications faster.
The chip’s building blocks also include SRAM, programmable vector units, 480 Gb/s networking units, and data switches. These are all connected to 144 on-chip instruction control units, which control the dispatch of tasks to associated functional units.
“This allows us to keep the hardware overhead associated with dispatching very low. Less than 3% of the area is used for decoding and dispatching instructions,” Abts said.
Groq has also taken a software-defined approach to reducing network congestion.
“The compiler can literally program network links as it would program ALU (arithmetic logic unit) or matrix. This alleviates some of the more conventional problems [hardware-based] approaches,” Abts said, referring specifically to adaptive routing.
“What we’re trying to achieve is predictable, repeatable performance that delivers low latency and high throughput across the entire system,” Abts said.