Game-changing breakthrough: Language model acceleration at 300x speed

This breakthrough method revolves around the integration of "fast feedforward" layers (FFF)

Alexander Lewis - Nov 25, 2023

The illustration shows animated language model speaking to human. — Freepik

In a pioneering development, researchers at ETH Zurich have unveiled an innovative technique that promises a substantial speed boost for neural networks. By reconfiguring the inference process, they achieved a staggering 99% reduction in computational requirements, presenting exciting possibilities for large language models such as GPT-3.

This breakthrough method revolves around the integration of "fast feedforward" layers (FFF), leveraging conditional matrix multiplication (CMM) as a replacement for the conventional dense matrix multiplications (DMM).

Unlike the computationally intensive DMM, FFF significantly lightens the computational load by processing each input with only a handful of neurons, leading to faster and more efficient language models.

The researchers note, "If trainable, this network could be replaced with a fast feedforward network of maximum depth 15, which would contain 65536 neurons but use only 16 for inference. This amounts to about 0.03% of GPT-3's neurons."

The researchers' modified model, FastBERT, showcases the remarkable potential of this technique. By replacing intermediate feedforward layers with FFF, FastBERT demonstrated exceptional performance, retaining over 96% of the original BERT model's effectiveness while utilising a mere 0.3% of its feedforward neurons.

“With a theoretical speedup promise of 341x at the scale of BERT-base models, we hope that our work will inspire an effort to implement primitives for conditional neural execution as a part of device programming interfaces,” the researchers write.

This advancement could revolutionise AI systems by addressing significant computational bottlenecks and ushering in a new era of efficiency and speed in language processing models.