Researchers upend AI status quo by eliminating matrix multiplication in LLMs

☆ Yσɠƚԋσʂ ☆@lemmy.ml · 7 months ago

Researchers upend AI status quo by eliminating matrix multiplication in LLMs

cygnus@lemmy.ca · 7 months ago

Finally some good “AI” news. Those things aren’t going away, so I’m happy to see any improvements to their energy efficiency.

theshatterstone54 · edit-2 7 months ago

Why are people downvoting? This is huge and should make LLMs more power efficient and memory efficient.

☆ Yσɠƚԋσʂ ☆@lemmy.ml · 7 months ago

Indeed, this seems like a big step forward, and here’s a link to the model https://github.com/ridgerchu/matmulfreellm

AutoTL;DR@lemmings.world · 7 months ago

This is the best summary I could come up with:

The researchers’ approach involves two main innovations: first, they created a custom LLM and constrained it to use only ternary values (-1, 0, 1) instead of traditional floating-point numbers, which allows for simpler computations.

Second, the researchers redesigned the computationally expensive self-attention mechanism in traditional language models with a simpler, more efficient unit (that they called a MatMul-free Linear Gated Recurrent Unit—or MLGRU) that processes words sequentially using basic arithmetic operations instead of matrix multiplications.

These changes, combined with a custom hardware implementation to accelerate ternary operations through the aforementioned FPGA chip, allowed the researchers to achieve what they claim is performance comparable to state-of-the-art models while reducing energy use.

Researchers claim the MatMul-free LM achieved competitive performance against the Llama 2 baseline on several benchmark tasks, including answering questions, commonsense reasoning, and physical understanding.

The researchers project that their approach could theoretically intersect with and surpass the performance of standard LLMs at scales around 10²³ FLOPS, which is roughly equivalent to the training compute required for models like Meta’s Llama-3 8B or Llama-2 70B.

The article was updated on June 26, 2024 at 9:20 AM to remove an inaccurate power estimate related to running a LLM locally on a RTX 3060 created by the author.

The original article contains 570 words, the summary contains 206 words. Saved 64%. I’m a bot and I’m open source!