Google has published a research paper on a new technology called Infini-attention that allows it to process massively large amounts of data with “infinitely long contexts” while also being capable of being easily inserted into other models to vastly improve their capabilities
That last part should be of interest to those who are interested in Google’s algorithm. Infini-attention is plug-and-play, which means it’s relatively easy to insert into other models, including those in use by Google’s core algorithm. The part about “infinitely long contexts” may have implications for how some of Google’s search systems can be updated.
The name of the research paper is: Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention
Memory Is Computationally Expensive For LLMs
Large Language Models (LLM) have limitations on how much data they can process at one time because the computational complexity and memory usage can spiral upward significantly. Infini-Attention gives the LLM the ability to handle longer contexts while keeping the down memory and processing power needed.
The research paper explains:
“Memory serves as a cornerstone of intelligence, as it enables efficient computations tailored to specific contexts. However, Transformers …and Transformer-based LLMs …have a constrained context-dependent memory, due to the nature of the attention mechanism.
Indeed, scaling LLMs to longer sequences (i.e. 1M tokens) is challenging with the standard Transformer architectures and serving longer and longer context models becomes costly financially.”
And elsewhere the research paper explains:
“Current transformer models are limited in their ability to process long sequences due to quadratic increases in computational and memory costs. Infini-attention aims to address this scalability issue.”
The researchers hypothesized that Infini-attention can scale to handle extremely long sequences with Transformers without the usual increases in computational and memory resources.
Three Important Features
Google’s Infini-attention solves the shortcomings of transformer models by incorporating three features that enable transformer-based LLMs to handle longer sequences without memory issues and enable them to use the context from earlier data in the sequence and match it to the context further away toward the end of the sequence.
The features of Infini-Attention
- Compressive Memory System
- Long-term Linear Attention
- Local Masked Attention
Compressive Memory System
Infini-attention uses what’s called a compressive memory system. As more data is input (as part of a long sequence of data), the compressive memory system compresses some of the older information in order to reduce the amount of space needed to store the data.
Long-term Linear Attention
Infini-attention also uses what’s called, “long-term linear attention mechanisms” which enable the LLM to process data that exists earlier in the sequence.
This is important for tasks where the context exists on a larger plane of data. It’s like being able to discuss an entire book within the context of all of the chapters and explain how the first chapter relates to another chapter in the middle of the book.
Local Masked Attention
In addition to the long-term attention, Infini-attention also uses what’s called local masked attention. This kind of attention processes nearby (localized) parts of the input data, which is useful for responses that depend on the closer parts of the data.
Combining the long-term and local attention together helps solve the problem of transformers being limited to how much input data it can remember and use for context.
The researchers explain:
“The Infini-attention incorporates a compressive memory into the vanilla attention mechanism and builds in both masked local attention and long-term linear attention mechanisms in a single Transformer block.”
Results Of Experiments And Testing
Infini-attention was tested with regular models for comparison across multiple benchmarks involving long input sequences, such as long-context language modeling, passkey retrieval, and book summarization tasks. Passkey retrieval is a test where the language model has to retrieve specific data from within a extremely long text sequence.
List of the three tests:
- Long-context Language Modeling
- Passkey Test
- Book Summary
Long-Context Language Modeling And The Perplexity Score
The researchers write that the models with Infini-attention outperformed the baseline models and that increasing the training sequence length brought even further improvements in the Perplexity score. The Perplexity score is a metric that measures language model performance, with lower scores indicating better performance.
The researchers shared their findings:
“Infini-Transformer outperforms both Transformer-XL …and Memorizing Transformers baselines while maintaining 114x less memory parameters than the Memorizing Transformer model with a vector retrieval-based KV memory with length of 65K at its 9th layer. Infini-Transformer outperforms memorizing transformers with memory length of 65K and achieves 114x compression ratio.
We further increased the training sequence length to 100K from 32K and trained the models on Arxiv-math dataset. 100K training further decreased the perplexity score to 2.21 and 2.20 for Linear and Linear + Delta models.”
Passkey Test
The passkey test is where a random number is hidden within a long text sequence with the task being that the model must fetch the hidden text. The passkey is hidden either near the beginning, middle or the end of the long text. The model was able to solve the passkey test up to a length of 1 million.
“A 1B LLM naturally scales to 1M sequence length and solves the passkey retrieval task when injected with Infini-attention. Infini-Transformers solved the passkey task with up to 1M context length when fine-tuned on 5K length inputs. We report token-level retrieval accuracy for passkeys hidden in a different part (start/middle/end) of long inputs with lengths 32K to 1M.”
Book Summary Test
Infini-attention also excelled at the book summary test by outperforming top benchmarks achieving new state of the art (SOTA) performance levels.
The results are described:
“Finally, we show that a 8B model with Infini-attention reaches a new SOTA result on a 500K length book summarization task after continual pre-training and task fine-tuning.
…We further scaled our approach by continuously pre-training a 8B LLM model with 8K input length for 30K steps. We then fine-tuned on a book summarization task, BookSum (Kry´sci´nski et al., 2021) where the goal is to generate a summary of an entire book text.
Our model outperforms the previous best results and achieves a new SOTA on BookSum by processing the entire text from book. …There is a clear trend showing that with more text provided as input from books, our Infini-Transformers improves its summarization performance metric.”
Implications Of Infini-Attention For SEO
Infini-attention is a breakthrough in modeling long and short range attention with greater efficiency than previous models without Infini-attention. It also supports “plug-and-play continual pre-training and long-context adaptation by design” which means that it can easily be integrated into existing models.
Lastly, the “continual pre-training and long-context adaptation” makes it ideal for scenarios where there’s a stream of new data that’s constantly needed to be added to train a model. That last part is super interesting because it may make it useful for applications on the back end of Google’s search systems, particularly where it is necessary to be able to analyze long sequences of information and understand the relevance from one part near the beginning of the sequence to another part that’s closer to the end.
The fact that the researchers claim “infinitely long inputs” is amazing but what’s really important for SEO is that this mechanism is the ability to handle long sequences of data in order to “Leave No Context Behind” as well as the plug and play aspect of it. It gives an idea of how some of Google’s systems could be improved if Google adapted Infini-attention to systems within their core algorithm.
Read the research paper:
Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention
Featured Image by Shutterstock/JHVEPhoto