Exploring LongNet: Microsoft's Breakthrough in Language Models
Written on
LongNet, a new transformer model from Microsoft, addresses the limitations of existing language models, which typically have context lengths capped at 2048, 4096, or even 32768 tokens. Released on July 19, a pivotal paper outlines how LongNet can theoretically manage a billion tokens, thus overcoming the constraints that hinder practical applications in large language models (LLMs).
This article will cover the following topics:
- Overview of Large Language Models (LLMs)
- Importance of Context in Language Processing
- Strategies for Extending Context Length
- Current Architectures for LLMs
- Challenges in Scaling Models
- Microsoft's Innovative Approach: LongNet
- The Role of Distributed Training
- Results Demonstrating 1 Billion Token Capacity
- Concluding Remarks
Let's delve into these subjects.
Large Language Models (LLMs)
Large Language Models are advanced deep learning constructs characterized by millions to billions of parameters. They are typically trained on vast text datasets scraped from the internet, which may contain up to a trillion tokens.
Consider a comprehensive matrix where every word relates to others in a particular sequence—this represents the self-attention mechanism. The model prioritizes words that have stronger connections, as they are more effective in predicting subsequent words. This connection can occur across varying depths, but what matters is how self-attention contributes to determining the next token—defined as either a word or part of a word, serving as a fundamental unit of meaning.
Thus, LLMs create intricate linguistic maps that generate outputs based on given inputs. The complexity of these maps is reflected in their parameters; for instance, PaLM-2 boasts 540 billion parameters, while OpenAI's GPT-4 consists of eight models, each with 220 billion parameters.
To effectively respond to prompts or conversations, LLMs must comprehend the context, often referred to as the input or instruction. For example, prompts can include specific guidelines, such as "You are a playful bot that enjoys puns" or "Respond using only the text from this excerpt of the Affordable Care Act." These inputs form the context for the model's responses.
Contextual Relevance
When applied, LLMs must grasp context, which can vary in length and complexity. Context shapes the nature of responses. For instance, in a chatbot scenario, initial context may be minimal, gradually expanding as the conversation unfolds.
Humans engage in deeper dialogues by recalling previous exchanges; similarly, we expect advanced LLMs to replicate this capacity, remembering conversational context to enhance their responses.
Achieving Extended Context Length
While LLMs can technically generate outputs regardless of context length, they typically operate within the latest N tokens, discarding older ones. To improve performance with longer contexts, models must be trained on more extensive sequences, enabling better predictions based on comprehensive input.
For a sequence of length N, the relationships between tokens increase exponentially, necessitating substantial memory. Thus, as context lengthens, the associated matrix becomes increasingly unwieldy, posing challenges for GPU storage and processing.
Additionally, calculating operations for extended sequences requires considerable computational resources, leading to slower training and inference, which may be impractical and environmentally taxing.
The primary challenge lies in balancing computational complexity with model expressivity, a scenario that remains unless new methodologies can address this trade-off.
Current Architectures for LLMs
Recurrent Neural Networks (RNNs) excel with longer sequences but face limitations in parallel processing essential for long-sequence modeling. Long Short-Term Memory (LSTM) models are a prominent example, struggling with vanishing gradients in extended sequences.
Convolutional Neural Networks (CNNs) can simplify training complexities and handle longer sequences. However, their performance often lags behind transformers at typical context lengths. Techniques such as position embeddings and multi-step attention have been explored but have yielded limited success.
Transformers remain the most effective architecture, featuring a complex design that incorporates multi-head self-attention and feed-forward layers. However, the requirement for calculating self-attention has posed significant challenges.
Challenges in Scaling
The computational demands for training LLMs with extended context can be daunting. For recurrent models, the cost of maintaining previous contexts results in O(Nd²) complexity, where d is the number of hidden dimensions. In contrast, vanilla attention incurs a cost of O(N²d).
Recognizing that many relationships within the matrix are trivial, transitioning to a sparse matrix format can enhance efficiency, reducing the computational burden to O(N * Sqrt(N) * d). However, this remains insufficient for longer contexts, prompting the introduction of LongNet.
Microsoft's Solution: LongNet
LongNet introduces two key concepts: dilation and segment length. By applying a dilation factor—essentially a skipping mechanism—LongNet optimizes how data points are utilized during training, effectively managing memory requirements.
The segment length corresponds to the sequence length, and the actual vector length is adjusted based on the dilation factor. This modulation allows for the representation of longer contexts without overwhelming memory resources.
Despite the benefits of dilation, some data points may remain unrepresented. This challenge is addressed through multi-headed attention, which systematically shifts the starting point of the dilated matrix to encompass all potential relationships, permitting parallel computations.
The Distributed Trainer
While the computational load of LongNet is manageable, the multi-headed attention requires repeated calculations, necessitating a distributed training approach. This independence allows for parallel processing across multiple GPUs, ensuring efficient runtime for longer contexts without sacrificing performance.
Results and Verification of Scaling to 1B Tokens
Microsoft confirmed the capability of scaling to one billion tokens using contemporary distributed computing frameworks. They commenced with an 8K sequence length, escalating to 1B while adjusting batch sizes accordingly.
In experiments comparing sequence lengths of 2K, 8K, and 32K, LongNet demonstrated a substantial increase in efficiency, reducing perplexity scores significantly compared to both sparse and vanilla transformers.
The findings suggest that LongNet adheres to established scaling laws, indicating that dense transformers are not essential for effective LLM scaling.
Closing Thoughts
The LongNet paper opens avenues for modeling extensive sequences, potentially allowing entire corpora or even the internet to be treated as a continuous sequence. However, such claims may be overstated, particularly regarding comparisons to artificial general intelligence (AGI).
Nonetheless, LongNet significantly enhances the feasibility of utilizing longer context lengths, empowering smaller entities to compete with larger organizations in the LLM landscape.
If you're curious about the impact of these advancements, consider the illustration below.
Thank you for reading! Your engagement is appreciated. Stay tuned for further updates on fascinating developments in the AI realm. Please consider following or subscribing for more insights.
Explore membership options at: https://ithinkbot.com/membership Connect with me on LinkedIn: https://www.linkedin.com/in/mandarkarhade/