Lost in the Middle: How Language Models Use Long Contexts

<aside> 💡

The Multi-scale Positional Encoding (Ms-PoE) architecture, introduced in "Found in the Middle: How Language Models Use Long Contexts Better via Plug-and-Play Positional Encoding," revolutionizes how language models handle long contexts. Built upon the transformer architecture, Ms-PoE employs a modified Rotary Position Embedding (RoPE) that rotates token embeddings in the complex plane based on their sequence position. This rotation is achieved by applying a position-dependent angle θ to each token's embedding, effectively encoding positional information into the representation itself.

The architecture's workflow begins with the input tokenization, where the text is converted into a sequence of tokens. Each token is then assigned a position-dependent embedding using the modified RoPE. The key innovation lies in the introduction of a novel rescaling function f(pos) that compresses the positional range for middle tokens. This compression is crucial as it makes tokens in the middle of long sequences more distinguishable, directly addressing the "lost-in-the-middle" problem that plagues many language models.

The rescaling function works by modifying the position indices used in the positional encoding calculation. For tokens in the middle of the sequence, the function compresses their positional range, effectively bringing them "closer together" in the embedding space. This compression is balanced to preserve the relative ordering of tokens while still making middle tokens more prominent to the model's attention mechanism.

Following the positional encoding, the architecture implements attention head-specific scaling. In this step, different attention heads are assigned varying scaling ratios (r_h), creating a multi-scale representation. This is achieved by applying a unique scaling factor to the positional encodings for each attention head. Heads with smaller r_h values focus on local context, allowing the model to capture fine-grained, nearby relationships between tokens. Conversely, heads with larger r_h values are designed to capture long-range dependencies, enabling the model to understand broader context and relationships across the entire sequence.

The complete Ms-PoE is mathematically expressed as Ms-PoE(pos, h, i) = exp(iθ_h,i) · f_h(pos), where pos is the token position, h is the attention head index, i is the dimension index, and θ_h,i is the rotation angle. This formula encapsulates how the positional information is encoded and scaled for each token across different attention heads and embedding dimensions.

During the forward pass of the model, these multi-scale positional encodings interact with the self-attention mechanism. The varying scales allow the model to simultaneously attend to both local and global contexts. For instance, when processing a long document, some heads might focus on understanding the relationship between adjacent words or sentences, while others might capture document-wide themes or references.

This innovative approach enables language models to process local details and global context in long sequences concurrently, significantly improving performance on tasks requiring comprehensive understanding of extended text. In practical applications, such as analyzing lengthy legal documents or research papers, the model can maintain awareness of both specific details and overarching themes throughout the entire document.

Importantly, Ms-PoE achieves this enhanced capability without introducing additional parameters or computational overhead. The rescaling and multi-scale attention are implemented through modifications to the existing positional encoding and attention mechanisms, rather than by adding new layers or components to the model. This makes Ms-PoE a highly efficient and practical solution for improving long-context processing in large language models, allowing for easy integration into existing architectures without the need for extensive retraining or increased computational resources.

</aside>

Architecture Overview

Ms-PoE builds upon the transformer architecture, introducing a novel positional encoding scheme with three key components:

Modified Rotary Position Embedding (RoPE)
Multi-scale Position Index Rescaling
Attention Head-specific Scaling

Technical Implementation

Modified Rotary Position Embedding (RoPE)

RoPE forms the foundation of Ms-PoE, encoding absolute positions using a rotation matrix:

$$ R_{\theta}(x) = [x_1 \cos(\theta) - x_2 \sin(\theta), x_1 \sin(\theta) + x_2 \cos(\theta)] $$

Where $$x$$ is the token embedding and $$\theta$$ is the position-dependent angle.

RoPE is called "rotatory" because it rotates token embeddings in the complex plane based on their sequence position. This rotation-based method offers several advantages:

It applies rotation to embeddings using rotation matrices.
It represents embeddings as complex numbers, with positions as pure rotations.
It can be visualized as rotating clock hands, each token's embedding rotating based on its position.

Multi-scale Position Index Rescaling

Ms-PoE introduces a rescaling function $$f(pos)$$ that modifies position indices:

$$ PE_{Ms-PoE}(pos, 2i) = \sin\left(\frac{f(pos)}{10000^{2i/d_{model}}}\right) $$

This function is designed to:

Compress the positional range for middle tokens