Overview of position embedding methods used in LLMs
If we look at the architecture of the original Transformer (the model that powers everything from GPT to Llama), we will find a peculiar design choice right at the beginning. Before the neural network does any heavy lifting, it adds a “Position Embedding” to the word embedding.
Why is this necessary, and how does it work?
To understand position embeddings, we first have to understand what a Transformer lacks compared to its predecessors.
Previous architectures, like Recurrent Neural Networks (RNNs), processed words sequentially: they looked at word 1, then word 2, then word 3. The “position” was inherent in the order of processing.
Transformers, however, process all tokens in a sequence without considering their positions in the sequence. Without some extra help, the model sees the sentence as a “bag of words.” To a raw Transformer, the sentence:
“The dog bit the man”
Looks mathematically identical to:
“The man bit the dog”
Because the model has no inherent sense of order, we must explicitly inject position information into the data. This is where Position Embeddings come in. They assign a unique vector to every index ($0, 1, 2, … T$) in the sequence.
Let ($w_0$, $w_1$, …, $w_i$, …, $w_N$) be a sequence of $N$ input tokens with $w_i$ being the $i^{th}$ token. Each $w_i$ is mapped to a $d_\text{model}$-dimensional embedding vector $x_i \in \mathbb{R}^d_\text{model}$ without position information. These token embeddings $X$, along with position information, are then transformed into queries, keys, and values used in the self-attention layer of the Transformer architecture.
\[\begin{aligned} \boldsymbol{q}_m &= f_q(\boldsymbol{x}_m, m) \\ \boldsymbol{k}_n &= f_k(\boldsymbol{x}_n, n) \\ \boldsymbol{v}_n &= f_v(\boldsymbol{x}_n, n), \end{aligned}\]where $\boldsymbol{q}_m, \boldsymbol{k}_n$ and $\boldsymbol{v}_n$ represent the $m^{\text{th}}$ and $n^{\text{th}}$ positions of query $Q \in \mathbb{R}^{N\times d_k}$, key $K \in \mathbb{R}^{N\times d_k}$, and value $V \in \mathbb{R}^{N\times d_v}$, respectively, as in the self-attention mechanism.
\[\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V\]Note that the dimension change from $d_\text{model}$ to $d_k$ or $d_v$ because the Transformer doesn’t use $X$ directly for attention. It multiplies $X$ by three separate learnable weight matrices ($W^Q$, $W^K$, $W^V$) to project the data into the “head” dimension.
This method was introduced in the original Transformer paper by Vaswani et al.
Imagine a clock with many hands moving at different speeds. By looking at the positions of all the hands simultaneously, you can determine the exact time. Sinusoidal embeddings work similarly: each dimension of the position vector corresponds to a sinusoid of a different frequency.
For a specific position and a specific dimension , the embedding is calculated as:
\[\begin{aligned} PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d_\text{model}}}\right) \\ PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d_\text{model}}}\right) \end{aligned}\]Where:
Why this is clever:
NOTE: Mathematically, the sinusoidal method has no hard limit in sequence length because the equation $PE_{(pos, i)}$ accepts any pos as the input. The code will not crash, and the model will run. However, in practice, if you train a Transformer with sinusoidal embeddings on a context length of 1024, and then test it on length 2048, the performance usually collapses perfectly.
The Unseen Signal Problem: Even though the sin/cos function stays between -1 and 1, the combination of values across the 512 dimensions creates a specific “fingerprint” for every position.
While the sinusoidal approach is elegant, later models like BERT and the early GPT series took a “brute force” approach that is often easier to implement.
How it works
The Trade-off
Regardless of whether you use the Sinusoidal or Learned method, the application is usually identical. The position vector is added (element-wise) to the token embedding before entering the first Transformer layer:
This “stamps” the token with information about where it sits in the sentence, allowing the Attention mechanism to differentiate between the first “The” and the second “The” in a sentence.
While Absolute Position Embeddings (both Sinusoidal and Learned) are effective, they share a fundamental flaw: they treat position as a fixed address. But in language, the absolute address doesn’t matter as much as the relative distance. The word “dog” (at index 500) relates to the word “barked” (at index 505) exactly the same way “dog” (at index 5) relates to “barked” (at index 10). The relationship is defined by the distance ($+5$), not the coordinate.
To solve this, researchers (Shaw et al., 2018
The Problem with RPE: While accurate, standard RPE is computationally expensive. It often requires materializing massive $N \times N$ ($N$ is the sequence length) matrices to store these bias terms, or it complicates the optimized attention kernels (like FlashAttention). We needed a method that had the efficiency of Absolute Embeddings (just modifying the vectors once) but the mathematical properties of Relative Embeddings.
Introduced by Su et al. (2021)
RoPE takes a different approach. It encodes position by rotating the vector in geometric space.
\[\boldsymbol{x}' = \boldsymbol{R}_{pos} \cdot \boldsymbol{x}\]Why rotation? Because in 2D space, if you have a vector at angle $\theta$ and you rotate it by $\phi$, the new angle is simply $\theta + \phi$. Rotation is inherently additive in angles, which preserves relative information perfectly when we take the dot product.
How RoPE Works
RoPE treats the embedding vector of size $d$ not as a single chunk, but as $d/2$ pairs of numbers. Each pair is treated as a coordinate $(x, y)$ in a 2D plane. For a token at position $m$, we rotate each pair by an angle $m \cdot \theta_i$, where $\theta_i$ is the frequency for that specific $i^{th}$ dimension. Using complex numbers, this is elegantly simple. For a 2D vector represented as a complex number $q$:\(f(q, m) = q \cdot e^{im\theta}\)In linear algebra terms (real numbers), this is a rotation matrix multiplication. For a feature pair $(q_1, q_2)$ at position $m$:
\[\begin{pmatrix} q'_1 \\ q'_2 \end{pmatrix} = \begin{pmatrix} \cos(m\theta) & -\sin(m\theta) \\ \sin(m\theta) & \cos(m\theta) \end{pmatrix} \begin{pmatrix} q_1 \\ q_2 \end{pmatrix}\]The “Relative” Magic (The Dot Product)
The reason RoPE took over the world is what happens when two rotated vectors interact in the Self-Attention layer. Let’s look at the dot product between a Query at position $m$ and a Key at position $n$. $\boldsymbol{q}_m$ is rotated by angle $m\theta$. $\boldsymbol{k}_n$ is rotated by angle $n\theta$. When we take their dot product (which measures similarity):
\[\langle \boldsymbol{q}_m, \boldsymbol{k}_n \rangle = \text{Real}( (\boldsymbol{q} e^{im\theta}) \cdot (\boldsymbol{k} e^{in\theta})^* )\]Using exponent rules ($e^A \cdot e^{-B} = e^{A-B}$), the absolute positions $m$ and $n$ cancel out, leaving only the difference:
\[\langle \boldsymbol{q}_m, \boldsymbol{k}_n \rangle = \langle \boldsymbol{q}, \boldsymbol{k} \rangle \cos((m-n)\theta) + \dots\]The attention score depends only on the relative distance $(m-n)$. The model naturally understands “5 steps back” regardless of whether it’s at step 100 or step 1000.
In practice, RoPE is applied efficiently:
Note: While RoPE extrapolates better than learned embeddings, extending it to massive lengths still requires tricks like “NTK-Aware Scaling” or “Linear Scaling,” which are simple adjustments to the rotation frequency.
You might ask: “If the Relative Bias approach is so inefficient, why did models like T5 use it?”
The answer is that it offers a very simple, robust guarantee for extrapolation. It solves the “unknown position” problem by simply refusing to distinguish between long distances.
Translation Invariance
First, like RoPE, the Relative Bias method relies on the distance $i-j$, not the absolute positions. The model learns a parameter for “distance 5.” It applies that parameter whether the tokens are at indices (10, 15) or indices (1000, 1005). This means it inherently understands that the local structure of language is the same everywhere in the document.
The “Clipping” or “Bucketing” Trick
The real secret to its extrapolation capability is how it handles the “infinite” tail of potential distances. In the original paper (Shaw et al.) and T5, they don’t learn a unique bias for every integer to infinity. Instead, they clip the distance at a certain maximum (let’s say $k=128$).
\[\text{used-distance} = \min(|i - j|, k)\]This acts as a “catch-all” bucket.
Why This Enables Extrapolation
Imagine you train a model on sequences of length 512. The model learns precise relationships for distances 0–128, and a generic relationship for “anything further than 128.
“When you deploy this model on a document with 10,000 tokens:
The model never encounters an “unknown” state. It simply categorizes all new, ultra-long distances into the “far away” category it already learned during training.
The “Near-Far” Analogy
Think of how you perceive objects:
Relative Bias works the same way. Once a word is “far enough” (past the clipping point $k$), the model stops caring about the exact meter-by-meter distance and just treats it as “background context.” This allows it to handle infinite lengths without crashing.
While Rotary Positional Embeddings (RoPE) are naturally more flexible than absolute embeddings, they aren’t magic. If you train a model on 1,024 tokens and suddenly feed it 2,048, the attention mechanism will likely collapse. This happens because the model encounters rotation angles it never saw during training.
To fix this without a costly full retraining, researchers use two primary scaling “tricks”: Linear Scaling and NTK-Aware Scaling.
Linear Scaling (Position Interpolation)
Linear scaling is the “rubber band” approach. It stretches the original training range across the new, longer sequence.
NTK-Aware Scaling
NTK-Aware scaling is a more surgical method. Instead of scaling the position index, we scale the base frequency ($b$) of the RoPE calculation.
Why High-Frequency Dims Don’t Break for NTK-Aware Scaling
A common question arises: If we are at token 2,048, the high-frequency dimensions will produce an absolute angle the model never saw in training. Why doesn’t this cause an error?
The answer lies in two properties of the attention mechanism: