from series, models.

some berts and modernbert

Introduction

In 2018, published by Devlin et al., BERT (Bidirectional Encoder Representations from Transformers) revolutionized natural language processing. By its inception, it set new benchmarks across tasks like text classification, named entity recognition, and question answering.

And, we can consider it as one of the most important milestones in the evolution of NLP model architectures. It was one of the first transformer models which introduced contextual embeddings and achieved real-world success. (take a look at my blog post "Journey of Embeddings"!)

While autoregressive models coming from the lineage of GPT revolutionized particularly natural language generation tasks, BERT based models maintained their popularity in specific use cases involving natural language understanding.

However, BERT's design was not without limitations. Its high computational cost, inefficiencies in fine-tuning, and inability to handle long sequences required and triggered a wave of improvements. Over the years, a series of successors refining BERT's architecture came. Models like RoBERTa, ALBERT, DistilBERT, ELECTRA and MosaicBERT improved efficiency, while Longformer and DeBERTa extended its capabilities for long-context and nuanced tasks.

Ideas from all of these improved models pushed the field further, and paved the way for a new generation of encoder-based architectures. Finally, ModernBERT came: a model designed to redefine the capabilities and efficiency of encoder-only transformers. Introduced in the late 2024 paper "Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference" by Warner et al., ModernBERT represents a significant leap forward by addressing the limitations of previous models.

Unlike its predecessors, ModernBERT is not only optimized for shorter tasks like classification and retrieval but also excels in long-context applications, handling sequences up to 8,192 tokens, a substantial improvement over the 512-token limit of BERT. This extended capacity opens up opportunities in domains such as legal text analysis, long-form document retrieval, and code understanding, where handling extensive context efficiently is crucial.

At the heart of ModernBERT’s design are several important advancements:

Rotary Positional Embeddings (RoPE): Enabling effective processing of both short and long-context sequences while maintaining performance consistency.

Flash Attention: A breakthrough memory-efficient attention mechanism that reduces computational overhead during both training and inference.

Unpadding Mechanism: Inspired by MosaicBERT, this technique eliminates unnecessary padding tokens and improves efficiency for variable-length sequences.

Global and Local Attention: This mechanism allows the model to attend to both global and local contexts and to enhance its ability to capture long-range dependencies.

Better Training Data: Trained on 2 trillion tokens spanning diverse domains such as web documents, scientific literature, and code

By synthesizing previous advancements and introducing its own, ModernBERT establishes itself as a Pareto improvement, delivering better performance while demanding fewer resources.

ModernBERT Pareto Curve, from https://huggingface.co/blog/modernbert

ModernBERT Pareto Curve, from HuggingFace blog

It achieves state-of-the-art results on GLUE and BEIR benchmarks, all while maintaining a balance between accuracy, speed, and resource efficiency. This balance makes ModernBERT not only a technological achievement but also a practical solution for real-world NLP challenges.

Of course, I will not just mention ModernBERT like writing a social media post and finish here. And I almost could not help writing a "Journey of ModernBERT" in the style of my previous posts. But it could took an era, so I tried to find a balance.

In the following sections, we will explore what happened after BERT, what are the main ideas, how we got ModernBERT, and how does ModernBERT work.

Let's start with reviewing BERT.

BERT

After the publication of the paper Attention is all You Need by Vaswami et al. in 2017, transformer architecture entered the scene of NLP. BERT was one of the first models that succesfully used this paradigm ready to be used in industrial settings.

BERT achieved its success through two main ideas: masked language modeling with bidirectional attention mechanism and its pretraining/fine-tuning paradigm.

Masked Language Modeling (MLM)

MLM is a pretraining objective where a certain percentage of tokens in the input text are randomly replaced with a special [MASK] token. The model is then trained to predict these masked tokens based on the surrounding context. This approach enables the model to learn bidirectional representations of language, meaning it considers both the tokens before and after the masked token when making predictions.

For example, in the sentence:

"BERT achieved its [MASK] through its two key innovations"

BERT uses the words "BERT achieved its" and "through its two key innovations" to infer that the missing word is likely "success" or another fitting noun. This bidirectional analysis ensures that BERT captures how words are influenced by their left and right context, which is critical for understanding nuanced relationships in language.

Both GPT and BERT create contextual embeddings, but their contextual scope differs. GPT's unidirectional embeddings are sufficient for tasks like generation, while BERT's bidirectional embeddings provide a richer understanding of language for comprehension-focused tasks.

Pretraining/Fine-Tuning Paradigm

BERT's pretraining/fine-tuning paradigm was another critical innovation that reshaped NLP. During the pretraining phase, BERT learns general language representations by optimizing its objectives: Masked Language Modeling (MLM) and Next Sentence Prediction (NSP) (later shown to be less impactful but initially included).

Once pretrained, BERT can be fine-tuned for specific downstream tasks with relatively small datasets and minimal additional training.

This two-step paradigm significantly lowered the barrier to developing high-performing NLP models for various applications. There was no need to train models from scratch for every task, making BERT a versatile foundation for real-world applications.

Upon release, BERT was on the top of leaderboards for GLUE (General Language Understanding Evaluation) and SQuAD (Stanford Question Answering Dataset), demonstrating state-of-the-art capabilities in natural language understanding. Within months, Google incorporated BERT into their search engine, improved the relevance of search results for over 10% of queries according to their official blog post Understanding searches better than ever before.

Despite its revolutionary approach, BERT had key limitations:

Heavy Computational Requirements: Its large model size and high memory demands limited real-world usability.

Limited Efficiency in Fine-Tuning: Adapting BERT to specific tasks still required significant resources.

Struggles with Long Sequences: BERT could process only a limited number of tokens, making it less effective for long-context tasks.

After BERT

RoBERTa

RoBERTa, introduced by Liu et al. in 2019 as a robust optimization of BERT, focused on refining the pretraining process and improving the training datasets to extract maximum performance from the original architecture.

In the original BERT implementation, the same tokens in a given training example were masked every epoch. This lead to less variation in the training signals the model received. With dynamic masking, RoBERTa randomly selects different tokens to mask each time an example is seen.

For example, consider the sentence:

"BERT achieved its success through its two key innovations."

In the first epoch, the model might mask the word "success", resulting in the input: "BERT achieved its [MASK] through its two key innovations."

In the second epoch, the model might mask a different token, such as "two", resulting in the input: "BERT achieved its success through its [MASK] key innovations."

This dynamic variation forces the model to infer context under a broader range of conditions, making it less likely to overfit to specific masking patterns. It encourages the model to explore the relationships between tokens in different contexts, almost like increasing the data itself. And this enhances its ability to learn robust and generalized bidirectional representations.

Dynamic masking also helps capture more nuanced relationships in language:

When "success" is masked, the model learns the relationship between "achieved" and "through its two key innovations."

When "two" is masked, the model focuses on numeric relationships and the structure of the phrase.

Moreover, RoBERTa was pretrained on much larger datasets than BERT, including BookCorpus, CC-News, OpenWebText, and Stories. This extended training allowed it to generalize better across tasks. Remember, in deep learning models, scale matters.

These changes significantly boosted RoBERTa’s performance across multiple benchmarks. For example, on GLUE, RoBERTa achieved an average score of 88.5, outperforming BERT’s 83.1. On SQuAD v1.1, it achieved an F1 score of 94.6 compared to BERT’s 88.5. By addressing inefficiencies in BERT’s pretraining, RoBERTa established itself as a powerful model for tasks requiring robust contextual understanding, making it a preferred choice for production environments needing enhanced accuracy.

ALBERT

ALBERT was introduced by Lan et al. in 2020, and aimed to make transformer-based models more resource-efficient while maintaining competitive performance.

It focused on one of BERT’s primary drawbacks, its large memory footprint, through two key innovations: cross parameter sharing and factorized embeddings.

Cross-Layer Parameter Sharing

Transformer-based models like BERT normally require unique sets of parameters for every layer, and this leads to significant memory consumption and inefficiency. ALBERT changes this by reusing parameters across layers and drastically reducing the number parameters in the model.

Many of the weights in deeper layers of BERT are highly correlated, meaning they learn redundant or similar transformations. So, instead of learning them repeatedly, sharing parameters would exploit this redundancy.

In ALBERT, weights for certain components such as the attention mechanism and feedforward networks are shared across all transformer layers. For example, the same set of weights for the feedforward network is applied at every layer of the model.

Instead of learning a unique transformation at each layer, ALBERT applies the same transformation iteratively across all layers.

While the parameters are shared, the intermediate representations at each layer remain distinct because they depend on the input processed through the shared parameters.

This innovation made the model lighter without compromising the performance too much.

Factorized Embeddings

The embedding layer in transformer-based models like BERT directly tied the vocabulary size (number of tokens in the vocabulary) to the hidden layer size (dimensions of the internal representations of the model). This coupling led to massive memory requirements, particularly for large vocabularies, as the embedding matrix was often the largest component in terms of storage.

For instance, if a model had a vocabulary size of 30000 tokens and a hidden layer size of 512 dimensions, the embedding matrix would require 30000×512 parameters. And, this would cause significant burden for memory efficiency, especially in environments with constrained resources.

ALBERT decoupled these by introducing a smaller embedding dimension and a separate projection layer to match the hidden layer size. This design significantly reduced the size of the embedding matrix. This change was an important milestone in making transformer models more lightweight and accessible for deployment.

These innovations enabled ALBERT to achieve state-of-the-art results on several benchmarks, including GLUE and SQuAD. On GLUE, ALBERT achieved an overall score of 89.4, surpassing BERT’s 83.1, while maintaining a significantly smaller parameter footprint. On SQuAD v1.1, ALBERT recorded an F1 score of 92.2 compared to BERT’s 88.5. This

DistilBERT

DistilBERT, introduced by Sanh et al. in 2020, was one of the first successful implementations of knowledge distillation in transformer models. Knowledge distillation is a process where a smaller model (student) learns to mimic a larger, pre-trained model (teacher), to capture most of its capabilities while significantly reducing size and computational requirements.

Knowledge Distillation

At the core of DistilBERT’s knowledge distillation process is the use of soft labels, which are the probability distributions produced by the teacher model for each output. Unlike traditional training that relies on hard labels (one-hot encoded ground truth), soft labels encode richer information about the teacher’s confidence in its predictions and the relationships between classes.

For example, in a classification task, while hard labels might indicate:

[bicycle=1, plane=0, car=0]

soft labels from the teacher model might convey:

[bicycle=0.85, plane=0.5, car=0.10]

The student is trained to mimic these soft labels, capturing not just the correct answer but also the relative confidence in each class.

This probabilistic knowledge transfer enables the student model to generalize better, even with significantly fewer parameters.

(It resembles the transition between one-hot encoding and dense vectors for representations of words. Fractal nature of innovations!)

In addition to soft labels, DistilBERT leverages the intermediate representations of the teacher model to align the internal architectures of the student and the teacher. By replicating the embeddings and attention patterns of the teacher, the student learns to capture the same latent knowledge while operating with reduced computational complexity. This step enhances the ability of student to mirror the contextual understanding of the teacher.

The training process combines multiple loss functions:

1. Distillation Loss: Measures the divergence between the student’s output probabilities and the teacher’s soft labels (e.g., using Kullback-Leibler divergence).

2. Supervised Loss: Ensures the student performs well on the original task using hard labels (e.g., cross-entropy loss).

3. Intermediate Layer Loss: Encourages the student’s hidden states to match those of the teacher, aligning their internal representations.

DistilBERT’s lightweight architecture made it ideal for applications requiring efficient real-time processing, such as mobile applications and web services.

With 40% reduction in parameters, on GLUE benchmarks, DistilBERT achieved an overall score of 82.3, compared to BERT’s 83.1.

Also, on SQuAD v1.1, DistilBERT recorded an F1 score of 86.9, which was close to BERT’s 88.5, demonstrating its ability to maintain high accuracy despite its significantly smaller size and faster inference speeds.

These results proved that distillation techniques could make transformer models accessible for resource-constrained environments without substantial trade-offs in performance.

ELECTRA

ELECTRA introduced by Clark et al. in 2020 as a novel approach to pretraining by combining a generator-discriminator framework, and fundamentally changed how masked tokens were predicted. Unlike BERT which masks tokens and predicts them directly, ELECTRA uses a two-step process.

Step 1: Generator

For simulating a realistic corruption process, a lightweight generator replaces selected tokens in the input sequence with plausible alternatives.

The generator is trained with a masked language modeling (MLM) objective similar to BERT. But it is designed to be small and efficient as it plays only a supporting role in the framework.

Step 2: Discriminator

The discriminator is trained to identify whether each token in the input is original or replaced. This binary classification task forces the model to attend to subtle contextual clues across all tokens in the sequence, not just the masked ones.

For example:

Input Sequence: "Unlike BERT which masks tokens and [MASK] them directly, ELECTRA uses a two-step process."

Generator Output: "Unlike BERT which masks tokens and [MASK] them directly, ELECTRA uses a two-step process."

Discriminator Task: Predict whether each token is original (e.g., "predicts" is replaced).

This process makes every token a learning opportunity and dramatically increases the density of training signals compared to BERT.

BERT’s pretraining relies on predicting a small percentage of masked tokens (e.g., 15%). In contrast, ELECTRA utilizes all tokens in the sequence by training the discriminator to assess each one. This results in richer and more efficient learning, as the model captures relationships across the entire input.

ELECTRA’s pretraining approach dramatically reduces the computational cost compared to BERT. For instance, it achieved comparable performance on benchmarks like GLUE while requiring only 25% of the compute during pretraining.

On the GLUE benchmark, ELECTRA recorded a score of 88.6 which matches RoBERTa’s 88.5, but with significantly lower resource requirements.

This made ELECTRA particularly appealing for scenarios where computational efficiency is critical, such as real-time systems and environments with constrained resources.

Longformer

Longformer was introduced by Beltagy et al. in 2020 to address one of the key limitations of BERT: its inability to handle long input sequences. Main reason for BERT's inability in it is due to quadratic computational complexity in the attention mechanism.

Longformer achieves the improvement through sparse attention, which is a mechanism that reduces the complexity of self-attention from quadratic to linear with respect to the sequence length.

In standard transformer architectures, the self-attention mechanism computes attention scores for all pairs of tokens in the input sequence. This results in a computational complexity of O(n^2) where n is the sequence length. This limitation makes processing long sequences prohibitively expensive.

Longformer reduces this complexity to O(n) by introducing sparse attention, which focuses attention computations on a subset of tokens.

It employs a combination of:

Local Attention: Each token attends only to a fixed window of nearby tokens to capture local context efficiently. This is particularly useful for tasks where adjacent tokens are most relevant.

Global Attention: Specific tokens, such as classification tokens or task-relevant keywords, are designated as global tokens. This allows them to attend to all tokens in the sequence and ensures that critical global context is not lost.

Longformer introduces flexibility through customizable attention patterns to enable task-specific configurations.

For example:

In classification tasks, the CLS token can be assigned global attention.
In summarization, key sentences or paragraphs can receive global attention to enhance the ability of the model to capture critical information.

This adaptability allows Longformer to be succesful in a variety of NLP tasks requiring long-context understanding such as summarization and long-form question answering. For example, on the TriviaQA benchmark, Longformer achieved an F1 score of 79.4, significantly surpassing BERT’s 72.6.

In document classification tasks, Longformer handles sequences up to 4096 tokens efficiently, compared to BERT’s 512.

Its ability to efficiently handle long sequences has made it a valuable tool in domains such as document understanding, legal text analysis, and biomedical research where large context windows are crucial for success.

DeBERTa

DeBERTa (Decoding-enhanced BERT with Disentangled Attention), by He et al. in 2021 introduced significant advancements in handling embeddings and attention mechanisms. Its primary innovation lies in disentangling content and positional embeddings.

Disentangled Content and Positional Embeddings

Traditional transformer models like BERT combine content and positional information into a single embedding. Each word is basically represented as a sum of its word embedding and position embedding. While effective, this coupling can lead to interference between semantic and positional relationships and can reduce the model’s ability to represent these aspects independently.

DeBERTa addresses this limitation by separating content embeddings (semantic information of tokens) and positional embeddings (relative positioning of tokens in the sequence). Each word is represented by these two vectors.

This disentanglement allows the model to focus on semantic meaning and handle positional complexity.

Enhanced Mask Decoder

DeBERTa further enhances its pretraining objective through the Enhanced Mask Decoder (EMD), a mechanism that refines how the model predicts masked tokens in the Masked Language Modeling (MLM) task of BERT. Instead of predicting each masked token independently, EMD evaluates pairwise relationships between the masked token and all other tokens in the sequence to capture richer contextual interactions.

The EMD achieves this through:

Pairwise Decoding: Instead of treating each masked token prediction as isolated, the decoder models the relationships between the masked token and its surrounding tokens.
Contextual Interaction: By combining the disentangled content embeddings and relative position embeddings, EMD processes the interactions between tokens more effectively.
Dense Training Signal: The pairwise decoding mechanism increases the amount of useful training information extracted from each sequence.

These innovations allowed DeBERTa to achieve state-of-the-art results on multiple NLP benchmarks. For example, on GLUE, DeBERTa scored 90.1, surpassing previous models like RoBERTa (88.5) and BERT (83.1). On SQuAD v2.0, DeBERTa achieved an F1 score of 89.9, significantly improving over BERT’s 79.0.

Following the success of DeBERTa v1, DeBERTa v2 introduced refinements to enhance efficiency, scalability, and pretraining effectiveness.

It built upon the disentangled embeddings and relative position bias of its predecessor while incorporating ELECTRA-style generator-discriminator framework (also experimented in DeBERTa paper, as replaced token detection objective) and leading to a more data-efficient and powerful language model.

Also, DeBERTa v2 was pretrained on 1.5TB of text, significantly larger than v1’s 78GB dataset. The dataset included CC-News, OpenWebText, Wikipedia, and BooksCorpus to improve generalization across NLP tasks.

DeBERTa v2 outperformed v1 across multiple benchmarks:

GLUE Benchmark: 90.1 (v2) vs. 88.3 (v1)
SQuAD v2.0 F1 Score: 89.9 (v2) vs. 88.2 (v1)
SuperGLUE Score: 89.2 (v2) vs. 87.8 (v1)

While DeBERTa v2 improved efficiency, it still required significant compute resources. DeBERTa v3 further improved the model's performance and efficiency.

Gradient-Disentangled Objective

DeBERTa v3’s most notable innovation was the Gradient-Disentangled Objective (GDC), which aimed to mitigate gradient conflicts between MLM (Masked Language Modeling) and RTD (Replaced Token Detection) objectives.

In models like DeBERTa v2 (which used both MLM and RTD for pretraining), conflicting gradients from these two objectives could interfere with learning, leading to suboptimal representations. GDC decouples these gradients, allowing the model to learn from both objectives independently and improving training efficiency.

GDC introduces a gradient projection layer that separates the updates from MLM and RTD, ensuring that one does not negatively impact the learning signal of the other.

(disentangling again!)

This disentangling mechanism reduces the required training steps and makes DeBERTa v3 more cost-efficient while maintaining state-of-the-art performance.

With GDC and optimizations in pretraining efficiency, DeBERTa v3 achieved superior performance across multiple tasks:

GLUE Benchmark: 91.3 (v3) vs. 90.1 (v2)
SQuAD v2.0 F1 Score: 91.1 (v3) vs. 89.9 (v2)
SuperGLUE Score: 90.5 (v3) vs. 89.2 (v2)

MosaicBERT

MosaicBERT, introduced by Portes et al. in 2023, is a lightweight and efficient transformer model designed for practical deployment in resource-constrained environments. While large transformer models like BERT and RoBERTa deliver state-of-the-art performance, they often come at the cost of high memory usage and computational inefficiency.

MosaicBERT integrates several architectural advancements that collectively improve both speed and accuracy. These include FlashAttention, Attention with Linear Biases (ALiBi), Gated Linear Units (GLU), unpadding mechanism, and low-precision LayerNorm, all of which significantly enhance training efficiency.

Unpadding Mechanism

In BERT, input sequences are padded to a fixed length (e.g., 512 tokens) to ensure efficient batch processing. This is necessary because transformers operate on fixed-size tensors, and modern deep learning frameworks (like PyTorch and TensorFlow) require uniform input dimensions for parallel computation. However, this padding introduces significant inefficiencies, especially when dealing with sequences of varying lengths.

BERT employs attention masks to prevent the model from attending to [PAD] tokens during self-attention computations. However, despite this masking, padded tokens still consume memory and computational resources, since they are passed through all layers of the model.

Instead of processing padded sequences throughout all transformer layers, MosaicBERT dynamically tracks actual token lengths, eliminates padding and reindexes tokens for efficient computation. This optimization allows the self-attention mechanism to focus only on meaningful tokens and significantly reduce FLOPs and memory usage.

MosaicBERT employs an architecture that removes unnecessary complexity while maintaining the core transformer structure.

GeGLU

Instead of traditional Feed-Forward Networks (FFNs) in transformer layers, MosaicBERT uses Gated Linear Unit (GLU) with activation function Gaussian-error Linear Unit (GeLU). This improves parameter efficiency and enhances expressivity while maintaining speed.

Flash Attention Integration

To further optimize computational efficiency, MosaicBERT incorporates Flash Attention, a technique that reduces the amount of memory-intensive operations involved in self-attention. This allows the model to scale efficiently while maintaining high-speed execution.

Low-Precision LayerNorm

The model leverages bfloat16 precision instead of float32 as in BERT for LayerNorm operations. This reduces memory bandwidth requirements and improving throughput without compromising numerical stability.

Attention with Linear Biases (ALiBi)

Instead of learned positional embeddings of BERT, MosaicBERT employs ALiBi, which introduces a position-aware bias directly in attention scores. This approach reduces overhead and enables better generalization to longer sequences.

On the GLUE benchmark, it achieved an average score of 79.6 in just 1.13 hours on 8×A100 GPUs, at a fraction of the cost compared to traditional BERT training setups.

MosaicBERT-Base achieves an accuracy of 83.2 on GLUE in just 4.6 hours, while a standard BERT-Base requires 11.5 hours on the same hardware.

The combination of unpadding, Flash Attention, and GeGLU allows MosaicBERT to be Pareto-optimal, meaning it consistently outperforms standard BERT in the tradeoff between accuracy and training time.

Cumulative effect of all of these works and experimentations pushed the field further.

ModernBERT

ModernBERT, introduced in the 2024 paper "Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference", represents a significant leap in the development of encoder-only transformer models. Authored by researchers from institutions Answer.AI, LightOn, and Johns Hopkins University, ModernBERT builds on the legacy of BERT by incorporating contemporary innovations that address longstanding limitations in efficiency, scalability, and context handling.

ModernBERT is designed to achieve three primary goals:

Efficiency: Optimized for fast inference and reduced memory usage
Versatility: Effective in handling both short- and long-context tasks
Scalability: Natively capable of processing sequences up to 8192 tokens

So, let's talk about the techniques.

Rotary Positional Embeddings

In BERT, each token has two components in its final embedding:

Token embedding (semantic meaning): Represents the meaning of a word, e.g., "BERT" is represented as a vector E_BERT.

Positional embedding (absolute position): Represents the position of the word in the sequence, e.g., position 1 is P1, position 2 is P2, and so on.

For the sentence "In BERT, each token has two components", BERT computes:

E_In + P1, E_BERT + P2, E_each + P3, E_token + P4, E_has + P5, E_two + P6, E_components + P7

While absolute positional embeddings allow models like BERT to encode token positions in a sequence, they treat positions as fixed, independent vectors. This approach lacks an explicit representation of the relative relationships between tokens, such as how close or far apart they are. As a result, models using absolute embeddings rely on the self-attention mechanism to infer these relationships indirectly, which can make it harder to generalize effectively across sequences of varying lengths or handle long contexts efficiently.

RoPE, on the other hand, applies a rotational transformation to the token embeddings to encode relative positional information directly into the attention mechanism. It is first proposed in RoFormer: Enhanced transformer with Rotary Position Embedding by Su et al. in 2023.

Rotated(E_BERT, θ1), Rotated(E_each, θ2), Rotated(E_token, θ3), Rotated(E_has, θ4), Rotated(E_two, θ5), Rotated(E_components, θ6)

Now, differences between angles represent relative distances.

Relative Positional Awareness: RoPE makes the model aware of the relative distances between tokens, which is crucial for capturing relationships like "which word modifies which" or "what is the main subject".

Generalization to Long Contexts: Since RoPE doesn’t rely on a fixed positional embedding table like BERT, it can extend to longer sequences without predefined limits.

Improved Attention Mechanism: By encoding relative positions directly into the token embeddings, the attention mechanism can more effectively weigh relationships between distant tokens.

Global and Local Attention

ModernBERT leverages a mix of global and local attention to optimize efficiency and capture both short-range and long-range dependencies.

While Longformer and similar models use predefined local and global attention patterns, ModernBERT optimizes attention placement dynamically. Instead of static global tokens, ModernBERT can assign different levels of global attention dynamically based on task-specific needs. This means that for some tasks (like classification) [CLS] and task-specific tokens receive stronger global attention, where for retrieval tasks important spans of text may be dynamically assigned global attention.

By balancing computational efficiency with effective context retention, ModernBERT achieves superior performance on long-context tasks without the quadratic cost of full self-attention.

Flash Attention

ModernBERT incorporates Flash Attention, an optimized attention mechanism introduced by Dao et al. in 2022. Flash Attention was designed to improve both the speed and memory efficiency of self-attention calculations in transformers, particularly for long sequences. Traditional self-attention mechanisms, as used in BERT, compute attention scores across all token pairs and store large attention matrices. Naturally, this leads to high memory consumption and slow processing times.

Flash Attention eliminates these inefficiencies by introducing three key optimizations:

Blocked Attention Computation:Traditional attention computes a full n x n attention matrix, where n is the sequence length, leading to quadratic complexity. Flash Attention processes attention in small blocks that fit within fast on-chip memory (SRAM), significantly reducing memory overhead.

Memory-efficient Backpropagation:Standard transformers store all intermediate attention scores, leading to excessive memory use. Flash Attention does not store attention matrices during forward propagation. Instead, it recomputes gradients on demand using on-chip memory.

Hardware Optimization:Traditional attention mechanisms perform many unnecessary memory reads/writes to slow DRAM. Flash Attention optimally schedules computations to minimize memory access, leveraging modern GPUs’ fast SRAM for significantly faster execution.

By integrating Flash Attention, ModernBERT achieves up to 3x faster self-attention computations while using significantly less memory.

Unpadding

As in MosaicBERT, unpadding method is applied. However, its implementation differs in scope, execution, and efficiency improvements.

MosaicBERT optimizes transformer efficiency by removing padding tokens before computation and dynamically tracking token lengths. Standard transformers process all tokens (including padded ones) through every layer and waste compute on non-informative tokens. While attention masks prevent padded tokens from influencing results, they still consume memory and FLOPs because padded sequences are fed through every layer.

ModernBERT takes unpadding optimization further by removing padding tokens before the token embedding layer itself and ensuring that no padded tokens ever enter the main model pipeline. Unlike MosaicBERT’s approach, ModernBERT completely avoids the need for intermediate repadding.

GeGLU

ModernBERT uses Gated Linear Units (GeGLU) instead of traditional GeLU activation functions. GeGLU enhances the model's expressivity by introducing an additional gating mechanism, allowing the model to control the flow of information dynamically.

GeLU applies a simple non-linear transformation to the input. GeGLU splits input into two parts, applies an activation function to one part while using the other as a gate. This allows for better feature selection and improves generalization.

Normalization after Embeddings

Unlike BERT, which applies LayerNorm at multiple points, ModernBERT introduces early normalization right after the embedding layer. This ensures that token embeddings are properly scaled before they enter the transformer layers, leading to more stable training and improved generalization.

Disabling Bias Terms

In large models, bias parameters contribute marginally to performance but increase computation. ModernBERT removes bias terms from fully connected layers, a technique that has been shown to improve generalization and reduce parameter redundancy. This works particularly well when paired with normalization layers, which naturally compensate for the absence of biases.

Training Data

Of course, the data... ModernBERT follows the trend of leveraging high-quality, large-scale datasets to enhance both generalization and task-specific performance. Unlike the original BERT, which was trained on BookCorpus and English Wikipedia (16GB of text data), ModernBERT benefits from a significantly more diverse, high-volume dataset.

ModernBERT is trained on 2 trillion tokens, sourced from a mix of web content, academic literature, code repositories, and domain-specific texts. This expanded dataset ensures better coverage of linguistic patterns, terminology, and factual knowledge, making ModernBERT more robust across various applications.

Evaluation

This highly diverse dataset enables ModernBERT to outperform its predecessors in natural language understanding (GLUE, BEIR), retrieval tasks (DPR, MLDR), and long-context reasoning.

ModernBERT demonstrates state-of-the-art performance on the GLUE benchmark, showcasing its robustness across various sentence-pair understanding tasks:

ModernBERT-base: GLUE score of 88.4, surpassing BERT-base (84.7) and RoBERTa-base (86.4).
ModernBERT-large: GLUE score of 90.4, outperforming BERT-large (85.2) and RoBERTa-large (88.9).

ModernBERT excels in information retrieval tasks under the Dense Passage Retrieval (DPR) and ColBERT settings:

BEIR benchmark (DPR setting)

ModernBERT-base: Achieved a score of 41.6, narrowly surpassing GTE-en-MLM-base (41.4).
ModernBERT-large: Scored 44.0, outperforming GTE-en-MLM-large (42.5).

BEIR benchmark (ColBERT setting)

ModernBERT-base: 51.3, versus BERT-base (49.0).
ModernBERT-large: 52.4, surpassing GTE-en-MLM-large (50.7).

ModernBERT achieves competitive results on the MLDR benchmark, which evaluates retrieval performance for long-context tasks:

Out-of-domain performance (MLDR-OOD)

ModernBERT-base: 27.4, trailing GTE-en-MLM-base (34.3).
ModernBERT-large: 34.3, slightly behind GTE-en-MLM-large (36.4).

In-domain performance (MLDR-ID)

ModernBERT-base: 44.0, comparable to GTE-en-MLM-base (44.4).
ModernBERT-large: 48.6, approaching GTE-en-MLM-large (48.9).

And, ModernBERT is significantly more succesful in code understanding tasks compared to previous encoder-only models:

CodeSearchNet

ModernBERT-base: Achieved a score of 56.4, outperforming GTE-en-MLM-base (44.9) and RoBERTa-base (44.3).
ModernBERT-large: Scored 59.5, exceeding GTE-en-MLM-large (40.5) and RoBERTa-large (47.3).

SQuAD v2.0 (SQA)

ModernBERT-base: 73.6, surpassing BERT-base (59.5).
ModernBERT-large: 83.9, significantly ahead of BERT-large (60.8).

ModernBERT outperforms competitors in terms of efficiency, thanks to advancements like flash attention and unpadding:

Maximum batch size

ModernBERT-base: 128k tokens, surpassing GTE-en-MLM-base (102k) and RoBERTa-base (102k).
ModernBERT-large: 128k tokens, exceeding GTE-en-MLM-large (102k) and RoBERTa-large (102k).

Inference speed (tokens/sec)

ModernBERT-base: 148.1 fixed, 147.3 variable.
ModernBERT-large: 52.9 fixed, 52.3 variable.

With its blend of cutting-edge innovations and practical design, ModernBERT represents a new standard for encoder-only models.

Conclusion

To conclude, encoder-only models made a come back with ModernBERT, and I expect even further improvements, most likely involving generator-discriminator framework or other methods tried in BERT-inspired architectures.

There are so many groundbreaking improvements on generative AI models. But the landscape of NLP is not limited to them, and in each field practical and promising work continues to be performed.