Convert Figma logo to code with AI

NVIDIA logoMegatron-LM

Ongoing research training transformer models at scale

15,226
3,606
15,226
613

Top Related Projects

42,282

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.

🤗 Transformers: the model-definition framework for state-of-the-art machine learning models in text, vision, audio, and multimodal models, for both inference and training.

An implementation of model parallel autoregressive transformers on GPUs, based on the Megatron and DeepSpeed libraries

32,157

Facebook AI Research Sequence-to-Sequence Toolkit written in Python.

40,008

TensorFlow code and pre-trained models for BERT

11,890

An open-source NLP research library, built on PyTorch.

Quick Overview

Megatron-LM is an open-source project by NVIDIA for training large language models efficiently on distributed GPU systems. It focuses on optimizing transformer-based models for scale, supporting various architectures like BERT, GPT, and T5. The library is designed to enable training of models with billions of parameters across multiple GPUs and nodes.

Pros

  • Highly optimized for distributed training on NVIDIA GPUs
  • Supports multiple model architectures (BERT, GPT, T5)
  • Implements efficient parallelism techniques (data, model, and pipeline parallelism)
  • Provides tools for efficient checkpointing and model loading

Cons

  • Steep learning curve for users not familiar with distributed training
  • Primarily focused on NVIDIA hardware, limiting its use on other platforms
  • Requires significant computational resources for large-scale training
  • Documentation can be sparse for some advanced features

Code Examples

  1. Initializing a GPT model:
import torch
from megatron import get_args
from megatron.model import GPTModel

args = get_args()
model = GPTModel(num_tokentypes=0, parallel_output=True)
  1. Setting up data parallel groups:
from megatron import mpu

mpu.initialize_model_parallel(model_parallel_size=4)
  1. Training loop with mixed precision:
from apex import amp

optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
model, optimizer = amp.initialize(model, optimizer, opt_level="O2")

for batch in dataloader:
    optimizer.zero_grad()
    loss = model(batch)
    with amp.scale_loss(loss, optimizer) as scaled_loss:
        scaled_loss.backward()
    optimizer.step()

Getting Started

To get started with Megatron-LM:

  1. Clone the repository:

    git clone https://github.com/NVIDIA/Megatron-LM.git
    cd Megatron-LM
    
  2. Install dependencies:

    pip install -r requirements.txt
    
  3. Set up your dataset and configuration file.

  4. Run training script:

    python pretrain_gpt.py \
        --model-parallel-size 2 \
        --num-layers 24 \
        --hidden-size 1024 \
        --num-attention-heads 16 \
        --batch-size 4 \
        --seq-length 1024 \
        --max-position-embeddings 1024 \
        --train-iters 500000 \
        --lr-decay-iters 320000 \
        --save $CHECKPOINT_PATH \
        --load $CHECKPOINT_PATH \
        --data-path $DATA_PATH \
        --vocab-file gpt2-vocab.json \
        --merge-file gpt2-merges.txt \
        --data-impl mmap \
        --split 949,50,1 \
        --distributed-backend nccl \
        --lr 0.00015 \
        --lr-decay-style cosine \
        --min-lr 1.0e-5 \
        --weight-decay 1e-2 \
        --clip-grad 1.0 \
        --warmup .01 \
        --checkpoint-activations \
        --fp16
    

Competitor Comparisons

42,282

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.

Pros of DeepSpeed

  • More flexible and adaptable to various deep learning frameworks
  • Offers a wider range of optimization techniques beyond model parallelism
  • Provides easier integration with existing codebases

Cons of DeepSpeed

  • May require more setup and configuration for optimal performance
  • Less specialized for transformer-based models compared to Megatron-LM

Code Comparison

Megatron-LM:

model = MegatronModule(
    num_layers=args.num_layers,
    hidden_size=args.hidden_size,
    num_attention_heads=args.num_attention_heads,
    vocab_size=args.vocab_size,
    max_position_embeddings=args.max_position_embeddings,
)

DeepSpeed:

model = MyModel(args)
model_engine, optimizer, _, _ = deepspeed.initialize(
    args=args,
    model=model,
    model_parameters=model.parameters()
)

DeepSpeed offers a more generalized approach, allowing users to define their own model architecture and then apply optimization techniques. Megatron-LM provides a more specialized implementation focused on transformer-based models, with built-in support for model parallelism.

Both libraries aim to improve training efficiency for large language models, but DeepSpeed offers a broader set of optimization techniques that can be applied to various deep learning tasks, while Megatron-LM is more focused on transformer architectures and model parallelism.

🤗 Transformers: the model-definition framework for state-of-the-art machine learning models in text, vision, audio, and multimodal models, for both inference and training.

Pros of Transformers

  • Extensive model support and easy-to-use API for various NLP tasks
  • Active community and frequent updates with new models and features
  • Seamless integration with popular deep learning frameworks like PyTorch and TensorFlow

Cons of Transformers

  • Less optimized for large-scale distributed training compared to Megatron-LM
  • May require more memory and computational resources for very large models

Code Comparison

Transformers:

from transformers import BertModel, BertTokenizer

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

Megatron-LM:

import megatron
from megatron import get_args, get_tokenizer, get_model

args = get_args()
tokenizer = get_tokenizer()
model = get_model(args)

Summary

Transformers offers a user-friendly API with broad model support, making it ideal for various NLP tasks and experimentation. Megatron-LM, on the other hand, is optimized for large-scale distributed training of massive language models. While Transformers is more versatile and easier to use, Megatron-LM excels in scenarios requiring efficient training of extremely large models across multiple GPUs or nodes.

An implementation of model parallel autoregressive transformers on GPUs, based on the Megatron and DeepSpeed libraries

Pros of gpt-neox

  • More user-friendly and easier to set up for newcomers
  • Includes additional features like wandb integration and custom tokenizers
  • Actively maintained with frequent updates and community contributions

Cons of gpt-neox

  • May have slightly lower performance compared to Megatron-LM in some scenarios
  • Less extensive documentation and fewer examples for advanced use cases

Code Comparison

Megatron-LM initialization:

model = MegatronModule(
    init_method=init_method,
    output_layer_init_method=scaled_init_method,
    num_tokentypes=num_tokentypes,
    parallel_output=parallel_output)

gpt-neox initialization:

model = GPTNeoX(
    num_tokentypes=num_tokentypes,
    parallel_output=parallel_output,
    use_cache=use_cache,
    config=config)

Both repositories provide powerful tools for training large language models, but gpt-neox offers a more accessible approach for newcomers and includes additional features. Megatron-LM, on the other hand, may offer slightly better performance and more extensive documentation for advanced users. The code comparison shows similarities in model initialization, with gpt-neox using a more streamlined approach.

32,157

Facebook AI Research Sequence-to-Sequence Toolkit written in Python.

Pros of fairseq

  • Broader support for various NLP tasks and architectures
  • More extensive documentation and examples
  • Active community and frequent updates

Cons of fairseq

  • Less optimized for large-scale language models
  • May require more setup and configuration for specific tasks

Code Comparison

fairseq:

from fairseq.models.transformer import TransformerModel

model = TransformerModel.from_pretrained('/path/to/model')
tokens = model.encode('Hello world!')
output = model.decode(tokens)

Megatron-LM:

from megatron import get_args, get_tokenizer, get_model
from megatron.initialize import initialize_megatron

args = get_args()
tokenizer = get_tokenizer()
model = get_model(args)

tokens = tokenizer.tokenize('Hello world!')
output = model.generate(tokens)

The code snippets demonstrate that fairseq offers a more straightforward API for loading and using pre-trained models, while Megatron-LM requires more setup and initialization steps. However, Megatron-LM's approach allows for greater customization and optimization for large-scale language models.

40,008

TensorFlow code and pre-trained models for BERT

Pros of BERT

  • Simpler architecture and easier to understand for beginners
  • Extensive documentation and community support
  • Widely adopted and used in various NLP tasks

Cons of BERT

  • Limited scalability for very large language models
  • Less efficient for distributed training on multiple GPUs

Code Comparison

BERT:

model = BertForSequenceClassification.from_pretrained('bert-base-uncased')
outputs = model(input_ids, attention_mask=attention_mask, labels=labels)
loss = outputs.loss
loss.backward()

Megatron-LM:

model = get_model(args)
output = model(tokens, labels, attention_mask)
loss = output['loss']
model.backward(loss)

Key Differences

  • Megatron-LM is designed for training large language models with billions of parameters, while BERT is more suitable for smaller models.
  • Megatron-LM offers advanced features for distributed training and model parallelism, which are not present in BERT.
  • BERT provides pre-trained models and easy fine-tuning capabilities, making it more accessible for various NLP tasks.
  • Megatron-LM focuses on performance and scalability, while BERT prioritizes ease of use and widespread adoption.
11,890

An open-source NLP research library, built on PyTorch.

Pros of AllenNLP

  • More comprehensive and feature-rich NLP toolkit
  • Easier to use for researchers and developers new to NLP
  • Better documentation and tutorials

Cons of AllenNLP

  • Less optimized for large-scale language model training
  • May not scale as efficiently on multi-GPU systems
  • Fewer options for advanced parallelism techniques

Code Comparison

AllenNLP:

from allennlp.data import DatasetReader, Instance
from allennlp.data.fields import TextField
from allennlp.data.token_indexers import SingleIdTokenIndexer

class MyDatasetReader(DatasetReader):
    def _read(self, file_path: str) -> Iterable[Instance]:
        with open(file_path, "r") as file:
            for line in file:
                yield self.text_to_instance(line.strip())

Megatron-LM:

from megatron import get_args
from megatron import print_rank_0
from megatron import get_tokenizer
from megatron.data.dataset_utils import build_train_valid_test_datasets

args = get_args()
tokenizer = get_tokenizer()
train_dataset, valid_dataset, test_dataset = build_train_valid_test_datasets(
    data_prefix, data_impl, splits_string,
    train_valid_test_num_samples,
    seq_length, seed, skip_warmup)

Convert Figma logo designs to code with AI

Visual Copilot

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

Megatron-LM and Megatron Core

GPU-optimized library for training transformer models at scale

Documentation version license

About

This repository contains two components: Megatron-LM and Megatron Core.

Megatron-LM is a reference example that includes Megatron Core plus pre-configured training scripts. Best for research teams, learning distributed training, and quick experimentation.

Megatron Core is a composable library with GPU-optimized building blocks for custom training frameworks. It provides transformer building blocks, advanced parallelism strategies (TP, PP, DP, EP, CP), mixed precision support (FP16, BF16, FP8, FP4), and model architectures. Best for framework developers and ML engineers building custom training pipelines.

Megatron Bridge provides bidirectional Hugging Face ↔ Megatron checkpoint conversion with production-ready recipes.

Getting Started

Install from PyPI:

uv pip install megatron-core

Or clone and install from source:

git clone https://github.com/NVIDIA/Megatron-LM.git
cd Megatron-LM
uv pip install -e .

Note: Building from source can use a lot of memory. If the build runs out of memory, limit parallel compilation jobs by setting MAX_JOBS (e.g. MAX_JOBS=4 uv pip install -e .).

For NGC container setup and all installation options, see the Installation Guide.

Latest News

  • [2026/03] Deprecating Python 3.10 support: We're officially dropping Python 3.10 support with the upcoming 0.17.0 release. Downstream applications must raise their lower boundary to 3.12 to stay compatible with MCore.
  • [2026/01] Dynamic Context Parallelism - Up to 1.48x speedup for variable-length sequence training with adaptive CP sizing.
  • [2025/12] Megatron Core development has moved to GitHub! All development and CI now happens in the open. We welcome community contributions.
  • [2025/10] Megatron Dev Branch - early access branch with experimental features.
  • [2025/10] Megatron Bridge - Bidirectional converter for interoperability between Hugging Face and Megatron checkpoints, featuring production-ready recipes for popular models.
  • [2025/08] MoE Q3-Q4 2025 Roadmap - Comprehensive roadmap for MoE features including DeepSeek-V3, Qwen3, advanced parallelism strategies, FP8 optimizations, and Blackwell performance enhancements.
  • [2025/08] GPT-OSS Model - Advanced features including YaRN RoPE scaling, attention sinks, and custom activation functions are being integrated into Megatron Core.
  • [2025/06] Megatron MoE Model Zoo - Best practices and optimized configurations for training DeepSeek-V3, Mixtral, and Qwen3 MoE models with performance benchmarking and checkpoint conversion tools.
  • [2025/05] Megatron Core v0.11.0 brings new capabilities for multi-data center LLM training (blog).
Previous News
  • [2024/07] Megatron Core v0.7 improves scalability and training resiliency and adds support for multimodal training (blog).
  • [2024/06] Megatron Core added supports for Mamba-based models. Check out our paper An Empirical Study of Mamba-based Language Models and code example.
  • [2024/01 Announcement] NVIDIA has released the core capabilities in Megatron-LM into Megatron Core in this repository. Megatron Core expands upon Megatron-LM's GPU-optimized techniques with more cutting-edge innovations on system-level optimizations, featuring composable and modular APIs.

Project Structure

Megatron-LM/
├── megatron/
│   ├── core/                    # Megatron Core (kernels, parallelism, building blocks)
│   │   ├── models/              # Transformer models
│   │   ├── transformer/         # Transformer building blocks
│   │   ├── tensor_parallel/     # Tensor parallelism
│   │   ├── pipeline_parallel/   # Pipeline parallelism
│   │   ├── distributed/         # Distributed training (FSDP, DDP)
│   │   ├── optimizer/           # Optimizers
│   │   ├── datasets/            # Dataset loaders
│   │   ├── inference/           # Inference engines and server
│   │   └── export/              # Model export (e.g. TensorRT-LLM)
│   ├── training/                # Training scripts
│   ├── legacy/                  # Legacy components
│   ├── post_training/           # Post-training (quantization, distillation, pruning, etc.)
│   └── rl/                      # Reinforcement learning (RLHF, etc.)
├── examples/                    # Ready-to-use training examples
├── tools/                       # Utility tools
├── tests/                       # Comprehensive test suite
└── docs/                        # Documentation

Performance Benchmarking

For our latest performance benchmarking results, please refer to NVIDIA Megatron Bridge Performance Summary.

Our codebase efficiently trains models from 2B to 462B parameters across thousands of GPUs, achieving up to 47% Model FLOP Utilization (MFU) on H100 clusters.

Model table

Benchmark Configuration:

  • Vocabulary size: 131,072 tokens
  • Sequence length: 4096 tokens
  • Model scaling: Varied hidden size, attention heads, and layers to achieve target parameter counts
  • Communication optimizations: Fine-grained overlapping with DP (--overlap-grad-reduce, --overlap-param-gather), TP (--tp-comm-overlap), and PP (enabled by default)

Key Results:

  • 6144 H100 GPUs: Successfully benchmarked 462B parameter model training
  • Superlinear scaling: MFU increases from 41% to 47-48% with model size
  • End-to-end measurement: Throughputs include all operations (data loading, optimizer steps, communication, logging)
  • Production ready: Full training pipeline with checkpointing and fault tolerance
  • Note: Performance results measured without training to convergence

Weak Scaling Results

Our weak scaled results show superlinear scaling (MFU increases from 41% for the smallest model considered to 47-48% for the largest models); this is because larger GEMMs have higher arithmetic intensity and are consequently more efficient to execute.

Weak scaling

Strong Scaling Results

We also strong scaled the standard GPT-3 model (our version has slightly more than 175 billion parameters due to larger vocabulary size) from 96 H100 GPUs to 4608 GPUs, using the same batch size of 1152 sequences throughout. Communication becomes more exposed at larger scale, leading to a reduction in MFU from 47% to 42%.

Strong scaling

Roadmaps

  • MoE Roadmap - DeepSeek-V3, Qwen3, advanced parallelism, FP8 optimizations, and Blackwell enhancements

Resources

Getting Help

  • 📖 Documentation - Official documentation
  • 🐛 Issues - Bug reports and feature requests

Contributing

We ❤️ contributions! Ways to contribute:

  • 🐛 Report bugs - Help us improve reliability
  • 💡 Suggest features - Shape the future of Megatron Core
  • 📝 Improve docs - Make Megatron Core more accessible
  • 🔧 Submit PRs - Contribute code improvements

→ Contributing Guide

Citation

If you use Megatron in your research or project, we appreciate that you use the following citations:

@article{megatron-lm,
  title={Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism},
  author={Shoeybi, Mohammad and Patwary, Mostofa and Puri, Raul and LeGresley, Patrick and Casper, Jared and Catanzaro, Bryan},
  journal={arXiv preprint arXiv:1909.08053},
  year={2019}
}