apex

A PyTorch Extension: Tools for easy mixed precision and distributed training in Pytorch

8,890

1,507

8,890

755

View on GitHub

Top Related Projects

fairscale

3,385

PyTorch extensions for high performance and large scale training.

DeepSpeed

41,188

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.

horovod

14,626

Distributed training framework for TensorFlow, Keras, PyTorch, and Apache MXNet.

pytorch

96,480

Tensors and Dynamic neural networks in Python with strong GPU acceleration

accelerate

9,278

🚀 A simple way to launch, train, and use PyTorch models on almost any device and distributed configuration, automatic mixed precision (including fp8), and easy-to-configure FSDP and DeepSpeed support

intel-extension-for-pytorch

1,988

A Python package for extending the official PyTorch that can easily obtain performance on Intel platform

Quick Overview

NVIDIA/apex is a PyTorch extension that provides tools for mixed precision and distributed training. It aims to improve performance and memory efficiency in deep learning workflows, particularly for large-scale models and datasets.

Pros

Enables mixed precision training, which can significantly speed up computations and reduce memory usage
Provides optimized CUDA kernels for common operations, enhancing performance on NVIDIA GPUs
Offers easy-to-use distributed training utilities for multi-GPU and multi-node setups
Integrates seamlessly with PyTorch, allowing for minimal code changes in existing projects

Cons

Primarily focused on NVIDIA GPUs, limiting its usefulness for other hardware
Requires careful tuning and understanding of mixed precision concepts for optimal results
May introduce additional complexity to the training pipeline
Some features may not be compatible with the latest PyTorch versions immediately upon release

Code Examples

Initializing mixed precision training:

from apex import amp

model, optimizer = amp.initialize(model, optimizer, opt_level="O1")

Using distributed data parallel with Apex:

from apex.parallel import DistributedDataParallel as DDP

model = DDP(model)

Applying gradient clipping with Apex:

from apex import amp

with amp.scale_loss(loss, optimizer) as scaled_loss:
    scaled_loss.backward()
amp.master_params(optimizer).clip_grad_norm_(max_norm=1.0)

Using Apex's optimized layer normalization:

from apex.normalization import FusedLayerNorm

layer_norm = FusedLayerNorm(normalized_shape)

Getting Started

To get started with NVIDIA/apex, follow these steps:

Install Apex:

git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./

Import and use Apex in your PyTorch code:

import torch
from apex import amp

# Define your model and optimizer
model = YourModel()
optimizer = torch.optim.Adam(model.parameters())

# Initialize mixed precision training
model, optimizer = amp.initialize(model, optimizer, opt_level="O1")

# Train your model using Apex features
for epoch in range(num_epochs):
    for batch in dataloader:
        optimizer.zero_grad()
        output = model(batch)
        loss = criterion(output, target)
        with amp.scale_loss(loss, optimizer) as scaled_loss:
            scaled_loss.backward()
        optimizer.step()

Competitor Comparisons

fairscale

3,385

PyTorch extensions for high performance and large scale training.

Pros of fairscale

More comprehensive distributed training support, including model parallelism and pipeline parallelism
Broader compatibility across different hardware platforms, not limited to NVIDIA GPUs
Active development and regular updates from Facebook AI Research team

Cons of fairscale

May have a steeper learning curve due to more advanced features
Potentially slower performance for some operations compared to Apex's CUDA-optimized implementations
Less focus on mixed-precision training compared to Apex

Code Comparison

Apex (Mixed Precision Training):

model, optimizer = amp.initialize(model, optimizer, opt_level="O1")
with amp.scale_loss(loss, optimizer) as scaled_loss:
    scaled_loss.backward()

fairscale (Sharded Data Parallel):

model = ShardedDataParallel(model, sharded_optimizer=optimizer)
output = model(input)
loss = criterion(output, target)
loss.backward()

Both libraries aim to improve training efficiency, but fairscale offers a wider range of distributed training techniques, while Apex focuses more on mixed-precision training and NVIDIA-specific optimizations.

DeepSpeed

41,188

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.

Pros of DeepSpeed

More comprehensive optimization toolkit with features like ZeRO, pipeline parallelism, and 1-bit Adam
Better support for distributed training across multiple GPUs and nodes
More active development and frequent updates

Cons of DeepSpeed

Steeper learning curve due to more complex features and configurations
May require more setup and tuning to achieve optimal performance
Potentially less stable due to rapid development and frequent changes

Code Comparison

Apex:

model, optimizer = amp.initialize(model, optimizer, opt_level="O1")
with amp.scale_loss(loss, optimizer) as scaled_loss:
    scaled_loss.backward()

DeepSpeed:

model_engine, optimizer, _, _ = deepspeed.initialize(
    args=args, model=model, model_parameters=params
)
loss = model_engine(batch)
model_engine.backward(loss)
model_engine.step()

Both libraries aim to optimize training performance, but DeepSpeed offers a more comprehensive suite of features for large-scale distributed training, while Apex focuses primarily on mixed precision training with a simpler API.

horovod

14,626

Distributed training framework for TensorFlow, Keras, PyTorch, and Apache MXNet.

Pros of Horovod

Framework-agnostic: Works with TensorFlow, PyTorch, and MXNet
Supports distributed training across multiple GPUs and nodes
Easier to scale to large clusters and supercomputers

Cons of Horovod

Requires more setup and configuration compared to Apex
May have slightly higher overhead for single-node multi-GPU training
Less integrated with NVIDIA-specific optimizations

Code Comparison

Apex:

model, optimizer = amp.initialize(model, optimizer, opt_level="O1")
with amp.scale_loss(loss, optimizer) as scaled_loss:
    scaled_loss.backward()

Horovod:

hvd.init()
optimizer = hvd.DistributedOptimizer(optimizer)
hvd.broadcast_parameters(model.state_dict(), root_rank=0)
loss.backward()

Both libraries aim to improve distributed training performance, but Apex focuses on mixed precision training and NVIDIA GPU optimizations, while Horovod emphasizes scalability across different frameworks and distributed environments. Apex is more tightly integrated with PyTorch and NVIDIA hardware, offering easier setup for single-node multi-GPU scenarios. Horovod provides greater flexibility for large-scale distributed training across various frameworks and hardware configurations.

pytorch

96,480

Tensors and Dynamic neural networks in Python with strong GPU acceleration

Pros of PyTorch

Broader ecosystem and community support
More comprehensive documentation and tutorials
Wider range of built-in features and functionalities

Cons of PyTorch

Slower performance for certain operations compared to Apex
Lacks some advanced mixed precision training features
May require more memory for large-scale models

Code Comparison

PyTorch:

import torch

model = torch.nn.Linear(10, 10)
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)
scaler = torch.cuda.amp.GradScaler()

for data, target in dataset:
    optimizer.zero_grad()
    with torch.cuda.amp.autocast():
        output = model(data)
        loss = loss_fn(output, target)
    scaler.scale(loss).backward()
    scaler.step(optimizer)
    scaler.update()

Apex:

import torch
from apex import amp

model = torch.nn.Linear(10, 10)
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)
model, optimizer = amp.initialize(model, optimizer, opt_level="O1")

for data, target in dataset:
    optimizer.zero_grad()
    output = model(data)
    loss = loss_fn(output, target)
    with amp.scale_loss(loss, optimizer) as scaled_loss:
        scaled_loss.backward()
    optimizer.step()

accelerate

9,278

Pros of Accelerate

Easier to use and more beginner-friendly
Supports a wider range of hardware and platforms
Integrates seamlessly with Hugging Face ecosystem

Cons of Accelerate

May not offer the same level of performance optimization as Apex
Less fine-grained control over mixed precision training

Code Comparison

Apex:

model, optimizer = amp.initialize(model, optimizer, opt_level="O1")
with amp.scale_loss(loss, optimizer) as scaled_loss:
    scaled_loss.backward()

Accelerate:

from accelerate import Accelerator
accelerator = Accelerator()
model, optimizer, training_dataloader = accelerator.prepare(
    model, optimizer, training_dataloader
)

Apex focuses on mixed precision training and optimization, while Accelerate provides a more general-purpose solution for distributed training and hardware acceleration. Apex offers more advanced features for performance tuning, but Accelerate is easier to integrate into existing projects and works across a broader range of hardware configurations.

Accelerate is designed to be more user-friendly and requires less code modification, making it a good choice for those new to distributed training or working with diverse hardware setups. Apex, on the other hand, may be preferred by users who need fine-grained control over mixed precision training and are willing to invest time in optimizing their models for NVIDIA GPUs.

intel-extension-for-pytorch

1,988

A Python package for extending the official PyTorch that can easily obtain performance on Intel platform

Pros of intel-extension-for-pytorch

Optimized for Intel hardware, including CPUs and GPUs
Supports a wider range of Intel-specific optimizations and features
Integrates seamlessly with Intel's oneAPI toolkit for enhanced performance

Cons of intel-extension-for-pytorch

Limited to Intel hardware, reducing flexibility for users with diverse hardware setups
May have a smaller community and fewer resources compared to Apex
Potentially slower adoption of new PyTorch features due to focus on Intel-specific optimizations

Code Comparison

apex:

from apex import amp
model, optimizer = amp.initialize(model, optimizer, opt_level="O1")

intel-extension-for-pytorch:

import intel_extension_for_pytorch as ipex
model = ipex.optimize(model)

Both extensions aim to improve PyTorch performance, but they target different hardware ecosystems. Apex focuses on NVIDIA GPUs and provides mixed precision training, while intel-extension-for-pytorch optimizes for Intel hardware. The code snippets demonstrate the simplicity of integrating these extensions into existing PyTorch projects, with slight differences in syntax and functionality.

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

Introduction

This repository holds NVIDIA-maintained utilities to streamline mixed precision and distributed training in Pytorch. Some of the code here will be included in upstream Pytorch eventually. The intent of Apex is to make up-to-date utilities available to users as quickly as possible.

Installation

Each apex.contrib module requires one or more install options other than --cpp_ext and --cuda_ext. Note that contrib modules do not necessarily support stable PyTorch releases, some of them might only be compatible with nightlies.

Containers

NVIDIA PyTorch Containers are available on NGC: https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch. The containers come with all the custom extensions available at the moment.

See the NGC documentation for details such as:

how to pull a container
how to run a pulled container
release notes

From Source

To install Apex from source, we recommend using the nightly Pytorch obtainable from https://github.com/pytorch/pytorch.

The latest stable release obtainable from https://pytorch.org should also work.

We recommend installing Ninja to make compilation faster.

Linux

For performance and full functionality, we recommend installing Apex with CUDA and C++ extensions using environment variables:

Using Environment Variables (Recommended)

git clone https://github.com/NVIDIA/apex
cd apex
# Build with core extensions (cpp and cuda)
APEX_CPP_EXT=1 APEX_CUDA_EXT=1 pip install -v --no-build-isolation .

# To build with additional extensions, specify them with environment variables
APEX_CPP_EXT=1 APEX_CUDA_EXT=1 APEX_FAST_MULTIHEAD_ATTN=1 APEX_FUSED_CONV_BIAS_RELU=1 pip install -v --no-build-isolation .

# To build all contrib extensions at once
APEX_CPP_EXT=1 APEX_CUDA_EXT=1 APEX_ALL_CONTRIB_EXT=1 pip install -v --no-build-isolation .

To reduce the build time, parallel building can be enabled:

NVCC_APPEND_FLAGS="--threads 4" APEX_PARALLEL_BUILD=8 APEX_CPP_EXT=1 APEX_CUDA_EXT=1 pip install -v --no-build-isolation .

When CPU cores or memory are limited, the --parallel option is generally preferred over --threads. See pull#1882 for more details.

Using Command-Line Flags (Legacy Method)

The traditional command-line flags are still supported:

# Using pip config-settings (pip >= 23.1)
pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" ./

# For older pip versions
pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --global-option="--cpp_ext" --global-option="--cuda_ext" ./

# To build with additional extensions
pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --global-option="--cpp_ext" --global-option="--cuda_ext" --global-option="--fast_multihead_attn" ./

Python-Only Build

APEX also supports a Python-only build via:

pip install -v --disable-pip-version-check --no-build-isolation --no-cache-dir ./

A Python-only build omits:

Fused kernels required to use apex.optimizers.FusedAdam.
Fused kernels required to use apex.normalization.FusedLayerNorm and apex.normalization.FusedRMSNorm.
Fused kernels that improve the performance and numerical stability of apex.parallel.SyncBatchNorm.
Fused kernels that improve the performance of apex.parallel.DistributedDataParallel and apex.amp. DistributedDataParallel, amp, and SyncBatchNorm will still be usable, but they may be slower.

[Experimental] Windows

pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" . may work if you were able to build Pytorch from source on your system. A Python-only build via pip install -v --no-cache-dir . is more likely to work.
If you installed Pytorch in a Conda environment, make sure to install Apex in that same environment.

Custom C++/CUDA Extensions and Install Options

If a requirement of a module is not met, then it will not be built.

Module Name	Environment Variable	Install Option	Misc
`apex_C`	`APEX_CPP_EXT=1`	`--cpp_ext`
`amp_C`	`APEX_CUDA_EXT=1`	`--cuda_ext`
`syncbn`	`APEX_CUDA_EXT=1`	`--cuda_ext`
`fused_layer_norm_cuda`	`APEX_CUDA_EXT=1`	`--cuda_ext`	`apex.normalization`
`mlp_cuda`	`APEX_CUDA_EXT=1`	`--cuda_ext`
`scaled_upper_triang_masked_softmax_cuda`	`APEX_CUDA_EXT=1`	`--cuda_ext`
`generic_scaled_masked_softmax_cuda`	`APEX_CUDA_EXT=1`	`--cuda_ext`
`scaled_masked_softmax_cuda`	`APEX_CUDA_EXT=1`	`--cuda_ext`
`fused_weight_gradient_mlp_cuda`	`APEX_CUDA_EXT=1`	`--cuda_ext`	Requires CUDA>=11
`permutation_search_cuda`	`APEX_PERMUTATION_SEARCH=1`	`--permutation_search`	`apex.contrib.sparsity`
`bnp`	`APEX_BNP=1`	`--bnp`	`apex.contrib.groupbn`
`xentropy`	`APEX_XENTROPY=1`	`--xentropy`	`apex.contrib.xentropy`
`focal_loss_cuda`	`APEX_FOCAL_LOSS=1`	`--focal_loss`	`apex.contrib.focal_loss`
`fused_index_mul_2d`	`APEX_INDEX_MUL_2D=1`	`--index_mul_2d`	`apex.contrib.index_mul_2d`
`fused_adam_cuda`	`APEX_DEPRECATED_FUSED_ADAM=1`	`--deprecated_fused_adam`	`apex.contrib.optimizers`
`fused_lamb_cuda`	`APEX_DEPRECATED_FUSED_LAMB=1`	`--deprecated_fused_lamb`	`apex.contrib.optimizers`
`fast_layer_norm`	`APEX_FAST_LAYER_NORM=1`	`--fast_layer_norm`	`apex.contrib.layer_norm`. different from `fused_layer_norm`
`fmhalib`	`APEX_FMHA=1`	`--fmha`	`apex.contrib.fmha`
`fast_multihead_attn`	`APEX_FAST_MULTIHEAD_ATTN=1`	`--fast_multihead_attn`	`apex.contrib.multihead_attn`
`transducer_joint_cuda`	`APEX_TRANSDUCER=1`	`--transducer`	`apex.contrib.transducer`
`transducer_loss_cuda`	`APEX_TRANSDUCER=1`	`--transducer`	`apex.contrib.transducer`
`cudnn_gbn_lib`	`APEX_CUDNN_GBN=1`	`--cudnn_gbn`	Requires cuDNN>=8.5, `apex.contrib.cudnn_gbn`
`peer_memory_cuda`	`APEX_PEER_MEMORY=1`	`--peer_memory`	`apex.contrib.peer_memory`
`nccl_p2p_cuda`	`APEX_NCCL_P2P=1`	`--nccl_p2p`	Requires NCCL >= 2.10, `apex.contrib.nccl_p2p`
`fast_bottleneck`	`APEX_FAST_BOTTLENECK=1`	`--fast_bottleneck`	Requires `peer_memory_cuda` and `nccl_p2p_cuda`, `apex.contrib.bottleneck`
`fused_conv_bias_relu`	`APEX_FUSED_CONV_BIAS_RELU=1`	`--fused_conv_bias_relu`	Requires cuDNN>=8.4, `apex.contrib.conv_bias_relu`
`distributed_adam_cuda`	`APEX_DISTRIBUTED_ADAM=1`	`--distributed_adam`	`apex.contrib.optimizers`
`distributed_lamb_cuda`	`APEX_DISTRIBUTED_LAMB=1`	`--distributed_lamb`	`apex.contrib.optimizers`
`_apex_nccl_allocator`	`APEX_NCCL_ALLOCATOR=1`	`--nccl_allocator`	Requires NCCL >= 2.19, `apex.contrib.nccl_allocator`
`_apex_gpu_direct_storage`	`APEX_GPU_DIRECT_STORAGE=1`	`--gpu_direct_storage`	`apex.contrib.gpu_direct_storage`

You can also build all contrib extensions at once by setting APEX_ALL_CONTRIB_EXT=1.

Top Related Projects

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot