Convert Figma logo to code with AI

NVIDIA logoapex

A PyTorch Extension: Tools for easy mixed precision and distributed training in Pytorch

8,693
1,458
8,693
753

Top Related Projects

PyTorch extensions for high performance and large scale training.

39,112

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.

14,559

Distributed training framework for TensorFlow, Keras, PyTorch, and Apache MXNet.

91,080

Tensors and Dynamic neural networks in Python with strong GPU acceleration

🚀 A simple way to launch, train, and use PyTorch models on almost any device and distributed configuration, automatic mixed precision (including fp8), and easy-to-configure FSDP and DeepSpeed support

A Python package for extending the official PyTorch that can easily obtain performance on Intel platform

Quick Overview

NVIDIA/apex is a PyTorch extension that provides tools for mixed precision and distributed training. It aims to improve performance and memory efficiency in deep learning workflows, particularly for large-scale models and datasets.

Pros

  • Enables mixed precision training, which can significantly speed up computations and reduce memory usage
  • Provides optimized CUDA kernels for common operations, enhancing performance on NVIDIA GPUs
  • Offers easy-to-use distributed training utilities for multi-GPU and multi-node setups
  • Integrates seamlessly with PyTorch, allowing for minimal code changes in existing projects

Cons

  • Primarily focused on NVIDIA GPUs, limiting its usefulness for other hardware
  • Requires careful tuning and understanding of mixed precision concepts for optimal results
  • May introduce additional complexity to the training pipeline
  • Some features may not be compatible with the latest PyTorch versions immediately upon release

Code Examples

  1. Initializing mixed precision training:
from apex import amp

model, optimizer = amp.initialize(model, optimizer, opt_level="O1")
  1. Using distributed data parallel with Apex:
from apex.parallel import DistributedDataParallel as DDP

model = DDP(model)
  1. Applying gradient clipping with Apex:
from apex import amp

with amp.scale_loss(loss, optimizer) as scaled_loss:
    scaled_loss.backward()
amp.master_params(optimizer).clip_grad_norm_(max_norm=1.0)
  1. Using Apex's optimized layer normalization:
from apex.normalization import FusedLayerNorm

layer_norm = FusedLayerNorm(normalized_shape)

Getting Started

To get started with NVIDIA/apex, follow these steps:

  1. Install Apex:
git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --disable-pip-version-check --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" ./
  1. Import and use Apex in your PyTorch code:
import torch
from apex import amp

# Define your model and optimizer
model = YourModel()
optimizer = torch.optim.Adam(model.parameters())

# Initialize mixed precision training
model, optimizer = amp.initialize(model, optimizer, opt_level="O1")

# Train your model using Apex features
for epoch in range(num_epochs):
    for batch in dataloader:
        optimizer.zero_grad()
        output = model(batch)
        loss = criterion(output, target)
        with amp.scale_loss(loss, optimizer) as scaled_loss:
            scaled_loss.backward()
        optimizer.step()

Competitor Comparisons

PyTorch extensions for high performance and large scale training.

Pros of fairscale

  • More comprehensive distributed training support, including model parallelism and pipeline parallelism
  • Broader compatibility across different hardware platforms, not limited to NVIDIA GPUs
  • Active development and regular updates from Facebook AI Research team

Cons of fairscale

  • May have a steeper learning curve due to more advanced features
  • Potentially slower performance for some operations compared to Apex's CUDA-optimized implementations
  • Less focus on mixed-precision training compared to Apex

Code Comparison

Apex (Mixed Precision Training):

model, optimizer = amp.initialize(model, optimizer, opt_level="O1")
with amp.scale_loss(loss, optimizer) as scaled_loss:
    scaled_loss.backward()

fairscale (Sharded Data Parallel):

model = ShardedDataParallel(model, sharded_optimizer=optimizer)
output = model(input)
loss = criterion(output, target)
loss.backward()

Both libraries aim to improve training efficiency, but fairscale offers a wider range of distributed training techniques, while Apex focuses more on mixed-precision training and NVIDIA-specific optimizations.

39,112

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective.

Pros of DeepSpeed

  • More comprehensive optimization toolkit with features like ZeRO, pipeline parallelism, and 1-bit Adam
  • Better support for distributed training across multiple GPUs and nodes
  • More active development and frequent updates

Cons of DeepSpeed

  • Steeper learning curve due to more complex features and configurations
  • May require more setup and tuning to achieve optimal performance
  • Potentially less stable due to rapid development and frequent changes

Code Comparison

Apex:

model, optimizer = amp.initialize(model, optimizer, opt_level="O1")
with amp.scale_loss(loss, optimizer) as scaled_loss:
    scaled_loss.backward()

DeepSpeed:

model_engine, optimizer, _, _ = deepspeed.initialize(
    args=args, model=model, model_parameters=params
)
loss = model_engine(batch)
model_engine.backward(loss)
model_engine.step()

Both libraries aim to optimize training performance, but DeepSpeed offers a more comprehensive suite of features for large-scale distributed training, while Apex focuses primarily on mixed precision training with a simpler API.

14,559

Distributed training framework for TensorFlow, Keras, PyTorch, and Apache MXNet.

Pros of Horovod

  • Framework-agnostic: Works with TensorFlow, PyTorch, and MXNet
  • Supports distributed training across multiple GPUs and nodes
  • Easier to scale to large clusters and supercomputers

Cons of Horovod

  • Requires more setup and configuration compared to Apex
  • May have slightly higher overhead for single-node multi-GPU training
  • Less integrated with NVIDIA-specific optimizations

Code Comparison

Apex:

model, optimizer = amp.initialize(model, optimizer, opt_level="O1")
with amp.scale_loss(loss, optimizer) as scaled_loss:
    scaled_loss.backward()

Horovod:

hvd.init()
optimizer = hvd.DistributedOptimizer(optimizer)
hvd.broadcast_parameters(model.state_dict(), root_rank=0)
loss.backward()

Both libraries aim to improve distributed training performance, but Apex focuses on mixed precision training and NVIDIA GPU optimizations, while Horovod emphasizes scalability across different frameworks and distributed environments. Apex is more tightly integrated with PyTorch and NVIDIA hardware, offering easier setup for single-node multi-GPU scenarios. Horovod provides greater flexibility for large-scale distributed training across various frameworks and hardware configurations.

91,080

Tensors and Dynamic neural networks in Python with strong GPU acceleration

Pros of PyTorch

  • Broader ecosystem and community support
  • More comprehensive documentation and tutorials
  • Wider range of built-in features and functionalities

Cons of PyTorch

  • Slower performance for certain operations compared to Apex
  • Lacks some advanced mixed precision training features
  • May require more memory for large-scale models

Code Comparison

PyTorch:

import torch

model = torch.nn.Linear(10, 10)
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)
scaler = torch.cuda.amp.GradScaler()

for data, target in dataset:
    optimizer.zero_grad()
    with torch.cuda.amp.autocast():
        output = model(data)
        loss = loss_fn(output, target)
    scaler.scale(loss).backward()
    scaler.step(optimizer)
    scaler.update()

Apex:

import torch
from apex import amp

model = torch.nn.Linear(10, 10)
optimizer = torch.optim.SGD(model.parameters(), lr=0.1)
model, optimizer = amp.initialize(model, optimizer, opt_level="O1")

for data, target in dataset:
    optimizer.zero_grad()
    output = model(data)
    loss = loss_fn(output, target)
    with amp.scale_loss(loss, optimizer) as scaled_loss:
        scaled_loss.backward()
    optimizer.step()

🚀 A simple way to launch, train, and use PyTorch models on almost any device and distributed configuration, automatic mixed precision (including fp8), and easy-to-configure FSDP and DeepSpeed support

Pros of Accelerate

  • Easier to use and more beginner-friendly
  • Supports a wider range of hardware and platforms
  • Integrates seamlessly with Hugging Face ecosystem

Cons of Accelerate

  • May not offer the same level of performance optimization as Apex
  • Less fine-grained control over mixed precision training

Code Comparison

Apex:

model, optimizer = amp.initialize(model, optimizer, opt_level="O1")
with amp.scale_loss(loss, optimizer) as scaled_loss:
    scaled_loss.backward()

Accelerate:

from accelerate import Accelerator
accelerator = Accelerator()
model, optimizer, training_dataloader = accelerator.prepare(
    model, optimizer, training_dataloader
)

Apex focuses on mixed precision training and optimization, while Accelerate provides a more general-purpose solution for distributed training and hardware acceleration. Apex offers more advanced features for performance tuning, but Accelerate is easier to integrate into existing projects and works across a broader range of hardware configurations.

Accelerate is designed to be more user-friendly and requires less code modification, making it a good choice for those new to distributed training or working with diverse hardware setups. Apex, on the other hand, may be preferred by users who need fine-grained control over mixed precision training and are willing to invest time in optimizing their models for NVIDIA GPUs.

A Python package for extending the official PyTorch that can easily obtain performance on Intel platform

Pros of intel-extension-for-pytorch

  • Optimized for Intel hardware, including CPUs and GPUs
  • Supports a wider range of Intel-specific optimizations and features
  • Integrates seamlessly with Intel's oneAPI toolkit for enhanced performance

Cons of intel-extension-for-pytorch

  • Limited to Intel hardware, reducing flexibility for users with diverse hardware setups
  • May have a smaller community and fewer resources compared to Apex
  • Potentially slower adoption of new PyTorch features due to focus on Intel-specific optimizations

Code Comparison

apex:

from apex import amp
model, optimizer = amp.initialize(model, optimizer, opt_level="O1")

intel-extension-for-pytorch:

import intel_extension_for_pytorch as ipex
model = ipex.optimize(model)

Both extensions aim to improve PyTorch performance, but they target different hardware ecosystems. Apex focuses on NVIDIA GPUs and provides mixed precision training, while intel-extension-for-pytorch optimizes for Intel hardware. The code snippets demonstrate the simplicity of integrating these extensions into existing PyTorch projects, with slight differences in syntax and functionality.

Convert Figma logo designs to code with AI

Visual Copilot

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

Introduction

This repository holds NVIDIA-maintained utilities to streamline mixed precision and distributed training in Pytorch. Some of the code here will be included in upstream Pytorch eventually. The intent of Apex is to make up-to-date utilities available to users as quickly as possible.

Installation

Each apex.contrib module requires one or more install options other than --cpp_ext and --cuda_ext. Note that contrib modules do not necessarily support stable PyTorch releases, some of them might only be compatible with nightlies.

Containers

NVIDIA PyTorch Containers are available on NGC: https://catalog.ngc.nvidia.com/orgs/nvidia/containers/pytorch. The containers come with all the custom extensions available at the moment.

See the NGC documentation for details such as:

  • how to pull a container
  • how to run a pulled container
  • release notes

From Source

To install Apex from source, we recommend using the nightly Pytorch obtainable from https://github.com/pytorch/pytorch.

The latest stable release obtainable from https://pytorch.org should also work.

We recommend installing Ninja to make compilation faster.

Linux

For performance and full functionality, we recommend installing Apex with CUDA and C++ extensions using environment variables:

Using Environment Variables (Recommended)

git clone https://github.com/NVIDIA/apex
cd apex
# Build with core extensions (cpp and cuda)
APEX_CPP_EXT=1 APEX_CUDA_EXT=1 pip install -v --no-build-isolation .

# To build with additional extensions, specify them with environment variables
APEX_CPP_EXT=1 APEX_CUDA_EXT=1 APEX_FAST_MULTIHEAD_ATTN=1 APEX_FUSED_CONV_BIAS_RELU=1 pip install -v --no-build-isolation .

# To build all contrib extensions at once
APEX_CPP_EXT=1 APEX_CUDA_EXT=1 APEX_ALL_CONTRIB_EXT=1 pip install -v --no-build-isolation .

To reduce the build time, parallel building can be enabled:

NVCC_APPEND_FLAGS="--threads 4" APEX_PARALLEL_BUILD=8 APEX_CPP_EXT=1 APEX_CUDA_EXT=1 pip install -v --no-build-isolation .

When CPU cores or memory are limited, the --parallel option is generally preferred over --threads. See pull#1882 for more details.

Using Command-Line Flags (Legacy Method)

The traditional command-line flags are still supported:

# Using pip config-settings (pip >= 23.1)
pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" ./

# For older pip versions
pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --global-option="--cpp_ext" --global-option="--cuda_ext" ./

# To build with additional extensions
pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --global-option="--cpp_ext" --global-option="--cuda_ext" --global-option="--fast_multihead_attn" ./

Python-Only Build

APEX also supports a Python-only build via:

pip install -v --disable-pip-version-check --no-build-isolation --no-cache-dir ./

A Python-only build omits:

  • Fused kernels required to use apex.optimizers.FusedAdam.
  • Fused kernels required to use apex.normalization.FusedLayerNorm and apex.normalization.FusedRMSNorm.
  • Fused kernels that improve the performance and numerical stability of apex.parallel.SyncBatchNorm.
  • Fused kernels that improve the performance of apex.parallel.DistributedDataParallel and apex.amp. DistributedDataParallel, amp, and SyncBatchNorm will still be usable, but they may be slower.

[Experimental] Windows

pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings "--build-option=--cpp_ext" --config-settings "--build-option=--cuda_ext" . may work if you were able to build Pytorch from source on your system. A Python-only build via pip install -v --no-cache-dir . is more likely to work.
If you installed Pytorch in a Conda environment, make sure to install Apex in that same environment.

Custom C++/CUDA Extensions and Install Options

If a requirement of a module is not met, then it will not be built.

Module NameEnvironment VariableInstall OptionMisc
apex_CAPEX_CPP_EXT=1--cpp_ext
amp_CAPEX_CUDA_EXT=1--cuda_ext
syncbnAPEX_CUDA_EXT=1--cuda_ext
fused_layer_norm_cudaAPEX_CUDA_EXT=1--cuda_extapex.normalization
mlp_cudaAPEX_CUDA_EXT=1--cuda_ext
scaled_upper_triang_masked_softmax_cudaAPEX_CUDA_EXT=1--cuda_ext
generic_scaled_masked_softmax_cudaAPEX_CUDA_EXT=1--cuda_ext
scaled_masked_softmax_cudaAPEX_CUDA_EXT=1--cuda_ext
fused_weight_gradient_mlp_cudaAPEX_CUDA_EXT=1--cuda_extRequires CUDA>=11
permutation_search_cudaAPEX_PERMUTATION_SEARCH=1--permutation_searchapex.contrib.sparsity
bnpAPEX_BNP=1--bnpapex.contrib.groupbn
xentropyAPEX_XENTROPY=1--xentropyapex.contrib.xentropy
focal_loss_cudaAPEX_FOCAL_LOSS=1--focal_lossapex.contrib.focal_loss
fused_index_mul_2dAPEX_INDEX_MUL_2D=1--index_mul_2dapex.contrib.index_mul_2d
fused_adam_cudaAPEX_DEPRECATED_FUSED_ADAM=1--deprecated_fused_adamapex.contrib.optimizers
fused_lamb_cudaAPEX_DEPRECATED_FUSED_LAMB=1--deprecated_fused_lambapex.contrib.optimizers
fast_layer_normAPEX_FAST_LAYER_NORM=1--fast_layer_normapex.contrib.layer_norm. different from fused_layer_norm
fmhalibAPEX_FMHA=1--fmhaapex.contrib.fmha
fast_multihead_attnAPEX_FAST_MULTIHEAD_ATTN=1--fast_multihead_attnapex.contrib.multihead_attn
transducer_joint_cudaAPEX_TRANSDUCER=1--transducerapex.contrib.transducer
transducer_loss_cudaAPEX_TRANSDUCER=1--transducerapex.contrib.transducer
cudnn_gbn_libAPEX_CUDNN_GBN=1--cudnn_gbnRequires cuDNN>=8.5, apex.contrib.cudnn_gbn
peer_memory_cudaAPEX_PEER_MEMORY=1--peer_memoryapex.contrib.peer_memory
nccl_p2p_cudaAPEX_NCCL_P2P=1--nccl_p2pRequires NCCL >= 2.10, apex.contrib.nccl_p2p
fast_bottleneckAPEX_FAST_BOTTLENECK=1--fast_bottleneckRequires peer_memory_cuda and nccl_p2p_cuda, apex.contrib.bottleneck
fused_conv_bias_reluAPEX_FUSED_CONV_BIAS_RELU=1--fused_conv_bias_reluRequires cuDNN>=8.4, apex.contrib.conv_bias_relu
distributed_adam_cudaAPEX_DISTRIBUTED_ADAM=1--distributed_adamapex.contrib.optimizers
distributed_lamb_cudaAPEX_DISTRIBUTED_LAMB=1--distributed_lambapex.contrib.optimizers
_apex_nccl_allocatorAPEX_NCCL_ALLOCATOR=1--nccl_allocatorRequires NCCL >= 2.19, apex.contrib.nccl_allocator
_apex_gpu_direct_storageAPEX_GPU_DIRECT_STORAGE=1--gpu_direct_storageapex.contrib.gpu_direct_storage

You can also build all contrib extensions at once by setting APEX_ALL_CONTRIB_EXT=1.