self-critical.pytorch

Unofficial pytorch implementation for Self-critical Sequence Training for Image Captioning. and others.

1,001

276

1,001

View on GitHub

Top Related Projects

neuraltalk2

5,528

Efficient Image Captioning code in Torch, runs on GPU

a-PyTorch-Tutorial-to-Image-Captioning

2,855

Show, Attend, and Tell | a PyTorch Tutorial to Image Captioning

pytorch-tutorial

31,610

PyTorch Tutorial for Deep Learning Researchers

ctrl

1,885

Conditional Transformer Language Model for Controllable Generation

Quick Overview

Self-critical.pytorch is a PyTorch implementation of the Self-Critical Sequence Training (SCST) method for image captioning. It provides a framework for training and evaluating image captioning models using reinforcement learning techniques, specifically the SCST approach, which aims to optimize the model directly on the evaluation metric.

Pros

Implements SCST, a powerful reinforcement learning technique for image captioning
Built on PyTorch, offering flexibility and ease of use
Includes pre-trained models and evaluation scripts
Supports multiple datasets and evaluation metrics

Cons

Limited documentation and examples
May require significant computational resources for training
Focused specifically on image captioning, limiting its applicability to other domains
Requires familiarity with reinforcement learning concepts

Code Examples

Loading a pre-trained model:

import torch
from models import AttModel

infos = torch.load('path/to/model/infos_td-best.pkl')
model = AttModel(infos['opt'])
model.load_state_dict(torch.load('path/to/model/model-best.pth'))

Generating captions for an image:

from dataloader import DataLoader
import eval_utils

loader = DataLoader(opt)
eval_utils.eval_split(model, loader, eval_kwargs)

Training the model using SCST:

import train

opt = train.parse_opt()
train.train(opt)

Getting Started

Clone the repository:

git clone https://github.com/ruotianluo/self-critical.pytorch.git
cd self-critical.pytorch

Install dependencies:
```
pip install -r requirements.txt
```

Prepare the dataset (e.g., MSCOCO):

python scripts/prepro_labels.py
python scripts/prepro_feats.py

Train the model:

python train.py --id st_lt --caption_model transformer --label_smoothing 0.2 --noamopt --noamopt_warmup 20000 --seq_per_img 5 --batch_size 10 --learning_rate 5e-4 --num_layers 6 --input_encoding_size 512 --rnn_size 2048 --learning_rate_decay_start 0 --scheduled_sampling_start 0 --checkpoint_path log_st_lt

Evaluate the model:

python eval.py --model log_st_lt/model-best.pth --infos_path log_st_lt/infos_st_lt-best.pkl --image_folder /path/to/images --num_images 5000

Competitor Comparisons

neuraltalk2

5,528

Efficient Image Captioning code in Torch, runs on GPU

Pros of neuraltalk2

Pioneering implementation of image captioning using RNNs and CNNs
Well-documented and easy to understand for beginners
Includes pre-trained models for quick experimentation

Cons of neuraltalk2

Older implementation using Torch, which is less popular now
Limited flexibility for customization and experimentation
Lacks more recent advancements in image captioning techniques

Code Comparison

neuraltalk2:

local cnn_backend = opt.backend
local layer_config = opt.layer_config
local img_size = opt.image_size
local input_encoding_size = opt.input_encoding_size

self-critical.pytorch:

class AttentionModel(CaptionModel):
    def __init__(self, opt):
        super(AttentionModel, self).__init__()
        self.vocab_size = opt.vocab_size
        self.input_encoding_size = opt.input_encoding_size

The code snippets show the difference in language (Lua vs. Python) and framework (Torch vs. PyTorch) used in the two repositories. self-critical.pytorch uses a more modern and widely-adopted framework, making it easier to integrate with other deep learning projects and leverage recent advancements in the field.

a-PyTorch-Tutorial-to-Image-Captioning

2,855

Show, Attend, and Tell | a PyTorch Tutorial to Image Captioning

Pros of a-PyTorch-Tutorial-to-Image-Captioning

More beginner-friendly with detailed explanations and step-by-step tutorial
Includes data preprocessing and visualization tools
Offers a simpler implementation focused on learning

Cons of a-PyTorch-Tutorial-to-Image-Captioning

Less advanced techniques compared to self-critical.pytorch
May not achieve state-of-the-art performance
Limited to a specific model architecture

Code Comparison

a-PyTorch-Tutorial-to-Image-Captioning:

class Attention(nn.Module):
    def __init__(self, encoder_dim, decoder_dim, attention_dim):
        super(Attention, self).__init__()
        self.encoder_att = nn.Linear(encoder_dim, attention_dim)
        self.decoder_att = nn.Linear(decoder_dim, attention_dim)
        self.full_att = nn.Linear(attention_dim, 1)
        self.relu = nn.ReLU()
        self.softmax = nn.Softmax(dim=1)

self-critical.pytorch:

class AttentionLayer(nn.Module):
    def __init__(self, dim):
        super(AttentionLayer, self).__init__()
        self.dim = dim
        self.h2att = nn.Linear(dim, dim)
        self.alpha_net = nn.Linear(dim, 1)

    def forward(self, h, att_feats, p_att_feats):
        att = self.alpha_net(torch.tanh(self.h2att(h)[:, None, :] + p_att_feats))
        att = F.softmax(att.squeeze(-1), dim=-1)
        att_res = torch.bmm(att.unsqueeze(1), att_feats).squeeze(1)
        return att_res

pytorch-tutorial

31,610

PyTorch Tutorial for Deep Learning Researchers

Pros of pytorch-tutorial

Comprehensive coverage of PyTorch basics and various deep learning models
Well-structured with clear explanations and comments in the code
Suitable for beginners learning PyTorch and deep learning concepts

Cons of pytorch-tutorial

Focuses on general PyTorch usage rather than specific applications like image captioning
May lack advanced techniques and optimizations for production-level projects
Does not include reinforcement learning or policy gradient methods

Code Comparison

pytorch-tutorial (Basic PyTorch usage):

import torch
import torch.nn as nn

# Define a simple neural network
class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.fc = nn.Linear(784, 10)

self-critical.pytorch (Image captioning model):

class AttentionModel(nn.Module):
    def __init__(self, opt):
        super(AttentionModel, self).__init__()
        self.att_type = opt.att_type
        self.rnn_type = opt.rnn_type
        self.rnn_size = opt.rnn_size

The code snippets highlight the difference in focus between the two repositories. pytorch-tutorial provides a basic example of creating a neural network, while self-critical.pytorch shows a more specialized model for image captioning with attention mechanisms.

ctrl

1,885

Conditional Transformer Language Model for Controllable Generation

Pros of CTRL

Larger scale language model with 1.63 billion parameters
Supports controllable text generation with prompt-based control codes
Trained on a diverse dataset including web pages, books, and Wikipedia

Cons of CTRL

More complex and resource-intensive to run and fine-tune
Less focused on image captioning tasks specifically
May require more expertise to adapt for specialized applications

Code Comparison

CTRL example:

from transformers import CTRLTokenizer, CTRLModel

tokenizer = CTRLTokenizer.from_pretrained("ctrl")
model = CTRLModel.from_pretrained("ctrl")

input_ids = tokenizer.encode("Links", return_tensors="pt")
outputs = model.generate(input_ids, max_length=50)

self-critical.pytorch example:

from models import AttModel

model = AttModel(opt)
fc_feats = torch.randn(1, opt.fc_feat_size)
att_feats = torch.randn(1, opt.att_feat_size, opt.att_feat_size)
seq, _ = model.sample(fc_feats, att_feats)

Summary

CTRL is a more general-purpose language model with controllable generation capabilities, while self-critical.pytorch focuses specifically on image captioning using self-critical sequence training. CTRL offers more flexibility but requires more resources, whereas self-critical.pytorch is more specialized and potentially easier to use for image captioning tasks.

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

An Image Captioning codebase

This is a codebase for image captioning research.

It supports:

Self critical training from Self-critical Sequence Training for Image Captioning
Bottom up feature from ref.
Test time ensemble
Multi-GPU training. (DistributedDataParallel is now supported with the help of pytorch-lightning, see ADVANCED.md for details)
Transformer captioning model.

A simple demo colab notebook is available here

Requirements

Python 3
PyTorch 1.3+ (along with torchvision) (Test with 1.13)
cider (already been added as a submodule)
coco-caption (already been added as a submodule) (Remember to follow initialization steps in coco-caption/README.md)
yacs
lmdbdict
Optional: pytorch-lightning (Tested with 2.0)

Install

If you have difficulty running the training scripts in tools. You can try installing this repo as a python package:

python -m pip install -e .

Pretrained models

Checkout MODEL_ZOO.md.

If you want to do evaluation only, you can then follow this section after downloading the pretrained models (and also the pretrained resnet101 or precomputed bottomup features, see data/README.md).

Train your own network on COCO/Flickr30k

Prepare data.

We now support both flickr30k and COCO. See details in data/README.md. (Note: the later sections assume COCO dataset; it should be trivial to use flickr30k.)

Start training

$ python tools/train.py --id fc --caption_model newfc --input_json data/cocotalk.json --input_fc_dir data/cocotalk_fc --input_att_dir data/cocotalk_att --input_label_h5 data/cocotalk_label.h5 --batch_size 10 --learning_rate 5e-4 --learning_rate_decay_start 0 --scheduled_sampling_start 0 --checkpoint_path log_fc --save_checkpoint_every 6000 --val_images_use 5000 --max_epochs 30

$ python tools/train.py --cfg configs/fc.yml --id fc

The train script will dump checkpoints into the folder specified by --checkpoint_path (default = log_$id/). By default only save the best-performing checkpoint on validation and the latest checkpoint to save disk space. You can also set --save_history_ckpt to 1 to save every checkpoint.

To resume training, you can specify --start_from option to be the path saving infos.pkl and model.pth (usually you could just set --start_from and --checkpoint_path to be the same).

To checkout the training curve or validation curve, you can use tensorboard. The loss histories are automatically dumped into --checkpoint_path.

The current command use scheduled sampling, you can also set --scheduled_sampling_start to -1 to turn off scheduled sampling.

If you'd like to evaluate BLEU/METEOR/CIDEr scores during training in addition to validation cross entropy loss, use --language_eval 1 option, but don't forget to pull the submodule coco-caption.

For all the arguments, you can specify them in a yaml file and use --cfg to use the configurations in that yaml file. The configurations in command line will overwrite cfg file if there are conflicts.

For more options, see opts.py.

Train using self critical

First you should preprocess the dataset and get the cache for calculating cider score:

$ python scripts/prepro_ngrams.py --input_json data/dataset_coco.json --dict_json data/cocotalk.json --output_pkl data/coco-train --split train

Then, copy the model from the pretrained model using cross entropy. (It's not mandatory to copy the model, just for back-up)

$ bash scripts/copy_model.sh fc fc_rl

Then

$ python tools/train.py --id fc_rl --caption_model newfc --input_json data/cocotalk.json --input_fc_dir data/cocotalk_fc --input_att_dir data/cocotalk_att --input_label_h5 data/cocotalk_label.h5 --batch_size 10 --learning_rate 5e-5 --start_from log_fc_rl --checkpoint_path log_fc_rl --save_checkpoint_every 6000 --language_eval 1 --val_images_use 5000 --self_critical_after 30 --cached_tokens coco-train-idxs --max_epoch 50 --train_sample_n 5

$ python tools/train.py --cfg configs/fc_rl.yml --id fc_rl

You will see a huge boost on Cider score, : ).

A few notes on training. Starting self-critical training after 30 epochs, the CIDEr score goes up to 1.05 after 600k iterations (including the 30 epochs pertraining).

Generate image captions

Evaluate on raw images

Note: this doesn't work for models trained with bottomup feature. Now place all your images of interest into a folder, e.g. blah, and run the eval script:

$ python tools/eval.py --model model.pth --infos_path infos.pkl --image_folder blah --num_images 10

This tells the eval script to run up to 10 images from the given folder. If you have a big GPU you can speed up the evaluation by increasing batch_size. Use --num_images -1 to process all images. The eval script will create an vis.json file inside the vis folder, which can then be visualized with the provided HTML interface:

$ cd vis
$ python -m SimpleHTTPServer

Now visit localhost:8000 in your browser and you should see your predicted captions.

Evaluate on Karpathy's test split

$ python tools/eval.py --dump_images 0 --num_images 5000 --model model.pth --infos_path infos.pkl --language_eval 1

The defualt split to evaluate is test. The default inference method is greedy decoding (--sample_method greedy), to sample from the posterior, set --sample_method sample.

Beam Search. Beam search can increase the performance of the search for greedy decoding sequence by ~5%. However, this is a little more expensive. To turn on the beam search, use --beam_size N, N should be greater than 1.

Evaluate on COCO test set

$ python tools/eval.py --input_json cocotest.json --input_fc_dir data/cocotest_bu_fc --input_att_dir data/cocotest_bu_att --input_label_h5 none --num_images -1 --model model.pth --infos_path infos.pkl --language_eval 0

You can download the preprocessed file cocotest.json, cocotest_bu_att and cocotest_bu_fc from link.

Miscellanea

Using cpu. The code is currently defaultly using gpu; there is even no option for switching. If someone highly needs a cpu model, please open an issue; I can potentially create a cpu checkpoint and modify the eval.py to run the model on cpu. However, there's no point using cpus to train the model.

Train on other dataset. It should be trivial to port if you can create a file like dataset_coco.json for your own dataset.

Live demo. Not supported now. Welcome pull request.

For more advanced features:

Checkout ADVANCED.md.

Reference

If you find this repo useful, please consider citing (no obligation at all):

@article{luo2018discriminability,
  title={Discriminability objective for training descriptive captions},
  author={Luo, Ruotian and Price, Brian and Cohen, Scott and Shakhnarovich, Gregory},
  journal={arXiv preprint arXiv:1803.04376},
  year={2018}
}

Of course, please cite the original paper of models you are using (You can find references in the model files).

Acknowledgements

Thanks the original neuraltalk2 and awesome PyTorch team.

Top Related Projects

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot