data
A PyTorch repo for data loading and utilities to be shared by the PyTorch domain libraries.
Top Related Projects
Quick Overview
Meta-PyTorch/data is a repository containing datasets and data loading utilities for PyTorch, specifically tailored for meta-learning tasks. It provides a collection of popular meta-learning datasets and tools to efficiently load and preprocess data for meta-learning experiments.
Pros
- Specialized for meta-learning tasks, saving time on dataset preparation
- Includes popular meta-learning datasets like Omniglot and Mini-ImageNet
- Offers efficient data loading and preprocessing utilities
- Integrates seamlessly with PyTorch ecosystem
Cons
- Limited to meta-learning datasets, not suitable for general-purpose machine learning tasks
- May require additional dependencies for specific datasets
- Documentation could be more comprehensive for some datasets
- Updates and maintenance may not be as frequent as larger, more general-purpose libraries
Code Examples
Loading the Omniglot dataset:
from meta_pytorch.data import OmniglotDataset
dataset = OmniglotDataset(root='./data', download=True)
Creating a meta-learning task sampler:
from meta_pytorch.data import TaskSampler
sampler = TaskSampler(dataset, n_way=5, k_shot=1, query_size=15)
Using a data loader for meta-learning:
from torch.utils.data import DataLoader
dataloader = DataLoader(dataset, batch_sampler=sampler, num_workers=4)
Getting Started
To get started with meta-pytorch/data, follow these steps:
-
Install the library:
pip install meta-pytorch -
Import and use the datasets:
from meta_pytorch.data import OmniglotDataset, MiniImageNetDataset omniglot = OmniglotDataset(root='./data', download=True) mini_imagenet = MiniImageNetDataset(root='./data', download=True) -
Create a task sampler and data loader:
from meta_pytorch.data import TaskSampler from torch.utils.data import DataLoader sampler = TaskSampler(omniglot, n_way=5, k_shot=1, query_size=15) dataloader = DataLoader(omniglot, batch_sampler=sampler, num_workers=4) -
Iterate through the data in your meta-learning experiment:
for batch in dataloader: # Your meta-learning training loop here pass
Competitor Comparisons
A PyTorch repo for data loading and utilities to be shared by the PyTorch domain libraries.
Pros of data
- More comprehensive dataset handling capabilities
- Better integration with PyTorch ecosystem
- Actively maintained with regular updates
Cons of data
- Potentially more complex API for simple use cases
- May have a steeper learning curve for beginners
- Larger codebase, which could impact performance in some scenarios
Code Comparison
data:
from torchdata.datapipes.iter import IterableWrapper, FileOpener
dp = IterableWrapper(["file1.txt", "file2.txt"])
dp = FileOpener(dp, mode="r")
for file_content in dp:
print(file_content)
data>:
from torch.utils.data import Dataset
class CustomDataset(Dataset):
def __init__(self, file_list):
self.file_list = file_list
def __getitem__(self, index):
with open(self.file_list[index], "r") as f:
return f.read()
def __len__(self):
return len(self.file_list)
The data repository offers a more flexible and powerful approach to data handling, utilizing DataPipes for efficient data processing. On the other hand, data> provides a simpler, more traditional Dataset implementation that may be easier to understand for those familiar with PyTorch's basic data utilities.
The fastai deep learning library
Pros of fastai
- More comprehensive library with a wider range of deep learning applications
- Higher-level API, making it easier for beginners to get started
- Extensive documentation and educational resources
Cons of fastai
- Less flexible for low-level customization
- Potentially slower execution compared to pure PyTorch implementations
- Steeper learning curve for understanding the entire ecosystem
Code Comparison
fastai:
from fastai.vision.all import *
path = untar_data(URLs.PETS)
dls = ImageDataLoaders.from_folder(path, valid_pct=0.2, size=224)
learn = cnn_learner(dls, resnet34, metrics=error_rate)
learn.fine_tune(1)
meta-pytorch/data:
from torchvision import datasets, transforms
transform = transforms.Compose([transforms.Resize(224), transforms.ToTensor()])
dataset = datasets.ImageFolder('path/to/data', transform=transform)
dataloader = torch.utils.data.DataLoader(dataset, batch_size=32, shuffle=True)
The fastai code showcases its high-level API for quick model creation and training, while the meta-pytorch/data example demonstrates a more low-level approach to data loading and preprocessing.
Convert
designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual CopilotREADME
TorchData
What is TorchData? | Stateful DataLoader | Install guide | Contributing | License
What is TorchData?
The TorchData project is an iterative enhancement to the PyTorch torch.utils.data.DataLoader and torch.utils.data.Dataset/IterableDataset to make them scalable, performant dataloading solutions. We will be iterating on the enhancements under the torchdata repo.
Our first change begins with adding checkpointing to torch.utils.data.DataLoader, which can be found in
stateful_dataloader, a drop-in replacement for torch.utils.data.DataLoader, by defining
load_state_dict and state_dict methods that enable mid-epoch checkpointing, and an API for users to track custom
iteration progress, and other custom states from the dataloader workers such as token buffers and/or RNG states.
Stateful DataLoader
torchdata.stateful_dataloader.StatefulDataLoader is a drop-in replacement for torch.utils.data.DataLoader which
provides state_dict and load_state_dict functionality. See
the Stateful DataLoader main page for more information and examples. Also check out the
examples
in this Colab notebook.
torchdata.nodes
torchdata.nodes is a library of composable iterators (not iterables!) that let you chain together common dataloading and pre-proc operations. It follows a streaming programming model, although "sampler + Map-style" can still be configured if you desire. See torchdata.nodes main page for more details. Stay tuned for tutorial on torchdata.nodes coming soon!
Installation
Version Compatibility
The following is the corresponding torchdata versions and supported Python versions.
torch | torchdata | python |
|---|---|---|
master / nightly | main / nightly | >=3.9, <=3.13 |
2.6.0 | 0.11.0 | >=3.9, <=3.13 |
2.5.0 | 0.10.0 | >=3.9, <=3.12 |
2.5.0 | 0.9.0 | >=3.9, <=3.12 |
2.4.0 | 0.8.0 | >=3.8, <=3.12 |
2.0.0 | 0.6.0 | >=3.8, <=3.11 |
1.13.1 | 0.5.1 | >=3.7, <=3.10 |
1.12.1 | 0.4.1 | >=3.7, <=3.10 |
1.12.0 | 0.4.0 | >=3.7, <=3.10 |
1.11.0 | 0.3.0 | >=3.7, <=3.10 |
Local pip or conda
First, set up an environment. We will be installing a PyTorch binary as well as torchdata. If you're using conda, create a conda environment:
conda create --name torchdata
conda activate torchdata
If you wish to use venv instead:
python -m venv torchdata-env
source torchdata-env/bin/activate
Install torchdata:
Using pip:
pip install torchdata
Using conda:
conda install -c pytorch torchdata
From source
pip install .
In case building TorchData from source fails, install the nightly version of PyTorch following the linked guide on the contributing page.
From nightly
The nightly version of TorchData is also provided and updated daily from main branch.
Using pip:
pip install --pre torchdata --index-url https://download.pytorch.org/whl/nightly/cpu
Using conda:
conda install torchdata -c pytorch-nightly
Contributing
We welcome PRs! See the CONTRIBUTING file.
Beta Usage and Feedback
We'd love to hear from and work with early adopters to shape our designs. Please reach out by raising an issue if you're interested in using this tooling for your project.
License
TorchData is BSD licensed, as found in the LICENSE file.
Top Related Projects
Convert
designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual Copilot