Top Related Projects
A framework for few-shot evaluation of language models.
Toolkit for creating, sharing and using natural language prompts.
Beyond the Imitation Game collaborative benchmark for measuring and extrapolating the capabilities of language models
Evals is a framework for evaluating LLMs and LLM systems, and an open-source registry of benchmarks.
🤗 Evaluate: A library for easily evaluating machine learning models and datasets.
Quick Overview
The allenai/natural-instructions repository is a large collection of natural language tasks and their instructions. It aims to facilitate research on instruction-following language models and to improve their generalization capabilities. The project contains a diverse set of tasks across various domains, languages, and formats.
Pros
- Extensive collection of tasks: Contains over 1,600 diverse tasks in multiple languages
- Well-structured data: Tasks are organized with clear instructions, inputs, and outputs
- Facilitates research: Enables studies on instruction-following models and their generalization
- Open-source: Allows for community contributions and improvements
Cons
- Large dataset size: May require significant computational resources to process and use
- Potential inconsistencies: Given the diverse sources and contributors, some tasks may have varying quality or formatting
- Limited to text-based tasks: Does not include multimodal or audio-based instructions
- Ongoing development: May have frequent updates, requiring users to stay current with changes
Code Examples
This repository primarily contains data and does not include a code library. However, here are some examples of how you might interact with the data:
# Example 1: Loading a task from the dataset
import json
with open('tasks/task_name.json', 'r') as f:
task_data = json.load(f)
print(task_data['definition'])
print(task_data['positive_examples'][0]['input'])
print(task_data['positive_examples'][0]['output'])
# Example 2: Iterating through tasks in a specific category
import os
import json
category = 'text_classification'
for filename in os.listdir(f'tasks/{category}'):
with open(f'tasks/{category}/{filename}', 'r') as f:
task_data = json.load(f)
print(f"Task: {task_data['name']}")
# Example 3: Extracting all unique languages used in the dataset
import json
import glob
languages = set()
for file in glob.glob('tasks/**/*.json', recursive=True):
with open(file, 'r') as f:
task_data = json.load(f)
languages.update(task_data.get('languages', []))
print(f"Unique languages in the dataset: {languages}")
Getting Started
To get started with the natural-instructions dataset:
-
Clone the repository:
git clone https://github.com/allenai/natural-instructions.git cd natural-instructions
-
Explore the tasks in the
tasks
directory. -
Use the provided scripts in the
scripts
directory to process or analyze the data as needed. -
Refer to the README.md file for detailed information on the dataset structure and guidelines for usage.
Competitor Comparisons
A framework for few-shot evaluation of language models.
Pros of lm-evaluation-harness
- More comprehensive evaluation suite with a wider range of tasks and metrics
- Better support for distributed evaluation across multiple GPUs
- More active development and community contributions
Cons of lm-evaluation-harness
- Steeper learning curve and more complex setup process
- Less focus on natural language instructions and more on traditional NLP benchmarks
- Potentially higher computational requirements for running evaluations
Code Comparison
natural-instructions:
from natural_instructions import Task
task = Task.from_file("tasks/task_name.json")
result = task.evaluate(model_output)
lm-evaluation-harness:
from lm_eval import tasks, evaluator
task_dict = tasks.get_task_dict(["task1", "task2"])
results = evaluator.evaluate(model, task_dict, num_fewshot=5)
The code comparison shows that natural-instructions focuses on individual task evaluation with a simpler API, while lm-evaluation-harness offers a more flexible approach for evaluating multiple tasks simultaneously with additional parameters like few-shot learning.
Both repositories serve different purposes in the field of language model evaluation, with natural-instructions emphasizing natural language task descriptions and lm-evaluation-harness providing a broader evaluation framework for various NLP benchmarks.
Toolkit for creating, sharing and using natural language prompts.
Pros of promptsource
- More user-friendly interface for prompt creation and management
- Supports a wider range of datasets and prompt types
- Offers a collaborative platform for prompt engineering
Cons of promptsource
- Less focus on instruction-following tasks
- Smaller collection of pre-defined prompts and instructions
- May require more setup and configuration
Code comparison
natural-instructions:
from natural_instructions import NaturalInstructionsDataset
dataset = NaturalInstructionsDataset("path/to/tasks")
for example in dataset:
print(example.input, example.output)
promptsource:
from datasets import load_dataset
from promptsource.templates import DatasetTemplates
dataset = load_dataset("squad")
templates = DatasetTemplates("squad")
for example in dataset["train"]:
prompt = templates.all_template_names[0].apply(example)
print(prompt)
Both repositories aim to improve prompt engineering and dataset creation for language models. natural-instructions focuses on instruction-following tasks with a large collection of pre-defined prompts, while promptsource offers a more flexible and collaborative approach to prompt creation across various datasets and task types.
Beyond the Imitation Game collaborative benchmark for measuring and extrapolating the capabilities of language models
Pros of BIG-bench
- Larger scale with over 200 diverse tasks, providing a more comprehensive evaluation
- Collaborative effort with contributions from multiple institutions and researchers
- Includes more complex and multi-step reasoning tasks
Cons of BIG-bench
- Higher complexity and resource requirements for implementation and evaluation
- Less focus on natural language instructions, which may limit applicability for certain use cases
- Potentially more challenging to extend or customize for specific domains
Code Comparison
natural-instructions:
from natural_instructions import NaturalInstructionsTask
task = NaturalInstructionsTask("task_name")
inputs = task.get_inputs()
outputs = task.get_outputs()
BIG-bench:
from bigbench import task_api
task = task_api.Task.create("task_name")
inputs = task.get_examples()
scores = task.evaluate_model(model)
Both repositories aim to evaluate language models, but natural-instructions focuses on instruction-following capabilities, while BIG-bench offers a broader range of tasks for assessing various aspects of language model performance. natural-instructions may be more suitable for researchers interested in instruction-based learning, while BIG-bench provides a more comprehensive benchmark for general language model capabilities.
Evals is a framework for evaluating LLMs and LLM systems, and an open-source registry of benchmarks.
Pros of evals
- More comprehensive evaluation framework with a wider range of tasks and metrics
- Better integration with OpenAI's ecosystem and API
- More active development and community support
Cons of evals
- Primarily focused on OpenAI models, potentially limiting its applicability to other AI systems
- More complex setup and configuration process
- Steeper learning curve for new users
Code comparison
natural-instructions:
from natural_instructions import NaturalInstructions
dataset = NaturalInstructions("data/tasks")
for example in dataset:
print(example.input, example.output)
evals:
from evals.api import CompletionFn, CompletionResult
from evals.elsuite.basic.match import Match
def completion_fn(prompt: str, **kwargs) -> CompletionResult:
# Implement your model's completion function here
pass
eval = Match(completion_fn=completion_fn)
eval.run()
🤗 Evaluate: A library for easily evaluating machine learning models and datasets.
Pros of evaluate
- More comprehensive evaluation framework with support for various metrics and tasks
- Integrates seamlessly with Hugging Face's ecosystem (datasets, models, etc.)
- Active development and community support
Cons of evaluate
- Steeper learning curve for beginners
- May be overkill for simple evaluation tasks
- Requires additional dependencies
Code comparison
natural-instructions:
from natural_instructions import NaturalInstructions
ni = NaturalInstructions()
task = ni.get_task("task_name")
result = ni.evaluate(task, model_output)
evaluate:
from evaluate import load
metric = load("accuracy")
results = metric.compute(predictions=predictions, references=references)
Key differences
- natural-instructions focuses on instruction-following tasks, while evaluate is a general-purpose evaluation framework
- evaluate offers more flexibility in metric selection and customization
- natural-instructions provides a simpler API for specific instruction-based evaluations
Use cases
- natural-instructions: Best for evaluating models on instruction-following tasks
- evaluate: Ideal for comprehensive model evaluation across various metrics and tasks
Convert
designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual CopilotREADME
A Repository of Language Instructions for NLP Tasks
TLDR; this repository maintains a community effort to create a large collection of tasks and their natural language definitions/instructions.
Check the releases for the summary of the latest changes and additions to the tasks.
If you have any suggestions to improve the data, let us know. We're looking for more contributions to make this data better and bigger! ð
News Bulletin
- May 2022: We released the several models trained on our data. Check out the code and checkpoints.
- April 2022: A paper on our data is out!
- October 15, 2021: the goal date for the our v2 dataset.
- The community have contributed over 1500 tasks!! ð
- We are working on cleaning up the new tasks and publishing a paper summarizing our new findings!
- You can still submit new tasks! The new tasks will be part of the future data releases.
- Sept 2021: general call for contributions is out!
- June 2021: we initiated this repository with 61 tasks!
Background
Why define tasks in natural language?
While the current dominant paradigm (supervised learning with task-specific labeled examples) has been successful in building task-specific models, such models can't generalize to unseen tasks; for example, a model that is supervised to solve questions cannot solve a classification task. We hypothesize that a model equipped with understanding and reasoning with natural language instructions should be able to generalize to any task that can be defined in terms of natural language.
Any empirical evidence that this might be true?
In our earlier effort, we built a smaller data (61 tasks) and
observed that language models benefit from language instructions, i.e., their generalization to unseen tasks when they were provided with more instructions.
Also, generalization to unseen tasks improves as the model is trained on more tasks.
Why build this dataset?
We believe that our earlier work is just scratching the surface and there is probably so much that be studied in this setup. We hope to put together a much larger dataset that covers a wider range of reasoning abilities. We believe that this expanded dataset will serve as a useful playground for the community to study and build the next generation of AI/NLP models. See this blog post for a summary of the motivation behind this work.
Task schema
Each consists of input/output. For example, think of the task of sentiment classification:
- Input:
I thought the Spiderman animation was good, but the movie disappointed me.
- Output:
Mixed
Here is another example from the same task:
- Input:
The pumpkin was one of the worst that I've had in my life.
- Output:
Negative
Additionally, each ask contains a task definition:
Given a tweet, classify it into one of 4 categories: Positive, Negative, Neutral, or Mixed.
Overall, each tasks follows this schema:
Or if you're comfortable with json files, here is how it would look like:
{
"Contributors": [""],
"Source": [""],
"URL": [""],
"Categories": [""],
"Reasoning": [""],
"Definition": [""],
"Input_language": [""],
"Output_language": [""],
"Instruction_language": [""],
"Domains": [""],
"Positive Examples": [ { "input": "", "output": "", "explanation": ""} ],
"Negative Examples": [ { "input": "", "output": "", "explanation": ""} ],
"Instances": [ { "id": "", "input": "", "output": [""]} ],
}
How to contribute
We would appreciate any external contributions! ð You can contribute in a variety of ways.
- If you think an important task is missing, you can contribute it via Pull-Request. You can also get inspirations from the task suggestions in the Github issues which you can sign up to work on.
- If you have any other suggested tasks but you're not sure if they're good fit, bring them up in the issues.
- If you have any questions or suggestions, please use the issues feature.
- If you're addimg a new task, make sure to review the following guidelines:
- Each task must contain contain a
.json
file that contains the task content. You can look inside thetasks/
directory for several examples.- Make sure that your json is human readable (use proper indentation; e.g., in Python:
json.dumps(your_json_string, indent=4, ensure_ascii=False)
) - Make sure that you json file is not bigger than 50MB.
- Make sure your task has no more 6.5k instances (input/output pairs).
- Each instance must have a unique id, which should be the task number plus a string generated by
uuid.uuid4().hex
. E.g.,task1356-bb5ff013dc5d49d7a962e85ed1de526b
. - Make sure to include task category and domains, based on this list.
- Make sure to number your task json correctly
- Look at the task number in the latest pull request, task number in your submission should be the next number.
- Make sure to include the source dataset name and the task type when naming your task json file.
- You can use this format:
taskabc_<source_dataset>_<task_type>.json
E.g. intask001_quoref_question_generation.json
, the source dataset isquoref
and the task isquestion generation
.
- You can use this format:
- Note that, source need not necessarily be a dataset and can be a website e.g. leetcode.
- If you have created the json without any reference, use
synthetic
in place of source.
- If you have created the json without any reference, use
- You should have one pull request per dataset. Name your pull request as
Task Name <start_task_number>-<end_task_number>
. - If you're building your tasks based existing datasets and their crowdsourcing templates, see these guidelines.
- Make sure that your json is human readable (use proper indentation; e.g., in Python:
- Add your task to our list of tasks.
- To make sure that your addition is formatted correctly, run the tests:
> python src/test_all.py
- To only test the formatting of a range of tasks, run
> python src/test_all.py --task <begin_task_number> <end_task_number>
. For example, running> python src/test_all.py --task 5 10
will run the test from task005 to task010.
- To only test the formatting of a range of tasks, run
- Each task must contain contain a
Benchmarking cross-task generalization
As is introduced in our paper, this dataset can be used for systematic study of cross-task generalization, i.e., training on a subset of tasks and evaluating on the remaining unseen ones. To make the comparison among different methods easier, we create an official split here, as is described in the paper. You can follow the instructions to set up your experiments.
We also released our experiment code and checkpoints for reproducibility and future research.
License
All the data here (except the instances of each task) are released under Apache-2.0 license. The instances of each tasks are subject to the license under which the original dataset was released. These license information are available unders "Instance License" field within each task file.
Misc.
If you want to use Natural Instructions v1, here's the code: link
Feel free to cite us.
@inproceedings{naturalinstructions,
title={Cross-task generalization via natural language crowdsourcing instructions},
author={Mishra, Swaroop and Khashabi, Daniel and Baral, Chitta and Hajishirzi, Hannaneh},
booktitle={ACL},
year={2022}
}
@inproceedings{supernaturalinstructions,
title={Super-NaturalInstructions:Generalization via Declarative Instructions on 1600+ Tasks},
author={Wang, Yizhong and Mishra, Swaroop and Alipoormolabashi, Pegah and Kordi, Yeganeh and Mirzaei, Amirreza and Arunkumar, Anjana and Ashok, Arjun and Dhanasekaran, Arut Selvan and Naik, Atharva and Stap, David and others},
booktitle={EMNLP},
year={2022}
}
Top Related Projects
A framework for few-shot evaluation of language models.
Toolkit for creating, sharing and using natural language prompts.
Beyond the Imitation Game collaborative benchmark for measuring and extrapolating the capabilities of language models
Evals is a framework for evaluating LLMs and LLM systems, and an open-source registry of benchmarks.
🤗 Evaluate: A library for easily evaluating machine learning models and datasets.
Convert
designs to code with AI
Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.
Try Visual Copilot