giskard-oss

🐢 Open-Source Evaluation & Testing library for LLM Agents

4,917

363

4,917

View on GitHub

Top Related Projects

responsible-ai-toolbox

1,654

Responsible AI Toolbox is a suite of tools providing model and data exploration and assessment user interfaces and libraries that enable a better understanding of AI systems. These interfaces and libraries empower developers and stakeholders of AI systems to develop and monitor AI more responsibly, and take better data-driven actions.

AIF360

2,688

A comprehensive set of fairness metrics for datasets and machine learning models, explanations for these metrics, and algorithms to mitigate bias in datasets and models.

fairlearn

2,159

A Python package to assess and improve fairness of machine learning models.

interpret

6,708

Fit interpretable models. Explain blackbox machine learning.

shap

24,183

A game theoretic approach to explain the output of any machine learning model.

AIX360

1,746

Interpretability and explainability of data and machine learning models

Quick Overview

Giskard-AI/giskard-oss is an open-source AI testing framework designed to detect and prevent AI failures in production. It provides a comprehensive suite of tools for testing, monitoring, and debugging machine learning models, with a focus on ensuring model reliability, fairness, and performance across various scenarios.

Pros

Comprehensive testing suite for AI models, covering various aspects such as performance, fairness, and robustness
User-friendly interface for creating and managing tests, making it accessible to both technical and non-technical users
Integrates well with popular machine learning frameworks and workflows
Supports multiple programming languages and environments

Cons

May require a learning curve for users new to AI testing concepts
Documentation could be more extensive for advanced use cases
Limited community support compared to more established testing frameworks
Some features may be better suited for enterprise-level projects, potentially overwhelming for smaller teams

Code Examples

Creating a simple test case:

from giskard import test

@test
def test_model_accuracy(model, dataset):
    predictions = model.predict(dataset)
    accuracy = (predictions == dataset.target).mean()
    assert accuracy > 0.8, "Model accuracy is below 80%"

Testing for fairness across protected groups:

from giskard import test, FairnessMetric

@test
def test_gender_fairness(model, dataset):
    fairness = FairnessMetric(protected_feature='gender')
    score = fairness.compute(model, dataset)
    assert score > 0.9, "Gender bias detected in model predictions"

Generating adversarial examples:

from giskard import test, AdversarialGenerator

@test
def test_adversarial_robustness(model, dataset):
    generator = AdversarialGenerator()
    adversarial_examples = generator.generate(model, dataset)
    robustness_score = model.evaluate(adversarial_examples)
    assert robustness_score > 0.7, "Model is not robust against adversarial attacks"

Getting Started

To get started with Giskard, follow these steps:

Install Giskard:

pip install giskard

Import Giskard and set up your model and dataset:

from giskard import Model, Dataset

model = Model(prediction_function=your_model_function)
dataset = Dataset(X=your_features, y=your_labels)

Create and run tests:

from giskard import test, Suite

@test
def your_custom_test(model, dataset):
    # Your test logic here
    pass

suite = Suite(tests=[your_custom_test])
results = suite.run(model, dataset)
print(results.summary())

Competitor Comparisons

responsible-ai-toolbox

1,654

Pros of Responsible AI Toolbox

Comprehensive suite of tools for responsible AI development, including interpretability, fairness, and error analysis
Extensive documentation and tutorials for easy adoption
Backed by Microsoft, ensuring long-term support and updates

Cons of Responsible AI Toolbox

Steeper learning curve due to the wide range of features
Primarily focused on tabular data and traditional machine learning models
Less emphasis on real-time monitoring and production-ready features

Code Comparison

Responsible AI Toolbox:

from raiwidgets import ExplanationDashboard

ExplanationDashboard(global_explanation, model, dataset, true_y, features)

Giskard:

from giskard import scan

scan(model, dataset, features)

The Responsible AI Toolbox code snippet demonstrates the use of an explanation dashboard, while Giskard's code shows a simpler scanning function for model analysis. Giskard's approach appears more straightforward, but the Responsible AI Toolbox offers more detailed visualization options.

AIF360

2,688

A comprehensive set of fairness metrics for datasets and machine learning models, explanations for these metrics, and algorithms to mitigate bias in datasets and models.

Pros of AIF360

Comprehensive suite of fairness metrics and algorithms
Well-established project with extensive documentation
Supports multiple programming languages (Python, R, and NodeJS)

Cons of AIF360

Steeper learning curve due to its extensive feature set
Less focus on model monitoring and debugging
Primarily designed for offline analysis rather than real-time monitoring

Code Comparison

AIF360:

from aif360.datasets import BinaryLabelDataset
from aif360.metrics import BinaryLabelDatasetMetric

dataset = BinaryLabelDataset(...)
metric = BinaryLabelDatasetMetric(dataset, unprivileged_groups, privileged_groups)

Giskard:

from giskard import Dataset, Model, scan

dataset = Dataset(...)
model = Model(...)
scan_results = scan(model, dataset)

Summary

AIF360 offers a comprehensive suite of fairness metrics and algorithms, making it suitable for in-depth fairness analysis across multiple programming languages. However, it has a steeper learning curve and is primarily designed for offline analysis.

Giskard, on the other hand, focuses on model monitoring and debugging, providing a more user-friendly interface for real-time analysis. It may be more suitable for users looking for quick insights and continuous monitoring of their ML models.

fairlearn

2,159

A Python package to assess and improve fairness of machine learning models.

Pros of fairlearn

More established project with a larger community and longer history
Focuses specifically on fairness metrics and mitigation techniques
Integrates well with popular machine learning libraries like scikit-learn

Cons of fairlearn

Limited scope compared to Giskard's broader testing capabilities
Less emphasis on model debugging and error analysis
May require more manual configuration for complex fairness scenarios

Code Comparison

fairlearn example:

from fairlearn.metrics import demographic_parity_difference
from fairlearn.reductions import DemographicParity

dp = DemographicParity()
mitigator = dp.fit(X, y, sensitive_features=A)
y_pred_mitigated = mitigator.predict(X)

Giskard example:

from giskard import scan, Dataset

dataset = Dataset(df, target="target")
scan_results = scan(model, dataset)
fairness_issues = scan_results.fairness_issues

While fairlearn focuses on specific fairness metrics and mitigation techniques, Giskard offers a more comprehensive approach to model testing and debugging, including fairness analysis as part of a broader suite of tests. fairlearn may be more suitable for projects with a strong focus on fairness, while Giskard provides a more general-purpose testing framework for machine learning models.

interpret

6,708

Fit interpretable models. Explain blackbox machine learning.

Pros of Interpret

More comprehensive and established library for interpretable machine learning
Supports a wider range of interpretation techniques and algorithms
Better documentation and examples for various use cases

Cons of Interpret

Steeper learning curve due to its extensive feature set
May be overkill for simpler interpretation tasks
Less focus on testing and quality assurance aspects of ML models

Code Comparison

Interpret:

from interpret import set_visualize_provider
from interpret.provider import InlineProvider
set_visualize_provider(InlineProvider())

from interpret.glassbox import ExplainableBoostingClassifier
ebm = ExplainableBoostingClassifier()
ebm.fit(X_train, y_train)

ebm_global = ebm.explain_global()
ebm_global.visualize()

Giskard:

from giskard import Model, Dataset

model = Model(predict_function, model_type="classification")
dataset = Dataset(X_test, y_test, name="test_dataset")

giskard_report = model.scan(dataset)
giskard_report.display()

Both libraries offer tools for model interpretation, but Interpret provides a more comprehensive set of techniques, while Giskard focuses on testing and quality assurance aspects of ML models.

shap

24,183

A game theoretic approach to explain the output of any machine learning model.

Pros of SHAP

More established and widely adopted in the data science community
Focuses specifically on model interpretability and feature importance
Supports a broader range of machine learning models and frameworks

Cons of SHAP

Limited to model explanation and doesn't offer comprehensive testing features
May require more manual effort to integrate into existing ML pipelines
Less emphasis on bias detection and fairness assessment

Code Comparison

SHAP example:

import shap
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X)
shap.summary_plot(shap_values, X)

Giskard example:

import giskard
test_suite = giskard.scan(model, dataset)
results = test_suite.run()
giskard.plot.performance(results)

SHAP focuses on generating and visualizing feature importance, while Giskard provides a more comprehensive testing suite for ML models, including performance, robustness, and fairness assessments. SHAP is better suited for in-depth model interpretability, whereas Giskard offers a broader range of testing capabilities for ML pipelines.

AIX360

1,746

Interpretability and explainability of data and machine learning models

Pros of AIX360

More comprehensive set of explainability algorithms, including LIME, SHAP, and ProtoDash
Stronger focus on interpretability for various AI models, not just testing
Better documentation and tutorials for understanding complex AI concepts

Cons of AIX360

Less emphasis on continuous testing and monitoring of AI systems
Fewer features for detecting data drift and model performance issues
Not as user-friendly for non-technical users or those new to AI explainability

Code Comparison

AIX360:

from aix360.algorithms.contrastive import CEMExplainer
explainer = CEMExplainer(model)
explanation = explainer.explain_instance(x, num_features=5)

Giskard:

from giskard import Model, Dataset
model = Model(predict_fn)
dataset = Dataset(df)
scan_results = model.scan(dataset)

The AIX360 code focuses on generating explanations for specific instances, while Giskard's code is geared towards scanning entire datasets for potential issues.

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

README

giskardlogo

The Evaluation & Testing framework for AI systems

Control risks of performance, bias and security issues in AI systems

Docs • Website • Community

Install Giskard ð¢

Install the latest version of Giskard from PyPi using pip:

pip install "giskard[llm]" -U

We officially support Python 3.9, 3.10 and 3.11.

Try in Colab ð

Open Colab notebook

Giskard is an open-source Python library that automatically detects performance, bias & security issues in AI applications. The library covers LLM-based applications such as RAG agents, all the way to traditional ML models for tabular data.

Scan: Automatically assess your LLM-based agents for performance, bias & security issues â¤µï¸

Issues detected include:

Hallucinations
Harmful content generation
Prompt injection
Robustness issues
Sensitive information disclosure
Stereotypes & discrimination
many more...

Scan Example

RAG Evaluation Toolkit (RAGET): Automatically generate evaluation datasets & evaluate RAG application answers â¤µï¸

If you're testing a RAG application, you can get an even more in-depth assessment using RAGET, Giskard's RAG Evaluation Toolkit.

RAGET can generate automatically a list of question, reference_answer and reference_context from the knowledge base of the RAG. You can then use this generated test set to evaluate your RAG agent.
RAGET computes scores for each component of the RAG agent. The scores are computed by aggregating the correctness of the agentâs answers on different question types.
- Here is the list of components evaluated with RAGET:
  - Generator: the LLM used inside the RAG to generate the answers
  - Retriever: fetch relevant documents from the knowledge base according to a user query
  - Rewriter: rewrite the user query to make it more relevant to the knowledge base or to account for chat history
  - Router: filter the query of the user based on his intentions
  - Knowledge Base: the set of documents given to the RAG to generate the answers

Test Suite Example

Giskard works with any model, in any environment and integrates seamlessly with your favorite tools â¤µï¸

Looking for solutions to evaluate computer vision models? Check out giskard-vision, a library dedicated for computer vision tasks.

ð¤¸ââï¸ Quickstart
- 1. ðï¸ Build a LLM agent
- 2. ð Scan your model for issues
- 3. ðª Automatically generate an evaluation dataset for your RAG applications
ð Community

ð¤¸ââï¸ Quickstart

1. ðï¸ Build a LLM agent

Let's build an agent that answers questions about climate change, based on the 2023 Climate Change Synthesis Report by the IPCC.

Before starting let's install the required libraries:

pip install langchain langchain-community langchain-openai tiktoken "pypdf<=3.17.0"

from langchain import FAISS, PromptTemplate
from langchain_openai import OpenAIEmbeddings, OpenAI
from langchain.document_loaders import PyPDFLoader
from langchain.chains import RetrievalQA
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Prepare vector store (FAISS) with IPPC report
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=100, add_start_index=True)
loader = PyPDFLoader("https://www.ipcc.ch/report/ar6/syr/downloads/report/IPCC_AR6_SYR_LongerReport.pdf")
db = FAISS.from_documents(loader.load_and_split(text_splitter), OpenAIEmbeddings())

# Prepare QA chain
PROMPT_TEMPLATE = """You are the Climate Assistant, a helpful AI assistant made by Giskard.
Your task is to answer common questions on climate change.
You will be given a question and relevant excerpts from the IPCC Climate Change Synthesis Report (2023).
Please provide short and clear answers based on the provided context. Be polite and helpful.

Context:
{context}

Question:
{question}

Your answer:
"""

llm = OpenAI(model="gpt-3.5-turbo-instruct", temperature=0)
prompt = PromptTemplate(template=PROMPT_TEMPLATE, input_variables=["question", "context"])
climate_qa_chain = RetrievalQA.from_llm(llm=llm, retriever=db.as_retriever(), prompt=prompt)

2. ð Scan your model for issues

Next, wrap your agent to prepare it for Giskard's scan:

import giskard
import pandas as pd

def model_predict(df: pd.DataFrame):
    """Wraps the LLM call in a simple Python function.

    The function takes a pandas.DataFrame containing the input variables needed
    by your model, and must return a list of the outputs (one for each row).
    """
    return [climate_qa_chain.run({"query": question}) for question in df["question"]]

# Donât forget to fill the `name` and `description`: they are used by Giskard
# to generate domain-specific tests.
giskard_model = giskard.Model(
    model=model_predict,
    model_type="text_generation",
    name="Climate Change Question Answering",
    description="This model answers any question about climate change based on IPCC reports",
    feature_names=["question"],
)

â¨â¨â¨Then run Giskard's magical scanâ¨â¨â¨

scan_results = giskard.scan(giskard_model)

Once the scan completes, you can display the results directly in your notebook:

display(scan_results)

# Or save it to a file
scan_results.to_html("scan_results.html")

If you're facing issues, check out our docs for more information.

3. ðª Automatically generate an evaluation dataset for your RAG applications

If the scan found issues in your model, you can automatically extract an evaluation dataset based on the issues found:

test_suite = scan_results.generate_test_suite("My first test suite")

By default, RAGET automatically generates 6 different question types (these can be selected if needed, see advanced question generation). The total number of questions is divided equally between each question type. To make the question generation more relevant and accurate, you can also provide a description of your agent.


from giskard.rag import generate_testset, KnowledgeBase

# Load your data and initialize the KnowledgeBase
df = pd.read_csv("path/to/your/knowledge_base.csv")

knowledge_base = KnowledgeBase.from_pandas(df, columns=["column_1", "column_2"])

# Generate a testset with 10 questions & answers for each question types (this will take a while)
testset = generate_testset(
    knowledge_base,
    num_questions=60,
    language='en',  # optional, we'll auto detect if not provided
    agent_description="A customer support chatbot for company X", # helps generating better questions
)

Depending on how many questions you generate, this can take a while. Once youâre done, you can save this generated test set for future use:

# Save the generated testset
testset.save("my_testset.jsonl")

You can easily load it back

from giskard.rag import QATestset

loaded_testset = QATestset.load("my_testset.jsonl")

# Convert it to a pandas dataframe
df = loaded_testset.to_pandas()

Hereâs an example of a generated question:

question	reference_context	reference_answer	metadata
For which countries can I track my shipping?	Document 1: We offer free shipping on all orders over $50. For orders below $50, we charge a flat rate of $5.99. We offer shipping services to customers residing in all 50 states of the US, in addition to providing delivery options to Canada and Mexico. Document 2: Once your purchase has been successfully confirmed and shipped, you will receive a confirmation email containing your tracking number. You can simply click on the link provided in the email or visit our websiteâs order tracking page.	We ship to all 50 states in the US, as well as to Canada and Mexico. We offer tracking for all our shippings.	`{"question_type": "simple", "seed_document_id": 1, "topic": "Shipping policy"}`

Each row of the test set contains 5 columns:

question: the generated question
reference_context: the context that can be used to answer the question
reference_answer: the answer to the question (generated with GPT-4)
conversation_history: not shown in the table above, contain the history of the conversation with the agent as a list, only relevant for conversational question, otherwise it contains an empty list.
metadata: a dictionary with various metadata about the question, this includes the question_type, seed_document_id the id of the document used to generate the question and the topic of the question

ð Community

We welcome contributions from the AI community! Read this guide to get started, and join our thriving community on Discord.

ð Leave us a star, it helps the project to get discovered by others and keeps us motivated to build awesome open-source tools! ð

â¤ï¸ If you find our work useful, please consider sponsoring us on GitHub. With a monthly sponsoring, you can get a sponsor badge, display your company in this readme, and get your bug reports prioritized. We also offer one-time sponsoring if you want us to get involved in a consulting project, run a workshop, or give a talk at your company.

ð Current sponsors

We thank the following companies which are sponsoring our project with monthly donations:

Lunary

Biolevate

Top Related Projects

Convert designs to code with AI

Introducing Visual Copilot: A new AI model to turn Figma designs to high quality code using your components.

Try Visual Copilot

Top Related Projects

Quick Overview

Pros

Cons

Code Examples

Getting Started

Competitor Comparisons

Pros of Responsible AI Toolbox

Cons of Responsible AI Toolbox

Code Comparison

Pros of AIF360

Cons of AIF360

Code Comparison

Summary

Pros of fairlearn

Cons of fairlearn

Code Comparison

Pros of Interpret

Cons of Interpret

Code Comparison

Pros of SHAP

Cons of SHAP

Code Comparison

Pros of AIX360

Cons of AIX360

Code Comparison

Convert designs to code with AI

README

The Evaluation & Testing framework for AI systems

Control risks of performance, bias and security issues in AI systems

Docs • Website • Community

Install Giskard ð¢

Try in Colab ð

Scan: Automatically assess your LLM-based agents for performance, bias & security issues â¤µï¸

RAG Evaluation Toolkit (RAGET): Automatically generate evaluation datasets & evaluate RAG application answers â¤µï¸

Contents

ð¤¸ââï¸ Quickstart

1. ðï¸ Build a LLM agent

2. ð Scan your model for issues

3. ðª Automatically generate an evaluation dataset for your RAG applications

ð Community

ð Current sponsors

Top Related Projects

Convert designs to code with AI

Install Giskard ð¢

Try in Colab ð

Scan: Automatically assess your LLM-based agents for performance, bias & security issues â¤µï¸

RAG Evaluation Toolkit (RAGET): Automatically generate evaluation datasets & evaluate RAG application answers â¤µï¸

ð¤¸ââï¸ Quickstart

1. ðï¸ Build a LLM agent

2. ð Scan your model for issues

3. ðª Automatically generate an evaluation dataset for your RAG applications

ð Community

ð Current sponsors