⚙️ Setup

🤖 Evaluating the likelihood of data points in a LLM's training set 🔍

Conjecture is a tool designed to evaluate whether specific data points are likely present in a machine learning model's training dataset. It uses various methods to assess model performance and potential data leakage. At the end of the day, this is just 'conjecture'...

📝 Data Retrieval: Fetch data from multiple sources, including: Wikipedia, YouTube, File, and Direct Input.
🔍 Model Evaluation: Use Attention Pattern Analysis (APA), Membership Inference Attacks (MIA), and Query Based Data Extraction (QBDE) to identify the liklihood of the data existing in the models training dataset.
📊 Results Display: View formatted tables of scores and average scores for each assessment category.
📈 Data Presence Check: Determine if the given data is likely present in the model's training dataset based on the assessment results.

⚙️ Setup

System Requirements

The recommended hardware setup includes:

Minimum 16GB of RAM.
Nvidia GPU with at least 4GB of VRAM for enhanced performance.

Tested on Windows 11. Should be compatible with other Unix-like systems as well.

GPU and CUDA Setup

For enhanced performance, it is highly recommended to install Nvidia CUDA. Follow the steps below:

Ensure your Nvidia drivers are up to date: https://www.nvidia.com/en-us/geforce/drivers/
Install the appropriate dependancies from here: https://pytorch.org/get-started/locally/
Validate CUDA is installed correctly by running the following and being returned a prompt python -c "import torch; print(torch.rand(2,3).cuda())"

Install the required Python dependencies:

pip install -r requirements.txt

Installation

Clone the repository and install Conjecture:

git clone https://github.com/yourusername/conjecture.git
cd conjecture
python -m pip install .

🏃 Running

To use Conjecture, run the command with appropriate arguments:

Command-Line Arguments

dataset_strings: List of dataset strings to evaluate.
file: Path to a file containing newline-separated dataset strings.
wikipedia: Wikipedia page name to fetch data from.
youtube: YouTube video ID to fetch data from.
model_name: Name of the model to evaluate.
num_permutations: Number of permutations for generating strings (default: 5).

Example Commands

Evaluate dataset strings directly:

conjecture --dataset_strings "data1" "data2" --model_name "unsloth/mistral-7b-instruct-v0.3-bnb-4bit"

Load dataset from a file:

conjecture --file "path/to/dataset.txt" --model_name "unsloth/mistral-7b-instruct-v0.3-bnb-4bit"

Fetch data from Wikipedia:

conjecture --wikipedia "Machine_learning" --model_name "unsloth/mistral-7b-instruct-v0.3-bnb-4bit"

Fetch transcript from YouTube:

conjecture --youtube "wHSjrRX_eY4" --model_name "unsloth/mistral-7b-instruct-v0.3-bnb-4bit"

Programmatic Examples

YouTube Video Example

Fetches the transcript of a YouTube video and assesses whether the data is present in the model's training dataset:

from youtube_transcript_api import YouTubeTranscriptApi
from conjecture.Judge import Judge

YOUTUBE_VIDEO_ID = "wHSjrRX_eY4"
data = YouTubeTranscriptApi.get_transcript(YOUTUBE_VIDEO_ID)

judge = Judge("unsloth/mistral-7b-instruct-v0.3-bnb-4bit", data.split("."))
judge.assess()

Wikipedia Example

Fetches introductory text from a Wikipedia page and assesses whether the data is present in the model's training dataset:

import requests
from conjecture.Judge import Judge

WIKIPEDIA_PAGE_NAME = "YouTube"

response = requests.get(
    'https://en.wikipedia.org/w/api.php',
    params={
        'action': 'query',
        'format': 'json',
        'titles': WIKIPEDIA_PAGE_NAME,
        'prop': 'extracts',
        'exintro': True,
        'explaintext': True,
    }
).json()

page = next(iter(response['query']['pages'].values()))
data = page['extract']

judge = Judge("unsloth/mistral-7b-instruct-v0.3-bnb-4bit", data.split("."))
judge.assess()

🔍 Output

Conjecture will display the following:

Tables of Scores: APA, MIA, and QBDE scores along with their averages.
Data Presence Check: A message indicating whether the data is likely present in the model's training dataset.

🙏 Contributions

Conjecture is open-source and welcomes contributions. To contribute:

Fork the repository on GitHub.
Create a new branch for your changes.
Implement and test your changes.
Submit a pull request with a clear description.

🐛 Issues and Feature Requests

Report bugs or request features by opening an issue on GitHub. Provide detailed information to assist in addressing your concerns.

📜 License

GNU General Public License v3.0

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
conjecture		conjecture
LICENSE		LICENSE
README.md		README.md
conjecture.png		conjecture.png
example_one.py		example_one.py
example_two.py		example_two.py
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

⚙️ Setup

System Requirements

GPU and CUDA Setup

Installation

🏃 Running

Example Commands

Programmatic Examples

YouTube Video Example

Wikipedia Example

🔍 Output

🙏 Contributions

🐛 Issues and Feature Requests

📜 License

About

Uh oh!

Uh oh!

Languages

License

user1342/conjecture

Folders and files

Latest commit

History

Repository files navigation

⚙️ Setup

System Requirements

GPU and CUDA Setup

Installation

🏃 Running

Example Commands

Programmatic Examples

YouTube Video Example

Wikipedia Example

🔍 Output

🙏 Contributions

🐛 Issues and Feature Requests

📜 License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Uh oh!

Languages