🤖 Evaluating the likelihood of data points in a LLM's training set 🔍
Conjecture is a tool designed to evaluate whether specific data points are likely present in a machine learning model's training dataset. It uses various methods to assess model performance and potential data leakage. At the end of the day, this is just 'conjecture'...- 📝 Data Retrieval: Fetch data from multiple sources, including: Wikipedia, YouTube, File, and Direct Input.
- 🔍 Model Evaluation: Use Attention Pattern Analysis (APA), Membership Inference Attacks (MIA), and Query Based Data Extraction (QBDE) to identify the liklihood of the data existing in the models training dataset.
- 📊 Results Display: View formatted tables of scores and average scores for each assessment category.
- 📈 Data Presence Check: Determine if the given data is likely present in the model's training dataset based on the assessment results.
The recommended hardware setup includes:
- Minimum 16GB of RAM.
- Nvidia GPU with at least 4GB of VRAM for enhanced performance.
Tested on Windows 11. Should be compatible with other Unix-like systems as well.
For enhanced performance, it is highly recommended to install Nvidia CUDA. Follow the steps below:
- Ensure your Nvidia drivers are up to date: https://www.nvidia.com/en-us/geforce/drivers/
- Install the appropriate dependancies from here: https://pytorch.org/get-started/locally/
- Validate CUDA is installed correctly by running the following and being returned a prompt
python -c "import torch; print(torch.rand(2,3).cuda())"
Install the required Python dependencies:
pip install -r requirements.txt
Clone the repository and install Conjecture:
git clone https://github.com/yourusername/conjecture.git
cd conjecture
python -m pip install .
To use Conjecture, run the command with appropriate arguments:
Command-Line Arguments
- dataset_strings: List of dataset strings to evaluate.
- file: Path to a file containing newline-separated dataset strings.
- wikipedia: Wikipedia page name to fetch data from.
- youtube: YouTube video ID to fetch data from.
- model_name: Name of the model to evaluate.
- num_permutations: Number of permutations for generating strings (default: 5).
Evaluate dataset strings directly:
conjecture --dataset_strings "data1" "data2" --model_name "unsloth/mistral-7b-instruct-v0.3-bnb-4bit"
Load dataset from a file:
conjecture --file "path/to/dataset.txt" --model_name "unsloth/mistral-7b-instruct-v0.3-bnb-4bit"
Fetch data from Wikipedia:
conjecture --wikipedia "Machine_learning" --model_name "unsloth/mistral-7b-instruct-v0.3-bnb-4bit"
Fetch transcript from YouTube:
conjecture --youtube "wHSjrRX_eY4" --model_name "unsloth/mistral-7b-instruct-v0.3-bnb-4bit"
Fetches the transcript of a YouTube video and assesses whether the data is present in the model's training dataset:
from youtube_transcript_api import YouTubeTranscriptApi
from conjecture.Judge import Judge
YOUTUBE_VIDEO_ID = "wHSjrRX_eY4"
data = YouTubeTranscriptApi.get_transcript(YOUTUBE_VIDEO_ID)
judge = Judge("unsloth/mistral-7b-instruct-v0.3-bnb-4bit", data.split("."))
judge.assess()
Fetches introductory text from a Wikipedia page and assesses whether the data is present in the model's training dataset:
import requests
from conjecture.Judge import Judge
WIKIPEDIA_PAGE_NAME = "YouTube"
response = requests.get(
'https://en.wikipedia.org/w/api.php',
params={
'action': 'query',
'format': 'json',
'titles': WIKIPEDIA_PAGE_NAME,
'prop': 'extracts',
'exintro': True,
'explaintext': True,
}
).json()
page = next(iter(response['query']['pages'].values()))
data = page['extract']
judge = Judge("unsloth/mistral-7b-instruct-v0.3-bnb-4bit", data.split("."))
judge.assess()
Conjecture will display the following:
- Tables of Scores: APA, MIA, and QBDE scores along with their averages.
- Data Presence Check: A message indicating whether the data is likely present in the model's training dataset.
Conjecture is open-source and welcomes contributions. To contribute:
- Fork the repository on GitHub.
- Create a new branch for your changes.
- Implement and test your changes.
- Submit a pull request with a clear description.
Report bugs or request features by opening an issue on GitHub. Provide detailed information to assist in addressing your concerns.
GNU General Public License v3.0