awesome-safety-tools

A curated collection of open source tools for online safety

Inspired by prior work like Awesome Redteaming and Awesome Phishing.

Help and contribute by adding a pull request to add more resources and tools!

Hash Matching

Hasher Matcher Action (HMA) by Meta
- hashing algorithm, matching function, and ability to hook into actions
PDQ by Meta
- perceptual hash algorithm for images
TMK by Meta
- visual similarity match for videos
VPDQ by Meta
- visual similarity match for videos using PDQ algorithm
Hasher-Matcher-Actioner (CLIP demo)
- HMA extension for CLIP as reference for adding other format extensins
Perception by Thorn
- provides a common wrapper around existing, popular perceptual hashes (such as those implemented by ImageHash)
Altitude by Jigsaw
- web UI and hash matching for violent extremism and terrorism content
Lattice Extract by Adobe
- grid and lattice detection to guard against FP in hash matching
RocketChat CSAM
- CSAM hash matching for RocketChat
MediaModeration (Wiki Extension)
- CSAM hash matching for Wikimedia

Classification

OSmod by Jigsaw
- toolkit of machine learning (ML) tools, models, and APIs that platforms can use to moderate content
Perspective API by Jigsaw
- machine learning-powered tool that helps platforms detect and assess the toxicity of online conversations
Presidio by Microsoft
- toolset for detecting Personal Identifiable Information (PII) and other sensitive data in images and text
Llama Guard by Meta
- AI-powered content moderation model to detect harm in text-based interactions
Llama Prompt Guard 2 by Meta
- Detects prompt injection and jailbreaking attacks in LLM inputs.
Purple Llama by Meta
- set of tools to assess and improve LLM security. Includes Llama Guard, CyberSec Eval, and Code Shield
ShieldGemma by Google DeepMind
- AI safety toolkit by Google DeepMind designed to help detect and mitigate harmful or unsafe outputs in LLM applications
Roblox Voice Safety Classifier
- machine learning model that detects and moderates harmful content in real-time voice chat on Roblox. Focuses on spoken language detection.
Detoxify by Unitary AI
- detects and mitigates generalized toxic language (including hate speech, harassment, bullying) in text
Toxic Prompt RoBERTa by Intel
- a BERT-based model for detecting toxic content in prompts to language models
NSFW Filtering
- browser extension to block explicit images from online platforms. User facing.
NSFW Keras Model
- convoluted neural network (CNN) based explicit image ML model
Guardrails AI
- a Python framework that helps build safe AI applications checking input/output for predefined risks
Private Detector by Bumble
- a pretrained model for detecting lewd images

Privacy Protection

Fawkes Facial De-Recognition Cloaking
- Code and binaries to confuse AIs when trying to match identity to photos, such as Clearview
- Many other great tools at github.com/Shawn-Shan, MIT researcher

Core Infrastructure

Mjolnir by Matrix
- moderation bot for the Matrix protocol that automatically enforces content policies
AbuseIO
- abuse management platform designed to help organizations handle and track abuse complaints related to online content, infrastructure, or services
Ozone by Bluesky
- labeling tool designed for Bluesky. Includes moderation features to action on abuse flags, policy enforcement tools, and investigation features
Open Truss by Github
- framework designed to help users create internal tools without needing to write code
Access by Discord
- a centralized portal for managing access to internal systems within any organization

Redteaming Tools

PyRIT Documentation
- Microsoft’s Python-based tool for AI red teaming and security testing.
AI Benchmarking Tool
- Evaluates AI models for security vulnerabilities and adversarial robustness.
Prompt Fuzzer Red Teaming Tool
- Tool for testing prompt injection vulnerabilities in AI systems.
Open Source Red Teaming Tool – Nvidia
- Framework for adversarial testing and model evaluation.
Tool that Enables Models to Chat with One Another
- Allows AI models to interact, helping test conversational weaknesses.
Microsoft AI Tool – Counterfit
- Automation tool for assessing AI model security and robustness.

Clustering

SpamAssassin by Apache
- anti-spam platform that uses a variety of techniques, including text analysis, Bayesian filtering, and DNS blocklists, to classify and block unsolicited email
scikit-learn
- python library including clustering through various algorithms, such as K-Means, DBSCAN, and hierarchical clustering

Rules Engines

RulesEngine by Microsoft
- a library for abstracting business logic, rules, and policies from a system via JSON for .NET language families
Marble
- a real-time fraud detection and compliance engine tailored for fintech companies and financial institutions
Automod by Bluesky
- a tool for automating content moderation processes for the Bluesky social network and other apps on the AT Protocol
Wikimedia Smite Spam
- an extension for MediaWiki that helps identify and manage spam content on a wiki
Druid by Apache
- a high performance real-time analytics database

Review

RabbitMQ
- a message broker that enables applications to communicate with each other by sending messages through queues
BullMQ
- message queue and batch processing for NodeJS and Python based on Redis
Owlculus
- an OSINT (Open-Source Intelligence) toolkit and case management platform
NCMEC Reporting by ello
- a Ruby client library for reporting incidents to the National Center for Missing & Exploited Children (NCMEC) CyberTipline

Investigation

ThreatExchange by Meta
- a platform that enables organizations to share information about threats, such as malware, phishing attacks, and online safety harms in a structured and privacy-compliant manner
ThreatExchange Client via PHP
- a PHP client for ThreatExchange
ThreatExchange via Python
- a Python library for ThreatExchange
Feluda by Tattle
- A configurable engine for analysing multi-lingual and multi-modal content
DAU Dashboard by Tattle
- Deepfake Analysis Unit(DAU) is a collaborative space for analyzing deepfakes
CIB MangoTree
- A collection of tools to aid researchers in coordinated inauthentic behavior (CIB) analysis
Interference by Digital Forensics Research Lab
- an interactive, open-source database that tracks allegations of foreign interference or foreign malign influence relevant to the 2024 U.S. presidential election

Safety Datasets

Aegis Content Safety by NVIDIA
- a dataset created by NVIDIA to aid in content moderation and toxicity detection
Toxicity by Jigsaw
- a large number of Wikipedia comments which have been labeled by human raters for toxic behavior
Toxic Chat by LMSYS
- a dataset of toxic conversations collected from interactions with Vicuna
Uli Dataset by Tattle
- A dataset of gendered abuse, created for Uli ML redaction.

Red Teaming Datasets

Red Team Resistance Leaderboard
- rankings of AI models based on resistance to adversarial attacks.
JailbreakHub by WalledAI
- a collection of jailbreak prompts and corresponding model responses
SidFeel Jailbreak Dataset
- a collection of prompts used for jailbreaking AI models.
HackAPrompt Jailbreak Dataset
- a dataset for testing AI vulnerability to prompt-based jailbreaking.
HiroKachi Jailbreak Dataset
- adataset focused on adversarial AI prompt attacks.
Rentry Jailbreak Datasets
- collection of datasets related to jailbreak attempts on AI models.
DEFCOM Red Teaming Dataset
- dataset from DEF CON’s AI red teaming event.
Anthropic’s AI Alignment Dataset
- data used for reinforcement learning with human feedback (RLHF) to align AI models.
Jailbreak Prompt Generator AI Model
- AI model that generates jailbreak-style prompts.

Fediverse

FediCheck
- domain moderation tool to assist ActivityPub service providers, such as Mastodon servers, now open-sourced.
Fediverse Spam Filtering
- a spam filter for Fediverse social media platforms. For now, the current version is only a proof of concept.
FIRES
- reference server + protocol for the exchange of moderation adivsories and recommendations

User Safety Tools

Uli by Tattle
- Software and Resources for Mitigating Online Gender Based Violence in India

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

awesome-safety-tools

Hash Matching

Classification

Privacy Protection

Core Infrastructure

Redteaming Tools

Clustering

Rules Engines

Review

Investigation

Safety Datasets

Red Teaming Datasets

Fediverse

User Safety Tools

About

Releases

Packages

Contributors 5

License

roostorg/awesome-safety-tools

Folders and files

Latest commit

History

Repository files navigation

awesome-safety-tools

Hash Matching

Classification

Privacy Protection

Core Infrastructure

Redteaming Tools

Clustering

Rules Engines

Review

Investigation

Safety Datasets

Red Teaming Datasets

Fediverse

User Safety Tools

About

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

Packages 0

Contributors 5

Packages