A curated collection of open source tools for online safety
Inspired by prior work like Awesome Redteaming and Awesome Phishing.
Help and contribute by adding a pull request to add more resources and tools!
- Hasher Matcher Action (HMA) by Meta
- hashing algorithm, matching function, and ability to hook into actions
- PDQ by Meta
- perceptual hash algorithm for images
- TMK by Meta
- visual similarity match for videos
- VPDQ by Meta
- visual similarity match for videos using PDQ algorithm
- Hasher-Matcher-Actioner (CLIP demo)
- HMA extension for CLIP as reference for adding other format extensins
- Perception by Thorn
- provides a common wrapper around existing, popular perceptual hashes (such as those implemented by ImageHash)
- Altitude by Jigsaw
- web UI and hash matching for violent extremism and terrorism content
- Lattice Extract by Adobe
- grid and lattice detection to guard against FP in hash matching
- RocketChat CSAM
- CSAM hash matching for RocketChat
- MediaModeration (Wiki Extension)
- CSAM hash matching for Wikimedia
- OSmod by Jigsaw
- toolkit of machine learning (ML) tools, models, and APIs that platforms can use to moderate content
- Perspective API by Jigsaw
- machine learning-powered tool that helps platforms detect and assess the toxicity of online conversations
- Presidio by Microsoft
- toolset for detecting Personal Identifiable Information (PII) and other sensitive data in images and text
- Llama Guard by Meta
- AI-powered content moderation model to detect harm in text-based interactions
- Llama Prompt Guard 2 by Meta
- Detects prompt injection and jailbreaking attacks in LLM inputs.
- Purple Llama by Meta
- set of tools to assess and improve LLM security. Includes Llama Guard, CyberSec Eval, and Code Shield
- ShieldGemma by Google DeepMind
- AI safety toolkit by Google DeepMind designed to help detect and mitigate harmful or unsafe outputs in LLM applications
- Roblox Voice Safety Classifier
- machine learning model that detects and moderates harmful content in real-time voice chat on Roblox. Focuses on spoken language detection.
- Detoxify by Unitary AI
- detects and mitigates generalized toxic language (including hate speech, harassment, bullying) in text
- Toxic Prompt RoBERTa by Intel
- a BERT-based model for detecting toxic content in prompts to language models
- NSFW Filtering
- browser extension to block explicit images from online platforms. User facing.
- NSFW Keras Model
- convoluted neural network (CNN) based explicit image ML model
- Guardrails AI
- a Python framework that helps build safe AI applications checking input/output for predefined risks
- Private Detector by Bumble
- a pretrained model for detecting lewd images
- Fawkes Facial De-Recognition Cloaking
- Code and binaries to confuse AIs when trying to match identity to photos, such as Clearview
- Many other great tools at github.com/Shawn-Shan, MIT researcher
- Mjolnir by Matrix
- moderation bot for the Matrix protocol that automatically enforces content policies
- AbuseIO
- abuse management platform designed to help organizations handle and track abuse complaints related to online content, infrastructure, or services
- Ozone by Bluesky
- labeling tool designed for Bluesky. Includes moderation features to action on abuse flags, policy enforcement tools, and investigation features
- Open Truss by Github
- framework designed to help users create internal tools without needing to write code
- Access by Discord
- a centralized portal for managing access to internal systems within any organization
- PyRIT Documentation
- Microsoft’s Python-based tool for AI red teaming and security testing.
- AI Benchmarking Tool
- Evaluates AI models for security vulnerabilities and adversarial robustness.
- Prompt Fuzzer Red Teaming Tool
- Tool for testing prompt injection vulnerabilities in AI systems.
- Open Source Red Teaming Tool – Nvidia
- Framework for adversarial testing and model evaluation.
- Tool that Enables Models to Chat with One Another
- Allows AI models to interact, helping test conversational weaknesses.
- Microsoft AI Tool – Counterfit
- Automation tool for assessing AI model security and robustness.
- SpamAssassin by Apache
- anti-spam platform that uses a variety of techniques, including text analysis, Bayesian filtering, and DNS blocklists, to classify and block unsolicited email
- scikit-learn
- python library including clustering through various algorithms, such as K-Means, DBSCAN, and hierarchical clustering
- RulesEngine by Microsoft
- a library for abstracting business logic, rules, and policies from a system via JSON for .NET language families
- Marble
- a real-time fraud detection and compliance engine tailored for fintech companies and financial institutions
- Automod by Bluesky
- a tool for automating content moderation processes for the Bluesky social network and other apps on the AT Protocol
- Wikimedia Smite Spam
- an extension for MediaWiki that helps identify and manage spam content on a wiki
- Druid by Apache
- a high performance real-time analytics database
- RabbitMQ
- a message broker that enables applications to communicate with each other by sending messages through queues
- BullMQ
- message queue and batch processing for NodeJS and Python based on Redis
- Owlculus
- an OSINT (Open-Source Intelligence) toolkit and case management platform
- NCMEC Reporting by ello
- a Ruby client library for reporting incidents to the National Center for Missing & Exploited Children (NCMEC) CyberTipline
- ThreatExchange by Meta
- a platform that enables organizations to share information about threats, such as malware, phishing attacks, and online safety harms in a structured and privacy-compliant manner
- ThreatExchange Client via PHP
- a PHP client for ThreatExchange
- ThreatExchange via Python
- a Python library for ThreatExchange
- Feluda by Tattle
- A configurable engine for analysing multi-lingual and multi-modal content
- DAU Dashboard by Tattle
- Deepfake Analysis Unit(DAU) is a collaborative space for analyzing deepfakes
- CIB MangoTree
- A collection of tools to aid researchers in coordinated inauthentic behavior (CIB) analysis
- Interference by Digital Forensics Research Lab
- an interactive, open-source database that tracks allegations of foreign interference or foreign malign influence relevant to the 2024 U.S. presidential election
- Aegis Content Safety by NVIDIA
- a dataset created by NVIDIA to aid in content moderation and toxicity detection
- Toxicity by Jigsaw
- a large number of Wikipedia comments which have been labeled by human raters for toxic behavior
- Toxic Chat by LMSYS
- a dataset of toxic conversations collected from interactions with Vicuna
- Uli Dataset by Tattle
- A dataset of gendered abuse, created for Uli ML redaction.
- Red Team Resistance Leaderboard
- rankings of AI models based on resistance to adversarial attacks.
- JailbreakHub by WalledAI
- a collection of jailbreak prompts and corresponding model responses
- SidFeel Jailbreak Dataset
- a collection of prompts used for jailbreaking AI models.
- HackAPrompt Jailbreak Dataset
- a dataset for testing AI vulnerability to prompt-based jailbreaking.
- HiroKachi Jailbreak Dataset
- adataset focused on adversarial AI prompt attacks.
- Rentry Jailbreak Datasets
- collection of datasets related to jailbreak attempts on AI models.
- DEFCOM Red Teaming Dataset
- dataset from DEF CON’s AI red teaming event.
- Anthropic’s AI Alignment Dataset
- data used for reinforcement learning with human feedback (RLHF) to align AI models.
- Jailbreak Prompt Generator AI Model
- AI model that generates jailbreak-style prompts.
-
- domain moderation tool to assist ActivityPub service providers, such as Mastodon servers, now open-sourced.
-
- a spam filter for Fediverse social media platforms. For now, the current version is only a proof of concept.
-
- reference server + protocol for the exchange of moderation adivsories and recommendations
- Uli by Tattle
- Software and Resources for Mitigating Online Gender Based Violence in India