Lists (1)
Sort Name ascending (A-Z)
Stars
Language-Agnostic SEntence Representations
Source code for the paper 'Audio Captioning Transformer'
Source code for models described in the paper "AudioCLIP: Extending CLIP to Image, Text and Audio" (https://arxiv.org/abs/2106.13043)
[NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.
Unified Efficient Fine-Tuning of 100+ LLMs & VLMs (ACL 2024)
SVIT: Scaling up Visual Instruction Tuning
Moshi is a speech-text foundation model and full-duplex spoken dialogue framework. It uses Mimi, a state-of-the-art streaming neural audio codec.
LLaMA-Omni is a low-latency and high-quality end-to-end speech interaction model built upon Llama-3.1-8B-Instruct, aiming to achieve speech capabilities at the GPT-4o level.
PyTorch implementation of Audio Flamingo 2: An Audio-Language Model with Long-Audio Understanding and Expert Reasoning Abilities.
Pytorch implementation of "Towards Practical and Efficient Image-to-Speech Captioning with Vision-Language Pre-training and Multi-modal Tokens"
open-source multimodal large language model that can hear, talk while thinking. Featuring real-time end-to-end speech input and streaming audio output conversational capabilities.
Faster Whisper transcription with CTranslate2
This repo lists relevant papers summarized in our survey paper: A Systematic Survey of Prompt Engineering on Vision-Language Foundation Models.
One Prompt Word is Enough to Boost Adversarial Robustness for Pre-trained Vision-Language Models
RobustBench: a standardized adversarial robustness benchmark [NeurIPS 2021 Benchmarks and Datasets Track]
A new adversarial purification method that uses the forward and reverse processes of diffusion models to remove adversarial perturbations.
Adversarial Robustness Toolbox (ART) - Python Library for Machine Learning Security - Evasion, Poisoning, Extraction, Inference - Red and Blue Teams
Prompt Learning for Vision-Language Models (IJCV'22, CVPR'22)
ECCV2024: Adversarial Prompt Tuning for Vision-Language Models
Targeted Adversarial Examples on Speech-to-Text systems
Implementation of "Defense against Adversarial Attacks on Audio DeepFake Detection"