Multi-model video-to-text by combining embeddings from Flan-T5 + CLIP + Whisper + SceneGraph. The 'backbone LLM' is pre-trained from scratch on YouTube (YT-1B dataset).

Jupyter Notebook 52 8 Updated Apr 21, 2023

google-research / scenic

Scenic: A Jax Library for Computer Vision Research and Beyond

Python 3,577 455 Updated Jun 23, 2025

zillow / psmnet-layout

Python 4 Updated Feb 17, 2023

oobabooga / text-generation-webui

LLM UI with advanced features, easy setup, and multiple backend support.

Python 44,095 5,682 Updated Jun 25, 2025

salesforce / LAVIS

LAVIS - A One-stop Library for Language-Vision Intelligence

Jupyter Notebook 10,681 1,044 Updated Nov 18, 2024

microsoft / unilm

Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities

Python 21,443 2,639 Updated Jun 3, 2025

arctanbell / LLaVA

Forked from haotian-liu/LLaVA

Large Language-and-Vision Assistant built towards multimodal GPT-4 level capabilities.

Python 1 Updated Apr 22, 2023

Python 3 3 Updated Sep 18, 2024

valeoai / WoodScape

The repository containing tools and information about the WoodScape dataset.

Python 651 130 Updated Aug 26, 2023

pjreddie / darknet

Convolutional Neural Networks

C 26,230 21,305 Updated May 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

arctanbell

Block or report arctanbell

Stars

yunlong10 / Awesome-LLMs-for-Video-Understanding

bytedance / UI-TARS

facebookresearch / SlowFast

LucasKre / dilated_tooth_seg_net

stepfun-ai / Step-Audio

ShareGPT4Omni / ShareGPT4Video

8000 LLaVA-VL / LLaVA-NeXT

ZLMediaKit / ZLMediaKit

648540858 / wvp-GB28181-pro

amao2001 / ganloss-latent-space

LiheYoung / Depth-Anything

xinyu1205 / recognize-anything

ttengwang / PDVC

KastanDay / video-pretrained-transformer

google-research / scenic

zillow / psmnet-layout

oobabooga / text-generation-webui

salesforce / LAVIS

microsoft / unilm

arctanbell / LLaVA

facebookresearch / ParlAI

huggingface / pytorch-image-models

Significant-Gravitas / AutoGPT

ymcui / Chinese-LLaMA-Alpaca

Megvii-BaseDetection / BEVDepth

OpenDriveLab / Birds-eye-view-Perception

fundamentalvision / BEVFormer

nemonameless / PaddleDetection

valeoai / WoodScape

pjreddie / darknet