Stars
🔥🔥🔥Latest Papers, Codes and Datasets on Vid-LLMs.
PySlowFast: video understanding codebase from FAIR for reproducing state-of-the-art video models.
DilatedToothSegNet: Tooth Segmentation Network on 3D Dental Meshes Through Increasing Receptive Vision
[NeurIPS 2024] An official implementation of ShareGPT4Video: Improving Video Understanding and Generation with Better Captions
WebRTC/RTSP/RTMP/HTTP/HLS/HTTP-FLV/WebSocket-FLV/HTTP-TS/HTTP-fMP4/WebSocket-TS/WebSocket-fMP4/GB28181/SRT server and client framework based on C++11
WEB VIDEO PLATFORM是一个基于GB28181-2016标准实现的网络视频平台,支持NAT穿透,支持海康、大华、宇视等品牌的IPC、NVR、DVR接入。支持国标级联,支持rtsp/rtmp等视频流转发到国标平台,支持rtsp/rtmp等推流转发到国标平台。
[CVPR 2024] Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data. Foundation Model for Monocular Depth Estimation
Open-source and strong foundation image recognition models.
End-to-End Dense Video Captioning with Parallel Decoding (ICCV 2021)
Multi-model video-to-text by combining embeddings from Flan-T5 + CLIP + Whisper + SceneGraph. The 'backbone LLM' is pre-trained from scratch on YouTube (YT-1B dataset).
Scenic: A Jax Library for Computer Vision Research and Beyond
LLM UI with advanced features, easy setup, and multiple backend support.
LAVIS - A One-stop Library for Language-Vision Intelligence
Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities
arctanbell / LLaVA
Forked from haotian-liu/LLaVALarge Language-and-Vision Assistant built towards multimodal GPT-4 level capabilities.
A framework for training and evaluating AI models on a variety of openly available dialogue datasets.
The largest collection of PyTorch image encoders / backbones. Including train, eval, inference, export scripts, and pretrained weights -- ResNet, ResNeXT, EfficientNet, NFNet, Vision Transformer (V…
AutoGPT is the vision of accessible AI for everyone, to use and to build on. Our mission is to provide the tools, so that you can focus on what matters.
中文LLaMA&Alpaca大语言模型+本地CPU/GPU训练部署 (Chinese LLaMA & Alpaca LLMs)
[IEEE T-PAMI 2023] Awesome BEV perception research and cookbook for all level audience in autonomous diriving
[ECCV 2022] This is the official implementation of BEVFormer, a camera-only framework for autonomous driving perception, e.g., 3D object detection and semantic map segmentation.
Object detection and instance segmentation toolkit based on PaddlePaddle.
The repository containing tools and information about the WoodScape dataset.