8000 GitHub - vera-framework/VERA: This is the code repo for the paper VERA: Explainable Video Anomaly Detection via Verbalized Learning of Vision-Language Models (CVPR 2025).
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

This is the code repo for the paper VERA: Explainable Video Anomaly Detection via Verbalized Learning of Vision-Language Models (CVPR 2025).

Notifications You must be signed in to change notification settings

vera-framework/VERA

Repository files navigation

VERA: Explainable Video Anomaly Detection via Verbalized Learning of Vision-Language Models

This is the code repo for the paper VERA: Explainable Video Anomaly Detection via Verbalized Learning of Vision-Language Models (CVPR 2025). In this repo, we take as an example using InternVL2 as the backbone for illustrating the use of the provided open-source code.

0. Environment and Data

The construction of environment for running VERA relies on the backbone we choose. Please install the environment based on the backbone we want to run VERA on. For example, in this repo, we use InternVL2, we will install the environment based on the instruction provided by InternVL2 (https://internvl.readthedocs.io/en/latest/get_started/installation.html). We also install the PyTorch Lighting (https://lightning.ai/docs/pytorch/stable/) package.

For datasets, we run on UCF-Crime and XD-Violence. Please download the original videos from links provided by the authors. UCF-Crime's project page is https://www.crcv.ucf.edu/projects/real-world/ (we recommend using the dropbox link) and XD-Violence's project page is https://roc-ng.github.io/XD-Violence/. Please let us know if you cannot download from the official links and we are happy to help.

With the downloaded videos, we need to extract frames from the original videos. Please use the provided code in the Preprocessing folder for getting extracted frames. After that, we can put the frames in the Data folder

1. Training in VERA

We can run the following script to obtain the optimal guiding question. The default batch size is 2 and the number of sampled frames is 8. After training, we usually choose the one with the highest validation accuracy as the guiding questions used in testing (we can also use AUC as the metric for selecting the guiding questions).

python training.py

For UCF-Crime, the best guiding questions we find are as follows:

1. Are there any people in the video who are not in their typical positions or engaging in activities that are not consistent with their usual behavior?
2. Are there any vehicles in the video that are not in their typical positions or being used in a way that is not consistent with their usual function?
3. Are there any objects in the video that are not in their typical positions or being used in a way that is not consistent with their usual function?
4. Is there any visible damage or unusual movement in the video that indicates an anomaly?
5. Are there any unusual sounds or noises in the video that suggest an anomaly?

For XD-Violence, the best guiding questions we find are as follows:

1. Are there any frames where the video content shows an object or person performing an unusual action that does not fit the usual pattern of the video content?
2. Are there any frames where the video content shows an object or person behaving in a way that is clearly abnormal or unexpected?
3. Are there any frames where the video content shows a clear indication of an anomaly in the behavior of objects or people, such as an object moving in an unexpected way or a person performing an unusual action?
4. Are there any frames where the video content shows a player or object being out of place or not fitting the usual pattern of the video content?
5. Are there any frames where the video content shows a disruption in the normal flow of the video, such as a sudden change in the camera angle or a loss of video quality?

2. Inference in VERA

After finishing training and deciding the guiding questions we will use, we need to do inference for test videos. The related codes are put in the Inference folder. We can put the learned questions in the backbone and run them on test data to get the initial anomaly score in the segment level. This is Step 1 in VERA's testing (Please refer to Sec. 3.3. in our paper). For this step, we need to run this script after putting guiding questions in the variable new_qs in the script.

python generate_intial_score.py

Note: We have provided the initial scores obtained by VERA (backbone: InternVL2) on UCF-Crime with this link. We have also provided the initial scores obtained by VERA (backbone: InternVL2) on XD-Violence with this link.

Next, given the initial score, we need to run Step 2 and Step 3 introduced in Sec. 3.3 for getting the frame-level anomaly score. We put those two steps in one script and we just need to run this line:

python compute_auc.py

Please note that the inference scores requires us to extract the vision feature beforehand. We use LAVAD's (https://github.com/lucazanella/lavad) and provide code and use ImageBind for extracting the vision features of each segment. You can access the extracted features we use for UCF-Crime from this link directly. The extracted feature for XD-Violence is provide in this link.

Note: We have provide the computation of AUC and AP for XD-Violence dataset as well, which is in the code compute_auc_XD_Violence.py and compute_AP_XD_Violence.py, respectively. For AP computation in XD-Violence, we follow Wu et al. for the preprocessing and postprocessing of getting ground truth and smoothing.

Acknolwedgement

We thank Zanella et al., Lv et al., and Wu et al. for sharing their codes.

Citation

If you use this code or find our work helpful, please consider citing:

@article{ye2024vera,
  title={Vera: Explainable video anomaly detection via verbalized learning of vision-language models},
  author={Ye, Muchao and Liu, Weiyang and He, Pan},
  journal={arXiv preprint arXiv:2412.01095},
  year={2024}
}

About

This is the code repo for the paper VERA: Explainable Video Anomaly Detection via Verbalized Learning of Vision-Language Models (CVPR 2025).

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published
0