Authors: Branko Brkljač $^{\text{§}}$ , Vladimir Kalušev $^{\text{§}}$ , Branislav Popović and Milan Sečujski
Face detection and face recognition have been in the focus of vision community since the very beginnings. Inspired by the success of the original Videoface digitizer, a pioneering device that allowed users to capture video signals from any source, we have designed an advanced video analytics tool to efficiently create structured video stories, i.e. identity-based information catalogs. VideoFace2.0 is the name of the developed system for spatial and temporal localization of each unique face in the input video, i.e. face re-identification (ReID), which also allows their cataloging, characterization and creation of structured video outputs for later downstream tasks. Developed near real-time solution is primarily designed to be utilized in application scenarios involving TV production, media analysis, and as an efficient tool for creating large video datasets necessary for training machine learning (ML) models in challenging vision tasks such as lip reading and multimodal speech recognition. Conducted experiments confirm applicability of the proposed face ReID algorithm that is combining the concepts of face detection, face recognition and passive tracking-by-detection in order to achieve robust and efficient face ReID. The system is envisioned as a compact and modular extensions of the existing video production equipment. Presented results are based on test implementation that achieves between 18-25 fps on consumer type notebook. Ablation experiments also confirmed that the proposed algorithm brings relative gain in the reduction of number of false identities in the range of 73%-93%. We hope that the presented work and shared code will stimulate further interest in development of similar, application specific video analysis tools, and lower the entry barrier for production of high-quality multi-modal ML datasets in the future.
📄 Publication preprint available at:
code TBA ...
✅ Main characteristics include:
-
Near real-time operation with ~ 18-25 fps (consumer type notebook with GPU)
-
On-line or off-line processing mode with different types of results visulaizations
🔍 * Detailed log-file of face identities found by the system. Suitable for video cataloging and spatial-temporal localization of each face image in which the same person appears
-
Fast post production of video stories based on the results of video analysis stored in the corresponding log-file: single run of face ReID producing multiple outputs
-
Modular and independent of the specific choice of methods for each of the components in Algorithm 1 (face detection and face recognition models)
⚡ * Succesfully tested on open-set face ReID in open-world indoor and outdoor scenes
✅ Main applications include:
- TV production, media analysis and creative industries.
⭐ * Production of custom video-based datasets for machine learning (ML) tasks involving multi-modal inputs like speech, text and image.
- Automated video analysis and cataloging.
⭐ * Production of structured video outputs or video stories.
🔨 * Editing of interviews or reportages, talk-shows, podcasts and other formats that include multiple speakers or participants.
-
Input testVideo2 that corresponds to the reportage brought by the field reporter to TV studio is automatically processed by VideoFace2.0 and the shown video story corresponding to an unknown person identified by the system as "person30" is created.
-
Produced "person30" video story alongside extracted face and mouth region videos (side-by-side visualization):
-
All frames of the original input video in which the open-set face of the selected person appears are identified by the system and mixed together into the shown video story with synchronized audio.
-
Original testVideo2 reportage: "Vancouver Talks" - by @impsquared YouTube™ channel.
-
Produced video story also includes overlaid visualizations of face bounding boxes and face landmarks of all other persons that are present in the same frames in which the selected "person30" appears (other persons identified by the proposed face ReID procedure).
-
Produced video story, download link:
▶️ testVideo2-->person30_video_story. -
*more video examples are available on the corresponding links in the Experimental results section below.
(a) face region video story;
(b) mouth region video story;
(c) face identity mismatch;
(d) ablation experiments on testVideo2 (side-by-side visual comparison of 4 different algorithms);
(e) on screen presence of all 23 identities found by VideoFace2.0 in testVideo2 in case of full Algorithm 1 - the proposed face ReID procedure corresponding to the the best face ReID result shown in the lower right part of abalation experiments visualization in (d).
# Number of found identities | exp 1 | exp 2 | exp 3 | exp 4 | true |
|
[m:s] |
---|---|---|---|---|---|---|---|
testvideo1 | 50 | 42 | 30 | 7 | 4 | 83 | 02:44 |
testvideo2 | 421 | 378 | 263 | 23 | 13 | 93 | 07:25 |
testvideo3 | 39 | 37 | 25 | 9 | 6 | 73 | 18:45 |
Table 1 notes:
-
$\exp i$ ,$i=1..4$ , ablation experiments -
$\gamma$ , relative gain of Algorithm 1 in terms of number of found identities in comaprison to other experiments, calculated as:
$\gamma = (1- \exp4 /(\sum_{i=1}^{3}(\exp_i)/3))\times 100%$ -
exp 1: detection + recognition;
-
exp 2: detection + recognition + passive tracker filtering of new identities
-
exp 2: detection + recognition + passive tracker filtering of new identities + detection confidence score
-
exp 4 (full Algorithm 1): detection + recognition + passive tracker filtering of new identities + detection confidence score + temporal post filtering
-
"true", expected or true number of unique identities in each video. Number of distinct faces that are expected to be found by the system. Does not mean that these faces have significant on screen presence.
-
[m:s] indicates duration in minutes and seconds.
Video demonstrations of VideoFace2.0 functionalities are available on the following YouTube™ channel:
- Presented experiments include 3 specific test videos with challenging face ReID situations and scene environments characteristic for the above mentioned application scenarios.
💡 In the following are image previews and individual YouTube™ links of some of the conducted experiments.
• Face ReID results based on full Algorithm 1 with parameters set to:
• Face ReID ablation experiments, side-by-side comparison:
Experiments are numbered 1-4 and consist of:
1. Upper left: detection + recognition (exp 1)
2. Upper right: detection + recognition + passive tracker filtering of new identities (exp 2)
3. Lower left: detection + recognition + passive tracker filtering of new identities + detection confidence score (exp 3)
4. Lower right: detection + recognition + passive tracker filtering of new identities + detection confidence score + temporal post filtering (proposed full Algorithm 1, exp 4)
• Face video story:
• Mouth region video story:
• Face ReID results based on full Algorithm 1 (with same set of parameters as for testVideo1):
• Face ReID results together with face and mouth region extraction (side-by-side) for the selected person identified as "person30":
Video consists of 3 parts:
1. Left side: Face re-identification (ReID) results.
Video shows all persons that have been identified as present together (in the same frame) with the selected "person30": their bounding boxes, person IDs and face landmark points.
2. Top right: Zoomed-in face image regions for the selected person.
Video part contains face images of "person30" cropped to face detection bounding box and:
3. Bottom right: Mouth region extraction for the selected person.
Video part interpretation is the same as in the case of face image regions described in the previous point 2.
• Face ReID ablation experiments, side-by-side comparison:
• Face ReID results (with same set of parameters as for testVideo1):
• Face re-identification results together with landmark points, face and mouth region extraction (side-by-side) for the selected person identified as "person1":
• Face ReID ablation experiments, side-by-side comparison:
Original testVideo2 and testVideo3 are avaialble on the following links under the YouTube™'s "Creative Commons Attribution license (reuse allowed)":
-
Original testVideo2: "Vancouver Talks" - by @impsquared YouTube™ channel.
-
Original testVideo3: "Reportaža Superior Velika Plana" - by @tvpirotpirot8451 YouTube™ channel.
Presented implementation and experimental results are based on the pre-trained face detection and face recognition models kindly provided by the InsightFace project - State-of-the-art 2D and 3D face analysis.
VideoFace2.0 is released under the MIT License terms in the provided LICENSE file.
[1] Brkljač, B., Kalušev, V., Popović, B., Sečujski, M. (2025). Transforming faces into video stories - VideoFace2.0. In Preprint submitted to the 14th Mediterranean Conference on Embedded Computing - MECO 2025, Budva, Montenegro, 10-14 June, 2025
@inproceedings{brkljacVideoface2025,
author = {Brklja{\v{c}}, Branko and Kalu{\v{s}}ev, Vladimir and Popovi{\'c}, Branislav and Se{\v{c}}ujski, Milan},
title = {Transforming faces into video stories - {VideoFace2.0}},
booktitle = {Preprint submitted to the 14\textsuperscript{th} Mediterranean Conference on Embedded Computing (MECO), Budva, Montenegro},
volume = {1},
pages = {1--4},
month = {10--14 June},
year = {2025},
doi = {-}
}
[2] Brkljač, B., Kalušev, V., Popović, B., Sečujski, M. (2025). Transforming faces into video stories - VideoFace2.0. arXiv preprint arXiv:2505.02060
@misc{brkljac2025transformingfacesvideostories,
title={Transforming faces into video stories - VideoFace2.0},
author={Branko Brkljač and Vladimir Kalušev and Branislav Popović and Milan Sečujski},
year={2025},
eprint={2505.02060},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2505.02060},
doi={10.48550/arXiv.2505.02060}
}