VideoFace2.0

Transforming faces into video stories

Authors: Branko Brkljač $^{\text{§}}$, Vladimir Kalušev $^{\text{§}}$, Branislav Popović and Milan Sečujski

$^{\text{§}}$ equal contribution

Abstract

Face detection and face recognition have been in the focus of vision community since the very beginnings. Inspired by the success of the original Videoface digitizer, a pioneering device that allowed users to capture video signals from any source, we have designed an advanced video analytics tool to efficiently create structured video stories, i.e. identity-based information catalogs. VideoFace2.0 is the name of the developed system for spatial and temporal localization of each unique face in the input video, i.e. face re-identification (ReID), which also allows their cataloging, characterization and creation of structured video outputs for later downstream tasks. Developed near real-time solution is primarily designed to be utilized in application scenarios involving TV production, media analysis, and as an efficient tool for creating large video datasets necessary for training machine learning (ML) models in challenging vision tasks such as lip reading and multimodal speech recognition. Conducted experiments confirm applicability of the proposed face ReID algorithm that is combining the concepts of face detection, face recognition and passive tracking-by-detection in order to achieve robust and efficient face ReID. The system is envisioned as a compact and modular extensions of the existing video production equipment. Presented results are based on test implementation that achieves between 18-25 fps on consumer type notebook. Ablation experiments also confirmed that the proposed algorithm brings relative gain in the reduction of number of false identities in the range of 73%-93%. We hope that the presented work and shared code will stimulate further interest in development of similar, application specific video analysis tools, and lower the entry barrier for production of high-quality multi-modal ML datasets in the future.

Original Videoface device

For more information, please check our conference publication on the link below

📄 Publication preprint available at:

code TBA ...

✅ Main characteristics include:

Near real-time operation with ~ 18-25 fps (consumer type notebook with GPU)
On-line or off-line processing mode with different types of results visulaizations

🔍 * Detailed log-file of face identities found by the system. Suitable for video cataloging and spatial-temporal localization of each face image in which the same person appears

Fast post production of video stories based on the results of video analysis stored in the corresponding log-file: single run of face ReID producing multiple outputs
Modular and independent of the specific choice of methods for each of the components in Algorithm 1 (face detection and face recognition models)

⚡ * Succesfully tested on open-set face ReID in open-world indoor and outdoor scenes

✅ Main applications include:

TV production, media analysis and creative industries.

⭐ * Production of custom video-based datasets for machine learning (ML) tasks involving multi-modal inputs like speech, text and image.

Automated video analysis and cataloging.

⭐ * Production of structured video outputs or video stories.

🔨 * Editing of interviews or reportages, talk-shows, podcasts and other formats that include multiple speakers or participants.

Video stories and VideoFace2.0 face ReID

An example of the produced testVideo2 --> person30_video_story

Input testVideo2 that corresponds to the reportage brought by the field reporter to TV studio is automatically processed by VideoFace2.0 and the shown video story corresponding to an unknown person identified by the system as "person30" is created.
Produced "person30" video story alongside extracted face and mouth region videos (side-by-side visualization):

▶️ testVideo2 --> "person30" + face + mouth region video stories

All frames of the original input video in which the open-set face of the selected person appears are identified by the system and mixed together into the shown video story with synchronized audio.
Original testVideo2 reportage: "Vancouver Talks" - by @impsquared YouTube™ channel.
Produced video story also includes overlaid visualizations of face bounding boxes and face landmarks of all other persons that are present in the same frames in which the selected "person30" appears (other persons identified by the proposed face ReID procedure).
Produced video story, download link: ▶️testVideo2-->person30_video_story.
*more video examples are available on the corresponding links in the Experimental results section below.

Proposed generic face ReID procedure:

Table 1 - Summary of ablation experiments

Reduction of false identities brought by Algorithm 1 (relative gain $\gamma$)

# Number of found identities	exp 1	exp 2	exp 3	exp 4	true	$\gamma$ [%]	[m:s]
testvideo1	50	42	30	7	4	83	02:44
testvideo2	421	378	263	23	13	93	07:25
testvideo3	39	37	25	9	6	73	18:45

Table 1 notes:

$\exp i$, $i=1..4$, ablation experiments
$\gamma$, relative gain of Algorithm 1 in terms of number of found identities in comaprison to other experiments, calculated as:
$\gamma = (1- \exp4 /(\sum_{i=1}^{3}(\exp_i)/3))\times 100%$
exp 1: detection + recognition;
exp 2: detection + recognition + passive tracker filtering of new identities
exp 2: detection + recognition + passive tracker filtering of new identities + detection confidence score
exp 4 (full Algorithm 1): detection + recognition + passive tracker filtering of new identities + detection confidence score + temporal post filtering
"true", expected or true number of unique identities in each video. Number of distinct faces that are expected to be found by the system. Does not mean that these faces have significant on screen presence.
[m:s] indicates duration in minutes and seconds.

Experimental results:

Video demonstrations of VideoFace2.0 functionalities are available on the following YouTube™ channel:

https://www.youtube.com/@kalusev

Presented experiments include 3 specific test videos with challenging face ReID situations and scene environments characteristic for the above mentioned application scenarios.

💡 In the following are image previews and individual YouTube™ links of some of the conducted experiments.

testVideo1

• Face ReID results based on full Algorithm 1 with parameters set to: $\sigma_h=0.6$, $\tau_d=0.6$, $\tau=0.8$, and $t_{min 8000 }=60$ frames*:

▶️testVideo1 face ReID results

• Face ReID ablation experiments, side-by-side comparison:

▶️testVideo1 face ReID ablation experiments

Experiments are numbered 1-4 and consist of:

1. Upper left: detection + recognition (exp 1)

2. Upper right: detection + recognition + passive tracker filtering of new identities (exp 2)

3. Lower left: detection + recognition + passive tracker filtering of new identities + detection confidence score (exp 3)

4. Lower right: detection + recognition + passive tracker filtering of new identities + detection confidence score + temporal post filtering (proposed full Algorithm 1, exp 4)

$^{\text{*}}$ Note that the introduced $t_{min}$ delay in new identity approval only affects initial appearance of new identities, but does not affect ReID of the identities already present in the gallery (real-time operation after the new identity is approved as valid). Therefore, it could be replaced by a more complex ReID decision rule, which would have the same role as the introduced post filtering. In case of the need for immediate appearance of new identities in real-time operation, $t_{min} \approx 0$ should be used.

• Face video story:

▶️testVideo1 face video story

• Mouth region video story:

▶️testVideo1 mouth region video story

testVideo2

• Face ReID results based on full Algorithm 1 (with same set of parameters as for testVideo1):

▶️testVideo2 face ReID results

• Face ReID results together with face and mouth region extraction (side-by-side) for the selected person identified as "person30":

▶️testVideo2 person30 face ReID with face and mouth region extraction

Video consists of 3 parts:

1. Left side: Face re-identification (ReID) results.

Video shows all persons that have been identified as present together (in the same frame) with the selected "person30": their bounding boxes, person IDs and face landmark points.

2. Top right: Zoomed-in face image regions for the selected person.

Video part contains face images of "person30" cropped to face detection bounding box and:

* AVG-scaled: Scaled to average width and height of face ROI over all frames in which "person30" appears (non-uniform scaling): face image on the left

* MAX-scaled: Face image is only positioned next to the AVG-scaled version (original image without scaling). Shown face video dimensions correspond to face appearance with maximum width and height in the original video.

3. Bottom right: Mouth region extraction for the selected person.

Video part interpretation is the same as in the case of face image regions described in the previous point 2.

• Face ReID ablation experiments, side-by-side comparison:

▶️testVideo2 face ReID ablation experiments

testVideo3

• Face ReID results (with same set of parameters as for testVideo1):

▶️testVideo3 face ReID results

• Face re-identification results together with landmark points, face and mouth region extraction (side-by-side) for the selected person identified as "person1":

▶️ testVideo3 --> "person1" video story

• Face ReID ablation experiments, side-by-side comparison:

▶️testVideo3 face ReID ablation experiments

Licenses:

Original testVideo2 and testVideo3 are avaialble on the following links under the YouTube™'s "Creative Commons Attribution license (reuse allowed)":

Original testVideo2: "Vancouver Talks" - by @impsquared YouTube™ channel.
Original testVideo3: "Reportaža Superior Velika Plana" - by @tvpirotpirot8451 YouTube™ channel.

Presented implementation and experimental results are based on the pre-trained face detection and face recognition models kindly provided by the InsightFace project - State-of-the-art 2D and 3D face analysis.

VideoFace2.0 is released under the MIT License terms in the provided LICENSE file.

How to cite:

[1] Brkljač, B., Kalušev, V., Popović, B., Sečujski, M. (2025). Transforming faces into video stories - VideoFace2.0. In Preprint submitted to the 14th Mediterranean Conference on Embedded Computing - MECO 2025, Budva, Montenegro, 10-14 June, 2025


    @inproceedings{brkljacVideoface2025,
    author = {Brklja{\v{c}}, Branko and Kalu{\v{s}}ev, Vladimir and Popovi{\'c}, Branislav and Se{\v{c}}ujski, Milan},
    title = {Transforming faces into video stories - {VideoFace2.0}},
    booktitle = {Preprint submitted to the 14\textsuperscript{th} Mediterranean Conference on Embedded Computing (MECO), Budva, Montenegro},
    volume = {1},
    pages = {1--4},   
    month = {10--14 June},
    year = {2025},
    doi = {-}
    }

[2] Brkljač, B., Kalušev, V., Popović, B., Sečujski, M. (2025). Transforming faces into video stories - VideoFace2.0. arXiv preprint arXiv:2505.02060


      @misc{brkljac2025transformingfacesvideostories,
      title={Transforming faces into video stories - VideoFace2.0}, 
      author={Branko Brkljač and Vladimir Kalušev and Branislav Popović and Milan Sečujski},
      year={2025},
      eprint={2505.02060},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2505.02060},
      doi={10.48550/arXiv.2505.02060} 	  
}

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
readmeFiles		readmeFiles
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

VideoFace2.0

Transforming faces into video stories

Authors: Branko Brkljač $^{\text{§}}$, Vladimir Kalušev $^{\text{§}}$, Branislav Popović and Milan Sečujski

Abstract

For more information, please check our conference publication on the link below

Video stories and VideoFace2.0 face ReID

An example of the produced testVideo2 --> person30_video_story