Predicting audio-visual salient events based on visual, audio and text modalities for movie summarization

P. Koutras, A. Zlatintsi, E. Iosif, A. Katsamanis, P. Maragos, A. Potamianos

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review


In this paper, we present a new and improved synergistic approach to the problem of audio-visual salient event detection and movie summarization based on visual, audio and text modalities. Spatio-temporal visual saliency is estimated through a perceptually inspired frontend based on 3D (space, time) Gabor filters and frame-wise features are extracted from the saliency volumes. For the auditory salient event detection we extract features based on Teager-Kaiser Energy Operator, while text analysis incorporates part-of-speech tagging and affective modeling of single words on the movie subtitles. For the evaluation of the proposed system, we employ an elementary and non-parametric classification technique like KNN. Detection results are reported on the MovSum database, using objective evaluations against ground-truth denoting the perceptually salient events, and human evaluations of the movie summaries. Our evaluation verifies the appropriateness of the proposed methods compared to our baseline system. Finally, our newly proposed summarization algorithm produces summaries that consist of salient and meaningful events, also improving the comprehension of the semantics.

Original languageEnglish
Title of host publication2015 IEEE International Conference on Image Processing, ICIP 2015 - Proceedings
PublisherIEEE Computer Society
Number of pages5
ISBN (Electronic)9781479983391
Publication statusPublished - 9 Dec 2015
Externally publishedYes
EventIEEE International Conference on Image Processing, ICIP 2015 - Quebec City, Canada
Duration: 27 Sept 201530 Sept 2015

Publication series

NameProceedings - International Conference on Image Processing, ICIP
ISSN (Print)1522-4880


ConferenceIEEE International Conference on Image Processing, ICIP 2015
CityQuebec City


  • affective text analysis
  • audio-visual salient events
  • auditory saliency
  • movie summarization
  • Visual saliency


Dive into the research topics of 'Predicting audio-visual salient events based on visual, audio and text modalities for movie summarization'. Together they form a unique fingerprint.

Cite this