TY - GEN
T1 - Predicting audio-visual salient events based on visual, audio and text modalities for movie summarization
AU - Koutras, P.
AU - Zlatintsi, A.
AU - Iosif, E.
AU - Katsamanis, A.
AU - Maragos, P.
AU - Potamianos, A.
N1 - Publisher Copyright:
© 2015 IEEE.
PY - 2015/12/9
Y1 - 2015/12/9
N2 - In this paper, we present a new and improved synergistic approach to the problem of audio-visual salient event detection and movie summarization based on visual, audio and text modalities. Spatio-temporal visual saliency is estimated through a perceptually inspired frontend based on 3D (space, time) Gabor filters and frame-wise features are extracted from the saliency volumes. For the auditory salient event detection we extract features based on Teager-Kaiser Energy Operator, while text analysis incorporates part-of-speech tagging and affective modeling of single words on the movie subtitles. For the evaluation of the proposed system, we employ an elementary and non-parametric classification technique like KNN. Detection results are reported on the MovSum database, using objective evaluations against ground-truth denoting the perceptually salient events, and human evaluations of the movie summaries. Our evaluation verifies the appropriateness of the proposed methods compared to our baseline system. Finally, our newly proposed summarization algorithm produces summaries that consist of salient and meaningful events, also improving the comprehension of the semantics.
AB - In this paper, we present a new and improved synergistic approach to the problem of audio-visual salient event detection and movie summarization based on visual, audio and text modalities. Spatio-temporal visual saliency is estimated through a perceptually inspired frontend based on 3D (space, time) Gabor filters and frame-wise features are extracted from the saliency volumes. For the auditory salient event detection we extract features based on Teager-Kaiser Energy Operator, while text analysis incorporates part-of-speech tagging and affective modeling of single words on the movie subtitles. For the evaluation of the proposed system, we employ an elementary and non-parametric classification technique like KNN. Detection results are reported on the MovSum database, using objective evaluations against ground-truth denoting the perceptually salient events, and human evaluations of the movie summaries. Our evaluation verifies the appropriateness of the proposed methods compared to our baseline system. Finally, our newly proposed summarization algorithm produces summaries that consist of salient and meaningful events, also improving the comprehension of the semantics.
KW - affective text analysis
KW - audio-visual salient events
KW - auditory saliency
KW - movie summarization
KW - Visual saliency
UR - http://www.scopus.com/inward/record.url?scp=84956615075&partnerID=8YFLogxK
U2 - 10.1109/ICIP.2015.7351630
DO - 10.1109/ICIP.2015.7351630
M3 - Conference contribution
AN - SCOPUS:84956615075
T3 - Proceedings - International Conference on Image Processing, ICIP
SP - 4361
EP - 4365
BT - 2015 IEEE International Conference on Image Processing, ICIP 2015 - Proceedings
PB - IEEE Computer Society
T2 - IEEE International Conference on Image Processing, ICIP 2015
Y2 - 27 September 2015 through 30 September 2015
ER -