Audio-based distributional representations of meaning using a fusion of feature encodings

Giannis Karamanolakis, Elias Iosif, Athanasia Zlatintsi, Aggelos Pikrakis, Alexandros Potamianos

Research output: Contribution to journalConference articlepeer-review

Abstract

Recently a "Bag-of-Audio-Words" approach was proposed [1] for the combination of lexical features with audio clips in a multimodal semantic representation, i.e., an Audio Distributional Semantic Model (ADSM). An important step towards the creation of ADSMs is the estimation of the semantic distance between clips in the acoustic space, which is especially challenging given the diversity of audio collections. In this work, we investigate the use of different feature encodings in order to address this challenge following a two-step approach. First, an audio clip is categorized with respect to three classes, namely, music, speech and other. Next, the feature encodings are fused according to the posterior probabilities estimated in the previous step. Using a collection of audio clips annotated with tags we derive a mapping between words and audio clips. Based on this mapping and the proposed audio semantic distance, we construct an ADSM model in order to compute the distance between words (lexical semantic similarity task). The proposed model is shown to significantly outperform (23:6% relative improvement in correlation coefficient) the state-of-the-art results reported in the literature.

Original languageEnglish
Pages (from-to)3658-3662
Number of pages5
JournalProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
Volume08-12-September-2016
DOIs
Publication statusPublished - 2016
Externally publishedYes
Event17th Annual Conference of the International Speech Communication Association, INTERSPEECH 2016 - San Francisco, United States
Duration: 8 Sept 201616 Sept 2016

Keywords

  • Audio representations
  • Bag-of-audio-words
  • Feature space fusion
  • Lexical semantic similarity

Fingerprint

Dive into the research topics of 'Audio-based distributional representations of meaning using a fusion of feature encodings'. Together they form a unique fingerprint.

Cite this