Audio-visual content structuring for automatic summarization

These last years, with the advent of sites such as Youtube, Dailymotion or Blip TV, the number of videos available on the Internet has increased considerably. The size and their lack of structure of these collections limit access to the contents. Sum- marization is one way to produce snippets that extract the essential content and present it as concisely as possible.In this work, we focus on extraction methods for video summary, based on au- dio analysis. We treat various scientific problems related to this objective : content extraction, document structuring, definition and estimation of objective function and algorithm extraction.On each of these aspects, we make concrete proposals that are evaluated.On content extraction, we present a fast spoken-term detection. The main no- velty of this approach is that it relies on the construction of a detector based on search terms. We show that this strategy of self-organization of the detector im- proves system robustness, which significantly exceeds the classical approach based on automatic speech recogntion.We then present an acoustic filtering method for automatic speech recognition based on Gaussian mixture models and factor analysis as it was used recently in speaker identification. The originality of our contribution is the use of decomposi- tion by factor analysis for estimating supervised filters in the cepstral domain.We then discuss the issues of structuring video collections. We show that the use of different levels of representation and different sources of information in or- der to characterize the editorial style of a video is principaly based on audio analy- sis, whereas most previous works suggested that the bulk of information on gender was contained in the image. Another contribution concerns the type of discourse identification ; we propose low-level models for detecting spontaneous speech that significantly improve the state of the art for this kind of approaches.The third focus of this work concerns the summary itself. As part of video summarization, we first try, to define what a synthetic view is. Is that what cha- racterizes the whole document, or what a user would remember (by example an emotional or funny moment) ? This issue is discussed and we make some concrete proposals for the definition of objective functions corresponding to three different criteria : salience, expressiveness and significance. We then propose an algorithm for finding the sum of the maximum interest that derives from the one introduced in previous works, based on integer linear programming.

Data and Resources

Additional Info

Field Value
Source https://theses.hal.science/tel-00954238
Author Rouvier, Mickaël
Maintainer CCSD
Last Updated May 6, 2026, 04:27 (UTC)
Created May 6, 2026, 04:27 (UTC)
Identifier NNT: 2011AVIG0192
Language fr
Rights https://about.hal.science/hal-authorisation-v1/
contributor Laboratoire Informatique d'Avignon (LIA) ; Avignon Université (AU)-Centre d'Enseignement et de Recherche en Informatique - CERI
creator Rouvier, Mickaël
date 2011-12-05T00:00:00
harvest_object_id 1c37d787-19a7-4361-b9eb-c959fabf5da0
harvest_source_id 3374d638-d20b-4672-ba96-a23232d55657
harvest_source_title test moissonnage SELUNE
metadata_modified 2026-03-31T00:00:00
set_spec type:THESE