Audio-visual content structuring for automatic summarization

These last years, with the advent of sites such as Youtube, Dailymotion or Blip TV, the number of videos available on the Internet has increased considerably. The size and their lack of structure of these collections limit access to the contents. Sum- marization is one way to produce snippets that extract the essential content and present it as concisely as possible.In this work, we focus on extraction methods for video summary, based on au- dio analysis. We treat various scientific problems related to this objective : content extraction, document structuring, definition and estimation of objective function and algorithm extraction.On each of these aspects, we make concrete proposals that are evaluated.On content extraction, we present a fast spoken-term detection. The main no- velty of this approach is that it relies on the construction of a detector based on search terms. We show that this strategy of self-organization of the detector im- proves system robustness, which significantly exceeds the classical approach based on automatic speech recogntion.We then present an acoustic filtering method for automatic speech recognition based on Gaussian mixture models and factor analysis as it was used recently in speaker identification. The originality of our contribution is the use of decomposi- tion by factor analysis for estimating supervised filters in the cepstral domain.We then discuss the issues of structuring video collections. We show that the use of different levels of representation and different sources of information in or- der to characterize the editorial style of a video is principaly based on audio analy- sis, whereas most previous works suggested that the bulk of information on gender was contained in the image. Another contribution concerns the type of discourse identification ; we propose low-level models for detecting spontaneous speech that significantly improve the state of the art for this kind of approaches.The third focus of this work concerns the summary itself. As part of video summarization, we first try, to define what a synthetic view is. Is that what cha- racterizes the whole document, or what a user would remember (by example an emotional or funny moment) ? This issue is discussed and we make some concrete proposals for the definition of objective functions corresponding to three different criteria : salience, expressiveness and significance. We then propose an algorithm for finding the sum of the maximum interest that derives from the one introduced in previous works, based on integer linear programming.

Data and Resources

Audio-visual content structuring for automatic...HTML
Explore
- More information
- Go to resource

Additional Info

Field	Value
Source	https://theses.hal.science/tel-00954238
Author	Rouvier, Mickaël
Maintainer	CCSD
Last Updated	May 6, 2026, 04:27 (UTC)
Created	May 6, 2026, 04:27 (UTC)
Identifier	NNT: 2011AVIG0192
Language	fr
Rights	https://about.hal.science/hal-authorisation-v1/
contributor	Laboratoire Informatique d'Avignon (LIA) ; Avignon Université (AU)-Centre d'Enseignement et de Recherche en Informatique - CERI
creator	Rouvier, Mickaël
date	2011-12-05T00:00:00
harvest_object_id	1c37d787-19a7-4361-b9eb-c959fabf5da0
harvest_source_id	3374d638-d20b-4672-ba96-a23232d55657
harvest_source_title	test moissonnage SELUNE
metadata_modified	2026-03-31T00:00:00
set_spec	type:THESE