Robust French syntax analysis: reconciling statistical methods and linguistic knowledge in the Talismane toolkit

In this thesis we explore robust statistical syntax analysis for French. Our main concern is to explore methods whereby the linguist can inject linguistic knowledge and/or resources into the robust statistical engine in order to improve results for specific phenomena. We first explore the dependency annotation schema for French, concentrating on certain phenomena. Next, we look into the various algorithms capable of producing this annotation, and in particular on the transition-based parsing algorithm used in the rest of this thesis. After exploring supervised machine learning algorithms for NLP classification problems, we present the Talismane toolkit for syntax analysis, built within the framework of this thesis, including four statistical modules - sentence boundary detection, tokenisation, pos-tagging and parsing - as well as the various linguistic resources used for the baseline model, including corpora, lexicons and feature sets. Our first experiments attempt various machine learning configurations in order to identify the best baseline. We then look into improvements made possible by a beam search and beam propagation. Finally, we present a series of experiments aimed at correcting errors related to specific linguistic phenomena, using targeted features. One of our innovations is the introduction of rules that can impose or prohibit certain decisions locally, thus bypassing the statistical model. We explore the usage of rules for errors that the features are unable to correct. Finally, we look into the enhancement of targeted features by large scale linguistic resources, and in particular a semi-supervised approach using a distributional semantic resource.

Data and Resources

Additional Info

Field Value
Source https://theses.hal.science/tel-00979681
Author Urieli, Assaf
Maintainer CCSD
Last Updated May 5, 2026, 14:28 (UTC)
Created May 5, 2026, 14:28 (UTC)
Identifier tel-00979681
Language en
Rights https://about.hal.science/hal-authorisation-v1/
contributor Cognition, Langues, Langage, Ergonomie (CLLE-ERSS) ; École Pratique des Hautes Études (EPHE) ; Université Paris Sciences et Lettres (PSL)-Université Paris Sciences et Lettres (PSL)-Université Toulouse - Jean Jaurès (UT2J) ; Communauté d'universités et établissements de Toulouse (Comue de Toulouse)-Communauté d'universités et établissements de Toulouse (Comue de Toulouse)-Université Bordeaux Montaigne (UBM)-Centre National de la Recherche Scientifique (CNRS)
creator Urieli, Assaf
date 2013-12-17T00:00:00
harvest_object_id b2d76e4b-44df-419c-aec7-d96c845260c4
harvest_source_id 3374d638-d20b-4672-ba96-a23232d55657
harvest_source_title test moissonnage SELUNE
metadata_modified 2025-09-23T00:00:00
set_spec type:THESE