Toric grammars: a new statistical approach to natural language modeling

We propose a new statistical model for computational linguistics. Rather than trying to estimate directly the probability distribution of a random sentence of the language, we define a Markov chain on finite sets of sentences with many finite recurrent communicating classes and define our language model as the invariant probability measures of the chain on each recurrent communicating class. This Markov chain, that we call a communication model, recombines at each step randomly the set of sentences forming its current state, using some grammar rules. When the grammar rules are fixed and known in advance instead of being estimated on the fly, we can prove supplementary mathematical properties. In particular, we can prove in this case that all states are recurrent states, so that the chain defines a partition of its state space into finite recurrent communicating classes. We show that our approach is a decisive departure from Markov models at the sentence level and discuss its relationships with Context Free Grammars. Although the toric grammars we use are closely related to Context Free Grammars, the way we generate the language from the grammar is qualitatively different. Our communication model has two purposes. On the one hand, it is used to define indirectly the probability distribution of a random sentence of the language. On the other hand it can serve as a (crude) model of language transmission from one speaker to another speaker through the communication of a (large) set of sentences.

Data and Resources

Additional Info

Field Value
Source https://hal.science/hal-00934487
Author Catoni, Olivier, Mainguy, Thomas
Maintainer CCSD
Last Updated May 7, 2026, 07:40 (UTC)
Created May 7, 2026, 07:40 (UTC)
Identifier hal-00934487
Language en
contributor Département de Mathématiques et Applications - ENS-PSL (UMR8553) (DMA) ; École normale supérieure - Paris (ENS-PSL) ; Université Paris Sciences et Lettres (PSL)-Université Paris Sciences et Lettres (PSL)-Centre National de la Recherche Scientifique (CNRS)
creator Catoni, Olivier
date 2013-02-11T00:00:00
harvest_object_id bde0bb07-12f6-49fb-97b2-2c83293e263a
harvest_source_id 3374d638-d20b-4672-ba96-a23232d55657
harvest_source_title test moissonnage SELUNE
metadata_modified 2025-03-20T00:00:00
relation info:eu-repo/semantics/altIdentifier/arxiv/1302.2569
set_spec type:UNDEFINED