Combining evolution and machine learning for functional annotation and classification of remote homologous proteins.

Detection of remote homologous proteins is essential for functional and structural classification of protein sequences and for the completion of the annotation for highly divergent genomes. Here, we present two new methods to address these problems. For the first prob- lem, we introduce ILP-SVM Homology that combines inductive logic programming (ILP) and propositional models. It proposes a novel logical representation of physico-chemical properties, conserved amino acid positions and conserved physico-chemical positions in sequence alignments. Based on these signals, ILP finds the most frequent patterns and uses them to train models, such as decision trees and support vector machines. ILP-SVM Homology achieves at least equal performance when compared with other methods. To address the second problem, we propose CASH, a large-scale pipeline to annotate highly divergent genomes. CASH was applied to the Plasmodium falciparum, but it is applicable to any species. In CASH we explore different evolutionary pathways including those that are phylogenetically distant from P. falciparum. As a result, each known domain is represented by an ensemble of heterogeneous models, and the outputs are combined through a meta-classifier that assigns a confidence score to each prediction. Based on this score and on properties as domain co-occurrence, CASH finds the most probable architecture for each query sequence by resolving a multi-objective optimization problem. CASH provides domain annotation for 70% of proteins in P. falciparum, while its competitors achieve at most 58%. We find additional domains into already annotated proteins, and predict domains for proteins with unknown function.

Data and Resources

Additional Info

Field Value
Source https://theses.hal.science/tel-00684155
Author Silva Bernardes, Juliana
Maintainer CCSD
Last Updated May 23, 2026, 01:41 (UTC)
Created May 23, 2026, 01:41 (UTC)
Identifier tel-00684155
Language en
Rights https://about.hal.science/hal-authorisation-v1/
contributor Génomique des Microorganismes (LGM) ; Université Pierre et Marie Curie - Paris 6 (UPMC)-Centre National de la Recherche Scientifique (CNRS)
creator Silva Bernardes, Juliana
date 2012-03-28T00:00:00
harvest_object_id 1e8592a3-b444-4258-8699-1a79fb0edf36
harvest_source_id 3374d638-d20b-4672-ba96-a23232d55657
harvest_source_title test moissonnage SELUNE
metadata_modified 2025-08-12T00:00:00
set_spec type:THESE