This thesis deals with emotional voices during a human-robot interaction. In a natural interaction, we define at least, four kinds of variabilities: environment (room, microphone); speaker, its physic characteristics (gender, age, voice type) and personality; emotional states; and finally the kind of interaction (game scenario, emergency, everyday life). From audio signals collected in different conditions, we tried to find out, with acoustic features, to overlap speaker and his emotional state characterisation taking into account these variabilities.To find which features are essential and which are to avoid is hard challenge because it needs to work with a high number of variabilities and then to have riche and diverse data to our disposal. The main results are about the collection and the annotation of natural emotional corpora that have been recorded with different kinds of speakers (children, adults, elderly people) in various environments, and about how reliable are acoustic features across the four variabilities. This analysis led to two interesting aspects: the audio characterisation of a corpus and the drawing of a black list of features which vary a lot. Emotions are ust a part of paralinguistic features that are supported by the audio channel, other paralinguistic features have been studied such as personality and stress in the voice. We have also built automatic emotion recognition and speaker characterisation module that we have tested during realistic interactions. An ethic discussion have been driven on our work.