Uma análise audiovisual da produção de tons lexicais
Ano de defesa: | 2020 |
---|---|
Autor(a) principal: | |
Orientador(a): | |
Banca de defesa: | |
Tipo de documento: | Dissertação |
Tipo de acesso: | Acesso aberto |
Idioma: | por |
Instituição de defesa: |
Universidade Federal de Minas Gerais
Brasil ENG - DEPARTAMENTO DE ENGENHARIA ELÉTRICA Programa de Pós-Graduação em Engenharia Elétrica UFMG |
Programa de Pós-Graduação: |
Não Informado pela instituição
|
Departamento: |
Não Informado pela instituição
|
País: |
Não Informado pela instituição
|
Palavras-chave em Português: | |
Link de acesso: | http://hdl.handle.net/1843/34183 https://orcid.org/0000-0002-7612-9754 |
Resumo: | It is known that speech manifests itself not only acoustically, but also visually, through facial movements and body gestures, in addition to having physiological correlates such as movement of the vocal tract and neural activity. This work presents an audiovisual analysis of the production of lexical tones, which are pitch variations that change the meaning of words in tone languages. Lexical tones are traditionally studied in terms of acoustic parameters, such as the fundamental frequency (F0) of the speech signal. This work, however, adopts an integrated approach, investigating the contribution, in isolation and jointly, of the acoustic and visual components of speech to the differentiation of lexical tones in three tone languages (Cantonese, Mandarin and Thai). The approach adopted consists in classifying the tones of each language from each component taken in isolation and to compare their performances. Data was collected in audiovisual speech production experiments with seven speakers of the three languages. The visual component of speech was obtained through 3D tracking of markers fixed to the participants' faces and heads, and the acoustic component was obtained simultaneously by a microphone. After the experiment, the positions of the markers were subjected to a head movement compensation procedure in order to separate them into their two components: one due to the movement of the face and the other due to the movement of the rigid body of the head. The acoustic signal had its F0 estimated through the autocorrelation method. At this point, the visual component is represented by three types of signals: Total movement (marker positions), Face and Head (resulting from the decomposition); and the acoustic component is represented by the F0 curves. All types of signals were parameterized using polynomial regression, being represented by coefficients that approximate their original trajectory. The parameterized signals were then used to train linear and non-linear classifiers, with the tones of each language used as class labels. The ability of each type of signal to classify the different lexical tones was measured using the accuracy of each classifier, obtained with cross-validation in K parts (K = 5). Visual signals were able to classify lexical tones in the three languages, with accuracy above chance. The highest accuracy was obtained by the F0 signals. Among the visual signals, the highest accuracy was obtained, in decreasing order, by the signals Total Movement and Face. In addition, some lexical tones of the same language were classified with above-average accuracy, suggesting that some tones are easier to classify than others. The results obtained are in accordance with the literature and suggest that lexical tones can be predicted not only by F0, but also, to a lesser extent, by the movements of the face and head. |