RAMBLE: robust acoustic modeling for Brazilian learners of English

Detalhes bibliográficos
Ano de defesa: 2018
Autor(a) principal: Shulby, Christopher Dane
Orientador(a): Não Informado pela instituição
Banca de defesa: Não Informado pela instituição
Tipo de documento: Tese
Tipo de acesso: Acesso aberto
Idioma: eng
Instituição de defesa: Biblioteca Digitais de Teses e Dissertações da USP
Programa de Pós-Graduação: Não Informado pela instituição
Departamento: Não Informado pela instituição
País: Não Informado pela instituição
Palavras-chave em Português:
Link de acesso: http://www.teses.usp.br/teses/disponiveis/55/55134/tde-19102018-112733/
Resumo: The gains made by current deep-learning techniques have often come with the price tag of big data and where that data is not available, a new solution must be found. Such is the case for accented and noisy speech where large databases do not exist and data augmentation techniques, which are less than perfect, present an even larger obstacle. Another problem is that state-of-the-art results are rarely reproducible because they use proprietary datasets, pretrained networks and/or weight initializations from other larger networks. An example of a low resource scenario exists even in the fifth largest land in the world; home to most of the speakers of the seventh most spoken language on earth. Brazil is the leader in the Latin-American economy and as a BRIC country aspires to become an ever-stronger player in the global marketplace. Still, English proficiency is low, even for professionals in businesses and universities. Low intelligibility and strong accents can damage professional credibility. It has been established in the literature for foreign language teaching that it is important that adult learners are made aware of their errors as outlined by the Noticing Theory, explaining that a learner is more successful when he is able to learn from his own mistakes. An essential objective of this dissertation is to classify phonemes in the acoustic model which is needed to properly identify phonemic errors automatically. A common belief in the community is that deep learning requires large datasets to be effective. This happens because brute force methods create a highly complex hypothesis space which requires large and complex networks which in turn demand a great amount of data samples in order to generate useful networks. Besides that, the loss functions used in neural learning does not provide statistical learning guarantees and only guarantees the network can memorize the training space well. In the case of accented or noisy speech where a new sample can carry a great deal of variation from the training samples, the generalization of such models suffers. The main objective of this dissertation is to investigate how more robust acoustic generalizations can be made, even with little data and noisy accented-speech data. The approach here is to take advantage of raw feature extraction provided by deep learning techniques and instead focus on how learning guarantees can be provided for small datasets to produce robust results for acoustic modeling without the dependency of big data. This has been done by careful and intelligent parameter and architecture selection within the framework of the statistical learning theory. Here, an intelligently defined CNN architecture, together with context windows and a knowledge-driven hierarchical tree of SVM classifiers achieves nearly state-of-the-art frame-wise phoneme recognition results with absolutely no pretraining or external weight initialization. A goal of this thesis is to produce transparent and reproducible architectures with high frame-level accuracy, comparable to the state of the art. Additionally, a convergence analysis based on the learning guarantees of the statistical learning theory is performed in order to evidence the generalization capacity of the model. The model achieves 39.7% error in framewise classification and a 43.5% phone error rate using deep feature extraction and SVM classification even with little data (less than 7 hours). These results are comparable to studies which use well over ten times that amount of data. Beyond the intrinsic evaluation, the model also achieves an accuracy of 88% in the identification of epenthesis, the error which is most difficult for Brazilian speakers of English This is a 69% relative percentage gain over the previous values in the literature. The results are significant because it shows how deep feature extraction can be applied to little data scenarios, contrary to popular belief. The extrinsic, task-based results also show how this approach could be useful in tasks like automatic error diagnosis. Another contribution is the publication of a number of freely available resources which previously did not exist, meant to aid future researches in dataset creation.