Efficient and multilingual text-to image synthesis : exploring novel architectures and cross-language Sstrategies

Detalhes bibliográficos
Ano de defesa: 2024
Autor(a) principal: Souza, Douglas Matos de
Orientador(a): Não Informado pela instituição
Banca de defesa: Não Informado pela instituição
Tipo de documento: Tese
Tipo de acesso: Acesso aberto
Idioma: eng
Instituição de defesa: Pontif?cia Universidade Cat?lica do Rio Grande do Sul
Escola Polit?cnica
Brasil
PUCRS
Programa de P?s-Gradua??o em Ci?ncia da Computa??o
Programa de Pós-Graduação: Não Informado pela instituição
Departamento: Não Informado pela instituição
País: Não Informado pela instituição
Palavras-chave em Português:
Link de acesso: https://tede2.pucrs.br/tede2/handle/tede/11642
Resumo: Text-to-image synthesis is the task of generating images from text descriptions. Given a textual description, a text-to-image algorithm can generate multiple novel images that contain the details described in the text. Text-to-image algorithms are appealing for various real-world tasks. With such algorithms, machines can draw truly novel images that can be used for content generation or assisted drawing, for example. The general framework of text-to-image approaches can be divided into two main parts: i) a text encoder and ii) a generative model for images, which learns a conditional distribution over encoded text. Currently, text-to-image approaches leverage multiple neural networks to overcome the challenges of learning a generative model over images, increasing the overall framework?s complexity as well as the required computation for generating high-resolution images. Additionally, no works so far have explored cross-language models in the context of text-to-image generation, limiting current approaches to supporting only English. This limitation has a significant downside as it restricts access to the technology to users familiar with the English language, leaving out a substantial number of people who could benefit. In this thesis, we make the following contributions to address each of the aforementioned gaps. First, we propose a new end-to-end text-to-image approach that relies on a single neural network for the image generator model, reducing complexity and computation. Second, we propose a new loss function that improves training and yields more accurate models. Finally, we study how text encoders affect the overall performance of text-to-image generation and propose a novel cross-language approach to extend models to support multiple languages simultaneously.