Efficient and multilingual text-to image synthesis : exploring novel architectures and cross-language Sstrategies
Ano de defesa: | 2024 |
---|---|
Autor(a) principal: | |
Orientador(a): | |
Banca de defesa: | |
Tipo de documento: | Tese |
Tipo de acesso: | Acesso aberto |
Idioma: | eng |
Instituição de defesa: |
Pontif?cia Universidade Cat?lica do Rio Grande do Sul
Escola Polit?cnica Brasil PUCRS Programa de P?s-Gradua??o em Ci?ncia da Computa??o |
Programa de Pós-Graduação: |
Não Informado pela instituição
|
Departamento: |
Não Informado pela instituição
|
País: |
Não Informado pela instituição
|
Palavras-chave em Português: | |
Link de acesso: | https://tede2.pucrs.br/tede2/handle/tede/11642 |
Resumo: | Text-to-image synthesis is the task of generating images from text descriptions. Given a textual description, a text-to-image algorithm can generate multiple novel images that contain the details described in the text. Text-to-image algorithms are appealing for various real-world tasks. With such algorithms, machines can draw truly novel images that can be used for content generation or assisted drawing, for example. The general framework of text-to-image approaches can be divided into two main parts: i) a text encoder and ii) a generative model for images, which learns a conditional distribution over encoded text. Currently, text-to-image approaches leverage multiple neural networks to overcome the challenges of learning a generative model over images, increasing the overall framework?s complexity as well as the required computation for generating high-resolution images. Additionally, no works so far have explored cross-language models in the context of text-to-image generation, limiting current approaches to supporting only English. This limitation has a significant downside as it restricts access to the technology to users familiar with the English language, leaving out a substantial number of people who could benefit. In this thesis, we make the following contributions to address each of the aforementioned gaps. First, we propose a new end-to-end text-to-image approach that relies on a single neural network for the image generator model, reducing complexity and computation. Second, we propose a new loss function that improves training and yields more accurate models. Finally, we study how text encoders affect the overall performance of text-to-image generation and propose a novel cross-language approach to extend models to support multiple languages simultaneously. |