Generating Coherent Sequences of Visual Illustrations for Real-World Manual Tasks
Main Author: | |
---|---|
Publication Date: | 2024 |
Other Authors: | , , , , , , |
Language: | eng |
Source: | Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) |
Download full: | http://hdl.handle.net/10362/179722 |
Summary: | Funding Information: This work was partially supported by Amazon Alexa Prize TaskBot, by a Google Research Gift and by NOVA LINCS ref. UIDB/04516/2020 (https://doi.org/10.54499/UIDB/04516/2020). Publisher Copyright: © 2024 Association for Computational Linguistics. |
id |
RCAP_01d07fee0e8f8a4a7e14ed03fcd6a97f |
---|---|
oai_identifier_str |
oai:run.unl.pt:10362/179722 |
network_acronym_str |
RCAP |
network_name_str |
Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) |
repository_id_str |
https://opendoar.ac.uk/repository/7160 |
spelling |
Generating Coherent Sequences of Visual Illustrations for Real-World Manual TasksComputer Science ApplicationsLinguistics and LanguageLanguage and LinguisticsSDG 3 - Good Health and Well-beingFunding Information: This work was partially supported by Amazon Alexa Prize TaskBot, by a Google Research Gift and by NOVA LINCS ref. UIDB/04516/2020 (https://doi.org/10.54499/UIDB/04516/2020). Publisher Copyright: © 2024 Association for Computational Linguistics.Multistep instructions, such as recipes and how-to guides, greatly benefit from visual aids, such as a series of images that accompany the instruction steps. While Large Language Models (LLMs) have become adept at generating coherent textual steps, Large Vision and Language Models (LVLMs) are less capable of generating accompanying image sequences. The most challenging aspect is that each generated image needs to adhere to the relevant textual step instruction, as well as be visually consistent with earlier images in the sequence. To address this problem, we propose an approach for generating consistent image sequences, which integrates a Latent Diffusion Model (LDM) with an LLM to transform the sequence into a caption to maintain the semantic coherence of the sequence. In addition, to maintain the visual coherence of the image sequence, we introduce a copy mechanism to initialise reverse diffusion processes with a latent vector iteration from a previously generated image from a relevant step. Both strategies will condition the reverse diffusion process on the sequence of instruction steps and tie the contents of the current image to previous instruction steps and corresponding images. Experiments show that the proposed approach is preferred by humans in 46.6% of the cases against 26.6% for the second best method. In addition, automatic metrics showed that the proposed method maintains semantic coherence and visual consistency across the sequence of visual illustrations.Association for Computational Linguistics (ACL)NOVALincsRUNBordalo, JoãoRamos, VascoValério, RodrigoGlória-Silva, DiogoBitton, YonatanYarom, MichalSzpektor, IdanMagalhaes, Joao2025-02-24T23:25:14Z20242024-01-01T00:00:00Zconference objectinfo:eu-repo/semantics/publishedVersion21application/pdfhttp://hdl.handle.net/10362/179722eng97988917609430736-587XPURE: 107298862https://doi.org/10.18653/v1/2024.acl-long.690info:eu-repo/semantics/openAccessreponame:Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)instname:FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologiainstacron:RCAAP2025-03-03T01:38:38Zoai:run.unl.pt:10362/179722Portal AgregadorONGhttps://www.rcaap.pt/oai/openaireinfo@rcaap.ptopendoar:https://opendoar.ac.uk/repository/71602025-05-29T00:06:58.759087Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) - FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologiafalse |
dc.title.none.fl_str_mv |
Generating Coherent Sequences of Visual Illustrations for Real-World Manual Tasks |
title |
Generating Coherent Sequences of Visual Illustrations for Real-World Manual Tasks |
spellingShingle |
Generating Coherent Sequences of Visual Illustrations for Real-World Manual Tasks Bordalo, João Computer Science Applications Linguistics and Language Language and Linguistics SDG 3 - Good Health and Well-being |
title_short |
Generating Coherent Sequences of Visual Illustrations for Real-World Manual Tasks |
title_full |
Generating Coherent Sequences of Visual Illustrations for Real-World Manual Tasks |
title_fullStr |
Generating Coherent Sequences of Visual Illustrations for Real-World Manual Tasks |
title_full_unstemmed |
Generating Coherent Sequences of Visual Illustrations for Real-World Manual Tasks |
title_sort |
Generating Coherent Sequences of Visual Illustrations for Real-World Manual Tasks |
author |
Bordalo, João |
author_facet |
Bordalo, João Ramos, Vasco Valério, Rodrigo Glória-Silva, Diogo Bitton, Yonatan Yarom, Michal Szpektor, Idan Magalhaes, Joao |
author_role |
author |
author2 |
Ramos, Vasco Valério, Rodrigo Glória-Silva, Diogo Bitton, Yonatan Yarom, Michal Szpektor, Idan Magalhaes, Joao |
author2_role |
author author author author author author author |
dc.contributor.none.fl_str_mv |
NOVALincs RUN |
dc.contributor.author.fl_str_mv |
Bordalo, João Ramos, Vasco Valério, Rodrigo Glória-Silva, Diogo Bitton, Yonatan Yarom, Michal Szpektor, Idan Magalhaes, Joao |
dc.subject.por.fl_str_mv |
Computer Science Applications Linguistics and Language Language and Linguistics SDG 3 - Good Health and Well-being |
topic |
Computer Science Applications Linguistics and Language Language and Linguistics SDG 3 - Good Health and Well-being |
description |
Funding Information: This work was partially supported by Amazon Alexa Prize TaskBot, by a Google Research Gift and by NOVA LINCS ref. UIDB/04516/2020 (https://doi.org/10.54499/UIDB/04516/2020). Publisher Copyright: © 2024 Association for Computational Linguistics. |
publishDate |
2024 |
dc.date.none.fl_str_mv |
2024 2024-01-01T00:00:00Z 2025-02-24T23:25:14Z |
dc.type.driver.fl_str_mv |
conference object |
dc.type.status.fl_str_mv |
info:eu-repo/semantics/publishedVersion |
status_str |
publishedVersion |
dc.identifier.uri.fl_str_mv |
http://hdl.handle.net/10362/179722 |
url |
http://hdl.handle.net/10362/179722 |
dc.language.iso.fl_str_mv |
eng |
language |
eng |
dc.relation.none.fl_str_mv |
9798891760943 0736-587X PURE: 107298862 https://doi.org/10.18653/v1/2024.acl-long.690 |
dc.rights.driver.fl_str_mv |
info:eu-repo/semantics/openAccess |
eu_rights_str_mv |
openAccess |
dc.format.none.fl_str_mv |
21 application/pdf |
dc.publisher.none.fl_str_mv |
Association for Computational Linguistics (ACL) |
publisher.none.fl_str_mv |
Association for Computational Linguistics (ACL) |
dc.source.none.fl_str_mv |
reponame:Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) instname:FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologia instacron:RCAAP |
instname_str |
FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologia |
instacron_str |
RCAAP |
institution |
RCAAP |
reponame_str |
Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) |
collection |
Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) |
repository.name.fl_str_mv |
Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) - FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologia |
repository.mail.fl_str_mv |
info@rcaap.pt |
_version_ |
1833600402834063360 |