Export Ready — 

Generating Coherent Sequences of Visual Illustrations for Real-World Manual Tasks

Bibliographic Details
Main Author: Bordalo, João
Publication Date: 2024
Other Authors: Ramos, Vasco, Valério, Rodrigo, Glória-Silva, Diogo, Bitton, Yonatan, Yarom, Michal, Szpektor, Idan, Magalhaes, Joao
Language: eng
Source: Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)
Download full: http://hdl.handle.net/10362/179722
Summary: Funding Information: This work was partially supported by Amazon Alexa Prize TaskBot, by a Google Research Gift and by NOVA LINCS ref. UIDB/04516/2020 (https://doi.org/10.54499/UIDB/04516/2020). Publisher Copyright: © 2024 Association for Computational Linguistics.
id RCAP_01d07fee0e8f8a4a7e14ed03fcd6a97f
oai_identifier_str oai:run.unl.pt:10362/179722
network_acronym_str RCAP
network_name_str Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)
repository_id_str https://opendoar.ac.uk/repository/7160
spelling Generating Coherent Sequences of Visual Illustrations for Real-World Manual TasksComputer Science ApplicationsLinguistics and LanguageLanguage and LinguisticsSDG 3 - Good Health and Well-beingFunding Information: This work was partially supported by Amazon Alexa Prize TaskBot, by a Google Research Gift and by NOVA LINCS ref. UIDB/04516/2020 (https://doi.org/10.54499/UIDB/04516/2020). Publisher Copyright: © 2024 Association for Computational Linguistics.Multistep instructions, such as recipes and how-to guides, greatly benefit from visual aids, such as a series of images that accompany the instruction steps. While Large Language Models (LLMs) have become adept at generating coherent textual steps, Large Vision and Language Models (LVLMs) are less capable of generating accompanying image sequences. The most challenging aspect is that each generated image needs to adhere to the relevant textual step instruction, as well as be visually consistent with earlier images in the sequence. To address this problem, we propose an approach for generating consistent image sequences, which integrates a Latent Diffusion Model (LDM) with an LLM to transform the sequence into a caption to maintain the semantic coherence of the sequence. In addition, to maintain the visual coherence of the image sequence, we introduce a copy mechanism to initialise reverse diffusion processes with a latent vector iteration from a previously generated image from a relevant step. Both strategies will condition the reverse diffusion process on the sequence of instruction steps and tie the contents of the current image to previous instruction steps and corresponding images. Experiments show that the proposed approach is preferred by humans in 46.6% of the cases against 26.6% for the second best method. In addition, automatic metrics showed that the proposed method maintains semantic coherence and visual consistency across the sequence of visual illustrations.Association for Computational Linguistics (ACL)NOVALincsRUNBordalo, JoãoRamos, VascoValério, RodrigoGlória-Silva, DiogoBitton, YonatanYarom, MichalSzpektor, IdanMagalhaes, Joao2025-02-24T23:25:14Z20242024-01-01T00:00:00Zconference objectinfo:eu-repo/semantics/publishedVersion21application/pdfhttp://hdl.handle.net/10362/179722eng97988917609430736-587XPURE: 107298862https://doi.org/10.18653/v1/2024.acl-long.690info:eu-repo/semantics/openAccessreponame:Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)instname:FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologiainstacron:RCAAP2025-03-03T01:38:38Zoai:run.unl.pt:10362/179722Portal AgregadorONGhttps://www.rcaap.pt/oai/openaireinfo@rcaap.ptopendoar:https://opendoar.ac.uk/repository/71602025-05-29T00:06:58.759087Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) - FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologiafalse
dc.title.none.fl_str_mv Generating Coherent Sequences of Visual Illustrations for Real-World Manual Tasks
title Generating Coherent Sequences of Visual Illustrations for Real-World Manual Tasks
spellingShingle Generating Coherent Sequences of Visual Illustrations for Real-World Manual Tasks
Bordalo, João
Computer Science Applications
Linguistics and Language
Language and Linguistics
SDG 3 - Good Health and Well-being
title_short Generating Coherent Sequences of Visual Illustrations for Real-World Manual Tasks
title_full Generating Coherent Sequences of Visual Illustrations for Real-World Manual Tasks
title_fullStr Generating Coherent Sequences of Visual Illustrations for Real-World Manual Tasks
title_full_unstemmed Generating Coherent Sequences of Visual Illustrations for Real-World Manual Tasks
title_sort Generating Coherent Sequences of Visual Illustrations for Real-World Manual Tasks
author Bordalo, João
author_facet Bordalo, João
Ramos, Vasco
Valério, Rodrigo
Glória-Silva, Diogo
Bitton, Yonatan
Yarom, Michal
Szpektor, Idan
Magalhaes, Joao
author_role author
author2 Ramos, Vasco
Valério, Rodrigo
Glória-Silva, Diogo
Bitton, Yonatan
Yarom, Michal
Szpektor, Idan
Magalhaes, Joao
author2_role author
author
author
author
author
author
author
dc.contributor.none.fl_str_mv NOVALincs
RUN
dc.contributor.author.fl_str_mv Bordalo, João
Ramos, Vasco
Valério, Rodrigo
Glória-Silva, Diogo
Bitton, Yonatan
Yarom, Michal
Szpektor, Idan
Magalhaes, Joao
dc.subject.por.fl_str_mv Computer Science Applications
Linguistics and Language
Language and Linguistics
SDG 3 - Good Health and Well-being
topic Computer Science Applications
Linguistics and Language
Language and Linguistics
SDG 3 - Good Health and Well-being
description Funding Information: This work was partially supported by Amazon Alexa Prize TaskBot, by a Google Research Gift and by NOVA LINCS ref. UIDB/04516/2020 (https://doi.org/10.54499/UIDB/04516/2020). Publisher Copyright: © 2024 Association for Computational Linguistics.
publishDate 2024
dc.date.none.fl_str_mv 2024
2024-01-01T00:00:00Z
2025-02-24T23:25:14Z
dc.type.driver.fl_str_mv conference object
dc.type.status.fl_str_mv info:eu-repo/semantics/publishedVersion
status_str publishedVersion
dc.identifier.uri.fl_str_mv http://hdl.handle.net/10362/179722
url http://hdl.handle.net/10362/179722
dc.language.iso.fl_str_mv eng
language eng
dc.relation.none.fl_str_mv 9798891760943
0736-587X
PURE: 107298862
https://doi.org/10.18653/v1/2024.acl-long.690
dc.rights.driver.fl_str_mv info:eu-repo/semantics/openAccess
eu_rights_str_mv openAccess
dc.format.none.fl_str_mv 21
application/pdf
dc.publisher.none.fl_str_mv Association for Computational Linguistics (ACL)
publisher.none.fl_str_mv Association for Computational Linguistics (ACL)
dc.source.none.fl_str_mv reponame:Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)
instname:FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologia
instacron:RCAAP
instname_str FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologia
instacron_str RCAAP
institution RCAAP
reponame_str Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)
collection Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)
repository.name.fl_str_mv Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) - FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologia
repository.mail.fl_str_mv info@rcaap.pt
_version_ 1833600402834063360