Generating Coherent Sequences of Visual Illustrations for Real-World Manual Tasks

Bordalo, João; Ramos, Vasco; Valério, Rodrigo; Glória-Silva, Diogo; Bitton, Yonatan; Yarom, Michal; Szpektor, Idan; Magalhaes, Joao

Generating Coherent Sequences of Visual Illustrations for Real-World Manual Tasks

Bibliographic Details
Main Author:	Bordalo, João
Publication Date:	2024
Other Authors:	Ramos, Vasco, Valério, Rodrigo, Glória-Silva, Diogo, Bitton, Yonatan, Yarom, Michal, Szpektor, Idan, Magalhaes, Joao
Language:	eng
Source:	Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)
Download full:	http://hdl.handle.net/10362/179722
Summary:	Funding Information: This work was partially supported by Amazon Alexa Prize TaskBot, by a Google Research Gift and by NOVA LINCS ref. UIDB/04516/2020 (https://doi.org/10.54499/UIDB/04516/2020). Publisher Copyright: © 2024 Association for Computational Linguistics.

Item metadata

id	RCAP_01d07fee0e8f8a4a7e14ed03fcd6a97f
oai_identifier_str	oai:run.unl.pt:10362/179722
network_acronym_str	RCAP
network_name_str	Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)
repository_id_str	https://opendoar.ac.uk/repository/7160
spelling	Generating Coherent Sequences of Visual Illustrations for Real-World Manual TasksComputer Science ApplicationsLinguistics and LanguageLanguage and LinguisticsSDG 3 - Good Health and Well-beingFunding Information: This work was partially supported by Amazon Alexa Prize TaskBot, by a Google Research Gift and by NOVA LINCS ref. UIDB/04516/2020 (https://doi.org/10.54499/UIDB/04516/2020). Publisher Copyright: © 2024 Association for Computational Linguistics.Multistep instructions, such as recipes and how-to guides, greatly benefit from visual aids, such as a series of images that accompany the instruction steps. While Large Language Models (LLMs) have become adept at generating coherent textual steps, Large Vision and Language Models (LVLMs) are less capable of generating accompanying image sequences. The most challenging aspect is that each generated image needs to adhere to the relevant textual step instruction, as well as be visually consistent with earlier images in the sequence. To address this problem, we propose an approach for generating consistent image sequences, which integrates a Latent Diffusion Model (LDM) with an LLM to transform the sequence into a caption to maintain the semantic coherence of the sequence. In addition, to maintain the visual coherence of the image sequence, we introduce a copy mechanism to initialise reverse diffusion processes with a latent vector iteration from a previously generated image from a relevant step. Both strategies will condition the reverse diffusion process on the sequence of instruction steps and tie the contents of the current image to previous instruction steps and corresponding images. Experiments show that the proposed approach is preferred by humans in 46.6% of the cases against 26.6% for the second best method. In addition, automatic metrics showed that the proposed method maintains semantic coherence and visual consistency across the sequence of visual illustrations.Association for Computational Linguistics (ACL)NOVALincsRUNBordalo, JoãoRamos, VascoValério, RodrigoGlória-Silva, DiogoBitton, YonatanYarom, MichalSzpektor, IdanMagalhaes, Joao2025-02-24T23:25:14Z20242024-01-01T00:00:00Zconference objectinfo:eu-repo/semantics/publishedVersion21application/pdfhttp://hdl.handle.net/10362/179722eng97988917609430736-587XPURE: 107298862https://doi.org/10.18653/v1/2024.acl-long.690info:eu-repo/semantics/openAccessreponame:Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)instname:FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologiainstacron:RCAAP2025-03-03T01:38:38Zoai:run.unl.pt:10362/179722Portal AgregadorONGhttps://www.rcaap.pt/oai/openaireinfo@rcaap.ptopendoar:https://opendoar.ac.uk/repository/71602025-05-29T00:06:58.759087Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) - FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologiafalse
dc.title.none.fl_str_mv	Generating Coherent Sequences of Visual Illustrations for Real-World Manual Tasks
title	Generating Coherent Sequences of Visual Illustrations for Real-World Manual Tasks
spellingShingle	Generating Coherent Sequences of Visual Illustrations for Real-World Manual Tasks Bordalo, João Computer Science Applications Linguistics and Language Language and Linguistics SDG 3 - Good Health and Well-being
title_short	Generating Coherent Sequences of Visual Illustrations for Real-World Manual Tasks
title_full	Generating Coherent Sequences of Visual Illustrations for Real-World Manual Tasks
title_fullStr	Generating Coherent Sequences of Visual Illustrations for Real-World Manual Tasks
title_full_unstemmed	Generating Coherent Sequences of Visual Illustrations for Real-World Manual Tasks
title_sort	Generating Coherent Sequences of Visual Illustrations for Real-World Manual Tasks
author	Bordalo, João
author_facet	Bordalo, João Ramos, Vasco Valério, Rodrigo Glória-Silva, Diogo Bitton, Yonatan Yarom, Michal Szpektor, Idan Magalhaes, Joao
author_role	author
author2	Ramos, Vasco Valério, Rodrigo Glória-Silva, Diogo Bitton, Yonatan Yarom, Michal Szpektor, Idan Magalhaes, Joao
author2_role	author author author author author author author
dc.contributor.none.fl_str_mv	NOVALincs RUN
dc.contributor.author.fl_str_mv	Bordalo, João Ramos, Vasco Valério, Rodrigo Glória-Silva, Diogo Bitton, Yonatan Yarom, Michal Szpektor, Idan Magalhaes, Joao
dc.subject.por.fl_str_mv	Computer Science Applications Linguistics and Language Language and Linguistics SDG 3 - Good Health and Well-being
topic	Computer Science Applications Linguistics and Language Language and Linguistics SDG 3 - Good Health and Well-being
description	Funding Information: This work was partially supported by Amazon Alexa Prize TaskBot, by a Google Research Gift and by NOVA LINCS ref. UIDB/04516/2020 (https://doi.org/10.54499/UIDB/04516/2020). Publisher Copyright: © 2024 Association for Computational Linguistics.
publishDate	2024
dc.date.none.fl_str_mv	2024 2024-01-01T00:00:00Z 2025-02-24T23:25:14Z
dc.type.driver.fl_str_mv	conference object
dc.type.status.fl_str_mv	info:eu-repo/semantics/publishedVersion
status_str	publishedVersion
dc.identifier.uri.fl_str_mv	http://hdl.handle.net/10362/179722
url	http://hdl.handle.net/10362/179722
dc.language.iso.fl_str_mv	eng
language	eng
dc.relation.none.fl_str_mv	9798891760943 0736-587X PURE: 107298862 https://doi.org/10.18653/v1/2024.acl-long.690
dc.rights.driver.fl_str_mv	info:eu-repo/semantics/openAccess
eu_rights_str_mv	openAccess
dc.format.none.fl_str_mv	21 application/pdf
dc.publisher.none.fl_str_mv	Association for Computational Linguistics (ACL)
publisher.none.fl_str_mv	Association for Computational Linguistics (ACL)
dc.source.none.fl_str_mv	reponame:Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) instname:FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologia instacron:RCAAP
instname_str	FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologia
instacron_str	RCAAP
institution	RCAAP
reponame_str	Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)
collection	Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)
repository.name.fl_str_mv	Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) - FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologia
repository.mail.fl_str_mv	info@rcaap.pt
_version_	1833600402834063360

Generating Coherent Sequences of Visual Illustrations for Real-World Manual Tasks

Similar Items