Show and Guide

Bibliographic Details
Main Author: Glória-Silva, Diogo
Publication Date: 2024
Other Authors: Semedo, David, Magalhães, João
Language: eng
Source: Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)
Download full: http://hdl.handle.net/10362/181117
Summary: Funding Information: This work was supported by the FCT Ph.D. scholarship grant Ref. PRT/BD/152810/2021 awarded by CMU Portugal Affiliated Ph.D. program, and by the FCT project NOVA LINCS Ref. (UIDB/04516/2020). Data collection was possible under the Alexa Prize Taskbot Challenge organized by Amazon Science. Publisher Copyright: © 2024 Association for Computational Linguistics.
id RCAP_a34989dba687c1a76ca0fca547a7d7c3
oai_identifier_str oai:run.unl.pt:10362/181117
network_acronym_str RCAP
network_name_str Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)
repository_id_str https://opendoar.ac.uk/repository/7160
spelling Show and GuideInstructional-Plan Grounded Vision and Language ModelComputational Theory and MathematicsComputer Science ApplicationsInformation SystemsLinguistics and LanguageFunding Information: This work was supported by the FCT Ph.D. scholarship grant Ref. PRT/BD/152810/2021 awarded by CMU Portugal Affiliated Ph.D. program, and by the FCT project NOVA LINCS Ref. (UIDB/04516/2020). Data collection was possible under the Alexa Prize Taskbot Challenge organized by Amazon Science. Publisher Copyright: © 2024 Association for Computational Linguistics.Guiding users through complex procedural plans is an inherently multimodal task in which having visually illustrated plan steps is crucial to deliver an effective plan guidance. However, existing works on plan-following language models (LMs) often are not capable of multimodal input and output. In this work, we present MM-PlanLLM, the first multimodal LLM designed to assist users in executing instructional tasks by leveraging both textual plans and visual information. Specifically, we bring cross-modality through two key tasks: Conversational Video Moment Retrieval, where the model retrieves relevant step-video segments based on user queries, and Visually-Informed Step Generation, where the model generates the next step in a plan, conditioned on an image of the user's current progress. MM-PlanLLM is trained using a novel multitask-multistage approach, designed to gradually expose the model to multimodal instructional-plans semantic layers, achieving strong performance on both multimodal and textual dialogue in a plan-grounded setting. Furthermore, we show that the model delivers cross-modal temporal and plan-structure representations aligned between textual plan steps and instructional video moments.Association for Computational Linguistics (ACL)NOVALincsRUNGlória-Silva, DiogoSemedo, DavidMagalhães, João2025-03-21T21:28:13Z20242024-01-01T00:00:00Zconference objectinfo:eu-repo/semantics/publishedVersion19application/pdfhttp://hdl.handle.net/10362/181117eng9798891761643PURE: 113515029https://doi.org/10.18653/v1/2024.emnlp-main.1191info:eu-repo/semantics/openAccessreponame:Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)instname:FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologiainstacron:RCAAP2025-03-31T02:03:45Zoai:run.unl.pt:10362/181117Portal AgregadorONGhttps://www.rcaap.pt/oai/openaireinfo@rcaap.ptopendoar:https://opendoar.ac.uk/repository/71602025-05-29T04:42:17.001015Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) - FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologiafalse
dc.title.none.fl_str_mv Show and Guide
Instructional-Plan Grounded Vision and Language Model
title Show and Guide
spellingShingle Show and Guide
Glória-Silva, Diogo
Computational Theory and Mathematics
Computer Science Applications
Information Systems
Linguistics and Language
title_short Show and Guide
title_full Show and Guide
title_fullStr Show and Guide
title_full_unstemmed Show and Guide
title_sort Show and Guide
author Glória-Silva, Diogo
author_facet Glória-Silva, Diogo
Semedo, David
Magalhães, João
author_role author
author2 Semedo, David
Magalhães, João
author2_role author
author
dc.contributor.none.fl_str_mv NOVALincs
RUN
dc.contributor.author.fl_str_mv Glória-Silva, Diogo
Semedo, David
Magalhães, João
dc.subject.por.fl_str_mv Computational Theory and Mathematics
Computer Science Applications
Information Systems
Linguistics and Language
topic Computational Theory and Mathematics
Computer Science Applications
Information Systems
Linguistics and Language
description Funding Information: This work was supported by the FCT Ph.D. scholarship grant Ref. PRT/BD/152810/2021 awarded by CMU Portugal Affiliated Ph.D. program, and by the FCT project NOVA LINCS Ref. (UIDB/04516/2020). Data collection was possible under the Alexa Prize Taskbot Challenge organized by Amazon Science. Publisher Copyright: © 2024 Association for Computational Linguistics.
publishDate 2024
dc.date.none.fl_str_mv 2024
2024-01-01T00:00:00Z
2025-03-21T21:28:13Z
dc.type.driver.fl_str_mv conference object
dc.type.status.fl_str_mv info:eu-repo/semantics/publishedVersion
status_str publishedVersion
dc.identifier.uri.fl_str_mv http://hdl.handle.net/10362/181117
url http://hdl.handle.net/10362/181117
dc.language.iso.fl_str_mv eng
language eng
dc.relation.none.fl_str_mv 9798891761643
PURE: 113515029
https://doi.org/10.18653/v1/2024.emnlp-main.1191
dc.rights.driver.fl_str_mv info:eu-repo/semantics/openAccess
eu_rights_str_mv openAccess
dc.format.none.fl_str_mv 19
application/pdf
dc.publisher.none.fl_str_mv Association for Computational Linguistics (ACL)
publisher.none.fl_str_mv Association for Computational Linguistics (ACL)
dc.source.none.fl_str_mv reponame:Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)
instname:FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologia
instacron:RCAAP
instname_str FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologia
instacron_str RCAAP
institution RCAAP
reponame_str Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)
collection Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)
repository.name.fl_str_mv Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) - FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologia
repository.mail.fl_str_mv info@rcaap.pt
_version_ 1833602125899235328