Show and Guide
| Main Author: | |
|---|---|
| Publication Date: | 2024 |
| Other Authors: | , |
| Language: | eng |
| Source: | Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) |
| Download full: | http://hdl.handle.net/10362/181117 |
Summary: | Funding Information: This work was supported by the FCT Ph.D. scholarship grant Ref. PRT/BD/152810/2021 awarded by CMU Portugal Affiliated Ph.D. program, and by the FCT project NOVA LINCS Ref. (UIDB/04516/2020). Data collection was possible under the Alexa Prize Taskbot Challenge organized by Amazon Science. Publisher Copyright: © 2024 Association for Computational Linguistics. |
| id |
RCAP_a34989dba687c1a76ca0fca547a7d7c3 |
|---|---|
| oai_identifier_str |
oai:run.unl.pt:10362/181117 |
| network_acronym_str |
RCAP |
| network_name_str |
Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) |
| repository_id_str |
https://opendoar.ac.uk/repository/7160 |
| spelling |
Show and GuideInstructional-Plan Grounded Vision and Language ModelComputational Theory and MathematicsComputer Science ApplicationsInformation SystemsLinguistics and LanguageFunding Information: This work was supported by the FCT Ph.D. scholarship grant Ref. PRT/BD/152810/2021 awarded by CMU Portugal Affiliated Ph.D. program, and by the FCT project NOVA LINCS Ref. (UIDB/04516/2020). Data collection was possible under the Alexa Prize Taskbot Challenge organized by Amazon Science. Publisher Copyright: © 2024 Association for Computational Linguistics.Guiding users through complex procedural plans is an inherently multimodal task in which having visually illustrated plan steps is crucial to deliver an effective plan guidance. However, existing works on plan-following language models (LMs) often are not capable of multimodal input and output. In this work, we present MM-PlanLLM, the first multimodal LLM designed to assist users in executing instructional tasks by leveraging both textual plans and visual information. Specifically, we bring cross-modality through two key tasks: Conversational Video Moment Retrieval, where the model retrieves relevant step-video segments based on user queries, and Visually-Informed Step Generation, where the model generates the next step in a plan, conditioned on an image of the user's current progress. MM-PlanLLM is trained using a novel multitask-multistage approach, designed to gradually expose the model to multimodal instructional-plans semantic layers, achieving strong performance on both multimodal and textual dialogue in a plan-grounded setting. Furthermore, we show that the model delivers cross-modal temporal and plan-structure representations aligned between textual plan steps and instructional video moments.Association for Computational Linguistics (ACL)NOVALincsRUNGlória-Silva, DiogoSemedo, DavidMagalhães, João2025-03-21T21:28:13Z20242024-01-01T00:00:00Zconference objectinfo:eu-repo/semantics/publishedVersion19application/pdfhttp://hdl.handle.net/10362/181117eng9798891761643PURE: 113515029https://doi.org/10.18653/v1/2024.emnlp-main.1191info:eu-repo/semantics/openAccessreponame:Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)instname:FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologiainstacron:RCAAP2025-03-31T02:03:45Zoai:run.unl.pt:10362/181117Portal AgregadorONGhttps://www.rcaap.pt/oai/openaireinfo@rcaap.ptopendoar:https://opendoar.ac.uk/repository/71602025-05-29T04:42:17.001015Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) - FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologiafalse |
| dc.title.none.fl_str_mv |
Show and Guide Instructional-Plan Grounded Vision and Language Model |
| title |
Show and Guide |
| spellingShingle |
Show and Guide Glória-Silva, Diogo Computational Theory and Mathematics Computer Science Applications Information Systems Linguistics and Language |
| title_short |
Show and Guide |
| title_full |
Show and Guide |
| title_fullStr |
Show and Guide |
| title_full_unstemmed |
Show and Guide |
| title_sort |
Show and Guide |
| author |
Glória-Silva, Diogo |
| author_facet |
Glória-Silva, Diogo Semedo, David Magalhães, João |
| author_role |
author |
| author2 |
Semedo, David Magalhães, João |
| author2_role |
author author |
| dc.contributor.none.fl_str_mv |
NOVALincs RUN |
| dc.contributor.author.fl_str_mv |
Glória-Silva, Diogo Semedo, David Magalhães, João |
| dc.subject.por.fl_str_mv |
Computational Theory and Mathematics Computer Science Applications Information Systems Linguistics and Language |
| topic |
Computational Theory and Mathematics Computer Science Applications Information Systems Linguistics and Language |
| description |
Funding Information: This work was supported by the FCT Ph.D. scholarship grant Ref. PRT/BD/152810/2021 awarded by CMU Portugal Affiliated Ph.D. program, and by the FCT project NOVA LINCS Ref. (UIDB/04516/2020). Data collection was possible under the Alexa Prize Taskbot Challenge organized by Amazon Science. Publisher Copyright: © 2024 Association for Computational Linguistics. |
| publishDate |
2024 |
| dc.date.none.fl_str_mv |
2024 2024-01-01T00:00:00Z 2025-03-21T21:28:13Z |
| dc.type.driver.fl_str_mv |
conference object |
| dc.type.status.fl_str_mv |
info:eu-repo/semantics/publishedVersion |
| status_str |
publishedVersion |
| dc.identifier.uri.fl_str_mv |
http://hdl.handle.net/10362/181117 |
| url |
http://hdl.handle.net/10362/181117 |
| dc.language.iso.fl_str_mv |
eng |
| language |
eng |
| dc.relation.none.fl_str_mv |
9798891761643 PURE: 113515029 https://doi.org/10.18653/v1/2024.emnlp-main.1191 |
| dc.rights.driver.fl_str_mv |
info:eu-repo/semantics/openAccess |
| eu_rights_str_mv |
openAccess |
| dc.format.none.fl_str_mv |
19 application/pdf |
| dc.publisher.none.fl_str_mv |
Association for Computational Linguistics (ACL) |
| publisher.none.fl_str_mv |
Association for Computational Linguistics (ACL) |
| dc.source.none.fl_str_mv |
reponame:Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) instname:FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologia instacron:RCAAP |
| instname_str |
FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologia |
| instacron_str |
RCAAP |
| institution |
RCAAP |
| reponame_str |
Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) |
| collection |
Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) |
| repository.name.fl_str_mv |
Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) - FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologia |
| repository.mail.fl_str_mv |
info@rcaap.pt |
| _version_ |
1833602125899235328 |