Large language models overcome the challenges of unstructured text data in ecology

Detalhes bibliográficos
Autor(a) principal: Castro, Andry
Data de Publicação: 2024
Outros Autores: Pinto, João, Reino, Luís, Pipek, Pavel, Capinha, César
Tipo de documento: Artigo
Idioma: eng
Título da fonte: Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)
Texto Completo: http://hdl.handle.net/10362/172961
Resumo: Funding Information: AC was supported by a grant ( PRT/BD/152100/2021 ) financed by the Portuguese Foundation for Science and Technology (FCT) under MIT Portugal Program. AC and CC acknowledge support from FCT through support to CEG/IGOT Research Unit ( UIDB/00295/2020 and UIDP/00295/2020 ). JP was funded through FCT for funds to GHTM ( UID/04413/2020 ). LR was funded through the FCT contract \u2018 CEECIND/00445/2017 \u2019 under the \u2018Stimulus of Scientific Employment\u2014Individual Support\u2019 and by FCT \u2018UNRAVEL\u2019 project ( PTDC/BIA-ECO/0207/2020 ; https://doi.org/10.54499/PTDC/BIA-ECO/0207/2020 ). PP acknowledge support from the Czech Science Foundation (project no. 23-07278S ). Publisher Copyright: © 2024 The Authors
id RCAP_573b79df1d6dcb63ba1baeb6704c3a1f
oai_identifier_str oai:run.unl.pt:10362/172961
network_acronym_str RCAP
network_name_str Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)
repository_id_str https://opendoar.ac.uk/repository/7160
spelling Large language models overcome the challenges of unstructured text data in ecologyAIAutomationData integrationGPTLLaMAUnstructured dataEcology, Evolution, Behavior and SystematicsEcologyModelling and SimulationEcological ModellingComputer Science ApplicationsComputational Theory and MathematicsApplied MathematicsFunding Information: AC was supported by a grant ( PRT/BD/152100/2021 ) financed by the Portuguese Foundation for Science and Technology (FCT) under MIT Portugal Program. AC and CC acknowledge support from FCT through support to CEG/IGOT Research Unit ( UIDB/00295/2020 and UIDP/00295/2020 ). JP was funded through FCT for funds to GHTM ( UID/04413/2020 ). LR was funded through the FCT contract \u2018 CEECIND/00445/2017 \u2019 under the \u2018Stimulus of Scientific Employment\u2014Individual Support\u2019 and by FCT \u2018UNRAVEL\u2019 project ( PTDC/BIA-ECO/0207/2020 ; https://doi.org/10.54499/PTDC/BIA-ECO/0207/2020 ). PP acknowledge support from the Czech Science Foundation (project no. 23-07278S ). Publisher Copyright: © 2024 The AuthorsThe vast volume of currently available unstructured text data, such as research papers, news, and technical report data, shows great potential for ecological research. However, manual processing of such data is labour-intensive, posing a significant challenge. In this study, we aimed to assess the application of three state-of-the-art prompt-based large language models (LLMs), GPT-3.5, GPT-4, and LLaMA-2-70B, to automate the identification, interpretation, extraction, and structuring of relevant ecological information from unstructured textual sources. We focused on species distribution data from two sources: news outlets and research papers. We assessed the LLMs for four key tasks: classification of documents with species distribution data, identification of regions where species are recorded, generation of geographical coordinates for these regions, and supply of results in a structured format. GPT-4 consistently outperformed the other models, demonstrating a high capacity to interpret textual data and extract relevant information, with the percentage of correct outputs often exceeding 90% (average accuracy across tasks: 87–100%). Its performance also depended on the data source type and task, with better results achieved with news reports, in the identification of regions with species reports and presentation of structured output. Its predecessor, GPT-3.5, exhibited slightly lower accuracy across all tasks and data sources (average accuracy across tasks: 81–97%), whereas LLaMA-2-70B showed the worst performance (37–73%). These results demonstrate the potential benefit of integrating prompt-based LLMs into ecological data assimilation workflows as essential tools to efficiently process large volumes of textual data.Vector borne diseases and pathogens (VBD)Instituto de Higiene e Medicina Tropical (IHMT)Global Health and Tropical Medicine (GHTM)RUNCastro, AndryPinto, JoãoReino, LuísPipek, PavelCapinha, César2024-10-03T22:26:22Z2024-092024-09-01T00:00:00Zinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/articleapplication/pdfhttp://hdl.handle.net/10362/172961eng1574-9541PURE: 100591487https://doi.org/10.1016/j.ecoinf.2024.102742info:eu-repo/semantics/openAccessreponame:Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)instname:FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologiainstacron:RCAAP2024-10-07T01:41:31Zoai:run.unl.pt:10362/172961Portal AgregadorONGhttps://www.rcaap.pt/oai/openaireinfo@rcaap.ptopendoar:https://opendoar.ac.uk/repository/71602025-05-28T18:55:29.270841Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) - FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologiafalse
dc.title.none.fl_str_mv Large language models overcome the challenges of unstructured text data in ecology
title Large language models overcome the challenges of unstructured text data in ecology
spellingShingle Large language models overcome the challenges of unstructured text data in ecology
Castro, Andry
AI
Automation
Data integration
GPT
LLaMA
Unstructured data
Ecology, Evolution, Behavior and Systematics
Ecology
Modelling and Simulation
Ecological Modelling
Computer Science Applications
Computational Theory and Mathematics
Applied Mathematics
title_short Large language models overcome the challenges of unstructured text data in ecology
title_full Large language models overcome the challenges of unstructured text data in ecology
title_fullStr Large language models overcome the challenges of unstructured text data in ecology
title_full_unstemmed Large language models overcome the challenges of unstructured text data in ecology
title_sort Large language models overcome the challenges of unstructured text data in ecology
author Castro, Andry
author_facet Castro, Andry
Pinto, João
Reino, Luís
Pipek, Pavel
Capinha, César
author_role author
author2 Pinto, João
Reino, Luís
Pipek, Pavel
Capinha, César
author2_role author
author
author
author
dc.contributor.none.fl_str_mv Vector borne diseases and pathogens (VBD)
Instituto de Higiene e Medicina Tropical (IHMT)
Global Health and Tropical Medicine (GHTM)
RUN
dc.contributor.author.fl_str_mv Castro, Andry
Pinto, João
Reino, Luís
Pipek, Pavel
Capinha, César
dc.subject.por.fl_str_mv AI
Automation
Data integration
GPT
LLaMA
Unstructured data
Ecology, Evolution, Behavior and Systematics
Ecology
Modelling and Simulation
Ecological Modelling
Computer Science Applications
Computational Theory and Mathematics
Applied Mathematics
topic AI
Automation
Data integration
GPT
LLaMA
Unstructured data
Ecology, Evolution, Behavior and Systematics
Ecology
Modelling and Simulation
Ecological Modelling
Computer Science Applications
Computational Theory and Mathematics
Applied Mathematics
description Funding Information: AC was supported by a grant ( PRT/BD/152100/2021 ) financed by the Portuguese Foundation for Science and Technology (FCT) under MIT Portugal Program. AC and CC acknowledge support from FCT through support to CEG/IGOT Research Unit ( UIDB/00295/2020 and UIDP/00295/2020 ). JP was funded through FCT for funds to GHTM ( UID/04413/2020 ). LR was funded through the FCT contract \u2018 CEECIND/00445/2017 \u2019 under the \u2018Stimulus of Scientific Employment\u2014Individual Support\u2019 and by FCT \u2018UNRAVEL\u2019 project ( PTDC/BIA-ECO/0207/2020 ; https://doi.org/10.54499/PTDC/BIA-ECO/0207/2020 ). PP acknowledge support from the Czech Science Foundation (project no. 23-07278S ). Publisher Copyright: © 2024 The Authors
publishDate 2024
dc.date.none.fl_str_mv 2024-10-03T22:26:22Z
2024-09
2024-09-01T00:00:00Z
dc.type.status.fl_str_mv info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv info:eu-repo/semantics/article
format article
status_str publishedVersion
dc.identifier.uri.fl_str_mv http://hdl.handle.net/10362/172961
url http://hdl.handle.net/10362/172961
dc.language.iso.fl_str_mv eng
language eng
dc.relation.none.fl_str_mv 1574-9541
PURE: 100591487
https://doi.org/10.1016/j.ecoinf.2024.102742
dc.rights.driver.fl_str_mv info:eu-repo/semantics/openAccess
eu_rights_str_mv openAccess
dc.format.none.fl_str_mv application/pdf
dc.source.none.fl_str_mv reponame:Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)
instname:FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologia
instacron:RCAAP
instname_str FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologia
instacron_str RCAAP
institution RCAAP
reponame_str Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)
collection Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)
repository.name.fl_str_mv Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) - FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologia
repository.mail.fl_str_mv info@rcaap.pt
_version_ 1833597757354409984