Large language models overcome the challenges of unstructured text data in ecology
| Autor(a) principal: | |
|---|---|
| Data de Publicação: | 2024 |
| Outros Autores: | , , , |
| Tipo de documento: | Artigo |
| Idioma: | eng |
| Título da fonte: | Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) |
| Texto Completo: | http://hdl.handle.net/10362/172961 |
Resumo: | Funding Information: AC was supported by a grant ( PRT/BD/152100/2021 ) financed by the Portuguese Foundation for Science and Technology (FCT) under MIT Portugal Program. AC and CC acknowledge support from FCT through support to CEG/IGOT Research Unit ( UIDB/00295/2020 and UIDP/00295/2020 ). JP was funded through FCT for funds to GHTM ( UID/04413/2020 ). LR was funded through the FCT contract \u2018 CEECIND/00445/2017 \u2019 under the \u2018Stimulus of Scientific Employment\u2014Individual Support\u2019 and by FCT \u2018UNRAVEL\u2019 project ( PTDC/BIA-ECO/0207/2020 ; https://doi.org/10.54499/PTDC/BIA-ECO/0207/2020 ). PP acknowledge support from the Czech Science Foundation (project no. 23-07278S ). Publisher Copyright: © 2024 The Authors |
| id |
RCAP_573b79df1d6dcb63ba1baeb6704c3a1f |
|---|---|
| oai_identifier_str |
oai:run.unl.pt:10362/172961 |
| network_acronym_str |
RCAP |
| network_name_str |
Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) |
| repository_id_str |
https://opendoar.ac.uk/repository/7160 |
| spelling |
Large language models overcome the challenges of unstructured text data in ecologyAIAutomationData integrationGPTLLaMAUnstructured dataEcology, Evolution, Behavior and SystematicsEcologyModelling and SimulationEcological ModellingComputer Science ApplicationsComputational Theory and MathematicsApplied MathematicsFunding Information: AC was supported by a grant ( PRT/BD/152100/2021 ) financed by the Portuguese Foundation for Science and Technology (FCT) under MIT Portugal Program. AC and CC acknowledge support from FCT through support to CEG/IGOT Research Unit ( UIDB/00295/2020 and UIDP/00295/2020 ). JP was funded through FCT for funds to GHTM ( UID/04413/2020 ). LR was funded through the FCT contract \u2018 CEECIND/00445/2017 \u2019 under the \u2018Stimulus of Scientific Employment\u2014Individual Support\u2019 and by FCT \u2018UNRAVEL\u2019 project ( PTDC/BIA-ECO/0207/2020 ; https://doi.org/10.54499/PTDC/BIA-ECO/0207/2020 ). PP acknowledge support from the Czech Science Foundation (project no. 23-07278S ). Publisher Copyright: © 2024 The AuthorsThe vast volume of currently available unstructured text data, such as research papers, news, and technical report data, shows great potential for ecological research. However, manual processing of such data is labour-intensive, posing a significant challenge. In this study, we aimed to assess the application of three state-of-the-art prompt-based large language models (LLMs), GPT-3.5, GPT-4, and LLaMA-2-70B, to automate the identification, interpretation, extraction, and structuring of relevant ecological information from unstructured textual sources. We focused on species distribution data from two sources: news outlets and research papers. We assessed the LLMs for four key tasks: classification of documents with species distribution data, identification of regions where species are recorded, generation of geographical coordinates for these regions, and supply of results in a structured format. GPT-4 consistently outperformed the other models, demonstrating a high capacity to interpret textual data and extract relevant information, with the percentage of correct outputs often exceeding 90% (average accuracy across tasks: 87–100%). Its performance also depended on the data source type and task, with better results achieved with news reports, in the identification of regions with species reports and presentation of structured output. Its predecessor, GPT-3.5, exhibited slightly lower accuracy across all tasks and data sources (average accuracy across tasks: 81–97%), whereas LLaMA-2-70B showed the worst performance (37–73%). These results demonstrate the potential benefit of integrating prompt-based LLMs into ecological data assimilation workflows as essential tools to efficiently process large volumes of textual data.Vector borne diseases and pathogens (VBD)Instituto de Higiene e Medicina Tropical (IHMT)Global Health and Tropical Medicine (GHTM)RUNCastro, AndryPinto, JoãoReino, LuísPipek, PavelCapinha, César2024-10-03T22:26:22Z2024-092024-09-01T00:00:00Zinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/articleapplication/pdfhttp://hdl.handle.net/10362/172961eng1574-9541PURE: 100591487https://doi.org/10.1016/j.ecoinf.2024.102742info:eu-repo/semantics/openAccessreponame:Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)instname:FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologiainstacron:RCAAP2024-10-07T01:41:31Zoai:run.unl.pt:10362/172961Portal AgregadorONGhttps://www.rcaap.pt/oai/openaireinfo@rcaap.ptopendoar:https://opendoar.ac.uk/repository/71602025-05-28T18:55:29.270841Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) - FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologiafalse |
| dc.title.none.fl_str_mv |
Large language models overcome the challenges of unstructured text data in ecology |
| title |
Large language models overcome the challenges of unstructured text data in ecology |
| spellingShingle |
Large language models overcome the challenges of unstructured text data in ecology Castro, Andry AI Automation Data integration GPT LLaMA Unstructured data Ecology, Evolution, Behavior and Systematics Ecology Modelling and Simulation Ecological Modelling Computer Science Applications Computational Theory and Mathematics Applied Mathematics |
| title_short |
Large language models overcome the challenges of unstructured text data in ecology |
| title_full |
Large language models overcome the challenges of unstructured text data in ecology |
| title_fullStr |
Large language models overcome the challenges of unstructured text data in ecology |
| title_full_unstemmed |
Large language models overcome the challenges of unstructured text data in ecology |
| title_sort |
Large language models overcome the challenges of unstructured text data in ecology |
| author |
Castro, Andry |
| author_facet |
Castro, Andry Pinto, João Reino, Luís Pipek, Pavel Capinha, César |
| author_role |
author |
| author2 |
Pinto, João Reino, Luís Pipek, Pavel Capinha, César |
| author2_role |
author author author author |
| dc.contributor.none.fl_str_mv |
Vector borne diseases and pathogens (VBD) Instituto de Higiene e Medicina Tropical (IHMT) Global Health and Tropical Medicine (GHTM) RUN |
| dc.contributor.author.fl_str_mv |
Castro, Andry Pinto, João Reino, Luís Pipek, Pavel Capinha, César |
| dc.subject.por.fl_str_mv |
AI Automation Data integration GPT LLaMA Unstructured data Ecology, Evolution, Behavior and Systematics Ecology Modelling and Simulation Ecological Modelling Computer Science Applications Computational Theory and Mathematics Applied Mathematics |
| topic |
AI Automation Data integration GPT LLaMA Unstructured data Ecology, Evolution, Behavior and Systematics Ecology Modelling and Simulation Ecological Modelling Computer Science Applications Computational Theory and Mathematics Applied Mathematics |
| description |
Funding Information: AC was supported by a grant ( PRT/BD/152100/2021 ) financed by the Portuguese Foundation for Science and Technology (FCT) under MIT Portugal Program. AC and CC acknowledge support from FCT through support to CEG/IGOT Research Unit ( UIDB/00295/2020 and UIDP/00295/2020 ). JP was funded through FCT for funds to GHTM ( UID/04413/2020 ). LR was funded through the FCT contract \u2018 CEECIND/00445/2017 \u2019 under the \u2018Stimulus of Scientific Employment\u2014Individual Support\u2019 and by FCT \u2018UNRAVEL\u2019 project ( PTDC/BIA-ECO/0207/2020 ; https://doi.org/10.54499/PTDC/BIA-ECO/0207/2020 ). PP acknowledge support from the Czech Science Foundation (project no. 23-07278S ). Publisher Copyright: © 2024 The Authors |
| publishDate |
2024 |
| dc.date.none.fl_str_mv |
2024-10-03T22:26:22Z 2024-09 2024-09-01T00:00:00Z |
| dc.type.status.fl_str_mv |
info:eu-repo/semantics/publishedVersion |
| dc.type.driver.fl_str_mv |
info:eu-repo/semantics/article |
| format |
article |
| status_str |
publishedVersion |
| dc.identifier.uri.fl_str_mv |
http://hdl.handle.net/10362/172961 |
| url |
http://hdl.handle.net/10362/172961 |
| dc.language.iso.fl_str_mv |
eng |
| language |
eng |
| dc.relation.none.fl_str_mv |
1574-9541 PURE: 100591487 https://doi.org/10.1016/j.ecoinf.2024.102742 |
| dc.rights.driver.fl_str_mv |
info:eu-repo/semantics/openAccess |
| eu_rights_str_mv |
openAccess |
| dc.format.none.fl_str_mv |
application/pdf |
| dc.source.none.fl_str_mv |
reponame:Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) instname:FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologia instacron:RCAAP |
| instname_str |
FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologia |
| instacron_str |
RCAAP |
| institution |
RCAAP |
| reponame_str |
Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) |
| collection |
Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) |
| repository.name.fl_str_mv |
Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) - FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologia |
| repository.mail.fl_str_mv |
info@rcaap.pt |
| _version_ |
1833597757354409984 |