Large language models overcome the challenges of unstructured text data in ecology

Bibliographic Details
Main Author: Castro, Andry
Publication Date: 2024
Other Authors: Pinto, João, Reino, Luís, Pipek, Pavel, Capinha, César
Format: Article
Language: eng
Source: Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)
Download full: http://hdl.handle.net/10400.5/96850
Summary: The vast volume of currently available unstructured text data, such as research papers, news, and technical report data, shows great potential for ecological research. However, manual processing of such data is labour-intensive, posing a significant challenge. In this study, we aimed to assess the application of three state-of-the-art prompt-based large language models (LLMs), GPT-3.5, GPT-4, and LLaMA-2-70B, to automate the identification, interpretation, extraction, and structuring of relevant ecological information from unstructured textual sources. We focused on species distribution data from two sources: news outlets and research papers. We assessed the LLMs for four key tasks: classification of documents with species distribution data, identification of regions where species are recorded, generation of geographical coordinates for these regions, and supply of results in a structured format. GPT-4 consistently outperformed the other models, demonstrating a high capacity to interpret textual data and extract relevant information, with the percentage of correct outputs often exceeding 90% (average accuracy across tasks: 87–100%). Its performance also depended on the data source type and task, with better results achieved with news reports, in the identification of regions with species reports and presentation of structured output. Its predecessor, GPT-3.5, exhibited slightly lower accuracy across all tasks and data sources (average accuracy across tasks: 81–97%), whereas LLaMA-2-70B showed the worst performance (37–73%). These results demonstrate the potential benefit of integrating prompt-based LLMs into ecological data assimilation workflows as essential tools to efficiently process large volumes of textual data.
id RCAP_6d43c4cc93c0c2f1e76711593fcff70f
oai_identifier_str oai:repositorio.ulisboa.pt:10400.5/96850
network_acronym_str RCAP
network_name_str Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)
repository_id_str https://opendoar.ac.uk/repository/7160
spelling Large language models overcome the challenges of unstructured text data in ecologyAIAutomationData integrationGPTLLaMAUnstructured dataThe vast volume of currently available unstructured text data, such as research papers, news, and technical report data, shows great potential for ecological research. However, manual processing of such data is labour-intensive, posing a significant challenge. In this study, we aimed to assess the application of three state-of-the-art prompt-based large language models (LLMs), GPT-3.5, GPT-4, and LLaMA-2-70B, to automate the identification, interpretation, extraction, and structuring of relevant ecological information from unstructured textual sources. We focused on species distribution data from two sources: news outlets and research papers. We assessed the LLMs for four key tasks: classification of documents with species distribution data, identification of regions where species are recorded, generation of geographical coordinates for these regions, and supply of results in a structured format. GPT-4 consistently outperformed the other models, demonstrating a high capacity to interpret textual data and extract relevant information, with the percentage of correct outputs often exceeding 90% (average accuracy across tasks: 87–100%). Its performance also depended on the data source type and task, with better results achieved with news reports, in the identification of regions with species reports and presentation of structured output. Its predecessor, GPT-3.5, exhibited slightly lower accuracy across all tasks and data sources (average accuracy across tasks: 81–97%), whereas LLaMA-2-70B showed the worst performance (37–73%). These results demonstrate the potential benefit of integrating prompt-based LLMs into ecological data assimilation workflows as essential tools to efficiently process large volumes of textual data.Repositório da Universidade de LisboaCastro, AndryPinto, JoãoReino, LuísPipek, PavelCapinha, César2025-01-06T12:18:13Z20242024-01-01T00:00:00Zinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/articleapplication/pdfhttp://hdl.handle.net/10400.5/96850engCastro, A., Pinto, J., Reino, L., Pipek, P., & Capinha, C. (2024). Large language models overcome the challenges of unstructured text data in ecology. Ecological Informatics, 82, 102742. https://doi.org/10.1016/j.ecoinf.2024.1027421574-954110.1016/j.ecoinf.2024.1027421878-0512info:eu-repo/semantics/openAccessreponame:Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)instname:FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologiainstacron:RCAAP2025-03-17T16:30:44Zoai:repositorio.ulisboa.pt:10400.5/96850Portal AgregadorONGhttps://www.rcaap.pt/oai/openaireinfo@rcaap.ptopendoar:https://opendoar.ac.uk/repository/71602025-05-29T04:17:42.684408Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) - FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologiafalse
dc.title.none.fl_str_mv Large language models overcome the challenges of unstructured text data in ecology
title Large language models overcome the challenges of unstructured text data in ecology
spellingShingle Large language models overcome the challenges of unstructured text data in ecology
Castro, Andry
AI
Automation
Data integration
GPT
LLaMA
Unstructured data
title_short Large language models overcome the challenges of unstructured text data in ecology
title_full Large language models overcome the challenges of unstructured text data in ecology
title_fullStr Large language models overcome the challenges of unstructured text data in ecology
title_full_unstemmed Large language models overcome the challenges of unstructured text data in ecology
title_sort Large language models overcome the challenges of unstructured text data in ecology
author Castro, Andry
author_facet Castro, Andry
Pinto, João
Reino, Luís
Pipek, Pavel
Capinha, César
author_role author
author2 Pinto, João
Reino, Luís
Pipek, Pavel
Capinha, César
author2_role author
author
author
author
dc.contributor.none.fl_str_mv Repositório da Universidade de Lisboa
dc.contributor.author.fl_str_mv Castro, Andry
Pinto, João
Reino, Luís
Pipek, Pavel
Capinha, César
dc.subject.por.fl_str_mv AI
Automation
Data integration
GPT
LLaMA
Unstructured data
topic AI
Automation
Data integration
GPT
LLaMA
Unstructured data
description The vast volume of currently available unstructured text data, such as research papers, news, and technical report data, shows great potential for ecological research. However, manual processing of such data is labour-intensive, posing a significant challenge. In this study, we aimed to assess the application of three state-of-the-art prompt-based large language models (LLMs), GPT-3.5, GPT-4, and LLaMA-2-70B, to automate the identification, interpretation, extraction, and structuring of relevant ecological information from unstructured textual sources. We focused on species distribution data from two sources: news outlets and research papers. We assessed the LLMs for four key tasks: classification of documents with species distribution data, identification of regions where species are recorded, generation of geographical coordinates for these regions, and supply of results in a structured format. GPT-4 consistently outperformed the other models, demonstrating a high capacity to interpret textual data and extract relevant information, with the percentage of correct outputs often exceeding 90% (average accuracy across tasks: 87–100%). Its performance also depended on the data source type and task, with better results achieved with news reports, in the identification of regions with species reports and presentation of structured output. Its predecessor, GPT-3.5, exhibited slightly lower accuracy across all tasks and data sources (average accuracy across tasks: 81–97%), whereas LLaMA-2-70B showed the worst performance (37–73%). These results demonstrate the potential benefit of integrating prompt-based LLMs into ecological data assimilation workflows as essential tools to efficiently process large volumes of textual data.
publishDate 2024
dc.date.none.fl_str_mv 2024
2024-01-01T00:00:00Z
2025-01-06T12:18:13Z
dc.type.status.fl_str_mv info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv info:eu-repo/semantics/article
format article
status_str publishedVersion
dc.identifier.uri.fl_str_mv http://hdl.handle.net/10400.5/96850
url http://hdl.handle.net/10400.5/96850
dc.language.iso.fl_str_mv eng
language eng
dc.relation.none.fl_str_mv Castro, A., Pinto, J., Reino, L., Pipek, P., & Capinha, C. (2024). Large language models overcome the challenges of unstructured text data in ecology. Ecological Informatics, 82, 102742. https://doi.org/10.1016/j.ecoinf.2024.102742
1574-9541
10.1016/j.ecoinf.2024.102742
1878-0512
dc.rights.driver.fl_str_mv info:eu-repo/semantics/openAccess
eu_rights_str_mv openAccess
dc.format.none.fl_str_mv application/pdf
dc.source.none.fl_str_mv reponame:Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)
instname:FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologia
instacron:RCAAP
instname_str FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologia
instacron_str RCAAP
institution RCAAP
reponame_str Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)
collection Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)
repository.name.fl_str_mv Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) - FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologia
repository.mail.fl_str_mv info@rcaap.pt
_version_ 1833602008371691520