An Annotated Corpus of Crime-Related Portuguese Documents for NLP and Machine Learning Processing

Bibliographic Details
Main Author: Carnaz, Gonçalo
Publication Date: 2021
Other Authors: Antunes, Mário, Nogueira, Vitor Beires
Format: Article
Language: por
Source: Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)
Download full: http://hdl.handle.net/10174/34695
https://doi.org/Carnaz, G.; Antunes, M.; Nogueira, V.B. An Annotated Corpus of Crime-Related Portuguese Documents for NLP and Machine Learning Processing. Data 2021, 6, 71. https://doi.org/10.3390/data6070071
https://doi.org/10.3390/data6070071
Summary: Criminal investigations collect and analyze the facts related to a crime, from which the investigators can deduce evidence to be used in court. It is a multidisciplinary and applied science, which includes interviews, interrogations, evidence collection, preservation of the chain of custody, and other methods and techniques of investigation. These techniques produce both digital and paper documents that have to be carefully analyzed to identify correlations and interactions among suspects, places, license plates, and other entities that are mentioned in the investigation. The computerized processing of these documents is a helping hand to the criminal investigation, as it allows the automatic identification of entities and their relations, being some of which difficult to identify manually. There exists a wide set of dedicated tools, but they have a major limitation: they are unable to process criminal reports in the Portuguese language, as an annotated corpus for that purpose does not exist. This paper presents an annotated corpus, composed of a collection of anonymized crime-related documents, which were extracted from official and open sources. The dataset was produced as the result of an exploratory initiative to collect crime-related data from websites and conditioned-access police reports. The dataset was evaluated and a mean precision of 0.808, recall of 0.722, and F1-score of 0.733 were obtained with the classification of the annotated named-entities present in the crime-related documents. This corpus can be employed to benchmark Machine Learning (ML) and Natural Language Processing (NLP) methods and tools to detect and correlate entities in the documents. Some examples are sentence detection, named-entity recognition, and identification of terms related to the criminal domain.
id RCAP_d4686fc79746ffd1ee567800e3db2109
oai_identifier_str oai:dspace.uevora.pt:10174/34695
network_acronym_str RCAP
network_name_str Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)
repository_id_str https://opendoar.ac.uk/repository/7160
spelling An Annotated Corpus of Crime-Related Portuguese Documents for NLP and Machine Learning Processingcrime-related documentscybersecuritycriminal investigationPortuguese language corpusCriminal investigations collect and analyze the facts related to a crime, from which the investigators can deduce evidence to be used in court. It is a multidisciplinary and applied science, which includes interviews, interrogations, evidence collection, preservation of the chain of custody, and other methods and techniques of investigation. These techniques produce both digital and paper documents that have to be carefully analyzed to identify correlations and interactions among suspects, places, license plates, and other entities that are mentioned in the investigation. The computerized processing of these documents is a helping hand to the criminal investigation, as it allows the automatic identification of entities and their relations, being some of which difficult to identify manually. There exists a wide set of dedicated tools, but they have a major limitation: they are unable to process criminal reports in the Portuguese language, as an annotated corpus for that purpose does not exist. This paper presents an annotated corpus, composed of a collection of anonymized crime-related documents, which were extracted from official and open sources. The dataset was produced as the result of an exploratory initiative to collect crime-related data from websites and conditioned-access police reports. The dataset was evaluated and a mean precision of 0.808, recall of 0.722, and F1-score of 0.733 were obtained with the classification of the annotated named-entities present in the crime-related documents. This corpus can be employed to benchmark Machine Learning (ML) and Natural Language Processing (NLP) methods and tools to detect and correlate entities in the documents. Some examples are sentence detection, named-entity recognition, and identification of terms related to the criminal domain.2023-02-24T12:58:23Z2023-02-242021-06-26T00:00:00Zinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/articlehttp://hdl.handle.net/10174/34695https://doi.org/Carnaz, G.; Antunes, M.; Nogueira, V.B. An Annotated Corpus of Crime-Related Portuguese Documents for NLP and Machine Learning Processing. Data 2021, 6, 71. https://doi.org/10.3390/data6070071http://hdl.handle.net/10174/34695https://doi.org/10.3390/data6070071pord34707@alunos.uevora.ptmario.antunes@ipleiria.ptvbn@uevora.pt498Carnaz, GonçaloAntunes, MárioNogueira, Vitor Beiresinfo:eu-repo/semantics/openAccessreponame:Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)instname:FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologiainstacron:RCAAP2024-01-03T19:37:32Zoai:dspace.uevora.pt:10174/34695Portal AgregadorONGhttps://www.rcaap.pt/oai/openaireinfo@rcaap.ptopendoar:https://opendoar.ac.uk/repository/71602025-05-28T12:30:21.327285Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) - FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologiafalse
dc.title.none.fl_str_mv An Annotated Corpus of Crime-Related Portuguese Documents for NLP and Machine Learning Processing
title An Annotated Corpus of Crime-Related Portuguese Documents for NLP and Machine Learning Processing
spellingShingle An Annotated Corpus of Crime-Related Portuguese Documents for NLP and Machine Learning Processing
Carnaz, Gonçalo
crime-related documents
cybersecurity
criminal investigation
Portuguese language corpus
title_short An Annotated Corpus of Crime-Related Portuguese Documents for NLP and Machine Learning Processing
title_full An Annotated Corpus of Crime-Related Portuguese Documents for NLP and Machine Learning Processing
title_fullStr An Annotated Corpus of Crime-Related Portuguese Documents for NLP and Machine Learning Processing
title_full_unstemmed An Annotated Corpus of Crime-Related Portuguese Documents for NLP and Machine Learning Processing
title_sort An Annotated Corpus of Crime-Related Portuguese Documents for NLP and Machine Learning Processing
author Carnaz, Gonçalo
author_facet Carnaz, Gonçalo
Antunes, Mário
Nogueira, Vitor Beires
author_role author
author2 Antunes, Mário
Nogueira, Vitor Beires
author2_role author
author
dc.contributor.author.fl_str_mv Carnaz, Gonçalo
Antunes, Mário
Nogueira, Vitor Beires
dc.subject.por.fl_str_mv crime-related documents
cybersecurity
criminal investigation
Portuguese language corpus
topic crime-related documents
cybersecurity
criminal investigation
Portuguese language corpus
description Criminal investigations collect and analyze the facts related to a crime, from which the investigators can deduce evidence to be used in court. It is a multidisciplinary and applied science, which includes interviews, interrogations, evidence collection, preservation of the chain of custody, and other methods and techniques of investigation. These techniques produce both digital and paper documents that have to be carefully analyzed to identify correlations and interactions among suspects, places, license plates, and other entities that are mentioned in the investigation. The computerized processing of these documents is a helping hand to the criminal investigation, as it allows the automatic identification of entities and their relations, being some of which difficult to identify manually. There exists a wide set of dedicated tools, but they have a major limitation: they are unable to process criminal reports in the Portuguese language, as an annotated corpus for that purpose does not exist. This paper presents an annotated corpus, composed of a collection of anonymized crime-related documents, which were extracted from official and open sources. The dataset was produced as the result of an exploratory initiative to collect crime-related data from websites and conditioned-access police reports. The dataset was evaluated and a mean precision of 0.808, recall of 0.722, and F1-score of 0.733 were obtained with the classification of the annotated named-entities present in the crime-related documents. This corpus can be employed to benchmark Machine Learning (ML) and Natural Language Processing (NLP) methods and tools to detect and correlate entities in the documents. Some examples are sentence detection, named-entity recognition, and identification of terms related to the criminal domain.
publishDate 2021
dc.date.none.fl_str_mv 2021-06-26T00:00:00Z
2023-02-24T12:58:23Z
2023-02-24
dc.type.status.fl_str_mv info:eu-repo/semantics/publishedVersion
dc.type.driver.fl_str_mv info:eu-repo/semantics/article
format article
status_str publishedVersion
dc.identifier.uri.fl_str_mv http://hdl.handle.net/10174/34695
https://doi.org/Carnaz, G.; Antunes, M.; Nogueira, V.B. An Annotated Corpus of Crime-Related Portuguese Documents for NLP and Machine Learning Processing. Data 2021, 6, 71. https://doi.org/10.3390/data6070071
http://hdl.handle.net/10174/34695
https://doi.org/10.3390/data6070071
url http://hdl.handle.net/10174/34695
https://doi.org/Carnaz, G.; Antunes, M.; Nogueira, V.B. An Annotated Corpus of Crime-Related Portuguese Documents for NLP and Machine Learning Processing. Data 2021, 6, 71. https://doi.org/10.3390/data6070071
https://doi.org/10.3390/data6070071
dc.language.iso.fl_str_mv por
language por
dc.relation.none.fl_str_mv d34707@alunos.uevora.pt
mario.antunes@ipleiria.pt
vbn@uevora.pt
498
dc.rights.driver.fl_str_mv info:eu-repo/semantics/openAccess
eu_rights_str_mv openAccess
dc.source.none.fl_str_mv reponame:Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)
instname:FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologia
instacron:RCAAP
instname_str FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologia
instacron_str RCAAP
institution RCAAP
reponame_str Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)
collection Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)
repository.name.fl_str_mv Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) - FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologia
repository.mail.fl_str_mv info@rcaap.pt
_version_ 1833592875960500224