Using a provenance model and spatiotemporal information to integrate heterogeneous biodiversity semantic data.

Detalhes bibliográficos
Ano de defesa: 2017
Autor(a) principal: Amanqui, Flor Karina Mamani
Orientador(a): Não Informado pela instituição
Banca de defesa: Não Informado pela instituição
Tipo de documento: Tese
Tipo de acesso: Acesso aberto
Idioma: eng
Instituição de defesa: Biblioteca Digitais de Teses e Dissertações da USP
Programa de Pós-Graduação: Não Informado pela instituição
Departamento: Não Informado pela instituição
País: Não Informado pela instituição
Palavras-chave em Português:
Link de acesso: http://www.teses.usp.br/teses/disponiveis/55/55134/tde-30012018-093704/
Resumo: In the last few years, the Web of data is being rapidly populated with biodiversity data. However, when researchers need to retrieve, integrate, and visualize these data, they need to rely on semi-manual approaches. That is due to the fact that biodiversity repositories, such as GBIF, offer data as just strings in CSV format spreadsheets. There is no machine readable metadata that could add meaning (semantics) to data. Without this metadata, automatic solutions are impossible and labor intensive semi-manual approaches for data integration and visualization are unavoidable. To reduce this problem, we present a novel architecture, called STBioData, to automatically link spatiotemporal biodiversity data, from heterogeneous data sources, to enable easier searching, visualization and downloading of relevant data. It supports the generation of interactive maps and mapping between biodiversity data and ontologies describing them (such as Darwin Core, DBpedia, GeoSPARQL, Time and PROV-O). A new biodiversity provenance model (BioProv), extending the W3C PROV Data Model, was proposed. BioProv enables applications that deal with biodiversity data to incorporate provenance data in their information. A web based prototype, based on this architecture, was implemented. It supports biodiversity domain experts in tasks, such as identifying a species conservation status, by automating most of the necessary tasks. It uses collection data, from important Brazilian biodiversity research institutions, and species geographic distributions and conservation status, from the IUCN Red List of Threatened Species. These data are converted to linked data, enriched and saved as RDF Triples. Users can access the system, using a web interface, and search for collection and species distribution records based on species names, time ranges and geographic location. After a data set is recovered, it can be displayed in an interactive map. The records contents are also shown (including provenance data) together with links to the original records at GBIF and IUCN. Users can export datasets, as a CSV or RDF file, or get a print out in PDF (including the visualizations). Choosing different time ranges, users can, for instance, verify the evolution of a species distribution. The STBioData prototype was tested using use cases. For the tests, 46,211 collection records, from SpeciesLink, and 38,589 conservation status records (including maps), from IUCN, for marine mammal were converted to 2,233,782. RDF triples and linked using well known ontologies. 90% of biodiversity experts, using the tool to determine conservation status, were able to find information about dolphin species, with a satisfactory recovery time, and were able to understand the interactive map. In an information retrieval experiment, when compared with SpeciesLink keyword based search, the prototypes semantic based search performed, on average, 24% better in precision and 22% in recall tests. And that does not takes into account cases where only the prototype returned search results. These results demonstrate the value of having public available linked biodiversity data with semantics.