Term frequency dynamics in collaborative articles
Main Author: | |
---|---|
Publication Date: | 2010 |
Other Authors: | , |
Format: | Book |
Language: | eng |
Source: | Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) |
Download full: | https://hdl.handle.net/10216/70210 |
Summary: | Documents on the World Wide Web are dynamic entities. Mainstream information retrieval systems and techniques are primarily focused on the latest version a document, generally ignoring its evolution over time. In this work, we study the term frequency dynamics in web documents over their lifespan. We use the Wikipedia as a document collection because it is a broad and public resource and, more important, because it provides access to the complete revision history of each document. We investigate the progression of similarity values over two projection variables, namely revision order and revision date. Based on this investigation we find that term frequency in encyclopedic documents - i.e. comprehensive and focused on a single topic - exhibits a rapid and steady progression towards the document's current version. The content in early versions quickly becomes very similar to the present version of the document. |
id |
RCAP_48b09ef1934e2d9f73c19dd0c6aa1606 |
---|---|
oai_identifier_str |
oai:repositorio-aberto.up.pt:10216/70210 |
network_acronym_str |
RCAP |
network_name_str |
Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) |
repository_id_str |
https://opendoar.ac.uk/repository/7160 |
spelling |
Term frequency dynamics in collaborative articlesEngenharia electrotécnica, electrónica e informáticaElectrical engineering, Electronic engineering, Information engineeringDocuments on the World Wide Web are dynamic entities. Mainstream information retrieval systems and techniques are primarily focused on the latest version a document, generally ignoring its evolution over time. In this work, we study the term frequency dynamics in web documents over their lifespan. We use the Wikipedia as a document collection because it is a broad and public resource and, more important, because it provides access to the complete revision history of each document. We investigate the progression of similarity values over two projection variables, namely revision order and revision date. Based on this investigation we find that term frequency in encyclopedic documents - i.e. comprehensive and focused on a single topic - exhibits a rapid and steady progression towards the document's current version. The content in early versions quickly becomes very similar to the present version of the document.20102010-01-01T00:00:00Zinfo:eu-repo/semantics/publishedVersioninfo:eu-repo/semantics/bookapplication/pdfhttps://hdl.handle.net/10216/70210eng10.1145/1860559.1860620Sérgio NunesCristina RibeiroGabriel Davidinfo:eu-repo/semantics/openAccessreponame:Repositórios Científicos de Acesso Aberto de Portugal (RCAAP)instname:FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologiainstacron:RCAAP2025-02-27T18:22:05Zoai:repositorio-aberto.up.pt:10216/70210Portal AgregadorONGhttps://www.rcaap.pt/oai/openaireinfo@rcaap.ptopendoar:https://opendoar.ac.uk/repository/71602025-05-28T22:46:36.606884Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) - FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologiafalse |
dc.title.none.fl_str_mv |
Term frequency dynamics in collaborative articles |
title |
Term frequency dynamics in collaborative articles |
spellingShingle |
Term frequency dynamics in collaborative articles Sérgio Nunes Engenharia electrotécnica, electrónica e informática Electrical engineering, Electronic engineering, Information engineering |
title_short |
Term frequency dynamics in collaborative articles |
title_full |
Term frequency dynamics in collaborative articles |
title_fullStr |
Term frequency dynamics in collaborative articles |
title_full_unstemmed |
Term frequency dynamics in collaborative articles |
title_sort |
Term frequency dynamics in collaborative articles |
author |
Sérgio Nunes |
author_facet |
Sérgio Nunes Cristina Ribeiro Gabriel David |
author_role |
author |
author2 |
Cristina Ribeiro Gabriel David |
author2_role |
author author |
dc.contributor.author.fl_str_mv |
Sérgio Nunes Cristina Ribeiro Gabriel David |
dc.subject.por.fl_str_mv |
Engenharia electrotécnica, electrónica e informática Electrical engineering, Electronic engineering, Information engineering |
topic |
Engenharia electrotécnica, electrónica e informática Electrical engineering, Electronic engineering, Information engineering |
description |
Documents on the World Wide Web are dynamic entities. Mainstream information retrieval systems and techniques are primarily focused on the latest version a document, generally ignoring its evolution over time. In this work, we study the term frequency dynamics in web documents over their lifespan. We use the Wikipedia as a document collection because it is a broad and public resource and, more important, because it provides access to the complete revision history of each document. We investigate the progression of similarity values over two projection variables, namely revision order and revision date. Based on this investigation we find that term frequency in encyclopedic documents - i.e. comprehensive and focused on a single topic - exhibits a rapid and steady progression towards the document's current version. The content in early versions quickly becomes very similar to the present version of the document. |
publishDate |
2010 |
dc.date.none.fl_str_mv |
2010 2010-01-01T00:00:00Z |
dc.type.status.fl_str_mv |
info:eu-repo/semantics/publishedVersion |
dc.type.driver.fl_str_mv |
info:eu-repo/semantics/book |
format |
book |
status_str |
publishedVersion |
dc.identifier.uri.fl_str_mv |
https://hdl.handle.net/10216/70210 |
url |
https://hdl.handle.net/10216/70210 |
dc.language.iso.fl_str_mv |
eng |
language |
eng |
dc.relation.none.fl_str_mv |
10.1145/1860559.1860620 |
dc.rights.driver.fl_str_mv |
info:eu-repo/semantics/openAccess |
eu_rights_str_mv |
openAccess |
dc.format.none.fl_str_mv |
application/pdf |
dc.source.none.fl_str_mv |
reponame:Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) instname:FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologia instacron:RCAAP |
instname_str |
FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologia |
instacron_str |
RCAAP |
institution |
RCAAP |
reponame_str |
Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) |
collection |
Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) |
repository.name.fl_str_mv |
Repositórios Científicos de Acesso Aberto de Portugal (RCAAP) - FCCN, serviços digitais da FCT – Fundação para a Ciência e a Tecnologia |
repository.mail.fl_str_mv |
info@rcaap.pt |
_version_ |
1833599858493095936 |