Universal indexes for highly repetitive document collections
Author
dc.contributor.author
Claude, Francisco
Author
dc.contributor.author
Fariña, Antonio
Author
dc.contributor.author
Martínez Prieto, Miguel
Author
dc.contributor.author
Navarro, Gonzalo
Admission date
dc.date.accessioned
2016-12-14T18:42:54Z
Available date
dc.date.available
2016-12-14T18:42:54Z
Publication date
dc.date.issued
2016
Cita de ítem
dc.identifier.citation
Information Systems 61(2016)1–23
es_ES
Identifier
dc.identifier.other
10.1016/j.is.2016.04.002
Identifier
dc.identifier.uri
https://repositorio.uchile.cl/handle/2250/141884
Abstract
dc.description.abstract
Indexing highly repetitive collections has become a relevant problem with the emergence of large repositories of versioned documents, among other applications. These Collections may reach huge sizes, but are formed mostly of documents that are near-copies of others. Traditional techniques for indexing these collections fail to properly exploit their regularities in order to reduce space.
We introduce new techniques for compressing inverted indexes that exploit this near copy regularity. They are based on run-length, Lempel-Ziv, or grammar compression of the differential inverted lists, instead of the usual practice of gap-encoding them. We show that, in this highly repetitive setting, our compression methods significantly reduce the space obtained with classical techniques, at the price of moderate slowdowns. Moreover, our best methods are universal, that is, they do not need to know the versioning structure of the collection, nor that a clear versioning structure even exists.
We also introduce compressed self-indexes in the comparison. These are designed for general strings (not only natural language texts) and represent the text collection plus the index structure (not an inverted index) in integrated form. We show that these techniques can compress much further, using a small fraction of the space required by our new inverted indexes. Yet they are orders of magnitude slower. (C) 2016 Elsevier Ltd. All rights reserved
es_ES
Patrocinador
dc.description.sponsorship
European Union 690941
Fondecyt (Conicyt, Chile) 1-140796
MINECO (PGE) TIN2013-47090-C3-3-P TIN2015-69951-R TIN2013-46238-C4-3-R
MINECO (FEDER) TIN2013-47090-C3-3-P TIN2015-69951-R TIN2013-46238-C4-3-R
ICT COST Action IC1302
Xunta de Galicia (FEDER) (Spain) GRC2013/053