Show simple item record

Authordc.contributor.authorClaude, Francisco 
Authordc.contributor.authorFariña, Antonio 
Authordc.contributor.authorMartínez Prieto, Miguel 
Authordc.contributor.authorNavarro, Gonzalo 
Admission datedc.date.accessioned2016-12-14T18:42:54Z
Available datedc.date.available2016-12-14T18:42:54Z
Publication datedc.date.issued2016
Cita de ítemdc.identifier.citationInformation Systems 61(2016)1–23es_ES
Identifierdc.identifier.other10.1016/j.is.2016.04.002
Identifierdc.identifier.urihttps://repositorio.uchile.cl/handle/2250/141884
Abstractdc.description.abstractIndexing highly repetitive collections has become a relevant problem with the emergence of large repositories of versioned documents, among other applications. These Collections may reach huge sizes, but are formed mostly of documents that are near-copies of others. Traditional techniques for indexing these collections fail to properly exploit their regularities in order to reduce space. We introduce new techniques for compressing inverted indexes that exploit this near copy regularity. They are based on run-length, Lempel-Ziv, or grammar compression of the differential inverted lists, instead of the usual practice of gap-encoding them. We show that, in this highly repetitive setting, our compression methods significantly reduce the space obtained with classical techniques, at the price of moderate slowdowns. Moreover, our best methods are universal, that is, they do not need to know the versioning structure of the collection, nor that a clear versioning structure even exists. We also introduce compressed self-indexes in the comparison. These are designed for general strings (not only natural language texts) and represent the text collection plus the index structure (not an inverted index) in integrated form. We show that these techniques can compress much further, using a small fraction of the space required by our new inverted indexes. Yet they are orders of magnitude slower. (C) 2016 Elsevier Ltd. All rights reservedes_ES
Patrocinadordc.description.sponsorshipEuropean Union 690941 Fondecyt (Conicyt, Chile) 1-140796 MINECO (PGE) TIN2013-47090-C3-3-P TIN2015-69951-R TIN2013-46238-C4-3-R MINECO (FEDER) TIN2013-47090-C3-3-P TIN2015-69951-R TIN2013-46238-C4-3-R ICT COST Action IC1302 Xunta de Galicia (FEDER) (Spain) GRC2013/053es_ES
Lenguagedc.language.isoenes_ES
Publisherdc.publisherPergamon-Elsevieres_ES
Type of licensedc.rightsAttribution-NonCommercial-NoDerivs 3.0 Chile*
Link to Licensedc.rights.urihttp://creativecommons.org/licenses/by-nc-nd/3.0/cl/*
Sourcedc.sourceInformation Systemses_ES
Keywordsdc.subjectRepetitive collectionses_ES
Keywordsdc.subjectInverted indexes_ES
Keywordsdc.subjectSelf-indexes_ES
Títulodc.titleUniversal indexes for highly repetitive document collectionses_ES
Document typedc.typeArtículo de revista
Catalogueruchile.catalogadorapces_ES
Indexationuchile.indexArtículo de publicación ISIes_ES


Files in this item

Icon

This item appears in the following Collection(s)

Show simple item record

Attribution-NonCommercial-NoDerivs 3.0 Chile
Except where otherwise noted, this item's license is described as Attribution-NonCommercial-NoDerivs 3.0 Chile