Hybrid compression of inverted lists for reordered document collections

Arroyuelo, Diego; Oyarzún, Mauricio; González, Senén; Sepúlveda, Víctor

Author	dc.contributor.author	Arroyuelo, Diego
Author	dc.contributor.author	Oyarzún, Mauricio
Author	dc.contributor.author	González, Senén
Author	dc.contributor.author	Sepúlveda, Víctor
Admission date	dc.date.accessioned	2019-05-31T15:20:02Z
Available date	dc.date.available	2019-05-31T15:20:02Z
Publication date	dc.date.issued	2018
Cita de ítem	dc.identifier.citation	Information Processing and Management, Volumen 54, Issue 6, 2018, Pages 1308-1324
Identifier	dc.identifier.issn	03064573
Identifier	dc.identifier.other	10.1016/j.ipm.2018.05.007
Identifier	dc.identifier.uri	https://repositorio.uchile.cl/handle/2250/169432
Abstract	dc.description.abstract	Text search engines are a fundamental tool nowadays. Their efficiency relies on a popular and simple data structure: inverted indexes. They store an inverted list per term of the vocabulary. The inverted list of a given term stores, among other things, the document identifiers (docIDs) of the documents that contain the term. Currently, inverted indexes can be stored efficiently using integer compression schemes. Previous research also studied how an optimized document ordering can be used to assign docIDs to the document database. This yields important improvements in index compression and query processing time. In this paper we show that using a hybrid compression approach on the inverted lists is more effective in this scenario, with two main contributions: • First, we introduce a document reordering approach that aims at generating runs of consecutive docIDs in a properly-selected subset of inverted lists of the index. • Second, we introduce hybrid compression approaches that combine gap and run-length encodings within inverted lists, in order to take advantage not only from small gaps, but also from long runs of consecutive docIDs generated by our document reordering approach. Our experimental results indicate a reduction of about 10%–30% in the space usage of the whole index (just regarding docIDs), compared with the most efficient state-of-the-art results. Also, decompression speed is up to 1.22 times faster if the runs of consecutive docIDs must be explicitly decompressed, and up to 4.58 times faster if implicit decompression of these runs is allowed (e.g., representing the runs as intervals in the output). Finally, we also improve the query processing time of AND queries (by up to 12%), WAND queries (by up to 23%), and full (non-ranked) OR queries (by up to 86%), outperforming the best existing approaches.
Lenguage	dc.language.iso	en
Publisher	dc.publisher	Elsevier Ltd
Type of license	dc.rights	Attribution-NonCommercial-NoDerivs 3.0 Chile
Link to License	dc.rights.uri	http://creativecommons.org/licenses/by-nc-nd/3.0/cl/
Source	dc.source	Information Processing and Management
Keywords	dc.subject	Index compression for information retrieval
Keywords	dc.subject	Reordered document collections
Título	dc.title	Hybrid compression of inverted lists for reordered document collections
Document type	dc.type	Artículo de revista
Cataloguer	uchile.catalogador	jmm
Indexation	uchile.index	Artículo de publicación SCOPUS
uchile.cosecha	uchile.cosecha	SI

Files in this item

Name:: Hybrid_compression.pdf
Size:: 1.180Mb
Format:: PDF

This item appears in the following Collection(s)

Artículos de revistas
Artículos de revistas

Show simple item record

Except where otherwise noted, this item's license is described as Attribution-NonCommercial-NoDerivs 3.0 Chile