Practical compressed string dictionaries

Martínez Prieto, Miguel A.; Brisaboa, Nieves; Cánovas, Rodrigo; Claude, Francisco; Navarro, Gonzalo

Author	dc.contributor.author	Martínez Prieto, Miguel A.
Author	dc.contributor.author	Brisaboa, Nieves
Author	dc.contributor.author	Cánovas, Rodrigo
Author	dc.contributor.author	Claude, Francisco
Author	dc.contributor.author	Navarro, Gonzalo
Admission date	dc.date.accessioned	2016-05-14T21:51:00Z
Available date	dc.date.available	2016-05-14T21:51:00Z
Publication date	dc.date.issued	2016
Cita de ítem	dc.identifier.citation	Information Systems 56( 2016) 73–108	en_US
Identifier	dc.identifier.other	DOI: 10.1016/j.is.2015.08.008
Identifier	dc.identifier.uri	https://repositorio.uchile.cl/handle/2250/138288
General note	dc.description	Artículo de publicación ISI	en_US
Abstract	dc.description.abstract	The need to store and query a set of strings - a string dictionary - arises in many kinds of applications. While classically these string dictionaries have accounted for a small share of the total space budget (e.g., in Natural Language Processing or when indexing text collections), recent applications in Web engines, Semantic Web (RDF) graphs, Bioinformatics, and many others handle very large string dictionaries, whose size is a significant fraction of the whole data. In these cases, string dictionary management is a scalability issue by itself. This paper focuses on the problem of managing large static string dictionaries in compressed main memory space. We revisit classical solutions for string dictionaries like hashing, tries, and front-coding, and improve them by using compression techniques. We also introduce some novel string dictionary representations built on top of recent advances in succinct data structures and full-text indexes. All these structures are empirically compared on a heterogeneous testbed formed by real-world string dictionaries. We show that the compressed representations may use as little as 5% of the original dictionary size, while supporting lookup operations within a few microseconds. These numbers outperform the state-of-the-art space/time tradeoffs in many cases. Furthermore, we enhance some representations to provide prefix- and substring-based searches, which also perform competitively. The results show that compressed string dictionaries are a useful building block for various data-intensive applications in different domains.	en_US
Patrocinador	dc.description.sponsorship	Spanish Ministry of Economy and Competitiveness TIN2013-46238-C4-3-R ICT COST Action KEYSTONE IC1302 Conicyt, Chile FB0001 Fondecyt Iniciacion 11130104	en_US
Lenguage	dc.language.iso	en	en_US
Publisher	dc.publisher	Elsevier	en_US
Type of license	dc.rights	Atribución-NoComercial-SinDerivadas 3.0 Chile	*
Link to License	dc.rights.uri	http://creativecommons.org/licenses/by-nc-nd/3.0/cl/	*
Keywords	dc.subject	Compressed string dictionaries	en_US
Keywords	dc.subject	Text processing	en_US
Keywords	dc.subject	Text databases	en_US
Keywords	dc.subject	Compressed data structures	en_US
Título	dc.title	Practical compressed string dictionaries	en_US
Document type	dc.type	Artículo de revista

Files in this item

Name:: Practical-compressed-string-di ...
Size:: 2.405Mb
Format:: PDF

This item appears in the following Collection(s)

Artículos de revistas
Artículos de revistas

Show simple item record

Except where otherwise noted, this item's license is described as Atribución-NoComercial-SinDerivadas 3.0 Chile