Stronger Lempel-Ziv Based Compressed Text Indexing
Abstract
Given a text T [1..u] over an alphabet of size σ, the full-text search problem
consists in finding the occ occurrences of a given pattern P[1..m] in T . In indexed
text searching we build an index on T to improve the search time, yet increasing the
space requirement. The current trend in indexed text searching is that of compressed
full-text self-indices, which replace the text with a more space-efficient representation
of it, at the same time providing indexed access to the text. Thus, we can provide
efficient access within compressed space.
The Lempel-Ziv index (LZ-index) of Navarro is a compressed full-text self-index
able to represent T using 4uHk(T ) + o(ulogσ) bits of space, where Hk(T ) denotes
the k-th order empirical entropy of T , for any k = o(logσ u). This space is about
four times the compressed text size. The index can locate all the occ occurrences of
a pattern P in T in O(m3 log σ + (m + occ) log u) worst-case time. Although this
index has proven very competitive in practice, the O(m3 logσ) term can be excessive for long patterns. Also, the factor 4 in its space complexity makes it larger than other
state-of-the-art alternatives.
In this paper we present stronger Lempel-Ziv based indices (LZ-indices), improving
the overall performance of the original LZ-index. We achieve indices requiring
(2 + )uHk(T ) + o(ulogσ) bits of space, for any constant > 0, which makes
them the smallest existing LZ-indices. We simultaneously improve the search time
to O(m2 + (m + occ) log u), which makes our indices very competitive with stateof-
the-art alternatives. Our indices support displaying any text substring of length
in optimal O( / logσ u) time. In addition, we show how the space can be squeezed
to (1+ )uHk(T )+o(ulogσ) to obtain a structure with O(m2) average search time
for m 2 logσ u. Alternatively, the search time of LZ-indices can be improved to
O((m + occ) log u) with (3 + )uHk(T ) + o(ulogσ) bits of space, which is much
less than the space needed by other Lempel-Ziv-based indices achieving the same
search time. Overall our indices stand out as a very attractive alternative for spaceefficient
indexed text searching.
Patrocinador
CONICYT PhD Fellowship Program, Fondecyt Grant 1-080019, Grant-in-Aid of the Ministry of Education, Science, Sports
and Culture of Japan
Quote Item
Algorithmica (2012) 62:54–101
Collections