A Lempel-Ziv Text Index on Secondary Storage
Author
Abstract
Full-text searching consists in locating the occurrences of a
given pattern P[1::m] in a text T[1::u], both sequences over an alpha-
bet of size ¾. In this paper we de¯ne a new index for full-text searching
on secondary storage, based on the Lempel-Ziv compression algorithm
and requiring 8uHk +o(u log ¾) bits of space, where Hk denotes the k-th
order empirical entropy of T, for any k = o(log¾ u). Our experimental re-
sults show that our index is signi¯cantly smaller than any other practical
secondary-memory data structure: 1.4{2.3 times the text size including
the text, which means 39%{65% the size of traditional indexes like String
B-trees [Ferragina and Grossi, JACM 1999]. In exchange, our index re-
quires more disk access to locate the pattern occurrences. Our index is
able to report up to 600 occurrences per disk access, for a disk page of
32 kilobytes. If we only need to count pattern occurrences, the space can
be reduced to about 1.04{1.68 times the text size, requiring about 20{60
disk accesses, depending on the pattern length.
General note
Artículo de publicación ISI
Identifier
URI: https://repositorio.uchile.cl/handle/2250/125817
Collections