Colored range queries and document retrieval
Abstract
Colored range queries are a well-studied topic in computational geometry and database
research that, in the past decade, have found exciting applications in information retrieval.
In this paper, we give improved time and space bounds for three important onedimensional
colored range queries — colored range listing, colored range top-k queries
and colored range counting — and, as a consequence, new bounds for various document
retrieval problems on general collections of sequences. Colored range listing is the problem
of preprocessing a sequence S[1, n] of colors so that, later, given an interval [i, i+ℓ−1], we
list the different colors in S[i, i+ℓ−1]. Colored range top-k queries ask instead for k most
frequent colors in the interval. Colored range counting asks for the number of different
colors in the interval.
We first describe a framework including almost all recent results on colored range
listing and document listing, which suggests new combinations of data structures for
these problems. For example, we give the first compressed data structure (using nHk(S) +
o(n log σ) bits, for any k = o(logσ n), where Hk(S) is the k-th order empirical entropy of
S and σ the number of different colors in S) that answers colored range listing queries
in constant time per returned result. We also give an efficient data structure for document
listing whose size is bounded in terms of the k-th order entropy of the library of documents.
We then show how (approximate) colored top-k queries can be reduced to (approximate)
range-mode queries on subsequences, yielding the first efficient data structure for this
problem. Finally, we show how modified wavelet trees can support colored range counting
using nH0(S) + O(n) + o(nH0(S)) bits, and answer queries in O(log ℓ) time. As far as we
know, this is the first data structure in which the query time depends only on ℓ and not on
n. We also show how our data structure can be made dynamic.
General note
Artículo de publicación ISI
Identifier
URI: https://repositorio.uchile.cl/handle/2250/126273
DOI: doi:10.1016/j.tcs.2012.08.004
ISSN: 0304-3975
Quote Item
Theoretical Computer Science 483 (2013) 36–50
Collections