New Space / Time Tradeo s for Top- k Document Retrieval on Sequences
Artículo
Publication date
2014Metadata
Show full item record
Cómo citar
Navarro, Gonzalo
Cómo citar
New Space / Time Tradeo s for Top- k Document Retrieval on Sequences
Author
Abstract
We address the problem of indexing a collection
D
=
f
T
1
;
T
2
;:::
T
D
g
of
D
string documents of total length
n
, so that we
can e
ciently answer
top-k queries
: retrieve
k
documents most relevant to a pattern
P
of length
p
given at query time.
There exist linear-space data structures, that is, using
O
(
n
) words, that answer such queries in optimal
O
(
p
+
k
) time
for an ample set of notions of relevance. However, using linear space is not su
ciently good for large text collections.
In this paper we explore how far the space
/
time tradeo
for this problem can be pushed. We obtain three results: (1)
When relevance is measured as term frequency (number of times
P
appears in a document
T
i
), an index occupying
j
CSA
j
+
o
(
n
) bits answers the query in time
O
(
t
search
(
p
)
+
k
lg
2
k
lg
"
n
), where
CSA
is a compressed su
x array indexing
D
,
t
search
is its time to find the su
x array interval of
P
, and
" >
0 is any constant. (2) With the same measure of
relevance, an index occupying
j
CSA
j
+
n
lg
D
+
o
(
n
lg
+
n
lg
D
) bits answers the query in time
O
(
t
search
(
p
)
+
k
lg
k
),
where lg
k
is the iterated logarithm of
k
. (3) When the relevance depends only on the documents, an index occupying
j
CSA
j
+
O
(
n
lg lg
n
) bits answers the query in
O
(
t
search
(
p
)
+
k t
SA
) time, where
t
SA
is the time the
CSA
needs to retrieve
a su
x array cell. On our way, we obtain some other results of independent interest
General note
Artículo de publicación ISI
Identifier
URI: https://repositorio.uchile.cl/handle/2250/126562
DOI: dx.doi.org/10.1016/j.tcs.2014.05.005
Quote Item
Theoretical Computer Science 542 (2014) 83–97
Collections