New Space / Time Tradeo s for Top- k Document Retrieval on Sequences
Author
dc.contributor.author
Navarro, Gonzalo
Author
dc.contributor.author
Thankachan, Sharma V.
es_CL
Admission date
dc.date.accessioned
2014-12-15T12:47:03Z
Available date
dc.date.available
2014-12-15T12:47:03Z
Publication date
dc.date.issued
2014
Cita de ítem
dc.identifier.citation
Theoretical Computer Science 542 (2014) 83–97
en_US
Identifier
dc.identifier.other
dx.doi.org/10.1016/j.tcs.2014.05.005
Identifier
dc.identifier.uri
https://repositorio.uchile.cl/handle/2250/126562
General note
dc.description
Artículo de publicación ISI
en_US
Abstract
dc.description.abstract
We address the problem of indexing a collection
D
=
f
T
1
;
T
2
;:::
T
D
g
of
D
string documents of total length
n
, so that we
can e
ciently answer
top-k queries
: retrieve
k
documents most relevant to a pattern
P
of length
p
given at query time.
There exist linear-space data structures, that is, using
O
(
n
) words, that answer such queries in optimal
O
(
p
+
k
) time
for an ample set of notions of relevance. However, using linear space is not su
ciently good for large text collections.
In this paper we explore how far the space
/
time tradeo
for this problem can be pushed. We obtain three results: (1)
When relevance is measured as term frequency (number of times
P
appears in a document
T
i
), an index occupying
j
CSA
j
+
o
(
n
) bits answers the query in time
O
(
t
search
(
p
)
+
k
lg
2
k
lg
"
n
), where
CSA
is a compressed su
x array indexing
D
,
t
search
is its time to find the su
x array interval of
P
, and
" >
0 is any constant. (2) With the same measure of
relevance, an index occupying
j
CSA
j
+
n
lg
D
+
o
(
n
lg
+
n
lg
D
) bits answers the query in time
O
(
t
search
(
p
)
+
k
lg
k
),
where lg
k
is the iterated logarithm of
k
. (3) When the relevance depends only on the documents, an index occupying
j
CSA
j
+
O
(
n
lg lg
n
) bits answers the query in
O
(
t
search
(
p
)
+
k t
SA
) time, where
t
SA
is the time the
CSA
needs to retrieve
a su
x array cell. On our way, we obtain some other results of independent interest