Time-optimal top-k document retrieval
Author
Abstract
LetDbe a collection ofDdocuments, which are strings over an alphabet of sizeσ,of total lengthn. We describe a data structure that uses linear space and reportskmost relevantdocuments that contain a query patternP, which is a string of lengthppacked inp/logσnwords, intimeO(p/logσn+k). This is optimal in the RAM model in the general case where logD= Θ(logn),and involves a novel RAM-optimal suffix tree search. Our construction supports an ample setof important relevance measures, such as the number of timesPappears in a document (calledterm frequency), a fixed document importance, and the minimal distance between two occurrencesofPin a document. When logD=o(logn), we show how to reduce the space of the data structurefromO(nlogn) toO(n(logσ+ logD+ log logn)) bits, and toO(n(logσ+ logD)) bits in the caseof the popular term frequency measure of relevance, at the price of an additive termO(logεσn)in the query time, for any constantε >0. We also consider the dynamic scenario, where doc-uments can be inserted and deleted from the collection. We obtain linear space and query timeO(p(log logn)2/logσn+ logn+klog logk), whereas insertions and deletions requireO(log1+εn)time per symbol, for any constantε >0. Finally, we consider an extended static scenario wherean extra parameterpar(P,d) is defined, and the query must retrieve only documentsdsuch thatpar(P,d)∈[τ1,τ2], where this range is specified at query time. We solve these queries using linearspace andO(p/logσn+ log1+εn+klogεn) time, for any constantε >0. Our technique is to trans-late these top-kproblems into multidimensional geometric search problems. As a bonus, we describesome improvements to those problems.
Indexation
Artículo de publicación SCOPUS
Identifier
URI: https://repositorio.uchile.cl/handle/2250/168813
DOI: 10.1137/140998949
ISSN: 10957111
00975397
Quote Item
SIAM Journal on Computing, Volumen 46, Issue 1, 2017, Pages 80-113
Collections