Top-k Term-Proximity in Succinct Space
Author
Abstract
LetD={T1,T2,...,TD}be a collection ofDstring doc-uments ofncharacters in total, that are drawn from an alphabet setΣ= [σ]. Thetop-kdocument retrieval problemis to preprocessDintoa data structure that, given a query (P[1..p],k), can return thekdocu-ments ofDmost relevant to patternP. The relevance is captured usinga predefined ranking function, which depends on the set of occurrencesofPinTd. For example, it can be the term frequency (i.e., the num-ber of occurrences ofPinTd), or it can be the term proximity (i.e., thedistance between the closest pair of occurrences ofPinTd), or a pattern-independent importance score ofTdsuch as PageRank. Linear space andoptimal query time solutions already exist for this problem. Compressedand compact space solutions are also known, but only for a few rank-ing functions such as term frequency and importance. However, spaceefficient data structures for term proximity based retrieval have beenevasive. In this paper we present the first sub-linear space data structurefor this relevance function, which uses onlyo(n) bits on top of any com-pressed suffix array ofDand solves queries in timeO((p+k) polylogn).
Indexation
Artículo de publicación SCOPUS
Identifier
URI: https://repositorio.uchile.cl/handle/2250/168876
DOI: 10.1007/s00453-016-0167-2
ISSN: 14320541
01784617
Quote Item
Algorithmica, Volumen 78, Issue 2, 2017, Pages 379-393
Collections