Abstract | dc.description.abstract | We show that, by combining an existing compression boosting technique with the wavelet tree data structure, we are able to design a variant of the FM-index which scales well with the size of the input alphabet Sigma. The size of the new index built on a string T[1, n] is bounded by nH(k) (T)+O ((n log log n) / log(\Sigma\) n) bits, where H-k(T) is the k-th order empirical entropy of T.
The above bound holds simultaneously for all k less than or equal to alphalog(\Sigma\) n and 0 < alpha < 1. Moreover, the index design does not depend on the parameter k, which plays a role only in analysis of the space occupancy.
Using our index, the counting of the occurrences of an arbitrary pattern P[1,p] as a substring of T takes O(p log \Sigma\) time. Locating each pattern occurrence takes O(log \Sigma\ (log(2) n / log log n)) time. Reporting a text substring of length 2 takes O((l + log(2) n/ log log n) log \Sigma\) time. | en |