Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), Volumen 10807 LNCS, 2018, Pages 490–503
Identifier
dc.identifier.issn
16113349
Identifier
dc.identifier.issn
03029743
Identifier
dc.identifier.other
10.1007/978-3-319-77404-6_36
Identifier
dc.identifier.uri
https://repositorio.uchile.cl/handle/2250/169320
Abstract
dc.description.abstract
Shannon’s entropy is a clear lower bound for statistical compression. The situation is not so well understood for dictionary-based
compression. A plausible lower bound is b, the least number of phrases
of a general bidirectional parse of a text, where phrases can be copied
from anywhere else in the text. Since computing b is NP-complete, a
popular gold standard is z, the number of phrases in the Lempel-Ziv
parse of the text, where phrases can be copied only from the left. While
z can be computed in linear time, almost nothing has been known for
decades about its approximation ratio with respect to b. In this paper
we prove that z = O(b log(n/b)), where n is the text length. We also
show that the bound is tight as a function of n, by exhibiting a string
family where z = Ω(b log n). Our upper bound is obtained by building a
run-length context-free grammar based on a locally consistent parsing of
the text. Our lower bound is obtained by relating b with r, the number of
equal-letter runs in the Burrows-Wheeler transform of the text. On our
way, we prove other relevant bounds between compressibility measures.