Author | dc.contributor.author | Fariña, Antonio | |
Author | dc.contributor.author | Navarro, Gonzalo | es_CL |
Author | dc.contributor.author | Paramá, José R. | es_CL |
Admission date | dc.date.accessioned | 2012-05-31T19:28:22Z | |
Available date | dc.date.available | 2012-05-31T19:28:22Z | |
Publication date | dc.date.issued | 2012-01 | |
Cita de ítem | dc.identifier.citation | COMPUTER JOURNAL Volume: 55 Issue: 1 Pages: 111-131 Published: JAN 2012 | es_CL |
Identifier | dc.identifier.other | DOI: 10.1093/comjnl/bxr096 | |
Identifier | dc.identifier.uri | https://repositorio.uchile.cl/handle/2250/125627 | |
Abstract | dc.description.abstract | Semistatic word-based byte-oriented compressors are known to be attractive alternatives to compress natural language texts. With compression ratios around 30-35%, they allow fast direct searching of compressed text. In this article, we reveal that these compressors have even more benefits. We show that most of the state-of-the-art compressors benefit from compressing not the original text, but the compressed representation obtained by a word-based byte-oriented statistical compressor. For example, p7zip with a dense-coding preprocessing achieves even better compression ratios and much faster compression than p7zip alone. We reach compression ratios below 17% in typical large English texts, which was obtained only by the slow prediction by partial matching compressors. Furthermore, searches perform much faster if the final compressor operates over word-based compressed text. We show that typical self-indexes also profit from our preprocessing step. They achieve much better space and time performance when indexing is preceded by a compression step. Apart from using the well-known Tagged Huffman code, we present a new suffix-free Dense-Code-based compressor that compresses slightly better. We also show how some self-indexes can handle non-suffix-free codes. As a result, the compressed/indexed text requires around 35% of the space of the original text and allows indexed searches for both words and phrases. | es_CL |
Patrocinador | dc.description.sponsorship | Fondecyt (Chile)
1-110066
Ministerio de Educacion y Ciencia
TIN2009-14560-C03-02
TIN2010-21246-C02-01
Ministerio de Ciencia e Innovacion
CDTI CEN-20091048
Xunta de Galicia
2010/17 | es_CL |
Lenguage | dc.language.iso | en | es_CL |
Publisher | dc.publisher | OXFORD UNIV PRESS | es_CL |
Keywords | dc.subject | natural language text compression | es_CL |
Título | dc.title | Boosting Text Compression with Word-Based Statistical Encoding(1) | es_CL |
Document type | dc.type | Artículo de revista | |