Boosting Text Compression with Word-Based Statistical Encoding(1)

Fariña, Antonio; Navarro, Gonzalo; Paramá, José R.

Author	dc.contributor.author	Fariña, Antonio
Author	dc.contributor.author	Navarro, Gonzalo	es_CL
Author	dc.contributor.author	Paramá, José R.	es_CL
Admission date	dc.date.accessioned	2012-05-31T19:28:22Z
Available date	dc.date.available	2012-05-31T19:28:22Z
Publication date	dc.date.issued	2012-01
Cita de ítem	dc.identifier.citation	COMPUTER JOURNAL Volume: 55 Issue: 1 Pages: 111-131 Published: JAN 2012	es_CL
Identifier	dc.identifier.other	DOI: 10.1093/comjnl/bxr096
Identifier	dc.identifier.uri	https://repositorio.uchile.cl/handle/2250/125627
Abstract	dc.description.abstract	Semistatic word-based byte-oriented compressors are known to be attractive alternatives to compress natural language texts. With compression ratios around 30-35%, they allow fast direct searching of compressed text. In this article, we reveal that these compressors have even more benefits. We show that most of the state-of-the-art compressors benefit from compressing not the original text, but the compressed representation obtained by a word-based byte-oriented statistical compressor. For example, p7zip with a dense-coding preprocessing achieves even better compression ratios and much faster compression than p7zip alone. We reach compression ratios below 17% in typical large English texts, which was obtained only by the slow prediction by partial matching compressors. Furthermore, searches perform much faster if the final compressor operates over word-based compressed text. We show that typical self-indexes also profit from our preprocessing step. They achieve much better space and time performance when indexing is preceded by a compression step. Apart from using the well-known Tagged Huffman code, we present a new suffix-free Dense-Code-based compressor that compresses slightly better. We also show how some self-indexes can handle non-suffix-free codes. As a result, the compressed/indexed text requires around 35% of the space of the original text and allows indexed searches for both words and phrases.	es_CL
Patrocinador	dc.description.sponsorship	Fondecyt (Chile) 1-110066 Ministerio de Educacion y Ciencia TIN2009-14560-C03-02 TIN2010-21246-C02-01 Ministerio de Ciencia e Innovacion CDTI CEN-20091048 Xunta de Galicia 2010/17	es_CL
Lenguage	dc.language.iso	en	es_CL
Publisher	dc.publisher	OXFORD UNIV PRESS	es_CL
Keywords	dc.subject	natural language text compression	es_CL
Título	dc.title	Boosting Text Compression with Word-Based Statistical Encoding(1)	es_CL
Document type	dc.type	Artículo de revista

Files in this item

Name:: Farina_Antonio_Boosting.pdf
Size:: 714.4Kb
Format:: PDF

This item appears in the following Collection(s)

Artículos de revistas
Artículos de revistas

Show simple item record