About
Contact
Help
Sending publications
How to publish
Advanced Search
View Item 
  •   Home
  • Facultad de Ciencias Físicas y Matemáticas
  • Artículos de revistas
  • View Item
  •   Home
  • Facultad de Ciencias Físicas y Matemáticas
  • Artículos de revistas
  • View Item
JavaScript is disabled for your browser. Some features of this site may not work without it.

Browse byCommunities and CollectionsDateAuthorsTitlesSubjectsThis CollectionDateAuthorsTitlesSubjects

My Account

Login to my accountRegister
Biblioteca Digital - Universidad de Chile
Revistas Chilenas
Repositorios Latinoamericanos
Tesis LatinoAmericanas
Tesis chilenas
Related linksRegistry of Open Access RepositoriesOpenDOARGoogle scholarCOREBASE
My Account
Login to my accountRegister

Boosting Text Compression with Word-Based Statistical Encoding(1)

Artículo
Thumbnail
Open/Download
IconFarina_Antonio_Boosting.pdf (714.4Kb)
Publication date
2012-01
Metadata
Show full item record
Cómo citar
Fariña, Antonio
Cómo citar
Boosting Text Compression with Word-Based Statistical Encoding(1)
.
Copiar
Cerrar

Author
  • Fariña, Antonio;
  • Navarro, Gonzalo;
  • Paramá, José R.;
Abstract
Semistatic word-based byte-oriented compressors are known to be attractive alternatives to compress natural language texts. With compression ratios around 30-35%, they allow fast direct searching of compressed text. In this article, we reveal that these compressors have even more benefits. We show that most of the state-of-the-art compressors benefit from compressing not the original text, but the compressed representation obtained by a word-based byte-oriented statistical compressor. For example, p7zip with a dense-coding preprocessing achieves even better compression ratios and much faster compression than p7zip alone. We reach compression ratios below 17% in typical large English texts, which was obtained only by the slow prediction by partial matching compressors. Furthermore, searches perform much faster if the final compressor operates over word-based compressed text. We show that typical self-indexes also profit from our preprocessing step. They achieve much better space and time performance when indexing is preceded by a compression step. Apart from using the well-known Tagged Huffman code, we present a new suffix-free Dense-Code-based compressor that compresses slightly better. We also show how some self-indexes can handle non-suffix-free codes. As a result, the compressed/indexed text requires around 35% of the space of the original text and allows indexed searches for both words and phrases.
Patrocinador
Fondecyt (Chile) 1-110066 Ministerio de Educacion y Ciencia TIN2009-14560-C03-02 TIN2010-21246-C02-01 Ministerio de Ciencia e Innovacion CDTI CEN-20091048 Xunta de Galicia 2010/17
Identifier
URI: https://repositorio.uchile.cl/handle/2250/125627
DOI: DOI: 10.1093/comjnl/bxr096
Quote Item
COMPUTER JOURNAL Volume: 55 Issue: 1 Pages: 111-131 Published: JAN 2012
Collections
  • Artículos de revistas
xmlui.footer.title
31 participating institutions
More than 73,000 publications
More than 110,000 topics
More than 75,000 authors
Published in the repository
  • How to publish
  • Definitions
  • Copyright
  • Frequent questions
Documents
  • Dating Guide
  • Thesis authorization
  • Document authorization
  • How to prepare a thesis (PDF)
Services
  • Digital library
  • Chilean academic journals portal
  • Latin American Repository Network
  • Latin American theses
  • Chilean theses
Dirección de Servicios de Información y Bibliotecas (SISIB)
Universidad de Chile

© 2020 DSpace
  • Access my account