Show simple item record

Professor Advisordc.contributor.advisorNavarro Badino, Gonzalo
Authordc.contributor.authorCáceres Reyes, Manuel Ariel 
Associate professordc.contributor.otherBarcelo Baeza, Pablo
Associate professordc.contributor.otherBustos Cárdenas, Benjamín
Associate professordc.contributor.otherArroyuelo Billiardi, Diego
Admission datedc.date.accessioned2019-12-05T13:21:27Z
Available datedc.date.available2019-12-05T13:21:27Z
Publication datedc.date.issued2019
Identifierdc.identifier.urihttps://repositorio.uchile.cl/handle/2250/172755
General notedc.descriptionTesis para optar al grado de Magíster en Ciencias, Mención Computaciónes_ES
General notedc.descriptionMemoria para optar al título de Ingeniero Civil en Computación
Abstractdc.description.abstractThe Block Tree is a recently proposed data structure representing a sequence T of length n in space bounded by the number of phrases z of the Lempel-Ziv parsing of T. It uses O(z log(n/z)) space and supports access to symbols and other operations in logarithmic time. We contribute both to the theoretical and practical development of Block Trees, proving new properties, presenting the first implementation faithful to its theoretical description, making further improvements to the structure, and studying a wide set of variants. Suffix trees are a fundamental data structure in stringology with a myriad of applications in areas like bioinformatics, enabling efficient solutions to complex problems such as (approximate) pattern matching, finding repeated substrings, computing matching statistics, computing maximal matches, and many others. However, their space usage, though linear, is an important problem in applications. A line of research addressing this problem consists on building compact representations of suffix trees, named Compressed Suffix Trees (CSTs), which simulate all the suffix tree functionality within space bounded by the information content of the sequence. Current implementations use 10 bits per symbol and support the operations in a few microseconds; or use 5 bits per symbol but their operation times raise to milliseconds. A recent branch of this area is repetition-aware CSTs, which focuses on building suffix trees for repetitive inputs, such as a versioned document or a collection of DNA sequences of a group of humans. CSTs in this area adapt Lempel-Ziv, Grammar and run-length based indexes. The most successful CST in practice is the grammar-compressed suffix tree (GCST), which uses about 2 bits per symbol and supports the operations in tens to hundreds of microseconds. We apply our Block Trees to design a new repetition-aware CST that, though larger than the GCST (i.e., using 3--4 bits per symbol), is faster by orders of magnitude. We obtain this result by using Block Trees on various components of the CST. First, we build them on the parentheses representation of the tree topology; these Block Trees must be augmented to support tree navigation operations on the parentheses. Second, we use Block Trees on various differentially-encoded arrays that compose a CST: the suffix array, its inverse, and the longest-common-prefix array. We end up with a set of recommended combinations for practitioners. All our code is publicly available at https://github.com/elarielcl/BT-CST.es_ES
Patrocinadordc.description.sponsorshipProyecto Fondecyt 1-170048, Proyecto Basal FB00001 (CeBiB) y Proyecto BIRDSes_ES
Lenguagedc.language.isoenes_ES
Publisherdc.publisherUniversidad de Chilees_ES
Type of licensedc.rightsAttribution-NonCommercial-NoDerivs 3.0 Chile*
Link to Licensedc.rights.urihttp://creativecommons.org/licenses/by-nc-nd/3.0/cl/*
Keywordsdc.subjectCompresión de datos (Ciencia de la computación)es_ES
Keywordsdc.subjectBioinformáticaes_ES
Keywordsdc.subjectSuffix treeses_ES
Títulodc.titleCompressed suffix trees for repetitive collections based on block treeses_ES
Document typedc.typeTesis
Catalogueruchile.catalogadorgmmes_ES
Facultyuchile.facultadFacultad de Ciencias Físicas y Matemáticases_ES


Files in this item

Icon
Icon

This item appears in the following Collection(s)

Show simple item record

Attribution-NonCommercial-NoDerivs 3.0 Chile
Except where otherwise noted, this item's license is described as Attribution-NonCommercial-NoDerivs 3.0 Chile