Show simple item record

Authordc.contributor.authorFariña, Antonio 
Authordc.contributor.authorBrisaboa, Nieves R. es_CL
Authordc.contributor.authorNavarro, Gonzalo es_CL
Authordc.contributor.authorClaude, Francisco es_CL
Authordc.contributor.authorPlaces, Angeles S. es_CL
Authordc.contributor.authorRodríguez, Eduardo es_CL
Admission datedc.date.accessioned2012-05-31T19:22:22Z
Available datedc.date.available2012-05-31T19:22:22Z
Publication datedc.date.issued2012-02
Cita de ítemdc.identifier.citationACM TRANSACTIONS ON INFORMATION SYSTEMS Volume: 30 Issue: 1 Article Number: 1 Published: FEB 2012es_CL
Identifierdc.identifier.otherDOI: 10.1145/2094072.2094073
Identifierdc.identifier.urihttps://repositorio.uchile.cl/handle/2250/125626
Abstractdc.description.abstractThe inverted index supports efficient full-text searches on natural language text collections. It requires some extra space over the compressed text that can be traded for search speed. It is usually fast for single-word searches, yet phrase searches require more expensive intersections. In this article we introduce a different kind of index. It replaces the text using essentially the same space required by the compressed text alone (compression ratio around 35%). Within this space it supports not only decompression of arbitrary passages, but efficient word and phrase searches. Searches are orders of magnitude faster than those over inverted indexes when looking for phrases, and still faster on single-word searches when little space is available. Our new indexes are particularly fast at counting the occurrences of words or phrases. This is useful for computing relevance of words or phrases. We adapt self-indexes that succeeded in indexing arbitrary strings within compressed space to deal with large alphabets. Natural language texts are then regarded as sequences of words, not characters, to achieve word-based self-indexes. We design an architecture that separates the searchable sequence from its presentation aspects. This permits applying case folding, stemming, removing stopwords, etc. as is usual on inverted indexes.es_CL
Patrocinadordc.description.sponsorshipMICINN TIN2009-14560-C03-02 TIN2010-21246-C02-01 Ministerio de Ciencia e Innovacion CDTI CEN-20091048 Xunta de Galicia 2010/17 NSERC Canada David R. Cheriton Scholarships Program Fondecyt, Chile 1-080019 1-110066es_CL
Lenguagedc.language.isoenes_CL
Publisherdc.publisherASSOC COMPUTING MACHINERYes_CL
Keywordsdc.subjectalgorithmses_CL
Títulodc.titleWord-Based Self-Indexes for Natural Language Textes_CL
Document typedc.typeArtículo de revista


Files in this item

Icon

This item appears in the following Collection(s)

Show simple item record