About
Contact
Help
Sending publications
How to publish
Advanced Search
View Item 
  •   Home
  • Facultad de Ciencias Físicas y Matemáticas
  • Artículos de revistas
  • View Item
  •   Home
  • Facultad de Ciencias Físicas y Matemáticas
  • Artículos de revistas
  • View Item
JavaScript is disabled for your browser. Some features of this site may not work without it.

Browse byCommunities and CollectionsDateAuthorsTitlesSubjectsThis CollectionDateAuthorsTitlesSubjects
Biblioteca Digital - Universidad de Chile
Revistas Chilenas
Repositorios Latinoamericanos
Tesis LatinoAmericanas
Tesis chilenas
Related linksRegistry of Open Access RepositoriesOpenDOARGoogle scholarCOREBASE
My Account
Login to my accountRegister

Word-Based Self-Indexes for Natural Language Text

Artículo
Thumbnail
Open/Download
IconFarina_Antonio.pdf (640.0Kb)
Publication date
2012-02
Metadata
Show full item record
Cómo citar
Fariña, Antonio
Cómo citar
Word-Based Self-Indexes for Natural Language Text
.
Copiar
Cerrar
Article has an altmetric score of 1
Author
  • Fariña, Antonio;
  • Brisaboa, Nieves R.;
  • Navarro, Gonzalo;
  • Claude, Francisco;
  • Places, Angeles S.;
  • Rodríguez, Eduardo;
Abstract
The inverted index supports efficient full-text searches on natural language text collections. It requires some extra space over the compressed text that can be traded for search speed. It is usually fast for single-word searches, yet phrase searches require more expensive intersections. In this article we introduce a different kind of index. It replaces the text using essentially the same space required by the compressed text alone (compression ratio around 35%). Within this space it supports not only decompression of arbitrary passages, but efficient word and phrase searches. Searches are orders of magnitude faster than those over inverted indexes when looking for phrases, and still faster on single-word searches when little space is available. Our new indexes are particularly fast at counting the occurrences of words or phrases. This is useful for computing relevance of words or phrases. We adapt self-indexes that succeeded in indexing arbitrary strings within compressed space to deal with large alphabets. Natural language texts are then regarded as sequences of words, not characters, to achieve word-based self-indexes. We design an architecture that separates the searchable sequence from its presentation aspects. This permits applying case folding, stemming, removing stopwords, etc. as is usual on inverted indexes.
Patrocinador
MICINN TIN2009-14560-C03-02 TIN2010-21246-C02-01 Ministerio de Ciencia e Innovacion CDTI CEN-20091048 Xunta de Galicia 2010/17 NSERC Canada David R. Cheriton Scholarships Program Fondecyt, Chile 1-080019 1-110066
Identifier
URI: https://repositorio.uchile.cl/handle/2250/125626
DOI: DOI: 10.1145/2094072.2094073
Quote Item
ACM TRANSACTIONS ON INFORMATION SYSTEMS Volume: 30 Issue: 1 Article Number: 1 Published: FEB 2012
Collections
  • Artículos de revistas
xmlui.footer.title
31 participating institutions
More than 73,000 publications
More than 110,000 topics
More than 75,000 authors
Published in the repository
  • How to publish
  • Definitions
  • Copyright
  • Frequent questions
Documents
  • Dating Guide
  • Thesis authorization
  • Document authorization
  • How to prepare a thesis (PDF)
Services
  • Digital library
  • Chilean academic journals portal
  • Latin American Repository Network
  • Latin American theses
  • Chilean theses
Dirección de Servicios de Información y Bibliotecas (SISIB)
Universidad de Chile

© 2020 DSpace
  • Access my account
 
Posted by 1 X users
27 readers on Mendeley
See more details