Show simple item record

Professor Advisordc.contributor.advisorNavarro Badino, Gonzalo
Authordc.contributor.authorGonzález Cornejo, Senen Andrés 
Staff editordc.contributor.editorFacultad de Ciencias Físicas y Matemáticas
Staff editordc.contributor.editorDepartamento de Ciencias de la Computación
Associate professordc.contributor.otherBustos Cárdenas, Benjamín
Associate professordc.contributor.otherSeco Naveiras, Diego
Admission datedc.date.accessioned2014-06-23T20:23:37Z
Available datedc.date.available2014-06-23T20:23:37Z
Publication datedc.date.issued2014
Identifierdc.identifier.urihttps://repositorio.uchile.cl/handle/2250/116403
General notedc.descriptionMagíster en Ciencias, Mención Computación
Abstractdc.description.abstractWeb search has become an important part of day-to-day life. Web search engines are important tools that give access to the information stored in the web. The success of a web search engine mostly depends on its efficiency and the quality of its ranking function. But also, web search engines give extra aids to their users, which make them more usable. An instance of this is the ability of generating result snippets and being able to retrieve the in-cache version of a web page, among others. Inverted indexes are a fundamental data structure used by web search engines to efficiently answer user queries. In a basic setup, inverted indexes only allow for simple (though fairly effective) ranking functions (e.g., BM25). It is well known that the high quality of nowadays search-engine results is due to sophisticated ranking functions. A particular example that has been widely studied in the literature is that of positional ranking functions, where the positions of the query terms within the resulting documents are used in order to rank them. To support this kind of ranking, the classical solution are positional inverted indexes. However, these usually demand large amounts of extra space, typically about three times the space of an inverted index. Moreover, if the web search engine needs to produce text snippets or display a cached copy of a web page, the textual data must be also stored. In this thesis we study time/space trade-offs for web search engines with positional ranking functions and text snippet generation. We aim to answer the question of whether positional inverted indexes are the most efficient way to store and retrieve positional data. In particular, we propose to get rid of positional data in inverted indexes, and instead obtain that information from the text collection itself. The challenge is to compress the text collection such that one can support the extraction of arbitrary documents, in order to find the positions of the query terms within them. We study and compare several alternatives for compressing the textual data. The first one uses a succinct data structure (in particular, a Wavelet Tree). We show how the space of the data structure can be reduced significantly, but also slowed down, by using high-order compressors within the nodes of the data structure. We then show how several text compression alternatives behave when used to obtain arbitrary documents (note that decompression speed is key in this application). Our starting point are compressors that either: (1) use little space for the text, yet with a slow decompression speed; and (2) have a very efficient decompression time (achieving a total performance comparable to that of positional inverted indexes), yet with a poor compression ratio. We then show how to obtain the best from both worlds: an efficient compression ratio, with a high decompression speed. We conclude that there exist a wide range of practical time/space trade-offs, other than just positional inverted indexes. The main result is that using only about 50% of the space of current solutions (i.e., positional inverted indexes plus the compressed text), one can support positional ranking and snippet generation almost with no time penalties. This seems to indicate that not to index positional data is the best solution in many practical scenarios. This can change the way in which positional data is stored and retrieved in web search engines.en_US
Lenguagedc.language.isoenen_US
Publisherdc.publisherUniversidad de Chileen_US
Type of licensedc.rightsAttribution-NonCommercial-NoDerivs 3.0 Chile*
Link to Licensedc.rights.urihttp://creativecommons.org/licenses/by-nc-nd/3.0/cl/*
Keywordsdc.subjectRecuperación de informaciónen_US
Keywordsdc.subjectEstructuras de datos (Ciencia de la computación)en_US
Keywordsdc.subjectIndices comprimidosen_US
Títulodc.titleTo index or not to index:|bTime-space trade-offs in search engines with positional ranking functionsen_US
Document typedc.typeTesis


Files in this item

Icon

This item appears in the following Collection(s)

Show simple item record

Attribution-NonCommercial-NoDerivs 3.0 Chile
Except where otherwise noted, this item's license is described as Attribution-NonCommercial-NoDerivs 3.0 Chile