To index or not to index:|bTime-space trade-offs in search engines with positional ranking functions

González Cornejo, Senen Andrés

Professor Advisor	dc.contributor.advisor	Navarro Badino, Gonzalo
Author	dc.contributor.author	González Cornejo, Senen Andrés
Staff editor	dc.contributor.editor	Facultad de Ciencias Físicas y Matemáticas
Staff editor	dc.contributor.editor	Departamento de Ciencias de la Computación
Associate professor	dc.contributor.other	Bustos Cárdenas, Benjamín
Associate professor	dc.contributor.other	Seco Naveiras, Diego
Admission date	dc.date.accessioned	2014-06-23T20:23:37Z
Available date	dc.date.available	2014-06-23T20:23:37Z
Publication date	dc.date.issued	2014
Identifier	dc.identifier.uri	https://repositorio.uchile.cl/handle/2250/116403
General note	dc.description	Magíster en Ciencias, Mención Computación
Abstract	dc.description.abstract	Web search has become an important part of day-to-day life. Web search engines are important tools that give access to the information stored in the web. The success of a web search engine mostly depends on its efficiency and the quality of its ranking function. But also, web search engines give extra aids to their users, which make them more usable. An instance of this is the ability of generating result snippets and being able to retrieve the in-cache version of a web page, among others. Inverted indexes are a fundamental data structure used by web search engines to efficiently answer user queries. In a basic setup, inverted indexes only allow for simple (though fairly effective) ranking functions (e.g., BM25). It is well known that the high quality of nowadays search-engine results is due to sophisticated ranking functions. A particular example that has been widely studied in the literature is that of positional ranking functions, where the positions of the query terms within the resulting documents are used in order to rank them. To support this kind of ranking, the classical solution are positional inverted indexes. However, these usually demand large amounts of extra space, typically about three times the space of an inverted index. Moreover, if the web search engine needs to produce text snippets or display a cached copy of a web page, the textual data must be also stored. In this thesis we study time/space trade-offs for web search engines with positional ranking functions and text snippet generation. We aim to answer the question of whether positional inverted indexes are the most efficient way to store and retrieve positional data. In particular, we propose to get rid of positional data in inverted indexes, and instead obtain that information from the text collection itself. The challenge is to compress the text collection such that one can support the extraction of arbitrary documents, in order to find the positions of the query terms within them. We study and compare several alternatives for compressing the textual data. The first one uses a succinct data structure (in particular, a Wavelet Tree). We show how the space of the data structure can be reduced significantly, but also slowed down, by using high-order compressors within the nodes of the data structure. We then show how several text compression alternatives behave when used to obtain arbitrary documents (note that decompression speed is key in this application). Our starting point are compressors that either: (1) use little space for the text, yet with a slow decompression speed; and (2) have a very efficient decompression time (achieving a total performance comparable to that of positional inverted indexes), yet with a poor compression ratio. We then show how to obtain the best from both worlds: an efficient compression ratio, with a high decompression speed. We conclude that there exist a wide range of practical time/space trade-offs, other than just positional inverted indexes. The main result is that using only about 50% of the space of current solutions (i.e., positional inverted indexes plus the compressed text), one can support positional ranking and snippet generation almost with no time penalties. This seems to indicate that not to index positional data is the best solution in many practical scenarios. This can change the way in which positional data is stored and retrieved in web search engines.	en_US
Lenguage	dc.language.iso	en	en_US
Publisher	dc.publisher	Universidad de Chile	en_US
Type of license	dc.rights	Attribution-NonCommercial-NoDerivs 3.0 Chile	*
Link to License	dc.rights.uri	http://creativecommons.org/licenses/by-nc-nd/3.0/cl/	*
Keywords	dc.subject	Recuperación de información	en_US
Keywords	dc.subject	Estructuras de datos (Ciencia de la computación)	en_US
Keywords	dc.subject	Indices comprimidos	en_US
Título	dc.title	To index or not to index:\|bTime-space trade-offs in search engines with positional ranking functions	en_US
Document type	dc.type	Tesis

Files in this item

Name:: cf-gonzalez_sc.pdf
Size:: 2.192Mb
Format:: PDF

This item appears in the following Collection(s)

Tesis Postgrado
Tesis Postgrado

Show simple item record

Except where otherwise noted, this item's license is described as Attribution-NonCommercial-NoDerivs 3.0 Chile