Professor Advisor | dc.contributor.advisor | Bravo Márquez, Felipe | |
Author | dc.contributor.author | Iturra Bocaz, Gabriel Emerson | |
Associate professor | dc.contributor.other | Abeliuk Kimelman, Andrés | |
Associate professor | dc.contributor.other | Gutiérrez Gallardo, Claudio | |
Associate professor | dc.contributor.other | Scheihing García, Eliana | |
Admission date | dc.date.accessioned | 2023-11-27T18:20:54Z | |
Available date | dc.date.available | 2023-11-27T18:20:54Z | |
Publication date | dc.date.issued | 2023 | |
Identifier | dc.identifier.uri | https://repositorio.uchile.cl/handle/2250/196539 | |
Abstract | dc.description.abstract | Word embeddings have become indispensable tools in various natural language processing and information retrieval tasks, including document classification, ranking, and question answering. However, traditional word embedding models have a major limitation in their static nature, which hinders their ability to adapt to the constantly evolving language patterns that emerge in sources such as social media and the web (e.g., new hashtags or brand names). To address this challenge, incremental word embedding algorithms have been introduced, enabling dynamic updating of word representations in response to new language patterns and continuous data streams.
This thesis presents RiverText, a comprehensive framework for training and evaluating incremental word embeddings from text data streams. Our tool provides a valuable resource for the natural language processing community that deals with word embeddings in streaming scenarios, such as social media analysis. The library implements various incremental word embedding techniques in a standardized framework, including Skip-gram, Continuous Bag of Words, and Word Context Matrix. Additionally, it uses PyTorch as its backend for neural network training, enabling efficient and flexible training.
We have also implemented a module that adapts intrinsic static word embedding evaluation tasks, such as word similarity and categorization, to a streaming setting. Finally, we compare the performance of our framework using different hyperparameter settings and discuss the results.
Our open-source library is available at https://github.com/dccuchile/rivertext. It includes detailed documentation and examples to help users get started with the framework quickly and easily. We believe that our framework will greatly benefit researchers and practitioners in natural language processing, especially those working with large-scale streaming text data. | es_ES |
Patrocinador | dc.description.sponsorship | ANID FONDECYT grant 1200290, National Center for Artificial Intelligence CENIA FB210017 y ANID-Millennium Science Initiative Program - Code ICN17 002 | es_ES |
Lenguage | dc.language.iso | en | es_ES |
Publisher | dc.publisher | Universidad de Chile | es_ES |
Type of license | dc.rights | Attribution-NonCommercial-NoDerivs 3.0 United States | * |
Link to License | dc.rights.uri | http://creativecommons.org/licenses/by-nc-nd/3.0/us/ | * |
Keywords | dc.subject | Procesamiento de lenguaje natural (Ciencia de la computación) | es_ES |
Keywords | dc.subject | Natural language processing (Computer science) | es_ES |
Keywords | dc.subject | Word embeddings | es_ES |
Keywords | dc.subject | Data streams | es_ES |
Keywords | dc.subject | Incremental learning | es_ES |
Título | dc.title | RiverText: A framework for training and evaluating incremental word embeddings from text data streams | es_ES |
Document type | dc.type | Tesis | es_ES |
dc.description.version | dc.description.version | Versión original del autor | es_ES |
dcterms.accessRights | dcterms.accessRights | Acceso abierto | es_ES |
Cataloguer | uchile.catalogador | gmm | es_ES |
Department | uchile.departamento | Departamento de Ciencias de la Computación | es_ES |
Faculty | uchile.facultad | Facultad de Ciencias Físicas y Matemáticas | es_ES |
uchile.titulacion | uchile.titulacion | Doble Titulación | es_ES |
uchile.carrera | uchile.carrera | Ingeniería Civil en Computación | es_ES |
uchile.gradoacademico | uchile.gradoacademico | Magister | es_ES |
uchile.notadetesis | uchile.notadetesis | Tesis para optar al grado de Magíster en Ciencias, Mención Computación | es_ES |
uchile.notadetesis | uchile.notadetesis | Memoria para optar al título de Ingeniero Civil en Computación | |