RiverText: A framework for training and evaluating incremental word embeddings from text data streams
Tesis
Access note
Acceso abierto
Publication date
2023Metadata
Show full item record
Cómo citar
Bravo Márquez, Felipe
Cómo citar
RiverText: A framework for training and evaluating incremental word embeddings from text data streams
Author
Professor Advisor
Abstract
Word embeddings have become indispensable tools in various natural language processing and information retrieval tasks, including document classification, ranking, and question answering. However, traditional word embedding models have a major limitation in their static nature, which hinders their ability to adapt to the constantly evolving language patterns that emerge in sources such as social media and the web (e.g., new hashtags or brand names). To address this challenge, incremental word embedding algorithms have been introduced, enabling dynamic updating of word representations in response to new language patterns and continuous data streams.
This thesis presents RiverText, a comprehensive framework for training and evaluating incremental word embeddings from text data streams. Our tool provides a valuable resource for the natural language processing community that deals with word embeddings in streaming scenarios, such as social media analysis. The library implements various incremental word embedding techniques in a standardized framework, including Skip-gram, Continuous Bag of Words, and Word Context Matrix. Additionally, it uses PyTorch as its backend for neural network training, enabling efficient and flexible training.
We have also implemented a module that adapts intrinsic static word embedding evaluation tasks, such as word similarity and categorization, to a streaming setting. Finally, we compare the performance of our framework using different hyperparameter settings and discuss the results.
Our open-source library is available at https://github.com/dccuchile/rivertext. It includes detailed documentation and examples to help users get started with the framework quickly and easily. We believe that our framework will greatly benefit researchers and practitioners in natural language processing, especially those working with large-scale streaming text data.
xmlui.dri2xhtml.METS-1.0.item-notadetesis.item
Tesis para optar al grado de Magíster en Ciencias, Mención Computación Memoria para optar al título de Ingeniero Civil en Computación
Patrocinador
ANID FONDECYT grant 1200290, National Center for Artificial Intelligence CENIA FB210017 y ANID-Millennium Science Initiative Program - Code ICN17 002
Identifier
URI: https://repositorio.uchile.cl/handle/2250/196539
Collections
The following license files are associated with this item: