RiverText: A framework for training and evaluating incremental word embeddings from text data streams

Word embeddings have become indispensable tools in various natural language processing and information retrieval tasks, including document classification, ranking, and question answering. However, traditional word embedding models have a major limitation in their static nature, which hinders their ability to adapt to the constantly evolving language patterns that emerge in sources such as social media and the web (e.g., new hashtags or brand names). To address this challenge, incremental word embedding algorithms have been introduced, enabling dynamic updating of word representations in response to new language patterns and continuous data streams. This thesis presents RiverText, a comprehensive framework for training and evaluating incremental word embeddings from text data streams. Our tool provides a valuable resource for the natural language processing community that deals with word embeddings in streaming scenarios, such as social media analysis. The library implements various incremental word embedding techniques in a standardized framework, including Skip-gram, Continuous Bag of Words, and Word Context Matrix. Additionally, it uses PyTorch as its backend for neural network training, enabling efficient and flexible training. We have also implemented a module that adapts intrinsic static word embedding evaluation tasks, such as word similarity and categorization, to a streaming setting. Finally, we compare the performance of our framework using different hyperparameter settings and discuss the results. Our open-source library is available at https://github.com/dccuchile/rivertext. It includes detailed documentation and examples to help users get started with the framework quickly and easily. We believe that our framework will greatly benefit researchers and practitioners in natural language processing, especially those working with large-scale streaming text data.

xmlui.dri2xhtml.METS-1.0.item-notadetesis.item

Tesis para optar al grado de Magíster en Ciencias, Mención Computación

Memoria para optar al título de Ingeniero Civil en Computación

Patrocinador

ANID FONDECYT grant 1200290, National Center for Artificial Intelligence CENIA FB210017 y ANID-Millennium Science Initiative Program - Code ICN17 002

Identifier

URI: https://repositorio.uchile.cl/handle/2250/196539
DOI: 10.58011/vpzd-4212

Collections