Empirical study of the visual reasoning capabilities of the neural state machine

Chaperón Burgos, Gabriel Alejandro

Tesis

Open/Download

Empirical-study-of-the-visual-reasoning-capabilities-of-the-neural-state-machine.pdf (3.299Mb)

Access note

Acceso abierto

Publication date

2023

Metadata

Show full item record

Cómo citar

Empirical study of the visual reasoning capabilities of the neural state machineFormato de cita

Copiar

Cerrar

Author

Chaperón Burgos, Gabriel Alejandro;

Professor Advisor

Abstract

El área de aprendizaje profundo es un área dentro de las ciencias de la computación, la estadística y la matemática donde los practicantes diseñan redes neuronales profundas para lograr imitar habilidades que son inherentemente humanas. En esta área se usan tareas con el fin evaluar la capacidad de un modelo para llevar a cabo una habilidad humana, como reconocimiento de objetos, clasificación de texto o reconocimiento de voz. A finales del 2019 una nueva arquitectura llamada Neural State Machine (NSM) fue propuesta para la tarea de respuesta de preguntas visuales, donde se espera que un modelo pueda responder preguntas que están basadas en una imagen. La arquitectura se inspira fuertemente en máquinas de estado tradicionales de teoría de autómatas, y funciona recorriendo un camino por los objetos de la imagen de forma iterativa hasta encontrar la respuesta a la pregunta. En este trabajo estudiamos de forma empírica las limitaciones de esta nueva arquitectura. De teoría de autómatas sabemos que la falta de memoria en las máquinas de estado tradicionales limita el tipo de entradas que pueden procesar. Considerando esta observación y el diseño de la NSM basado en máquinas de estado, nosotros conjeturamos que la arquitectura va a ser incapaz de procesar algunos tipos de preguntas basadas en imágenes. Para probar nuestra hipótesis usamos una metodología experimental. Primero definimos categorías de preguntas donde pensamos que la NSM tendrá problemas. Estas preguntas vienen de esfuerzos previos en la literatura de establecer puntos de referencia para sistemas multimodales de texto y visión. Luego evaluamos la arquitectura y comparamos los resultados con resultados base donde la NSM alcanza un desempeño prácticamente perfecto. Nuestros hallazgos muestran que la NSM efectivamente tiene problemas para responder las preguntas propuestas. La disminución en el desempeño varía en cada caso, llegando en ocasiones a niveles aleatorios. Nuestros resultados sugieren que para tener una solución exhaustiva para la tarea de respuestas de preguntas basadas en imágenes es necesario ir más allá de una red neuronal que representa una máquina finita de estados.

The field of deep learning is a subfield of computer science, statistics and mathematics where practitioners try to build deep neural networks that mimic, to some extent, abilities inherent to human beings. In this field, tasks are used to evaluate the ability of a model to perform specific human skills, like object recognition, text classification or speech recognition. In late 2019, a new architecture called Neural State Machine (NSM) was proposed for the task of visual question answering, where a model has to answer a question based on an image. The network is heavily inspired by traditional state machines from automata theory, and works by iteratively following a path on the image trying to find the answer to the question. In this work we empirically study the limitations of this new architecture. From automata theory we know that traditional state machine’s lack of memory limits the kind of inputs they can process. Considering this observation and the networks inspiration on state machines we hypothesize the network will be unable to process certain kinds of image-based questions. We prove our hypothesis using an experimental approach. First we define a number of question categories where we think the NSM will struggle. These questions come from previous efforts in the literature to establish benchmarks for multimodal visual-text systems. Next we evaluate our architecture and compare the results to a baseline in which the NSM performs almost perfectly. Our findings show the NSM indeed struggles in the proposed proposed questions, with varying degrees of decrease in performance, reaching in some cases random performance. Our results suggests that, in order to have a comprehensive solution for the question answering problem, one would need to go beyond a neural network representation of a finite statemachine

xmlui.dri2xhtml.METS-1.0.item-notadetesis.item

Tesis para optar al grado de Magíster en Ciencias, Mención Computación

Memoria para optar al titulo de Ingeniero Civil en Computación

Identifier

URI: https://repositorio.uchile.cl/handle/2250/199836
DOI: 10.58011/5h7p-q442

Collections