Evaluation of convolutional and attentional modules for one-shot object detection

Loyola Maureira, Cristóbal Andrés

Professor Advisor	dc.contributor.advisor	Saavedra Rondo, José
Author	dc.contributor.author	Loyola Maureira, Cristóbal Andrés
Associate professor	dc.contributor.other	Barriere, Valentín
Associate professor	dc.contributor.other	Pino Urtubia, José
Associate professor	dc.contributor.other	Saavedra Ruiz, Carolina
Admission date	dc.date.accessioned	2025-05-07T17:48:07Z
Available date	dc.date.available	2025-05-07T17:48:07Z
Publication date	dc.date.issued	2024
Identifier	dc.identifier.other	10.58011/kqy8-bq64
Identifier	dc.identifier.uri	https://repositorio.uchile.cl/handle/2250/204755
Abstract	dc.description.abstract	Un problema fundamental en el área de visión por computadora es la detección de objetos, la cual consiste en determinar la ubicación de todas las instancias de objetos que están presentes en una imagen dada. Algunas de las aplicaciones incluyen la detección de vehículos, el conteo de personas, la detección de rostros y el reconocimiento de patentes. A pesar del éxito de las redes neuronales profundas en tareas relacionadas con reconocimiento de imágenes, aún existen muchos desafíos al momento de diseñar un detector de objetos. Uno de estos desafíos es el etiquetado de datos, el cual es un proceso que requiere de mucho tiempo y recursos. Esto significa que no es factible entrenar (al menos de forma puramente supervisada) un modelo que sea capaz de reconocer todas las categorías de objetos presentes en una escena del mundo real. Dentro de este contexto, este trabajo aborda el problema específico de la detección de objetos en una modalidad one-shot, la cual consiste en encontrar todas las instancias de un objeto de query sobre una imagen de target, con la restricción de que la clase de objeto de la imagen de query no haya sido vista durante el entrenamiento del modelo. Para estudiar este problema, implementamos y evaluamos diferentes modelos de aprendizaje profundo en dos contextos diferentes: cuando tanto el target como la query son imágenes, y cuando el target es una imagen pero la query es un dibujo. A través de nuestros experimentos, analizamos diferentes aspectos relacionados con la detección de objetos one-shot: la estrategia de entrenamiento utilizada, el efecto de incorporar módulos atencionales y la importancia relativa de los distintos componentes de un head de detección. Los resultados obtenidos muestran que: i) Una estrategia contrastiva de entrenamiento mejora considerablemente la capacidad de los modelos para detectar sólo aquellos objetos que pertenecen a la misma clase de la query; ii) Aunque los detectores que usan módulos atencionales alcanzaron resultados más bajos que aquellos detectores puramente convolucionales, ambos presentan la habilidad de generalizar a las clases no vistas; iii) Los detectores de dos etapas evaluados alcanzaron resultados considerablemente mejores que aquellos de una etapa; iv) En los heads de detección, el uso de una relación global entre los objetos propuestos y la query puede empeorar el rendimiento debido a la pérdida de información espacial. Finalmente, comprobamos que los modelos one-shot originalmente diseñados para imágenes, pueden ser extendidos exitosamente a la detección guiada por dibujos. Nuestra hipótesis es que el bajo rendimiento de los módulos atencionales se debe a la cantidad de datos usados en el entrenamiento. Para mejorarlos, proponemos entrenar con datasets más masivos y con mayor variabilidad. Por otro lado, aunque los detectores de una etapa entregaron resultados deficientes, este enfoque aún es deseable debido a su eficiencia. Considerando esto, proponemos las siguientes modificaciones como trabajo futuro: incorporar backbones basados en Transformers como ViT, utilizar modelos del estado del arte como SAM para encontrar las regiones de interés y obtener las detecciones finales con un enfoque de segmentación basado en máscaras como en Mask2Former.	es_ES
Abstract	dc.description.abstract	A central problem in computer vision is object detection, that is, to find the location of all object instances in a given image. Some of its applications include vehicle detection, people counting, face detection and number-plate recognition. Despite the success of deep neural networks on image recognition related tasks, there are still many open challenges when designing an object detector. One of these challenges is data labeling, which is a process that requires a lot of time and resources. This means that it is not feasible to train (at least in a fully supervised manner) a model capable of recognizing all object categories that appear in a real world scenario. Within this context, this thesis tackles the specific problem of one-shot object detection, which consists in finding all instances of a query object image over a target image, with the restriction that the query category was not seen during model training. To study this problem, we implemented and evaluated different deep learning models in two different contexts: when both the target and the query are images, and when the target is an image but the query is a sketch. Through our experiments, we analyzed different aspects related to the one-shot object detection task: the training strategy used, the effect of incorporating attentional modules and the relative importance of the different components of a detection head. The results obtained show that: i) a contrastive training strategy considerably improves the capacity of the models to detect only those objects belonging to the query class; ii) although the detectors that use attentional modules achieved lower results than the fully-convolutional ones, both of them do exhibit the ability to generalize to unseen classes; iii) the two-stage detectors evaluated achieved considerably better results than the one-stage detectors; iv) in the detection heads, the use of a global relation between the proposed objects and the query can reduce the performance due to the loss of spatial information. Finally, we verified that one-shot models originally designed for images, can be successfully extended for sketch-guided detection. We hypothesize that the low performance of the attentional modules is due to the amount of data used in training. To improve them, we propose training with more massive datasets that have higher variability. On the other hand, even though one-stage detectors delivered poor results, this approach is still desirable because of its efficiency. Considering this, we propose the following modifications as future work: to incorporate backbones based on Transformers such as ViT [5], to use state-of-the-art models like SAM [21] to find the regions of interest and to obtain the final detections with a mask-based segmentation approach like in Mask2Former [4].	es_ES
Patrocinador	dc.description.sponsorship	Este trabajo ha sido parcialmente financiado por Fondef IDea I+D Proyecto ID23I10107 y Centro Nacional de Inteligencia Artificial CENIA FB210017, Financiamiento Basal ANID
Lenguage	dc.language.iso	en	es_ES
Type of license	dc.rights	Attribution-NonCommercial-NoDerivs 3.0 United States	*
Link to License	dc.rights.uri	http://creativecommons.org/licenses/by-nc-nd/3.0/us/	*
Título	dc.title	Evaluation of convolutional and attentional modules for one-shot object detection	es_ES
Document type	dc.type	Tesis	es_ES
dc.description.version	dc.description.version	Versión original del autor	es_ES
dcterms.accessRights	dcterms.accessRights	Acceso abierto	es_ES
Cataloguer	uchile.catalogador	chb	es_ES
Department	uchile.departamento	Departamento de Ciencias de la Computación	es_ES
Faculty	uchile.facultad	Facultad de Ciencias Físicas y Matemáticas	es_ES
uchile.carrera	uchile.carrera	Ingeniería Civil en Computación	es_ES
uchile.gradoacademico	uchile.gradoacademico	Magister	es_ES
uchile.notadetesis	uchile.notadetesis	Tesis para optar al grado de Magíster en Ciencias, Mención Computación	es_ES

Files in this item

Name:: Evaluation-of-convolutional-an ...
Size:: 9.819Mb
Format:: PDF

This item appears in the following Collection(s)

Tesis Postgrado
Tesis Postgrado

Show simple item record

Except where otherwise noted, this item's license is described as Attribution-NonCommercial-NoDerivs 3.0 United States