Extracting Structured Supervision From Captions for Weakly Supervised Semantic Segmentation

Vilar, Daniel R.; Pérez Flores, Claudio

Author	dc.contributor.author	Vilar, Daniel R.
Author	dc.contributor.author	Pérez Flores, Claudio
Admission date	dc.date.accessioned	2021-09-21T14:30:20Z
Available date	dc.date.available	2021-09-21T14:30:20Z
Publication date	dc.date.issued	2021
Cita de ítem	dc.identifier.citation	IEEE Access Volume 9 Page 65702-65720 (2021)	es_ES
Identifier	dc.identifier.other	10.1109/ACCESS.2021.3076074
Identifier	dc.identifier.uri	https://repositorio.uchile.cl/handle/2250/182014
Abstract	dc.description.abstract	Weakly supervised semantic segmentation (WSSS) methods have received significant attention in recent years, since they can dramatically reduce the annotation costs of fully supervised alternatives. While most previous studies focused on leveraging classification labels, we explore instead the use of image captions, which can be obtained easily from the web and contain richer visual information. Existing methods for this task assigned text snippets to relevant semantic labels by simply matching class names, and then employed a model trained to localize arbitrary text in images to generate pseudo-ground truth segmentation masks. Instead, we propose a dedicated caption processing module to extract structured supervision from captions, consisting of improved relevant object labels, their visual attributes, and additional background categories, all of which are useful for improving segmentation quality. This module uses syntactic structures learned from text data, and semantic relations retrieved from a knowledge database, without requiring additional annotations on the specific image domain, and consequently can be extended immediately to new object categories. We then present a novel localization network, which is trained to localize only these structured labels. This strategy simplifies model design, while focusing training signals on relevant visual information. Finally, we describe a method for leveraging all types of localization maps to obtain high-quality segmentation masks, which are used to train a supervised model. On the challenging MS-COCO dataset, our method moves the state-of-the-art forward significantly for WSSS with image-level supervision by a margin of 7.6% absolute (26.7% relative) mean Intersection-over-Union, achieving 54.5% precision and 50.9% recall.	es_ES
Patrocinador	dc.description.sponsorship	This work was supported by ANID (Agencia Nacional de Investigación y Desarrollo) under Grants FONDECYT 1191610, and FONDEF ID16I20290, by the Department of Electrical Engineering, and Advanced Mining Technology Center (CONICYT Project AFB180004), Universidad de Chile.	es_ES
Lenguage	dc.language.iso	en	es_ES
Publisher	dc.publisher	IEEE-Inst Electrical Electronics Engineers	es_ES
Type of license	dc.rights	Attribution-NonCommercial-NoDerivs 3.0 Chile	*
Link to License	dc.rights.uri	http://creativecommons.org/licenses/by-nc-nd/3.0/cl/	*
Source	dc.source	IEEE Access	es_ES
Keywords	dc.subject	Image segmentation	es_ES
Keywords	dc.subject	Training	es_ES
Keywords	dc.subject	Semantics	es_ES
Keywords	dc.subject	Visualization	es_ES
Keywords	dc.subject	Location awareness	es_ES
Keywords	dc.subject	Cams	es_ES
Keywords	dc.subject	Task analysis	es_ES
Keywords	dc.subject	Image captions	es_ES
Keywords	dc.subject	Semantic segmentation	es_ES
Keywords	dc.subject	Weakly supervised	es_ES
Título	dc.title	Extracting Structured Supervision From Captions for Weakly Supervised Semantic Segmentation	es_ES
Document type	dc.type	Artículo de revista
Cataloguer	uchile.catalogador	crb	es_ES

Files in this item

Name:: Extracting-Structured-Supervis ...
Size:: 10.40Mb
Format:: PDF

This item appears in the following Collection(s)

Artículos de revistas
Artículos de revistas

Show simple item record

Except where otherwise noted, this item's license is described as Attribution-NonCommercial-NoDerivs 3.0 Chile