TY - GEN
T1 - Dynamic Multimodal Instance Segmentation Guided by Natural Language Queries
AU - Margffoy-Tuay, Edgar
AU - Pérez, Juan C.
AU - Botero, Emilio
AU - Arbeláez, Pablo
N1 - Publisher Copyright:
© 2018, Springer Nature Switzerland AG.
PY - 2018
Y1 - 2018
N2 - We address the problem of segmenting an object given a natural language expression that describes it. Current techniques tackle this task by either (i) directly or recursively merging linguistic and visual information in the channel dimension and then performing convolutions; or by (ii) mapping the expression to a space in which it can be thought of as a filter, whose response is directly related to the presence of the object at a given spatial coordinate in the image, so that a convolution can be applied to look for the object. We propose a novel method that integrates these two insights in order to fully exploit the recursive nature of language. Additionally, during the upsampling process, we take advantage of the intermediate information generated when downsampling the image, so that detailed segmentations can be obtained. We compare our method against the state-of-the-art approaches in four standard datasets, in which it surpasses all previous methods in six of eight of the splits for this task.
AB - We address the problem of segmenting an object given a natural language expression that describes it. Current techniques tackle this task by either (i) directly or recursively merging linguistic and visual information in the channel dimension and then performing convolutions; or by (ii) mapping the expression to a space in which it can be thought of as a filter, whose response is directly related to the presence of the object at a given spatial coordinate in the image, so that a convolution can be applied to look for the object. We propose a novel method that integrates these two insights in order to fully exploit the recursive nature of language. Additionally, during the upsampling process, we take advantage of the intermediate information generated when downsampling the image, so that detailed segmentations can be obtained. We compare our method against the state-of-the-art approaches in four standard datasets, in which it surpasses all previous methods in six of eight of the splits for this task.
KW - Dynamic convolutional filters
KW - Instance segmentation
KW - Multimodal interaction
KW - Natural language processing
KW - Referring expressions
UR - https://www.scopus.com/pages/publications/85055121079
U2 - 10.1007/978-3-030-01252-6_39
DO - 10.1007/978-3-030-01252-6_39
M3 - Conference contribution
AN - SCOPUS:85055121079
SN - 9783030012519
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 656
EP - 672
BT - Computer Vision – ECCV 2018 - 15th European Conference, 2018, Proceedings
A2 - Ferrari, Vittorio
A2 - Sminchisescu, Cristian
A2 - Weiss, Yair
A2 - Hebert, Martial
PB - Springer Verlag
T2 - 15th European Conference on Computer Vision, ECCV 2018
Y2 - 8 September 2018 through 14 September 2018
ER -