Exploring and Visualizing Referring Expression Comprehension
Abstract
Human-machine interaction is one of the main objectives currently in the field of Artificial Intelligence. This work will contribute to enhance this interaction by exploring the new task of Referring Expression Comprehension (REC), consisting of: given a referring expression—which can be a linguistic phrase or human speech—and an image, detect the object to which the expression refers (i.e., achieve a binary segmentation of the referred object). The multimodal nature of this task will require the use of different deep learning architectures, among them: convolutional neural networks (computer vision); and recurrent neural networks and the Transformer model (natural language processing).
This thesis is presented as a self-contained document that can be understood by a reader with no prior knowledge of machine learning. The bulk of the work consists of an exhaustive study of the REC task: from the applications; until the study, comparison and implementation of models; going through a complete description of the current state of the art. Likewise, a functional, free and public web page is presented in which interaction is allowed in a simple way with the model described in this work.
Keywords
Referring Expression Comprehension • Artificial Intelligence • Machine Learning • Deep Learning • Computer Vision • Natural Language Processing • Multimodal Learning
Mathematics Subject Classification
68T45Thesis
Access to full text: Referring Expression Comprehension - David Álvarez Rosa.pdf
Slides: Referring Expression Comprehension (Slides) - David Álvarez Rosa.pdf
Source code: Git repository
Choose Image
Select one of these images by clicking on it.
Results
or use microphone