Exploring and Visualizing Referring Expression Comprehension

This website is the result of a bachelor's thesis.

Abstract

Human-machine interaction is one of the main objectives currently in the field of Artificial Intelligence. This work will contribute to enhance this interaction by exploring the new task of Referring Expression Comprehension (REC), consisting of: given a referring expression—which can be a linguistic phrase or human speech—and an image, detect the object to which the expression refers (i.e., achieve a binary segmentation of the referred object). The multimodal nature of this task will require the use of different deep learning architectures, among them: convolutional neural networks (computer vision); and recurrent neural networks and the Transformer model (natural language processing).

This thesis is presented as a self-contained document that can be understood by a reader with no prior knowledge of machine learning. The bulk of the work consists of an exhaustive study of the REC task: from the applications; until the study, comparison and implementation of models; going through a complete description of the current state of the art. Likewise, a functional, free and public web page is presented in which interaction is allowed in a simple way with the model described in this work.

Keywords

Referring Expression Comprehension • Artificial Intelligence • Machine Learning • Deep Learning • Computer Vision • Natural Language Processing • Multimodal Learning

Mathematics Subject Classification

68T45

Thesis

Access to full text: Referring Expression Comprehension - David Álvarez Rosa.pdf

Slides: Referring Expression Comprehension (Slides) - David Álvarez Rosa.pdf

Source code: Git repository

Choose Image

Select one of these images by clicking on it.

Results

Please, choose image.

or use microphone

Selected referring expression

Please, enter a referring expression.

Segmented image

Click first the submit button!

Warning Note