Originally published on August 30, 2018 by SingularityNet.IO
Exploring a hybrid approach to visual question answering through deeper integration of OpenCog and a Vision Subsystem.
Let us imagine a scenario in which Sophia, the social humanoid robot, is asked a simple question by someone:
“Sophia, is it raining?”
If Sophia says “yes” to the question, does she know why she gave that answer? In other words, how does Sophia answer the question?
The ability to answer questions about visual scenes, in other words, the ability to perform Visual Question Answering (VQA) is something that comes naturally to humans. However, the current state-of-the-art models of VQA leave much to be desired.
One of the control systems used to operate Sophia is OpenCog, a cognitive architecture. OpenCog operates over a knowledge base represented as a hypergraph called Atomspace. For Sophia to accurately answer questions about visual scenes, the content of those scenes needs to be made accessible to OpenCog.
In an earlier research article, we discussed that the simplest way to achieve that would be to process images with a Deep Neural Network (DNN) and to insert the descriptions of the images into Atomspace. One example of such a DNN would be YOLO, which describes an image with a set of labeled bounding boxes.
Although such a simple approach can be useful for semantic image retrieval, it will not be sufficient for answering arbitrary visual questions.