Image Caption Technical Report
Image Caption Technical Report
Image Caption Technical Report
Submitted by:
(101403022) Ajay Kumar Chhimpa
(101403023) Akash Gupta
(101403024) Akash Kumar Sikarwar
(101583005) Ayush Garg
Aim
This aim of this project is to develop a Digital assistant that can generate descriptive
captions for images using neural language models. A Digital assistant help the user to provide
answer to his questions which would be given in speech form as a command.
Intended audience
This project can act as vision for the visually impaired people, as it can identify nearby
objects through the camera and give the output in audio form. The app provides a highly
interactive platform for the specially abled people
Project Scope
The goal is to design an android application which covers all the functions of image
description and provides an interface of Digital assistant to the user. A Digital assistant
help the user to provide answer to his questions which would be given in speech form as a
command.
By using Deep learning techniques the project performes:
The purpose of this model is to encode the visual information from an image and semantic
information from a caption, into a embedding space; this embedding space has the property
that vectors that are close to each other are visually or semantically related. For a batch of
images and captions, we can use the model to map them all into this embedding space,
compute a distance metric, and for each image and for each caption find its nearest neighbors.
If you rank the neighbors by which examples are closest, you have ranked how relevant
images and captions are to each other.
Previous work
Traditionally, pre-defined templates have been used to generate captions for images. But this
approach is very limited because it cannot be used to generate lexically rich captions.
The research in the problem of caption generation has seen a surge since the advancement in
training neural networks and the availability of large classification datasets. Most of the
related work has been based on training deep recurrent neural networks. The first paper that
used neural networks for generating image captions was proposed by Kiros et al. [6], that
used Multi-modal log bilinear model that was biased by the features obtained from input
image.
Karpathy et.al [4] developed a model that generated text descriptions for images based on
labels in the form of a set of sentences and images. They use multi-modal embeddings to
align images and text based on a ranking model they proposed. Their model was evaluated
on both full frame and region level experiments and it was found that their Multimodal
Recurrent Neural Net architecture outperformed retrieval baselines.
In our project we have used a Convolutional Neural Network coupled with an LSTM based
architecture. An image is passed on as an input to the CNN, which yields certain annotation
vectors. Based on a human vision inspired notion of attention, a context vector is obtained as
a function of these annotation vectors, which is then passed as an input to the LSTM.
Methodology
Annotation vector extraction
Figure 5 illustrates the process of extraction of feature vectors from an image by a CNN. Given an input
image of size 24 24, CNN generates 4 matrices by convolving the image with 4 different filters(one filter
over entire image at a time). This yields 4 sub images or feature maps of size 20*20.These are then
subsample to decrease the size of feature maps. These convolution and subsampling procedures are
repeated at the subsequent stages. After certain stages these 2 dimensional feature maps are converted to
1 dimensional vector through a fully connected layer. This 1 dimensional vector can then be used for
classification or other tasks. In our work, we will be using the feature maps (not the 1 dimensional hidden
vector) called annotation vectors for generating context vectors.
For the image-caption relevancy task, recurrent neural networks help accumulate the
semantics of a sentence. Strings of sentences are parsed into words, each of which has a
GloVe vector representation that can be found in a lookup table. These word vectors are
fed into a recurrent neural network sequentially, which captures the notion of semantic
meaning over the entire sequence of words via its hidden state. We treat t he hidden state
after the recurrent net has seen the last word in the sentence as the sentence embedding.
Requirement Analysis:
Use Case Diagram:
Description:
User enter the username and password for authentication.
Level: Low Level
Primary Actor:
Application User
Pre-Conditions:
User should be registered.
User should have entered the username and password.
Post Conditions:
Minimal Guarantee:
Users username and password is encrypted.
Trigger:
Unauthorized user opens the app.
Frequency:
Once,unless logged out.
Description:
User makes account in the application.
Primary Actor:
Application User
Pre-Conditions:
App is opened and any other user is not logged in.
The user information is valid in registration form.
Post Conditions
Minimal Guarantee:
Only through valid details the user gets registered.
Two users cant register with same username.
Trigger:
Unauthorized user opens the app.
Frequency:
Once,unless another user wants to create account.
Use Case: Image upload by the user.
Description:
User selects a particular Image from the Phone Gallery or Clicks the image through
Camera.
Primary Actor:
Application User
Pre-Conditions:
User must be logged in.
Post Conditions
Minimal Guarantee:
The file will only get uploaded if its valid.
Trigger
User starts the Image Captioning process by clicking the Image Captioning button.
Frequency:
About 10 times per hour.
Use Case: Speech Recognition.
Description:
Speech Recognition is an extension to the overall application and part of digital
assistant.As the User speaks, his speech gets recognized and it can be utilized for google
search and other actions.
Level: Sub-Function
Primary Actor:
Application User
Pre-Conditions:
User must be logged in.
Speech Button is selected.
Post Conditions:
Minimal Guarantee:
The user will be notified of the error.
Trigger:
User starts the Speech Recognition process by clicking the Speech Recognition button.
Frequency:
Once a day.
Use Case: Caption Receipt.
Description:
User receives the description of Image uploaded by him.The user can get description in the
form of speech or text form.
Primary Actor:
Application User
Pre-Conditions:
Image must be uploaded.
Captioning algorithm applied.
Post Conditions:
Trigger:
Image upload by the user.
1. Introduction
Purpose
Apps in modern world are unamenable towards specially abled humans who have a
hard time interacting with them.
To Design an app, that can describe images in a meaningful way in the form of
speech.
An app that takes input in the form of voice and returns result to the user in the form
of voice.
Project Scope
The goal is to design an android application which covers all the functions of image
description and provides an interface of Digital assistant to the user. A Digital assistant
help the user to provide answer to his questions which would be given in speech form as a
command.
By using Deep learning techniques and Natural language processing project performes:
[10] Image Captioning: Recognising Different types of objects in an image and creating
a meaningful sentence that describes that image to visually impaired persons.
[11] Text to speech conversion.
[12] Speech to text conversion and Identifying result for users query.
References
https://developer.android.com
mscoco.org/dataset/
http://cs.stanford.edu/people/karpathy/deepimagesent/flickr8k.zip
2. Overall Description
Product Perspective
Our project named AISH is a self contained project which aims at recognizing the
objects in the image and then describing image complete in meaningful way. This project
can act as vision for the visually impaired people, as it can identify nearby objects through
the camera and give the output in audio form. The app provides a highly interactive
platform for the specially abled people. The app implements:
an image encodera linear transformation from the 4096 dimensional image feature
vector to a 300 dimensional embedding space
a caption encodera recurrent neural network which takes word vectors as input at
each time step, accumulates their collective meaning, and outputs a single semantic
embedding by the end of the sentence.
A cost function which involved computed a similarity metric, which happens to be
cosine similarity, between image and caption embeddings
system diagram of the image encoder and caption encoder working to map the data in a
visual-semantic embedding space
Product Features
The goal is to design a multi-utility virtual assistant such that any user without any
restriction can use it for simplifying their simple daily works and provide them applications
for entertainment just by their voice and a simple click. The app specifically targets
Visually Impaired people. Brief knowledge regarding Smart phone is required as the app
receives input in the form of speech. English language is used for interaction.
Operating Environment
The software has to be integrated onto the user smart phone that in turn has an
extremely limited support for machine learning APIs.
The interface is to be built keeping in mind that it can be easily operated by visually
impaired people.
3. System Features
User registration<Priority:1>
Description
Authenticate and Login user to the system.The input can be voice or text.
New users should be able to register to the system.
Registered user should be able to change his password if he forgets his password.
Registered user should be able to update his profile.
Stimulus/Response Sequence
opens the app.
Speak command for register.
Enters the details.
Speak for sign up.
Speech Recognition<Priority:1>
When app is opened the speech recognizer will be running in the background
for user inputs.
Image Captioning<Priority:1>
User receives the description of Image uploaded by them. The user can get description in
the form of speech or text form.
Stimulus/Response Sequence
Open the app.
Click Image Captioning Button or speak the command.
Click or Upload the Image.
Image caption is generated.
Text to speech<Priority:1>
The text generated after image captioning is described to the user through speech.
1. USER INTERFACES:
Login Activity:
The user interacts with this activity for authentication which is required for storing
the user information and captionized content on the server so that the user can
access them later.
Login activity requires username and password field.
If the login fails the user is notified by an error message and the error speech.
Signup Activity:
User registers through this activity. The user details like name, email and phone
number are requested.User is notified if there is already another user with same username
or any other kind of data validation error.
Tabbed Activity:
Tab 1: Captioning
In this tab the interface for capturing and uploading the image is placed. The user
gives voice command for capturing the image and uploading it for captioning or
he can press the button and follow the procedure..
The output is a text describing the image in the textview. The text will then be read for the
user. In the background the captioning algorithm is implemented.
The tab has relative layout.
Tab 2: Tools:
Tools tab contains additional features like -
loading the images from the user's social media account on facebook or instagram
for captioning.
Reading news or weather report.
Sending suggestions
The tab has linear list view layout for listing the features.
Tab 3: Profile
Profile has textviews and edittext for changing the personal information.
There is a button for logout.
2. HARDWARE INTERFACES:
3. SOFTWARE INTERFACES:
Operating System: Android
Language: Java
Database:MySQL database.
Libraries:
Keras and Tensorflow for deep learning models.
Retrofit library for communicating with MySQL database.
CloudRail for api integration for multiple social media sites.
a. Performance Requirements
Performance - The software is designed for the smart phone and cannot run from a
standalone desktop PC. The software will support simultaneous user access only if there are
multiple terminals. Only voice information will be handled by the software. Amount of
information to be handled can vary from user to user.
Usability The software has a simple GUI and easy to use. It has been designed in such a
way that visually impaired people can easily use it with minimal problems. The voice
commands are particularly helpful for them.
Reliability The reliability of the software entirely depends upon the availability of the
server. As long as the server is available, the software will always work without a problem.
Security: Credentials of the user are encrypted and application is accessible only to
authenticated users.
Manageability: Once the image captioning algorithm is devised, no frequent changes will
be required. Easily manageable
Appendix A: Glossary
Table 1 gives explanation of the most commonly used terms in this SRS document.
Abbreviations
Table 2 gives the full form of most commonly used mnemonics in this SRS document.
WBS:
Section 4: Design Specifications
[1] COLLOBERT, R., WESTON, J., BOTTOU, L., KARLEN, M., KAVUKCUOGLU, K.,
AND KUKSA, P. Natural language processing (almost) from scratch. The Journal of Machine
[3] KARPATHY, A., AND FEI-FEI, L. Deep visual-semantic alignments for generating image
descriptions. arXiv preprint arXiv:1412.2306 (2014).
[4] KIROS, R., SALAKHUTDINOV, R., AND ZEMEL, R. Multimodal neural language models.
In Proceedings of the 31st International Conference on Machine Learning (ICML-14)
(2014), T. Jebara and E. P. Xing, Eds., JMLR Workshop and Conference Proceedings,
pp. 595603.
[5] SIMONYAN, K., AND ZISSERMAN, A. Very deep convolutional networks for large-scale
image recognition. CoRR abs/1409.1556 (2014).