Show and Tell Using Computer; A Deep learning approach.

February 1, 2022

The Artificial intelligence is making its way in all fields of the world by improving every type of work. The field of Artificial Intelligence (AI) is based on understanding cognition, learning and teaching like humans. It's one of the interesting topic that everybody wants to learn how it works. Artificial Neural Networks are forming an important part in Artificial intelligence. I would like to share my experience on Artificial Neural Network, Artificial Neural Networks are used to generate Artificial Intelligence. Artificial Neural Networks are spreading everywhere, the Artificial Neural Network is an attempt to simulate the functions of biological neurons in human brain.

This deep learning application can benefit people with visual impairments. Well-trained models can be deployed within screen readers to help visually challenged people to understand the context and the contents that an image may be trying to
communicate. A screen reader is essentially a type of assistive technology that manifests
itself as software liaisons or through specially designed hardware devices used by visually
challenged people to navigate the web. It is also efficacious in general utilisation of electronic devices such as mobile phones, computers, and tablets. The main goal of a screen
reading assistive technology is to encode and transcribe the visual data into synthesised
speech. If the screen reader contains a braille output, it generates relevant braille patterns
that the user can recognize by tracing their fingers over as guided by the braille system.

Our final year Master’s project is a Captioning System for Image, which produces output in form of text and speech. Our web application system can identify the majority of images and caption them
correctly with decent accuracy, But there are some problems with this web application,
where a new image may not be identified accurately amongst this model. The reason for
this issue is the lack of training image data. However, we have proposed a solution whereas
the future model of our application will train a broad range of data sets processed through
the model. We have implemented the Transformer deep learning model in our web application, which is the primary model in training and captioning huge sets of images from the
data sets. We used the Django framework and Python for back-end. We train our data set
using high performance GPU using software as a service provided by PaperSpace where
we used the GPU RTX5000 with 30GB RAM to train our model. We created a prototype by using JavaScript for client-side dynamic interactions and form submissions with
the Django framework for web-server handling. Finally, we deployed the Web application
online using the Heroku Deployment service by SalesForce and synchronised the code to
UbiOps inference system, which will process the user input image and leverage the training
from the data set to produce accurate and descriptive captions.

Creating an Image caption generation application using deep learning and the front-end
application using respective proprietary technologies has been a challenging yet fulfilling
endeavour that has allowed the members of this team to not only learn and implement new
things, but also helped us build character and learn to work in a team. Coming up with a
well-trained deep learning model that is capable of discerning image features while at the
same time, using said features for generating grammatically correct semantic sentences is a
remarkable achievement. This project serves as a stepping stone for future developments
and advancements in web technologies and the field of deep learning as there is always
scope for improvement be it the image feature extractions, generation speed of the model
or the caption generated by the model. Hopefully, This project marks a major milestone
in the years to come and will be remembered with fondness and the memories of hard work
and creative ingenuity will be long remembered.