31

Do it for the ‘gram: Instagram-style Caption Generator

 4 years ago
source link: https://towardsdatascience.com/do-it-for-the-gram-instagram-style-caption-generator-4e7044766e34?gi=e337a744d5bd
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Generating captions for Instagram photos using a Keras CNN-RNN framework

0*55n2q_A2BFV22peY.jpg

Dec 3 ·9min read

Authors: Camille Bowman, Sejal Dua, Erika Nakagawa

1. Introduction

Image captioning refers to the Deep Learning application of generating a textual description of an image using Natural Language Processing (NLP) and Computer Vision (CV). This task requires an algorithm to not only understand the content of the image, but also to generate language that connects to its interpretation. We wanted to take this challenge one step further by generating captions specifically for Instagram pictures.

How would you describe the picture below?

1*D2bUvEzNtiE4LoHj5WBylg.jpeg?q=20

A simple caption generator could describe the image as something along the lines of “four friends in a pink room”. Microsoft’s CaptionBot answered with “I think it’s a group of people posing for a camera”. Isn’t that how you would describe it as well? But if you wanted to post this image on Instagram, would that be the caption for it?

Instagram captions tend to be more advanced than a simple descriptor and consist of puns, inside jokes, lyrics, references, sentiment, and sarcasm. In some cases, the caption may not be relevant to the presented image at all. In this case, user @sejaldua99 posted the above image with the caption “aarhus goes way 2 crazzyyy”.

Our work aims to generate captions that follow a specific style with specific vocabulary and expressions. To achieve this, our model consists of a Convolutional Neural Network (CNN) and a Long Short-Term Memory (LSTM) Recurrent Neural Network (RNN).

2. Prerequisites

This blogpost assumes familiarity with basic Deep Learning concepts such as convolutional neural networks (CNNs), recurrent neural networks (RNNs), gradient descent, backpropagation, overfitting, probability, Python syntax and data structures, Keras library, TensorFlow, etc.

3. Data Collection

We scraped Instagram using a profile crawler powered by Selenium that we forked from GitHub user @timgrossmann. The scraper built JSON objects for each Instagram user, including post information, location information, comments on each post, the timestamp of each user interaction, and number likes, among other things. We wrote a simple script to parse each JSON in our scraped profile directories, and from there, we extracted Instagram photos and captions.

Our dataset consists of posts from Instagram users that follow our accounts: @camillebowman, @sejaldua99, and @erikaaanakagawa. Due to the fact that the scraper required logging into Instagram via Headless Chrome, we were only able to access the information from users we follow and public profiles.

1*6GokY618mfm-ziQdSLHqpQ.png?q=20

network visualization of the data source: 3 core nodes and one degree of follower connections

The network above depicts the scope of our data. It is worth noting that because we have three predominantly unique spheres of people from which we have collected our data, we have introduced some slight bias. It would not be too surprising, for example, if our neural network generated captions relating to “Johns Hopkins”, “Middlebury”, or “Tufts” because we each have follow a lot of users from our own college communities, and those users happen to post about the events happening on their respective campuses.

We collected data over the course of two weeks, scraping user profiles all day and all night (my computer is feeling very sleep deprived). We managed to acquire 100 posts from just under 3000 users, but not all of the data from these users made it into our CSV file. Turns out, our scraper struggles quite a bit — 23% of the time, to be precise — due to Chrome Driver variations, scroll rate errors, and Instagram updates since the time the scripts were written.

All factors taken into consideration, the profile crawler we used was still very much functional in helping us procure an enormous dataset with which to train our neural network.

4. Data Cleaning

1*1fCtD60kTct0F7vh6UpH4w.gif?q=20

CLEANING

As with any data science project, 80% of the work is acquiring the data, storing it, parsing it, and cleaning it. The cleaning of the JSON involved a file system traversal and some really elaborate conditional logic.

1*dti6quPabiNejAmJiHE-nw.png?q=20

summary report: composition of the dataset

We made some deliberate choices to construct a dataset that we thought would minimize bias and offer the most diverse collection of image-caption pairs as possible. The design choices were as follows:

  • no image, no caption, collections: We got rid of any posts without captions and discarded collections and videos. This unfortunately eliminated roughly 30% of the uncleaned dataset, but relative to our large volume of posts––around 130,000 in total––this was not too costly.
  • too few posts : We excluded users with fewer than 10 posts because the impact of a user with, say, 3 posts versus a user with 100 posts seemed like it was off by an order of magnitude, and we wanted to correct for any bias that may have introduced. After throwing out users with too few posts, the median number of posts per users shot up from 21 to 46, which is much closer to the theoretical median.
  • low num followers : Instagram is a platform on which people like to tell funny stories and share worthwhile life updates. The younger generation of Instagram users often have two Instagram accounts, one which they use to share photos in a more public way and another which they use to share photos and stories to closer friends. We wanted to rule out these smaller, more private accounts due to the fact that the content of captions from those types of accounts almost never matches the corresponding image. For this reason, we threw out all post data from users with less than 200 followers.
  • non-english : While neural networks have the potential to be incredibly intelligent, one thing we wanted to avoid was confusing our model with captions of different languages. Imagine how much time and effort it takes a human being to become fluent in more than 5 languages. We figured that having captions in different languages would increase our training time exponentially, so we installed a library called pyenchant which essentially checks whether a given word is in the English dictionary or not. We stipulated that if the text portion of a caption (excluding hashtags and mentions) is longer than 5 words long and more than 80% of the words are not in the English directory, we would designate the caption as “non-English” and discard the image-caption pair. Only 300 something posts fell into this condition, and we think we were able to retain captions resembling “update: i am moving to france. c’est la vie.” Click the link to view all of the non-English captions that were discarded.

1*4gE20J5nB4v6xEiaEZQw4Q.png?q=20

pie chart representation of the dataset (blue represents viable data, red represents discarded data)

After tossing out roughly 38% of the dataset, we decided to briefly look into the composition of our viable data. We felt that Instagram captions are unique due to their highly fluid structure. That is, a caption can be comprised of alphabetical text, special characters, hashtags, mentions, emojis, or some combination of all the above. We wanted to quantify what we were working with, so we parsed each caption for characters like “#” and “@” and used the Python emoji library to sift for the correct unicode symbol that denotes an emoji. In order of highest to lowest prevalence, captions contained emojis, then hashtags, then mentions, then quotes. A little less than half of all usable Instagram captions did not contain any meaningful special characters.

5. Network Architecture

Our network was inspired by Jason Brownlee’s How to Develop a Deep Learning Photo Caption Generator from Scratch article.

We combined a Convolutional Neural Network for image classification with a Recurrent Neural Network for sequence modeling to create a single neural network that generates Instagram captions for images.

In other words, we used a 16-layer Oxford Visual Geometry Group (VGG) model pre-trained on the ImageNet dataset to interpret the content of the photos. We removed the last layer of the CNN to collect the extracted features predicted from the model to use as input to a RNN decoder that generates captions.

We also used a tokenizer to create an “alphabet” of all characters that existed in our body of captions. Unlike in Jason Brownlee’s article, we decided to tokenize on a character level instead of a word level, and include punctuation and capitalization. We thought this would better reflect the captions we were attempting to generate, as it allowed us to preserve key aspects of the caption (like emoticons and hashtags), and have our network generate emojis better. We then encoded each caption by transforming it to an array of numbers, with each index representing a character. Finally, we prefaced each encoded caption with a start sequence and stop sequence to facilitate training.

preprocessing images and obtaining features vectors from CNN
tokenizing captions on a character level

We created input-output vectors for each caption, where the input was the image’s feature vector and the first n characters of the caption, and the output was the n+1 character in the caption. We made one input-output pairing for each character in the caption vector. The first pairing was the image and the start sequence as input and the first character in the caption as output, and the final pairing was the image and the entire caption as input and the stop sequence as output. To prevent our number of input-output pairings from becoming explosive and limit the bias of any one individual caption because we were only able to use 10,000 images to train, we only used caption-image pairings with captions having no more than 60 characters.

We then trained a Long Short-Term Memory (LSTM) decoder as a language model that took the feature vector and encoded character arrays as input, and produced an encoded character as output.

building the layers of the neural network

1*x-YyTvei_np7EkFs0loRUg.png?q=20

summary report: structure of our model

6. Results & Conclusion

We ran multiple iterations with varying parameters. In every iteration, our network ended up getting stuck in a local minima and generated the same caption no matter what image it was fed. Here are some examples of the single caption generated by some of our training iterations:

  • summer things and I can’t wait to be a second-year (little summer?? I’m so much love you sae you are
  • summer is summer :/D
  • The world with the best weekend #thetan.
  • Happy birthday to the state of my favorite place! :purple_heart::kissing_heart:

Although the predicted results lack diversity, they show that our network was able to learn and pick up the language of our Instagram followers. It used hashtags, emoticons, punctuation, and emojis. These captions are mostly comprised of legitimate English words. We believe that our largest limitation in terms of model quality was how long it took to train. With 1,000 captions of any length, the model took 2 days to train. Limiting the caption length to 60 characters or fewer allowed us to use 10,000 images and decreased the training time to 10 hours. But needing to let the model train overnight or all day limited our ability to test multiple iterations and troubleshoot the bugs we were encountering. It also inhibited us from training with more epochs and on a larger subset of data, which are two strategies we used in class to improve performance. Ideally we would have been able to use a significantly larger fraction of the 80,000 image-caption pairs available in our dataset, and a supercomputer would also have been nice to have too!

Of course, there are always ways to modify our model to improve the accuracy:

  • Using a larger dataset
  • Changing the model architecture
  • Hyper parameter tuning (learning rate, batch size, number of layers, etc.)

However, as stated above, with our limited time and computing power, many of these modifications were out of reach.

Despite all of our room for improvement, we are proud of our neural network. It did learn some things about Instagram captions and even attempted to create an emoticon! Clearly, it still has a ways to go, but we will hopefully inspire someone to continue and improve upon our work in the future. Thanks for reading :)

All of our code can be found at the GitHub repo linked below!

Acknowledgements

We would like to thank Dr. Ulf Aslak Jensen for teaching us the fundamentals of various neural network models in his Danish Institute of Study Abroad (DIS) course “Artificial Neural Networks and Deep Learning” and giving us some valuable tools with which to tackle this challenge. Congratulations on your recent doctorate and wedding! We would also like to thank the many GitHub repositories and online articles we consulted while creating this project. Shoutout to the open source community. As always, the biggest thank you to StackOverflow for helping idiots like us solve our problems everyday.

References

Brownlee, Jason. (2019). How to Develop a Deep Learning Photo Caption Generator from Scratch.

Grossmann, Tim. (2019). Instagram-Profilecrawl GitHub repository.

Park, Cesc & Kim, Byeongchang & Kim, Gunhee. (2017). Attend to You: Personalized Image Captioning with Context Sequence Memory Networks.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK