36

A python flask app that predicts the personality type on the basis of user entri...

 4 years ago
source link: https://towardsdatascience.com/a-python-flask-app-that-predicts-the-personality-type-on-the-basis-of-user-entries-using-text-1269d6f644bd?gi=d0f0e5b6bef8
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

A python flask app that predicts the personality type on the basis of user entries using text analytics.

The painting above (called vertical flow) by Irene Rice Perreria is one that I find most interesting. To me, it describes the heterogeneity in colors and also simultaneously how these different colors originate from the same essence. Colors at the end of the day operate on a spectrum rather than as individual points. One could make the case that human personalities also operate on a similar spectrum rather than in isolation and different personality types coincide and diverge continuously on this spectrum. In this article, I describe an application that helps users predict their own personality on the Myers Briggs spectrum (of 16 personalities) using text analytics in python. The final application is deployed on heroku and can be accessed here . The application is developed in python and makes use of ‘flask’ for app development and the ‘tfidvectorizer’ from ‘sklearn’ for text analytics.

Data used for analysis and our key question

The data used for analysis is a database that contains posts by reddit users corresponding to different personality types. The database can be accessed on kaggle here . This is a free, open source, public domain dataset . We will create a personality finder application where a user can enter any set of words that describes themselves. The application will run the compute a similarity score of the user entered text with the various entries from the personality type database to predict the personality type (one of 16 Myers Briggs personalities). It will also predict the likelihood of the four Myers Briggs characteristics such as Introverted/Extraverted, Intuitive/Observant, Thinking/Feeling and Judging Judging/Perceiving. The video below shows the final application in action. The application will show the top 5 predictions with the similarity scores so the user knows a spectrum of personalities that fit her or him.

ayieUbM.jpg

Structure of the application

This being a traditional flask application, we will need to first define a structure. The structure we will use is a standard flask application structure with an index.html file that will contain our user interface and host our results and a python file with the code to create our application.

#Structure of the flask application----Root folder
    --Base data files #This is basically just the csv with the base data from kaggle
    --application.py #This is the main python code file
    ----templates
        --index.html #The templates folder contains our index.html file with the user interface
    ----static
        #The static folder contains any images we will want to display in the application. 

Below is a diagrammatic representation of the flow of data in the application

bIbaAfj.png!web

Defining the user interface (index.html)

So, let’s get started with the user interface itself. The first part of the index.html file is easy. This is just some basic text that describes the application itself and a simple image of the painting above that we will host in our ‘static’ folder. So, that will look like

<html> <head><section>  <img class="mySlides" src="/static/Vertical.jpg"  style="height:50%" style="width:100%">  </section>    <h1> Personality finder (developed using text analysis)</h1>   <p> Find your personality type using this python flask application that makes use of a text vectorizer to analyze    posts by individuals identifying as different personality types. The app will yield your top 5 personalities from the     16 Myers Briggs personality types. Data downloaded from kaggle. By the way the paiting displayed here is "Vertical Flow" By I.Rice.Pereria.</p>    <p> Please note that the script might take a minute to run since it has to parse through over 8000 entries.</head>

Now, let’s create a form where a user can submit words or sentences that we will use for the text analysis. This is a simple html post form with a submit button. We will have to keep track of the name we assign to the form since this will be used later in the back end.

<form method="post" >    <input name="question1_field" placeholder="enter search term here">    <input type="submit"></form>

I will skip an explanation for the css styling for the html page. Finally, we will create a table which will output the top 5 entries for the user. We will also print out various headers such as the personality type, similarity score, search term, row etc. We will need to define table heads (‘th’) and table data (‘td’) in html. Let’s define the heads first,

#Each of the heads is some data that we want to print out for the user<h2>Relevant personality types displayed below</h2>
<table>  
<tr>    
<th>Type</th>    
<th>SimiScore</th>    
<th>Search Term</th>    
<th>Rank</th>    
<th>Introversion/Extraversion</th>   
<th>Intuitive/Observant</th>    
<th>Thinking/Feeling</th>    
<th>Judging/Perceiving</th>    
</tr>

Now, the data will be brought in from the python side. So we can use some jinja code to effectively access our python dataframe variables. We will call our python dataframe in the backend ‘docs’. Jinja will allow us to iterate over this python dataframe as shown below. Now, the order in which each variable is called should be similar to the order of the heads we defined in the block above.

{% for doc in docs %}
<tr>
<td>{{doc["type"]}}</td>
<td>{{doc["Simiscore"]}}</td>
<td>{{doc["Search Term"]}}</td>
<td>{{doc["Rank"]}}</td>
<td>{{doc["Introversion/Extraversion"]}}</td>
<td>{{doc["Intuitive/Observant"]}}</td>
<td>{{doc["Thinking/Feeling"]}}</td>
<td>{{doc["Judging/Perceiving"]}}</td> 
</tr>
{% endfor %}
</table>

Defining the backend (python)

The complete backend python code file can be accessed here . The first step is obviously to import all of the required python libraries. Since we will be using text analytics in python, we will need to import the Tfidvectorizer function from the sklearn package. Also, let’s go ahead and create the application itself in flask. We will also create a standard get and post route for the application so we can effectively send data to it and receive data from it.

#Import required packagesfrom flask import Flask, flash, redirect, render_template, request, session, abort,send_from_directory,send_file
import pandas as pd 
import numpy as np
import json
import xlrd
from sklearn.feature_extraction.text import TfidfVectorizer#Create a flask application
application= Flask(__name__)#Create an application route
@application.route("/",methods=["GET","POST"])

Now let’s define what will happen on the homepage using a homepage function within the flask application. Now, here is where we will first try to call the data that the user has entered in the form we defined above in html. As we know the form itself is called ‘question1_field’. We can access this data using a simple request function that will invoke the form data. We can also specify a default word that will be called in case the user has not selected any words (The example default word for this example is ‘Education’). Let’s also read in the raw data itself as a dataframe using pandas. I have cleaned up the user entries from the reddit data from kaggle. A clean version of the csv can be accessed here . Below is a screenshot of the dataset. The post column contains multiple posts by a user. Please note that this is an open source dataset, available on kaggle that contains no confidential information.

6Jjyyuv.png!web

def homepage():
#Call value from HTML form  
words = request.form.get('question1_field', 'Education')#Read raw data as a dataframe
datasource1 = pd.read_csv("Personalities.csv")

Dealing with multiple words- A user may enter multiple words as a part of a single entry. We would want to split these words into multiple items so that we can run the text analysis for each word separately before running the analysis for all the words together. We will also want to filter the posts in the dataframe for each of the words and bring those filtereted posts into a different dataframe we will use. So, let’s create another dataframe for that purpose. After creating this new dataframe, let’s drop duplicates just in case the same posts got picked twice.

#Separate words into multiple entrieswords = words.split() #(This will be stored as separate items in a list)#Filter the data for posts that contain either of these words
datasource=[]for i in words:      
    d2 = datasource1[datasource1['posts'].str.contains(str(i))]          
    datasource.append(d2)
  
datasource = pd.concat(datasource, ignore_index=True)
datasource = datasource.drop_duplicates()

Ok, before we proceed, let’s talk about what a similarity score is and how and why we are going to calculate it in the given context. A similarity score or a ‘pairwise similarity score’ is basically the distance of a particular word from other words in an entry. So basically if a user says ‘I love metal’, these words would have a very high similarity with a post that mentions anything related to metal. Since each post corresponds to a personality type, we can technically ‘predict’ that the user’s personality is similar to the personality type of the user that made that post.

The tfidf vectorizer will compute the frequency and significance of words that appear within a particular post. So, if in the above example, if the word metal in the post just occurs in passing, the tfid score would be very low thus yielding a low similarity score. the tfid vectorizer would compute its score on the basis of the frequency of the word and its uniqueness. I won’t go into the details of how the tfidvectorizer works in this article. So, to compute the similarity between the user text and the post text, we would need the two values side by side in a dataframe. We already have a filtered dataframe with the relevant posts for the user. Let’s create an empty dataframe called documents which we will use for the text analysis. We want to clear the dataframe each time the user text changes. So, we can call it within the for loop below.

for i in range(0, len(datasource)):      
    documents = []      
    documents.append(words)      
    documents.append(datasource.loc[i, 'posts'])

Now for each entry in the dataframe, we want to compute the tfidf score for the post with respect to the user entry. This can be accomplished with the tfidvectorizer function. The function allows us to specify stop words i.e. words that the function should ignore articles (a, an, the etc.) and prepositions (over, under, between etc.). So, if we just set the stop_words parameter in the function to ‘english’, these words get ignored automatically. Adding the tfid vectorizer to compute a score for each entry, our function now looks like the below.

for i in range(0, len(datasource)):      
    documents = []      
    documents.append(words)      
    documents.append(datasource.loc[i, 'posts'])
    
      tfidf = TfidfVectorizer(stop_words="english", ngram_range=(1, 4)).fit_transform(documents)    #tfidf is the tfidf score we compute for this post with respect to the entered text.

Now we know the significance of the user entered words within each of the selected posts. But we now want to compute a similarity score using this tfidf score i.e. that is how similar are the words entered by the user to the words entered in the post. The similarity can also be defined as the cosine distance between the words. This post from stack overflow explains the computation of the pairwise similarity in detail. The short explanation is that you can compute the pairwise similarity by multiplying the tfidf score by its transpose. This yields a similarity matrix and accessing the first element of this matrix will give us the similarity score. We will also format the similarity score to display only the first four decimals and add it to the dataframe as its own column. So, the above code with the pairwise similarity calculation becomes,

for i in range(0, len(datasource)):      
    documents = []      
    documents.append(words)      
    documents.append(datasource.loc[i, 'posts'])
    
    tfidf = TfidfVectorizer(stop_words="english", ngram_range=(1, 4)).fit_transform(documents)    #Compute pairwise similarity by multiplying by transpose.
    pairwise_similarity = tfidf * tfidf.T    #Then, extract the score by selecting the first element in the   matrix.
    simiscore =   pairwise_similarity.A[0][1]    #Format the similarity score to select only 4 decimals
     simiscore = "{0:.4f}".format(simiscore)
     
     #Finally, add the similarity score to the dataframe
      datasource.loc[i, 'Simiscore'] = simiscore

Now that the loop is done, we will have similarity scores for all the posts in our dataset computed for the text entered by the user. The next steps are ranking the similarity scores, picking the top 5, adding the search term as its own column to the dataframe and selecting the relevant columns that we would like to display. The final output will be assigned to a dataframe called docs that we will pass to our html table.

#Compute the rank and add it to the dataframe as its own column
datasource['Rank'] = datasource['Simiscore'].rank(method='average', ascending=False) #Sort by the rank
datasource = datasource.sort_values('Rank', ascending=True)#Select top 5 entries
datasource = datasource.head(5)#Add the search term to the dataframe as its own column
datasource['Search Term'] = words  #Select relevant columns and assign to a dataframe called docsdocs = datasource[ ["type" ,"Rank", "Simiscore","Search Term","Introversion/Extraversion","Intuitive/Observant","Thinking/Feeling","Judging/Perceiving"]]

Since we are sending this data to a html table, we will have to convert it to a json format. Thankfully, this can be easily achieved in pandas using the to_json function. We can also set the orientation to records, so that each entry becomes a record. We can complete the homepage function in flask by returning an object called ‘docs’ which our front end is expecting. We will use the render_template function in flask to send the data to our index.html file.

#Convert to json
docs = docs.to_json(orient="records")
#return data to front end for display return render_template('index.html',docs=json.loads(docs))

Finally, let’s finish our application in standard flask style,

if __name__ == "__main__":    
   app.run(debug=True)

There you have it! On running the code, you should get the following message with a link to the application on a local drive.

* Running on http://127.0.0.1:5000/ (Press CTRL+C to quit)

On running accessing this link, you will be directed to a working application!

ayieUbM.jpg

There it is! Try out some words yourself.

Here are some relevant links below,

  1. Link to the github project for this application- https://github.com/kanishkan91/Personality-type-finder
  2. Link to a working version of this application deployed on heroku- https://personality-type-finder.herokuapp.com/
  3. Link to the base data used for this application- https://github.com/kanishkan91/Personality-type-finder/blob/master/Personalities.csv

PS- The github project contains a .gitignore and requirements.txt file that you can use to deploy a modified version of this application to any server.

A special thank you to Anish Hadkar and Aditya Kulkarni for their feedback on this application during its development. Any additional feedback that you might have is always welcome!


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK