Making Text Ascent

How I built and deployed a machine learning web app

Nov 3 ·5min read

I’ve often found myself reading an article, say on data science, and wondering, where can I read something simpler on this topic? I realized I wasn’t the only one when a friend posted a similar question on LinkedIn. She asked how to find articles in a specific range between most simple and most complex. I realized we don’t have an easy system for that type of search besides manually reading for a good fit.

Business Understanding

Building on my interests in web search, I created Text Ascent, a web app that uses unsupervised ML to help users discover content based on text complexity. I hope Text Ascent can be one tool used to address searching for content along all the stages of our learning journeys. Central to the goals I have for Text Ascent is for it to make niche topics of interest between people more accessible.

Y7nMzii.jpg!web

Photo By Ted Bryan Yu from Unsplash

Data Understanding

I used Wikipedia-API , a python wrapper for Wikipedia ’s API to gather article titles on topics ranging from art to science. Then I ran a data gathering function (scrape_to_mongodb.py) that took those titles and scraped 11k+ articles for summaries, full text, and urls into a MongoDB database. I excluded articles that had full text less than 300 words because there are entries in Wikipedia like ‘music file’ that did not serve my model’s purpose.

See the Data Collection Notebook & Data Exploration Notebook .

Data Preparation

The content returned from the Wikipedia-API wrapper did not require further cleaning. I did need to make sure that when the content was displayed on the web app that the html was read as JSON to avoid carriage returns displaying to the user. I graded the full text of each document using the textstat package ’s Flesch-Kincaid Grade.

These files are saved in an AWS S3 bucket to allow make the web app accessible. See the Data Preparation Notebook .

Modeling

The current model uses cosine distance between the top 20 features of importance in corpus vectors and user input vectors to return similar content from the library to user input. The model features were created with TF-IDF vectorizer. TF-IDF vectorizer splits the words in the corpus documents, removes stop words, and computes a term frequency for each word in each document, adjusted for how frequently the word appears in the corpus. In other words, uncommon words are given more weight than commonly used words.

Reproduce This Model

Get a list of documents of interest and format into a dataframe like clean_df . Get text difficulty scores using TextStat. My example on AWS S3: clean_df
Fit your corpus to your vectorizer (learns vocabulary and idf from training set), which is the text series in your df My example on AWS S3: vectorizer
Use a vectorizer transform function (transforms documents to document-term matrix) to create your corpus vectors My example on AWS S3: corpus vectors
Clone this repository
In the traverse_flask directory, create an empty subdirectory named data .
Implement the flask app by running flask in traverse_flask in the terminal with $ export FLASK_APP=app $ flask run . This flask app.py takes in functions from functions.py . Adjust the functions to change the data pipeline on the backend. Adjust the brython in the static/templates/index.html to change the way data is reflected to the user.

See the Model Functions .

Evaluation

This product is successful if users are able to discover content related to what they were already reading that is of a different reading difficulty. User satisfaction, repeat usage, web app traffic, and sharing of the app are the metrics I am using to evaluate Text Ascent’s success. I evaluated 4 models before going with the model deployed on the web app:

Model 1: Used TextStat, Gensim, and Spacy.
Model 2: Used Latent Dirichlet Allocation (LDA) topic modeling with 10 topics, then sorts user content into a topic.
Model 3: Used TextStat and TF-IDF Vectorizer with 2000 dimensions.
Model 4: Used TextStat and TF-IDF Vectorizer with top 20 features.

Each iteration was done to so the resulting content was more similar to the user input content.

Future Modeling

I would also like to compare a pre-trained neural network to my current TFIDF Vectorization to see if the quality of returned content improves. Improvement would be measured through user feedback in a simple manual grading system to be added to the web app. See the Evaluation Notebook .

Deployment

Text Ascent has been deployed as a flask-enabled web app traverse.sherzyang.com on an EC2 instance (currently not running). The app uses brython to interact between python functions and html. Below are two images from the web app. Given any user input text, the model will output related articles from the library with links in the title to full length articles. Users can scroll or traverse from simpler content to more complex content and the table will update accordingly.

iUV7BnU.png!web

Y3QZRbz.png!web

Future Iterations

As part of my interests in search and our new world of one-shot answers — thank you Alexa, Siri and Google Home — I plan on deploying Text Ascent as an Amazon Alexa skill. The skill will allow a user to “scroll” or “traverse” along a gradient of simpler to more complex summaries on a topic just like telling Alexa to play a song louder or softer. I believe creating options in content will expand us beyond the world of one-shot answers in a positive way.

Additionally, I am eager to grow the corpus to include books from Project Guttenberg and beyond. If you have some content you’d like to see added to the current library of wikipedia articles please send me a message on LinkedIn . I’ve seen several web extensions that grade a book’s reading difficulty on Amazon or Goodreads ( Read Up is a great one). Those products inspire me to develop a corpus-free search functionality for Text Ascent in the future. I envision Text Ascent becoming much more useful when it can return Google or Bing web search API enabled content.

How I built and deployed a machine learning web app

Business Understanding

Data Understanding

Data Preparation

Modeling

Reproduce This Model

Evaluation

Future Modeling

Deployment

Future Iterations

Credits

Recommend

Business Simulations With Python

ELECTRA: 超越BERT, 19年最佳NLP预训练模型

工业大数据服务提供商「凯奥斯」获得天使轮融资，为制造类企业提供智能物联网数据服务

Google Photos for Android rolling out sorting in Albums tab - 9to5Google

男子被毒蛇咬伤后切断手指医生事后告知没有这个必要 - 社会趣闻 - cnBeta.COM

华为研发人员投诉食堂

二线城市大专毕业生的未来？

如图，印象笔记今年双十一这是什么套路，有必要升级专业账户吗？

Up5k_basic

Kalman Filter(1) — The Basics

About Joyk