The DataHour: How to Stay Relevant in World of AI?

Overview on AI

Analytics Vidhya has long been at the forefront of imparting data science knowledge to its community. With the intent to make learning data science more engaging to the community, we began with our new initiative- “DataHour”.

DataHour is a series of webinars by top industry experts where they teach and democratize data science knowledge. On 29th March 2022, we were joined by Anastasiia Molodoria for a DataHour session on “How to Stay Relevant in the Booming World of AI?”

Anastasiia has a strong math background and experience in predictive modeling, NLP (Natural Language Processing), data processing, and deep learning. She has successfully integrated ML, DL, and NLP solutions for retailers and product tech companies considering optimization and automation of routine daily tasks and increasing business efficiency.

Currently, she’s working at MobiDev as the Data Science Team Leader.

Are you excited to dive deeper into the world of Data Science and Machine Learning? We got you covered. Let’s get started with the major highlights of this session: How to Stay Relevant in the Booming World of AI?

Introduction on AI

From AI this session, you’ll have two learnings:

First, what are the most popular AI directions?
Second, understand of where to start in order to successfully work in these areas.

Anastasiia had covered these topics considering business cases for a deeper understanding of the value of AI integration for solving real-world problems. Also, this will help you getting insight into what main steps should be done for achieving the successful delivery of an ML product to the client. And how to provide the right expectation to the business, when you don’t know beforehand the exact output of your ML research.

Prerequisites: Some basic understanding of Data Science.

So, let’s dive into the ocean of AI.

Starting AI Project: How to Provide the Right Estimates and Meet Expectations?

With the few basic examples, we’ll try to understand what are the right estimates and expectations we need to met to click start a project. So, let’s begin.

For the same, first, we need to figure out two things: PoC vs MVP

Poc VS MVP

What exactly these two terminologies are and why we need it?

We need these to know:

What business goal we are solving?
Do we have any idea how to solve this task?
What are evaluation criteria?
What technology we should use?

Now, let’s look how PoC and MVP works.

PoC – Proof of Concept

It takes an input (task we need to solve) and helps us in getting desired output.

Input Output

Output given by PoC:

understand whether you have capabilities to develop the solution
help to make more accurate estimates
understand the necessity of engaging 3rd-party developers (back-end, front-end, etc)
have a great basis for MVP solution
show expertise on practice

MVP – Minimum Viable Product

It helps in maintaining a balance between Minimum and Viable product types (classification), that is, through this we can get optimal set of features to start with.

For example, Donuts in market. The market is flooded with donuts and suppose, you are a new-comer in this business. Only knowing how to make donuts will not benefit your business because there are so many brands who already exists. So, what new you need to emerge, is basically the idea that will boost your business and make this successful. That idea is the extra features that you’ll add to your donut. The extra features is called as MVP Solution.

No, let’s see how to/how not to build MVP with another example.

Explanation of the example: The 1st way of making car is not a minimum viable product because suppose if somethings goes wrong at step three, it’s not possible to get the desired product successfully. And all our efforts whether it is money or time or efforts, all goes in vain.

Moreover, the 2nd way of making is the perfect way of building a minimum viable product .

This was all about PoC and MVP individually.

Now let’s know when to use PoC and when to use MVP.

Here, we’ll answer a basic question, Do we know where to start from?

Case1: If yes, next step is, are the goal and all steps clear?

If yes, we’ll choose MVP.
If no, we’ll choose PoC .

Case2: If no, you don’t have any idea, then you’ll have to choose PoC.

CRISP-DM Process (Cross-Industry Process for Data Mining)

It’s an project that is not a sequential development one. For example (sequential development project), developing mobile application, here, we’ll know what will to the next step and get desired result. Data Science projects are iterative. For this type of project we need to know:

Business Understanding: Ask your client what he/she wants to develop.

Data Understanding: After assignment to data by client to you. There might be a situation where you are not able to get insights from the data. So, just connect with the client again get the proper understanding of data.

Data Preparation: This is one of the major dimension that a data scientist have to look into for whatever projects you are handling with.

Modeling: Select the model that fits your idea the best. But, if you observe some glitch in the model, then, you can go back and look into the data preparation again. And, then by correcting, a new model or correction in the model can be done.

Evaluation: Evaluate the data whether the model will work or not. If this works, then go for the deployment. If not, you need to understand the business again.

Deployment: Deploy the AI project.

New Project : Specifics of AI Estimation

Source: Presenter

This is a different project. We need somehow to estimate it. And if for example on one side we have some mobile application where more or less it’s clear that we have some specific operation system. Here, we need to add some buttons, etc. How properly you’ll estimate this if you don’t know results in advance. So you don’t know whether it will work, what accuracy you will have but you need to do it. This is frequently asked question so please estimate it.

Few Recommendations For the Projects

Be sure that you can solve the core task. Start with PoC.

Why? If you are not sure. Because, for example, if the idea is to develop mobile application based on AI and if this main core task cannot be solved. There is no need to gather all other developers at all because we are not solving the main functionality that we have. So ask clients to start with PoC if you are not sure, it’s fine completely.

No commitment on specific numbers in metrics.

Why? It can be the case when client ask you something like any accuracy commitments. Don’t do this because you don’t know results in advance. You can describe it to the client. What you can do, for example, on first stage, select several models that you are going to try and usually these models are tested by some open source data sets. There you receive some metrics, share this look to client. Convey client that you are not sure how it will be on your data. Because we didn’t try and model need to be developed and it’s fine.

Provides the client with the project risk and limitation it’s crucially important.

Why? Because for example, let’s imagine that you test your poc or experiment with audio files and it looks good to you. Then you are sure that it will work in production. But when it becomes the product, data you see is completely different. Like, there is a lot of background noise and the approach is not working. So better to write it in risk. If some output get wrong you can refer to this point that you provided to the client in the very beginning.

Explain this project flow to the client.

Like, we described to you. So describe it to the client so that you will be on the same page.

More decomposition more understanding of how to achieve the goal.

So, if you have more understanding it’s great because first of all you will be more confident in your estimates and achieving the goal. This is the main point here as much details you will have to reach your goal as better.

Clarify the necessity of runtime.

It’s important but client doesn’t tell about it. It’s true story we will cover it today a bit more later on. So if client don’t tell about runtime it doesn’t mean that it doesn’t matter so better to clarify on your own.

How to Deliver a Result to Client?

- Demo and visualization are the best options (python/R visualization tools, streamlit, gradio, etc.).
- Even if you have a small subset of client’s data – use it for demo.
- Make sure that client understands your point.
- Report with all details of your work.

Popular AI directions Overview

Source: Presenter

When you show something abstractive means when you apply a model on some open source data it’s one
story completely. Another story, when client don’t provide you a lot of data (eg, three pictures) but when clients see some insights from you on their own data, it’s completely different story. You will get more trust from the clients definitely. So even you have a small amount of data try to use it for demo to the client. Make sure that clients understand your point.

We all are working in data science area where it’s really easy to lose people, because we have here a lot of technical details and often client is not technical as we are. So we have to describe complicated stuff in easy word. So, make sure that client understand your plan. Ask questions if it was clear or maybe you should paraphrase, it’s fine. But it’s better to do each time to avoid any miscommunication in the future because it’s really even worse if you will have it.

And the last recommendation, you can provide report to the client with all details of your work. It’s really great point because when you finish your meet you can share this report and client can go through these details. The most interesting one including against optimization- computer vision, nlp, time series.

Business case: Time Series Predictions

Let’s understand this with an example of Cafe Chain Owner.

The main goal of this owner to support and maintain the business. This owner came to you with two question:

He wants to know the number of products that will be sold
understand the performance of employees who is a good performer who is not so good.

As input data client provide to you a SQL database with bunch of tables. Connection between these tables or data what is expected solution here. So if client want no number of product that will be sold.

What is Expected Solution?

We can predict amount of products.
For performance of employee, we can suggest to develop some kind of employees rating based on sold products, tips or something we have in the data.

But this isn’t enough, try to think wider and more deeper as a competent data scientist.

What Else You can Suggest?

Recommendations for the best sold products, so, client will know what is going like in pairs. So employees can apply cross selling and increase benefits and revenue.
You can suggest employees anomalous detection. For example, in terms of employees analysis whether some employees is cheating and we can try to detect this anomalies in our data.
Popularity of products per hours per day. This is great point for marketing strategy perhaps to apply some advertisement during specific days or weekends or lunch time.
You can suggest dashboard, so, a client will be able to see data in real real time.
You can suggest clustering analysis too, so, having data we can have customers based by groups. And identify this behavior inside these groups and again.

Business Case: NLP

We’ll understand this with an example. There is a business owner who has product and the main goal of this person is:

to sell the product and
provide customer support

Let’s imagine customer support via chats because we need to have somewhere text in this business case. So it will be in the chats and this client came to you with two requests:

Have the understanding of employee performance.
Help in providing services more effectively.

What Solution You’ll Propose on AI?

We propose to the client:

Sentiment Analysis: Based on text can identify emotion and tone of conversation. So, we can detect some negative cases and try to understand who is from employees have more negative conversation. Try to make some analysis based on these sentiments.
Text Summarization: To summarize all conversation. Example, a customer came to the chat support and reference to some problem that he or she referred before. She said, I talked with someone a few days ago and this agent need to go to the database to find this conversation, ticket to read to understand. So it takes a lot of time. There is nothing good here. So this summarization will help to speed up this process definitely.
Keyword Detection: This is the idea to detect keywords from conversation. From these keywords, we can apply some text to this conversation. For example, if some problem was already solved by some agents and another agent faced with the same problem. It will be easy to find with some text. Because there is some searching rule here.

How to Start?

Tabular Data

Tabular Data/ Structured Data is a form of database which consists of few number of rows and columns. We can say tabular data is a table which stored data of different types no matter whether it boolean, number or a alphabet. Tabular Data makes data ready for insights more efficient. Usually, it deals with these task, but not finite atleast. We deal with tabular data that has three main tasks – regression, classification, and clusterization.

Regression: This task is for predicting some specific number like price, sales units, means a
number, decimal.

Classification: This is based on assigning some class to the observation. For example like disease detection,
whether it’s good or sentiment.

Clusterization: It’s we don’t know amount of classes but we want to group our data in some specific amount of groups like we discussed previously customer groups detection so wrap up people by behaviors.

Data provided for AI Projects

It’s true that expectations and reality is different. We are expecting that we will not have missing data and all variables are well known and data is clean and everything is fine and our goal is to apply model and tune it. But,
this is not reality. To have good insights of data we need to follow:

Understanding the data: In wrapping up the table, classical table or data task the main and the first important step is data understanding. So all of this unknown stuff should be completely understandable at this stage so you will need to have full understanding if it’s not clear ask to the client. Because if you are not sure in the data. It’s not possible to develop a really valuable model.

Data Cleaning and preparation: Data to be ready for modeling.

Feature engineering: Before modern feature engineering, it’s a great and interesting step. You can generate new features, you can get new insights, you can discuss it with the client. So like all experiments is up to you and it’s really interesting. From modern step you can do this task. But you can go back to the official engineering or data cleanup again. We are working with iterative process it’s not sequential.

Modeling: So like all experiments is up to you and it’s really interesting. From modern step you can do this task. But you can go back to the official engineering or data cleanup again. We are working with iterative process it’s not sequential.

Evaluation: So it’s fine and evaluation, so validate data on your data set. Make sure that you are not overfitting and everything working as you expected.

How can this result be improved further?

Look at the data from different perspectives
Add 3rd-party data to enrich the model
Try to get new insights from text features

NLP (Natural Language Processing)

This is basically area working with text. Presenter found out really great research and you can click link if you are interested to read more NLP market . But the main idea here in 2020 is amount of investment in NLP and we are expecting a huge growth. And considering the amount of projects that we are working with NLP is quite huge. This area will definitely grow. And in this research they cover in two key growing direction.

First it’s a cloud-based solution like AWS, GCP, so temporary use some NLP solution.
And second is increasing usage of smart devices to facilitate smart environments.

What does it mean? It means that we all get used to SIRI usage or for example if you want to find out something in Youtube when we turn on tv we don’t want to tap it, we want to just say and it’s much easier. It’s still NLP and all of the stuffs will be even developed more and more in NLP we have also like a lot of directions.

Popular NLP directions

How to Work with Text? (NLP)?

First, we need to split text with punctuation or without penetration. It’s different story but idea is to split text and then to each unique value assign some token. Of course, there is like different options and for spreading and for tokenization. But, the high level idea to convert this text to the digits and then we can work with this digit sequentially. So, it looks easier but if someone haven’t worked with text yet so hope you will have like more understanding that it’s not really complicated under the hood .

NLP Approaches

We have two approaches generally:

Classical NLP
NLP with Deep Learning with Neural Network

Text classification, key vertex similarity can be solved with both approaches. But you can see that some of the tasks. If we can solve tasks with classical approaches maybe it will not be relevant uh real world.

Now main focus is on deep learning and model trained with neural network produce really great. For example, we can enrich this model with our custom data so like this area is developed much more right now.

Intuitions for solving NLP

Let’s say the task is to generate summaries for a given input text.

Solution

Training nlp model from scratch is not really efficient on practice. Because it takes a lot of time a lot of money a lot of investment a lot of label data and it’s really hard and will take enormous time and money to do transfer learning.
Transfer learning and pre-trained models is a better choice. Pre-trained models means that someone already trained it and you can try to use on your case. And transfer learning it’s use pre-trained model, add your data and just like to continue training so to enrich already print model with your custom data. It’s one of the best choice right now in terms of nlp.
Data set used for pre-trained nlp model matters. For example, you chose nlp model trained on wikipedia like general text and you have the main task in medicine. So most likely it will not work like 90 percent that it will not produce great results. So try to find even if you are focusing about retraining model trying to find a model that was pre-trained on relevant data set. New NLP directions will add more to the growth.

New NLP Directions

These are:

Extractive Summary
Abstractive Summary
Image Captioning

Two years ago, extractive summary approach was like everywhere and abstractive was really rare and not produce good results. But right now situation is opposite completely. So extractive summary approach the core intuition behind it. When we have text we split text by sentence, we score each sentence and apply some rating and select the most relevant. But output will contain the original sentences and most likely it will be out of context.

Abstractive summary approach can paraphrase and can generate new sentences. So on input you have let’s say 10 sentence output can be one with paraphrased and like shortly. Summary of this text right now is more popular and it produces really great results. But several years ago it was completely opposite.

Image captioning: So idea here is, we have image as input we are detecting objects here. And then we are generating some description in text. So it could be useful for some automatic creation of text or description. By the photo interesting point is annotation for blind people. So we can convert it to the voice and like describe to the blind people what is showing here.

Some Helpful Research/Resources for NLP :

Hugging Face huggingface.co – pretrained models and possibility for fine-tuning
Github repositories: implemented solutions and new modules
Python packages: spacy, nltk, gensim, etc.
3rd party API and external platforms: AWS, GCP, etc.

Computer Vision on AI

Computer Vision is area in AI for working with images and pictures. So this is classical task for computer vision.

Example : Let’s try to understand how to detect whether cat’s dog in the picture. We’ll perform this task with the help of computer vision directions.

How to Work with Image Data?

Pictures need to be converted to the numbers because we are working with numbers. And how it looks behind when you read like ordinary gpg or png picture it looks like three-layered metrics. When two dimension is hight and width in pixels and the next one is three-channels rgb.

Example, for picture, three channels, it’s usually it can be a different mouth but usually three channels red green blue all of our pictures that we are looking in our like fonts most of them contains these three layers. And based on intensity of each colors we are achieving this result.

But what if you don’t know anything about computer vision. In that senerio:

Read image data, look at the properties and data you have (OpenCV, etc).
Review and understand neural network components: layers, activation functions, etc.
Review and understand the logic behind popular NN architectures: ResNet, UNet, etc.
Choose Image Classification as your first task in CV.
Write and train your first custom NN model on ImageNet dataset.
Use a pretrained model and apply a transfer learning approch.

What if we don’t have the dataset, client doesn’t provided the dataset and want us to detect a person. What as a data scientist you’ll do here, don’t say no to the client. You have three options:

Try to find a suitable pre-trained model ( if no data).
Apply transfer learning (if have small about of data).
If possible to gather data apply custom model (new technologies) and give fruitful results to the client.

New Technologies in the world of AI

These are:

LIDAR
Human Pose Estimation

LIDAR: It’s really interesting. Maybe someone from you have already this camera in your mobile phone. Idea of this camera, it has laser and output it’s not to the pictures like we have. It’s with a picture with information where light is going from so plenty of information here from lidar. And it’s really high developed right now.

GANS: It’s really awesome interesting neural network in computer vision. The main idea behind that we have input picture and we can modify this picture. So, apply a smile or change race or hair style or hair cards or something in appearance. It can be useful in different areas and starting from generating new data samples for your data sets and photo editing and face animation.

Human Pose Estimation: Idea here is to detect key points. It can be useful for some fitness app for example to identify whether you properly doing some exercises. Use case can be studies further from here
Human pose estimation guide

DS Optimization for Big Data Solution

In this scenario – How to meet such type of client expectations?

Runtime is important every time, even when a client is not telling about it.
Think about speed during development, better to write optimized code from scratch.
Use GPU / CPU resources on maximum.
Multiprocessing is a great option for parallelization.
Write a project with the pipeline running approach – make life easier in the future when your model will be deployed.

Few things that you must consider for writing a code on AI:

AUXILIARY VARIABLES: Don’t create a lot of auxiliary variables, especially with heavy tables. It’s because you
will use a lot of RAM memory
‘FOR’ LOOP: Avoid ‘for’ loop as much as possible. Try to apply vectorized functions: apply(), etc.
BIG INPUT TABLES: Read only necessary columns. Ultimately, it will speed up reading and reduce RAM memory consumption
DATA TYPES: Use ‘lightweight’ data types as maximum as you can to speed up processing time
SQL QUERIES: Write effective SQL queries for getting data from DB. It will dramatically speed up runtime.

Conclusion on AI

I hope you have enjoyed the session and afterwards, you will stay relevant in world of AI. Secondly, the layman examples must have complemented your learning. Wish you, good luck. Learn more, grow high.