What fits you as a data scientist?

Discover your place and get the right direction

Jun 21 ·6min read

ZrUnY3V.jpg!web

Photo by Monty Allen onUnsplash

Data science strives to understand the natural world, which is, by nature, very complicated. But how? Analyzing data, a significant amount of data (so-called big data ), trying to understand them and squeeze knowledge and experience , to make decisions and solve problems. For a better understanding of what is the experience in the field of data science and machine learning, please check my introductory article about machine learning and artificial intelligence (link below).

Artificial Intelligence and Machine Learning. Are they the same thing?

The answer is NO. And I’ll tell you why.

medium.com

As a data scientist, the first thing to know is the data lifecycle, which is made of steps.

Data Collection

Nowadays, data collection is an easy task. Data collection is the action of gathering data from various sources: web pages, news, social media, reports, graphs, tables, etc. are all sources of digital raw data, ready to consume for everyone interested.

Flows of data — from Giphy

In this field, a good data scientist develops an inherent curiosity about the world; he is data-driven, so he spends enormous amounts of time collecting data to answer the questions of interest. The required skills are:

think about what data are needed to solve the problem you are involved in
knowing how to collect data from various sources and how to combine them in a structured way
knowing some tools or application for the collection of data and ETL (Extract, Transform and Load)

Data Cleaning

Once collected, most of the time, raw data are “messy”.

aeQvYr2.jpg!web

A very messy office with a bunch of documents and raw data to organize — Photo by Wonderlane onUnsplash

Data cleaning is a complex task, which involves:

detecting and correcting corrupt or inaccurate data because of partial or missing data gathering
validation of data and estimate of missing values, based on information about the relevant phenomena, relied on the problem.
data enhancement, via harmonization and normalization of data
transformation of data to obtain uniformity and comparability of values in the dataset

Exploratory Data Analysis

Exploratory data analysis (EDA) is a collection of techniques for seeing what the data can tell us. In EDA, we use both mathematical models and common sense to cope with the significance of our data.

7Vri2qu.jpg!web

Graphical representation of data — Photo by Stephen Dawson onUnsplash

As data scientists, we must know what to expect from the data we collect, have to formulate a hypothesis, and “fill the gap” in what information we have.

There are many tools to help us:

Descriptive statistics: to have a representation of data, made by tables, graphs, summarizing values, etc.
Inferential statistics: to bring our collection of data, which is un incomplete representation of reality, to infer, make assumptions, about the fundamental characteristics of the phenomena.
A deep understanding of the environment, that is to say, the context of the problem we try to solve with data science techniques.

It’s worth recalling that, in classical machine learning (ML), this phase of the data lifecycle is up to us, the data scientists. When in deep learning (DL) and even more in reinforcement learning (RL), it is up to the model, the machine, to cope with that. In DL, during the training phase, the algorithm learns the characteristics of data provided and adapt to them. In RL, the environment is, even more, an active part of the learning process.

Model Building

Model building is a fundamental part of the ML process. When we create a model, we can then train a machine to learn patterns on our data (training set) to predict unknown or future data.

Bra6Zbu.jpg!web

Be creative … it’s time to model building! — Photo by Jo Szczepanska onUnsplash

In the model building, we try to predict outcomes from the analysis.

Again, some skills are essential here:

apply the right learning schema to our data to solve a specific problem (regression, classification, association, clustering, etc…)
test and evaluate the results of the model training via defined metrics to compute performance
combine many techniques and models to get to a better result in terms of prediction, robustness of the model, etc. ( ensemble modeling)

Model Deployment

When our model is ready, and we get good results from the training and test set (training and evaluation stage), it’s time to put it into production. This is the final stage when we get results from data, for business, study, research, or maybe also for fun!

ARFnuq.jpg!web

Time to get results — Photo by NeONBRAND onUnsplash

We need to know:

how to deploy our model on various ready to use, state of the art framework. Think about, i.e., in Python tools and libraries such as NumPy, pandas, scikit-learn (ML), TensorFlow Keras, PyTorch (DL), openai tools for RL, etc.
how to get results into the production environment, for optimization, anomaly detection, automation, prediction, etc.
how to summarize results to stakeholder

So, what’s next?

So far, we talked about the process. But, who’s needed in every step? Let’s clear the concepts and what are the roles involved.

First of all, let’s summarize all the process with a picture

u6nqI3I.png!web

roles and data pipeline — by the Author

Data scientist

As you can see, a data scientist is expected to do everything from data collection to model deployment; he must be aware of the real problems, and he has to know many techniques about every stage of the process. So, the required skills are:

a grasp of how to do SQL and other methods of querying datasets
a deep understanding of algebra, statistics and set theory for useful modeling techniques of data
knowing about Python, R, Java, C++ or other languages for data cleaning, data manipulation, EDA and visualization.
ability to select or combine modeling techniques suitable for solving the problems based on data and expected results
knowing how to combine the data pipeline in a production environment with methods of visualization and presentation of the results, in the form of a web application, reports, commands to machines, etc.

Data engineer

A data engineer is more focused on data collection and data cleaning. In this position, we must be very expert in database and data query techniques, besides being able in ETL (extract, transform and load) of data from various sources. Then we must know how to clean data, deal with null or inconsistent values and many more techniques to build a strong foundation of sources for the following ML models.

Data analyst

A data analyst works hard on data cleaning and EDA. He masters statistics, both descriptive than inferential, and is always trying to squeeze every single bit of information from data. His role is crucial for data modeling and to build reliable models that can actually capture the behavior of the environment. In the data science path to knowledge, in my opinion, this can be the first step, and then we can explore other phases of the process.

Machine learning engineer

A machine learning engineer knows how to get the most out of the data, based on various techniques and ML algorithms; he masters ML models, optimization of hyperparameters, evaluation and metrics, and is on the edge of the latest research in the field. Besides that, he also knows how to scale and deploy models into production systems.

Discover your place and get the right direction

Artificial Intelligence and Machine Learning. Are they the same thing?

The answer is NO. And I’ll tell you why.

medium.com

Data Collection

Data Cleaning

Exploratory Data Analysis

Model Building

Model Deployment

So, what’s next?

Data scientist

Data engineer

Data analyst

Machine learning engineer

Recommend

罗永浩：准备做综艺节目正在组建团队

微信最新回应朋友圈无法刷新：已修复好对象不用赔了

瑞信胜诉将追回瑞幸3亿债务陆正耀旗下两实体遭清算

Source code of “Delores: A Thimbleweed Park mini-adventure”

Knowing known unknowns with deep neural networks

Blender Bot — Part 1: The Data

Object Detection using YOLOv3

JAVA设计模式 2【创建型】原型模式的理解与使用、理解浅克隆和深克隆

Why Deno will stop using TypeScript

专访28岁退休程序员郭宇：1600人加好友，问我“挣了多少钱”

About Joyk