9

Resources I wish I knew when I started out with Data Science

 4 years ago
source link: https://mc.ai/resources-i-wish-i-knew-when-i-started-out-with-data-science/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

A Powerful Learning Path

Photo by Radu Marcusu on Unsplash

If you’re reading this, you’ve probably made the decision to continue with Data Science. Congratulations!

Learning Data Science is mammoth. The only way to eat an elephant is one bite at a time.

It isn’t too much different from any other challenging task. There might be more bites you need to take depending on where you start but there’s no magic to it; no shortcut. It’s just one bite at a time.

The learning path below consists of 2 main parts — the Mathematics, and the Technical Skills. It’s a good idea to start with the Maths so you can gauge whether Data Science is for you. For some, Mathematics can be daunting since Data Science primarily uses a lot of Mathematical concepts. Identifying that in the beginning can save you from regret later on.

To summarize,

Start with the Mathematics

Part 1: Mathematics

3 topics that are essentially important in Data Science are Linear Algebra, Calculus and Statistics. For the majority of tasks though, you can get away with just Statistics. Even then, I’ve linked useful resources under all 3 concepts, if you’d like to read more about each.

Linear Algebra

Linear Algebra is (almost) everywhere in Data Science. In a majority of calculations, your computer’s going to use a lot of Linear Algebra. In neural networks, the representation and processing of the network use Linear Algebra. It’s quite hard to think about any model that isn’t implemented using this branch of Mathematics.

But just as important as this looks, for the majority of cases, you won’t be handwriting code to apply matrix transformations in your dataset. Really what you need is a good intuition about its core principles.

Use the following resource if you’d like to learn about the theory about Linear Algebra and apply it in code:

Calculus

Like Linear Algebra, Calculus too plays a large role in Data Science, specifically with algorithms used in Machine Learning. Also, like the former, you don’t have to be a Calculus guru to master Data Science. What you need is understanding its core principles and how those principles might affect your models.

Statistics & Probability

Statistics & Probability is a topic that you’re really going to have to learn. This will take a significant chunk of your time, but the good news is that the concepts aren’t that difficult, so there’s no reason why you shouldn’t master this topic.

Other Topics in Math

These are topics that you probably won’t use on a daily basis as a beginner Data Scientist, but if you want to up your game, they are particularly useful.

  1. Graph Theory

“Graph Theory is the study of relationships . Given a set of nodes — which can be used to abstract anything from cities to computer data — Graph Theory studies the relationship between them in a very deep manner and provides answers to many arrangement, networking, optimisation, matching and operational problems. And the strength of it is the power to be used to abstract such a vast array of real problems.” — Frank Hannary

If you’re looking to build models to optimize routes for a logistics company or building a fraud detection system, a graph-based approach will sometimes outperform other solutions.

2. Discrete Mathematics

Discrete mathematics is the branch of mathematics which deals with objects that can assume only distinct, separated values. Whereas discrete objects can often be characterized by integers, continuous objects require real numbers.

As soon as you start to use mathematics with machines, you’re in the world of discrete mathematics where each number only has so many “bits” available to represent it.

If you’re terrified of math and disregard the sight of mathematical equations, you’re not going to have much fun as a data scientist. If, however, you’re willing to invest time to improve your familiarity with probability and statistics and to learn the principles underlying calculus and linear algebra, math should not get in the way of you becoming a professional data scientist.

PS: Math really is fun. As you go deeper into math, you too will share Data Scientists’ unbridled passion for Mathematics.

Part 2: Technical Skills

Now to the next/slightly more interesting part. 2.5 exabytes of data are generated every day (1 exabyte = 10¹⁸ bytes) and it would be absurd not to use computers to analyze small portions of that data (referred to as ‘Big Data’).

“How much programming is required in data science, particularly statistical analysis and machine learning?”

A lot. In practice, most every data science job will require you to code, for the reasons specified above, and also because most companies require some data cleaning, implementation and productization, and adaptation of algorithms to their own specific purposes. If you can’t implement your own solutions into something product-ready, then you are a much less useful employee. ( Source )

Python

Python is by far the world’s most widely used programming language in Data Science. Almost four out of five developers say that Python is their main language in JetBrains’ 2016 survey.

I would recommend that you focus on Python and spend a little time on R as well.

R

Python will suffice for a majority of your Data Science projects, but to really be called a well-rounded Data Scientist, you need to have R in your toolkit. You don’t have to be a guru in R and Python. Choose one and learn the fundamentals about the other.

Data Analysis with Python (Numpy, Pandas & Matplotlib)

If you’re learning Python specifically for Data Science, you’ll need to know how to analyze data, specifically how to load, manipulate, and visualize data.

Machine Learning with Python

Machine learning algorithms are at the core of Data Science. Invest some time to grasp their theory and applications.

SQL

SQL, or Structured Query Language, is extremely vital for a data scientist. One of the fundamental processes in data modelling is extracting the data in the first place. This will more often than not involve running SQL queries against a database.

Production Systems

If you’re in a job you’ll be utilizing the company’s computational resources to extract, transform and analyze data. It’s not enough to use a single machine to perform these tasks.

It’ll be worth the time learning about these tools specifically because they’re used extensively in the industry today.

SQL is one of the tools used today. Cloud-computing platforms like Amazon Web Services (AWS) , Google Cloud and Microsoft Azure are used extensively by a large number of businesses.

Another useful skill is Version Control .

Gaining Practical Experience

Following the resources I mentioned above will get only get you half the way. To apply the knowledge you’ve just gained, you need to, well, apply that knowledge by practising !

To truly master these concepts, you will need to use the skills in some projects that ideally closely resemble a real-world application. You will encounter problems — there’s no escaping this — to work through such as erroneous data and develop a really deep level of expertise in Data Science.

Here are a couple of good places where you can get this practical experience from (for free):

Kaggle

Machine Learning competitions serve many purposes. They serve as a channel for problem-solving and brainstorming. Kaggle is one of the most well-known platforms for Data Science. It’s a great way to apply your newly-gained skills.

Here is a list of 10 Data Science Competitions for you to hone your skills:

UCI Machine Learning Repository

The UCI machine learning repository is a large source of publically available data sets which you can use to put together your own data projects.

It is worth noting that storing your projects publically on Github is a good idea as this can create a portfolio showcasing your skills to use for future job applications.

UCI ML Repository

Contributing to Open Source

A good option to consider is contributing to open source projects. There are many Python projects that rely on the developer community to maintain them. Github is a good place to start.

Numfocus is a good example of a project like this.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK