Setting Up Your Data Science Work Bench

nUjUVbR.jpg!web

Photo by ThisisEngineering RAEng on Unsplash

Get your computer ready for learning data science

Mar 28 ·6min read

In my lastpost, I covered the core tools required for data science work. In this article, I am going to give a step by step guide to getting your computer set up to perform typical data science and machine learning tasks.

I personally work on a mac so most set up instructions will be set up for this operating system.

Install python

As discussed in my last post python is now the most popular programming language for data science practitioners. Therefore the first step in configuring your computer is to install python.

To install and configure python on your computer you will need to use the terminal. If you have not already set this up you will need to download and install Xcode (Apple’s Integrated Development Environment).

Mac OS X comes with python 2.7 already installed. However, for many data science projects you will need to be able to work with a variety of different python versions.

There are a number of tools available that enable the installation and management of different python versions however pyenv is probably one of the simplest to use. Pyenv supports the management of python versions at both the user and project level.

To install pyenv you will need to first install homebrew , which is a package manager for Mac. Once you have this you can install pyenv with this command (for Windows installation see these instructions ).

brew install pyenv

You will then need to add the pyenv initializer to your shell startup scripts and reload your bash_profile by running the following.

echo 'eval "$(pyenv init -)"' >> ~/.bash_profile
source ~/.bash_profile

To view versions of python that are installed on your system run the following command.

pyenv install --list

To install a new version of python simply run.

pyenv install <python-version>#e.g.pyenv install python-3.7

Install python packages

Pip is the preferred installer for installing python packages and is included by default with python 3.4 and above. You will need this to install any open-source python libraries.

To install a package using pip simply run the following.

pip install <package>#e.g.pip install pandas

Virtual environments

Different python projects will require different dependencies and versions of python. It is therefore important to have a way to create isolated and reproducible environments for each project. Virtual environments accomplish this.

There are a number of tools to create python virtual environments but I personally use pipenv .

Pipenv can be installed with homebrew.

brew install pipenv

To create a new environment using a specific version of python. Make a new directory and then run the following command from your new directory.

mkdir pip-test
cd pip-test
pipenv --python 3.7

To activate the environment run pipenv-shell you will now be in a new environment called ‘pip-test’.

If we inspect the contents of the directory you will see that pipenv has created a new file called Pipfile . This is the pipenv equivalent of a requirements file and contains all packages and versions that are used in the environment.

To install packages into the pipenv environment simply run the following.

pipenv install pandas

Any packages installed will be reflected in the pip file which means that the environment can be recreated easily using this file.

Jupyter Notebooks

Jupyter Notebooks are a web-based application for writing code. They are particularly suited to data science tasks because they enable you to render documentation, diagrams, tables and charts directly in line with your code. This creates a highly interactive and shareable platform for developing data science projects.

To install Jupyter notebooks simply run pip install notebook or if you are working in a pipenv shell pipenv install notebook .

As this is a web-based application you need to start the notebook server to begin writing your code. You do this by running this command.

jupyter notebook

This will open the application in your web browser, the default URL is http://127.0.0.1:8888 .

rAjmu2r.png!web

The Notebook application running in the browser

Jupyter Notebooks are able to work with virtual environments so that you are able to run the notebooks for a project in the correct project environment. To make the pipenv environment available in the web application you need to run the following.

python -m ipykernel install --user --name myenv --display-name "Python (myenv)"#e.g.python -m ipykernel install --user --name jup-test --display-name "Python (jup-test)"

If you now restart the web application and got to new you will see your pipenv environment available. Selecting this will start a new notebook that will run with all the dependencies you have set up in your pipenv shell.

A notebook running in the Python(jup-test) environment

Python IDE

Jupyter Notebooks are very good for exploratory data science projects and for writing code that you will only use once. However, for efficiency, it is a good idea to write commonly used pieces of code into functions within modules that can be imported and used across projects (this is known as modularising your code).

Notebooks are not particularly well suited to writing modules. For this task, it is better to use an IDE (Integrated Development Environment). There are many available but I personally use Pycharm. The benefit of using IDE’s is that they contain tools such as Github integration and unit testing built-in.

Pycharm has both a paid professional version and a free community edition. To download and install Pycharm visit this website and follow the installation instructions.

Version control

One final tool you will want to use for your data science projects is Github. This is the most commonly used tool for version control. Version control essentially involves storing a version of your project online. Development is then performed locally from branches. Branches are essentially a copy of the project where changes can be made that will not affect the master version.

Once changes have been made locally you can push the changes to Github and they can be merged into the master branch in a controlled process known as a pull request.

Using Github will enable you to track changes to your project. You can also make changes and test the impact they have before integrating them into the final version. Github also enables collaboration with others as they can safely make changes without impacting the master branch.

To use Github you first need to install it which can be done by following these instructions . You will then need to visit the Github website and create an account.

Once you have an account you can create a new repository.

v6RVBb2.png!web

Creating a new repository

To work on the project locally you will need to clone the repository.

cd my-directory
git clone https://github.com/rebeccavickery/my-repository.git

This article is a guide to setting up your computer ready to work on data science projects. Many of the tools I have listed are my personally preferred tools. However, in most cases, there are several alternatives. It is worth exploring the different options to find those best suited to your working style and projects.

Thanks for reading!

I send out a monthly newsletter if you would like to join please sign up via this link. Looking forward to being part of your learning journey!

Setting Up Your Data Science Work Bench