Best practices organizing data science projects

Photo by Siriwan Srisuwan on Unsplash

Data science projects imply in most of the cases a lot of data artifacts (like documents, excel files, data from websites, R files, python files), and requires repeating and improving each step, understanding the underlying logic behind each decision.

1) Objectives for a data organization

There are several objectives to achieve:

Optimization of time: we need to optimize time minimizing lost of files, problems reproducing code, problems explain the reason-why behind decisions.
Reproducibility: There is an active component of repetitions for data science projects, and there is a benefit is the organization system could help in the task to recreate easily any part of your code (or the entire project), now and perhaps in some moment in the future (6 months, 1 year, 2 years …)
Improve the quality of the projects: organized projects usually mean detailed explanations along the process. During the process of documentation and under the necessity to explain the reason behind each step is more probable find bugs and inconsistencies.

2) Starting a new project: the beginning

Since the very beginning, it is a good practice to start with a good organization for a data science project, and instead of considering that as a waste of time, we can see that as a savvy approach to saving times in different ways.

Also, because we are working with others into a organization, it is important to understand that everyone has different workflows & ways to work. For a shared project is a good idea to achieve a real consensus about not only the folder structure but the expected content for each folder.

3) Use control version

Why is it necessary to use a control version? To delegate basic tasks like:

To have an automatized backup system for the work, and just for that the necessary work to implement that is highly valuable.
For handling changes on the files during all the project. Also, return to previous versions in order to check something. Version control systems can solve the problem of reviewing and retrieving previous changes and allow single files to be used rather than duplicated.
To facilitate the process of working with others making it easy to share files and keep working on them.

Some of the most popular tools are GIT, SVN, Subversion… no matter the final choice the best idea is to implement it.

4) Document everything

When we are speaking about the documentation we are referring about:

documents included for analysis
intermediate datasets
intermediate versions of your code

The most challenging decision is determining how much time to invest in a document: too much time and yes, it is a waste of time, too little and the documentation will be incomplete and useless.

5) Improve the process

The fundamental idea is to evaluate the process and improve the workflow.

At the moment to finish a project, or delivery something is a good idea to evaluate if there is something to improve: a better organization for the files, or correct the way to document, no matter what at the end the idea is to understand that any process is in constant movement and you need to improve it.

Conclusions

Managing the organization of a data project means to evaluate what are the objectives into an organization system, how to structure the data, the best way to establish a backup system and version control and finally how to document all the processes.

1) Objectives for a data organization

2) Starting a new project: the beginning

3) Use control version

4) Document everything

5) Improve the process

Conclusions

Recommend

Kubernetes best practices: Organizing with Namespaces

Programming Best Practices For Data Science

Four Best Practices for Successful AI Projects

The PowerBI Volleyball Report – Organizing Data To Start | Voice of the DBA

The 7 Best Reasons to Choose MySQL for Organizing Your Web Database

SwiftUI tips for organizing multiplatform projects

Organizing projects and notes with vimwiki and VimR

From Chaos to Clarity: 6 Best Practices for Organizing Big Data

Organizing projects and defining names in Go

Top 7 Business Best Practices for Data Projects

About Joyk