18

Best practices organizing data science projects

 4 years ago
source link: https://www.tuicool.com/articles/U7vMNfZ
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
BzuM7zn.jpg!webJjai6vY.jpg!web
Photo by Siriwan Srisuwan on  Unsplash

Data science projects imply in most of the cases a lot of data artifacts (like documents, excel files, data from websites, R files, python files), and requires repeating and improving each step, understanding the underlying logic behind each decision.

1) Objectives for a data organization

There are several objectives to achieve:

  1. Optimization of time: we need to optimize time minimizing lost of files, problems reproducing code, problems explain the reason-why behind decisions.
  2. Reproducibility: There is an active component of repetitions for data science projects, and there is a benefit is the organization system could help in the task to recreate easily any part of your code (or the entire project), now and perhaps in some moment in the future (6 months, 1 year, 2 years …)
  3. Improve the quality of the projects: organized projects usually mean detailed explanations along the process. During the process of documentation and under the necessity to explain the reason behind each step is more probable find bugs and inconsistencies.

2) Starting a new project: the beginning

Since the very beginning, it is a good practice to start with a good organization for a data science project, and instead of considering that as a waste of time, we can see that as a savvy approach to saving times in different ways.

Also, because we are working with others into a organization, it is important to understand that everyone has different workflows & ways to work. For a shared project is a good idea to achieve a real consensus about not only the folder structure but the expected content for each folder.

3) Use control version

Why is it necessary to use a control version? To delegate basic tasks like:

  • To have an automatized backup system for the work, and just for that the necessary work to implement that is highly valuable.
  • For handling changes on the files during all the project. Also, return to previous versions in order to check something. Version control systems can solve the problem of reviewing and retrieving previous changes and allow single files to be used rather than duplicated.
  • To facilitate the process of working with others making it easy to share files and keep working on them.

Some of the most popular tools are GIT, SVN, Subversion… no matter the final choice the best idea is to implement it.

4) Document everything

When we are speaking about the documentation we are referring about:

  • documents included for analysis
  • intermediate datasets
  • intermediate versions of your code

The most challenging decision is determining how much time to invest in a document: too much time and yes, it is a waste of time, too little and the documentation will be incomplete and useless.

5) Improve the process

The fundamental idea is to evaluate the process and improve the workflow.

At the moment to finish a project, or delivery something is a good idea to evaluate if there is something to improve: a better organization for the files, or correct the way to document, no matter what at the end the idea is to understand that any process is in constant movement and you need to improve it.

Conclusions

Managing the organization of a data project means to evaluate what are the objectives into an organization system, how to structure the data, the best way to establish a backup system and version control and finally how to document all the processes.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK