

Best practices organizing data science projects
source link: https://www.tuicool.com/articles/U7vMNfZ
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.



Data science projects imply in most of the cases a lot of data artifacts (like documents, excel files, data from websites, R files, python files), and requires repeating and improving each step, understanding the underlying logic behind each decision.
1) Objectives for a data organization
There are several objectives to achieve:
- Optimization of time: we need to optimize time minimizing lost of files, problems reproducing code, problems explain the reason-why behind decisions.
- Reproducibility: There is an active component of repetitions for data science projects, and there is a benefit is the organization system could help in the task to recreate easily any part of your code (or the entire project), now and perhaps in some moment in the future (6 months, 1 year, 2 years …)
- Improve the quality of the projects: organized projects usually mean detailed explanations along the process. During the process of documentation and under the necessity to explain the reason behind each step is more probable find bugs and inconsistencies.
2) Starting a new project: the beginning
Since the very beginning, it is a good practice to start with a good organization for a data science project, and instead of considering that as a waste of time, we can see that as a savvy approach to saving times in different ways.
Also, because we are working with others into a organization, it is important to understand that everyone has different workflows & ways to work. For a shared project is a good idea to achieve a real consensus about not only the folder structure but the expected content for each folder.
3) Use control version
Why is it necessary to use a control version? To delegate basic tasks like:
- To have an automatized backup system for the work, and just for that the necessary work to implement that is highly valuable.
- For handling changes on the files during all the project. Also, return to previous versions in order to check something. Version control systems can solve the problem of reviewing and retrieving previous changes and allow single files to be used rather than duplicated.
- To facilitate the process of working with others making it easy to share files and keep working on them.
Some of the most popular tools are GIT, SVN, Subversion… no matter the final choice the best idea is to implement it.
4) Document everything
When we are speaking about the documentation we are referring about:
- documents included for analysis
- intermediate datasets
- intermediate versions of your code
The most challenging decision is determining how much time to invest in a document: too much time and yes, it is a waste of time, too little and the documentation will be incomplete and useless.
5) Improve the process
The fundamental idea is to evaluate the process and improve the workflow.
At the moment to finish a project, or delivery something is a good idea to evaluate if there is something to improve: a better organization for the files, or correct the way to document, no matter what at the end the idea is to understand that any process is in constant movement and you need to improve it.
Conclusions
Managing the organization of a data project means to evaluate what are the objectives into an organization system, how to structure the data, the best way to establish a backup system and version control and finally how to document all the processes.
Recommend
-
74
除非特别声明,此文章内容采用知识共享署名 3.0许可,代码示例采用Apache 2.0许可。更多细节请查看我们的服务条款。
-
57
This post was originally published here The data science life cycle is generally comprised of the following comp...
-
12
Four Best Practices for Successful AI ProjectsWhile nearly all organizations believe AI would benefit their operations, very few have implemented it. Here are four best practices that can speed your implementation. Credit: Ol...
-
11
The PowerBI Volleyball Report – Organizing Data To Start One of my goals this year was to build a repo...
-
3
The 7 Best Reasons to Choose MySQL for Organizing Your Web Database October 15, 2021 Hardik Shah ...
-
10
SwiftUI tips for organizing multiplatform projects 19 November 2021 I...
-
6
After a long time of absence I decided to reactivate my blog! So here comes another post related to optimizing the workflow of a PhD student in Computer Science. The problem If there is one thing you should do a...
-
10
From Chaos to Clarity: 6 Best Practices for Organizing Big Data May 9, 2023 by Bala Krishna Gangisetty //
-
5
Organizing projects and defining names in Go
-
4
Top 7 Business Best Practices for Data Projects January 11, 2024
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK