GIT Essentials for Every Data Scientist
source link: https://towardsdatascience.com/git-essentials-for-every-data-scientist-8d70429e0e92?gi=acbb3f90f033
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
Image by RegalShave from Pixabay
Introduction to Version Control
Version control is all about managing changes to files and directories by one or many contributors. Git is an incredibly popular system for version control and the one we will be running through for this course.
There are many benefits to version control, and Git specifically. Including a view of historical changes made to your project, automatic notification of conflicting work, where two individuals effectively write conflicting lines of code, allows for collaboration across many individuals which allows teams to grow.
Version control is a staple to software engineering and something that is slowly being adopted across data science teams where data scientists often work in silos with their own technologies and workflows. While a degree of autonomy is critical, the ability for data scientists to collaborate in high coordination with one another has tremendous difficulty scaling without some process around version control.
What is a Repository?
You’ve likely heard the term many times… repository.
All of your data science projects managed with Git will have two main components. The first is all of the work you’re doing in association with files and directories.. your scripts, models, and where and how their stored; the other piece of this is the information that Git holds onto to maintain a record of all of the changes that have been made to your project over time.
When you add those pieces together, you have yourself a repository, or as the cool kids call it… a repo ;)
Basic Commands For You to Know
Git status lets you know what is in the “staging area”
The staging area is where you put the files that you will be changing. It’s effectively you prepping a variety of letters and putting them in a box ready to send. Whether you want to remove things from here or add more is up to you, but the moment you hand them to the mailman there’s no getting them back. Those changes will take place. Git status will give you information about whatever file(s) are in the box ready to go to the main.
If you ran
git status and found there was nothing in your staging area not to worry! You first need to add files to the staging area. You can do so with
git add filename . Whatever filename you add here will be moved to the staging area. That means that all of the changes residing in a given file would be ready to push or be updated in the repo.
Now we can see what file is in the staging area with
git status , but what about the event where you want to see what has changed? You can use what's called
git diff .
Git diff will return all of the differences between the original file and all of the changes to be made, denoting them as a and b respectively.
git diff , you might actually run
git diff -r HEAD .
HEAD will give you the most recent commit, and
-r will make a comparison to a specific version of the file. If you want to see the changes of one file in particular, you can include the file path after
HEAD . Something to the effect of
git diff -r HEAD filepath
Once you’ve added files to your staging area, you can put them in the mailbox with
git commit . Keep in mind that anything in the 'box' gets shipped together as one unit. So if you want to undo anything about a given commit, you would have to roll back the entire commit.
A good best practice is to commit with good frequency.
One thing to keep in mind is you won’t actually just run
git commit . Your command will actually look like this
git commit -m "model updates" . This
-m is your log message. A best practice here is to be specific and descriptive about the changes you've made to your project. You'll thank yourself later!
Now the last command I’ll talk about for now is
git log /
git log is where you can pull up your repository's history of commits. It provides a handful of pieces of information like the author, commit date, and log message.
I hope this proves a useful crash-course on git! Git your hands dirty with those commands to get yourself and your data science teams using Git more effectively!
Happy data science-ing!
Aggregate valuable and interesting links.
Joyk means Joy of geeK