2

Extracting Diffs from Git with Python

 2 years ago
source link: https://bbengfort.github.io/2016/05/git-diff-extract/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Extracting Diffs from Git with Python

May 6, 2016 · 2 min · Benjamin Bengfort

One of the first steps to performing analysis of Git repositories is extracting the changes over time, e.g. the Git log. This seems like it should be a very simple thing to do, as visualizations on GitHub and elsewhere show file change analyses through history on a commit by commit basis. Moreover, by using the GitPython library you have direct access to Git repositories that is scriptable. Unfortunately, things aren’t as simple as that, so I present a snippet for extracting change information from a Repository.

First thing first, dependencies. To use this code you must install GitPython:

$ pip install gitpython

What I’m looking for in this example is the change for every single file throughout time for every commit. This doesn’t necessarily mean the change in the blobs themselves, but metadata about the change that occurred. For example:

  • Object: the path or name of the file
  • Commit: the commit in which the file was changed
  • Author: the username or email of the author of the file
  • Timestamp: when the file was changed
  • Size: the number of bytes changed (negative for deletions)
  • Type of change: whether the file was added, deleted, modified, or renamed.
  • Stats: the number of lines changed/inserted/deleted.

This pretty straight forward analysis will allow us to build a graph model of how users and files interact inside of a particular project. So here’s the snippet:

## Imports import os import git

## Module Constants DATE_TIME_FORMAT = "%Y-%m-%dT%H:%M:%S%z" EMPTY_TREE_SHA = "4b825dc642cb6eb9a060e54bf8d69288fbee4904"

def versions(path, branch='master'): """ This function returns a generator which iterates through all commits of the repository located in the given path for the given branch. It yields file diff information to show a timeseries of file changes. """

# Create the repository, raises an error if it isn't one. repo = git.Repo(path)

# Iterate through every commit for the given branch in the repository for commit in repo.iter_commits(branch): # Determine the parent of the commit to diff against. # If no parent, this is the first commit, so use empty tree. # Then create a mapping of path to diff for each file changed. parent = commit.parents[0] if commit.parents else EMPTY_TREE_SHA diffs = { diff.a_path: diff for diff in commit.diff(parent) }

# The stats on the commit is a summary of all the changes for this # commit, we'll iterate through it to get the information we need. for objpath, stats in commit.stats.files.items():

# Select the diff for the path in the stats diff = diffs.get(objpath)

# If the path is not in the dictionary, it's because it was # renamed, so search through the b_paths for the current name. if not diff: for diff in diffs.values(): if diff.b_path == path and diff.renamed: break

# Update the stats with the additional information stats.update({ 'object': os.path.join(path, objpath), 'commit': commit.hexsha, 'author': commit.author.email, 'timestamp': commit.authored_datetime.strftime(DATE_TIME_FORMAT), 'size': diff_size(diff), 'type': diff_type(diff), })

yield stats

def diff_size(diff): """ Computes the size of the diff by comparing the size of the blobs. """ if diff.b_blob is None and diff.deleted_file: # This is a deletion, so return negative the size of the original. return diff.a_blob.size * -1

if diff.a_blob is None and diff.new_file: # This is a new file, so return the size of the new value. return diff.b_blob.size

# Otherwise just return the size a-b return diff.a_blob.size - diff.b_blob.size

def diff_type(diff): """ Determines the type of the diff by looking at the diff flags. """ if diff.renamed: return 'R' if diff.deleted_file: return 'D' if diff.new_file: return 'A' return 'M'

The result from this snippet is a generator that yields dictionaries that look something like:

{
  "deletions": 0,
  "insertions": 18,
  "author": "[email protected]",
  "timestamp": "2016-02-23T12:36:59-0500",
  "object": "cloudscope/tests/test_utils/__init__.py",
  "lines": 18,
  "commit": "00c5dd71d86f94dce5fd31b254a1c690c5ec1a53",
  "type": "A",
  "size": 509
}

This can be used to create a history of file changes, or to create a graph of files that are commonly changed together.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK