Gitential Datasets for Open Source Projects

TLDR: Raw GIT data extracted via libgit2 for a couple of open source projects.

Each directory contains sample CSV files and a README with the download links (Parquet and JSON formats).

Our intention to publish further data (a lot actually) with additional metrics and dimensions.

Please feel free to request datasets for other repositories and/or projects in the issues!

Schema

Commits

Column	Type	Description
id	string	The commit's SHA
delay	int64	Seconds elapsed between the creation and last application of the commit (rebases can cause negative values)
age	int64	Shortest interval between the commit and it's parents
ismerge	bool	Whether the commit has two or more parents (or is a squash)
squashof	int64	Whether it is a squash and merge commit (currently parsed from commit message)
author_name	string	The author's name
author_email	string	The author's email address
committer_name	string	The committer's name
committer_email	string	The committer's email address
author_time	datetime	The author signature's timestamp
committer_time	datetime	The committer signature's timestamp
loc_d	int64	Number of lines deleted in this commit
loc_i	int64	Number of lines inserted in this commit
comp_d	int64	Whitespace complexity deleted in this commit
comp_i	int64	Whitespace complexity inserted in this commit
nfiles	int64	Number of files (paches) affected by this commit
message	string	The (nice and shiny and fixless :) commit messages
ndiffs	int64	Number of diffs and parent commits
author_email_dedup	string	Author's deduplicated email address
author_name_dedup	string	Author's deduplicated name
committer_email_dedup	string	Committer's deduplicated email address
committer_name_dedup	string	Committer's deduplicated name

Patches

These are the individual files touched by commits. Patches are generated by diffing two revisions.

Column	Type	Description
id	string	The commit the patch belongs to
parent_id	string	The parent commit which the diff was generated against
oldpath	string	The file's old path before applying the patch
newpath	string	The file's new path after applying the patch
ismerge	bool	Whether the commit has two or more parents (or is a squash)
status	string	What kind of modification happened with the file (added / deleted / modified / etc)
author_time	datetime	The author signature's timestamp
oldsize	int64	The file's size in bytes before applying the patch
newsize	int64	The file's size in bytes after applying the patch
language	string	Programming language of this patch
langtype	string	Language types given by Github's linguist
skipped	string	Whether the patch generation has been skipped or NOT (otherwise the reason)
istest	bool	Whether the file is a test file or not
loc_d	int64	Number of lines deleted in this patch
loc_i	int64	Number of lines inserted in this patch
comp_d	int64	Whitespace complexity deleted in this commit
comp_i	int64	Whitespace complexity inserted in this commit
loc_d_std	float32	Deleted number of lines deviation in the hunks
loc_i_std	float32	Inserted number of lines deviation in the hunks
comp_d_std	float32	Deleted complexity deviation in the hunks
comp_i_std	float32	Inserted complexity deviation in the hunks
nhunks	int64	Number of hunks in this patch
nblames	int64	Number of unique commits this patch has churned lines from
blame_loc	int64	Number of lines this patches has churned (deleted)

Blames

Contains blame segments for patches, an example from libgit2's github page.

Column	Type	Description
id	string	The commit's SHA
author_email	string	The author's email address
author_time	datetime	The author signature's timestamp
ismerge	bool	Whether the commit has two or more parents (or is a squash)
newpath	string	The file's new path after applying the patch
istest	bool	Whether the file is a test file or not
blame_id	string	The commit's SHA where this commit has churned lines from
loc_d	int64	Number of churned lines
language	string	Programming language of the affected file
blame_author_email	string	The churned author's email address
blame_author_time	datetime	The author signature's timestamp of the chirned commit
blame_ismerge	bool	Whether the churned commit was a merge commit or not
author_email_dedup	string	Author's deduplicated email address
author_name_dedup	string	Author's deduplicated name

Both lightweight and annotated tags.

Column	Type	Description
id	string	The tags's SHA
name	string	The tag's name (in case of annotated)
message	string	The tag's message (in case of annotated)
type	int64	Git object type (mostly commit)
author_time	datetime	The timestamp of the tag's creation

A more detailed documentation is on the way.

Python example

Install Dependencies

Reading Parquet files requires PyArrow

conda install -y pandas pyarrow

Fetching package metadata ...............
Solving package specifications: .

Package plan for installation in environment /Users/krisz/.pyenv/versions/miniconda3-latest/envs/ml3:

The following packages will be UPDATED:

    pandas: 0.21.0-py36_0 conda-forge --> 0.22.0-py36_0 conda-forge

pandas-0.22.0- 100% |################################| Time: 0:00:04   2.30 MB/s

Download Files

Currently there two formats available: Parquet and JSON

wget https://s3.amazonaws.com/gitential-com-datasets/libgit2-libgit2/commits.parquet
wget https://s3.amazonaws.com/gitential-com-datasets/libgit2-libgit2/commits.json.gz

--2018-01-15 11:09:00--  https://s3.amazonaws.com/gitential-com-datasets/libgit2-libgit2/commits.parquet
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.216.100.149
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.216.100.149|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1483345 (1.4M) [binary/octet-stream]
Saving to: ‘commits.parquet’

commits.parquet     100%[===================>]   1.41M   616KB/s    in 2.4s

2018-01-15 11:09:03 (616 KB/s) - ‘commits.parquet’ saved [1483345/1483345]

--2018-01-15 11:09:04--  https://s3.amazonaws.com/gitential-com-datasets/libgit2-libgit2/commits.json.gz
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.216.100.149
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.216.100.149|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1642853 (1.6M) [application/json]
Saving to: ‘commits.json.gz’

commits.json.gz     100%[===================>]   1.57M   903KB/s    in 1.8s

2018-01-15 11:09:06 (903 KB/s) - ‘commits.json.gz’ saved [1642853/1642853]

Reading Parquet file

import pandas as pd

commits = pd.read_parquet('./commits.parquet')

commits[['id', 'message']].head()

id	message
c15648cbd059b92c177586ab1701a167222c7681	Initial draft of libgit2\n\nSigned-off-by: Sha...
44181c23ea6c39d51a4b481dc59ecf2cc3967e76	Mark git_oid parameters const when they should...
46d8b885bd65158e8cb53266ba4b627b5991bce8	Rename git_odb_sread to just git_odb_read\n\nM...
171aaf21d9f7582270c390962f61d3d2613c4d59	Hide GIT_{BEGIN,END}_DECL from doxygen as its ...
b51eb250ed0cbda59d3108d04569fab9413909fd	Cleanup git_odb documentation formatting\n\nSi...

Reading JSON.gz file

import pandas as pd

commits = pd.read_json('./commits.json.gz', compression='infer')

commits[['id', 'message']].head()

id	message
c15648cbd059b92c177586ab1701a167222c7681	Initial draft of libgit2\n\nSigned-off-by: Sha...
44181c23ea6c39d51a4b481dc59ecf2cc3967e76	Mark git_oid parameters const when they should...
46d8b885bd65158e8cb53266ba4b627b5991bce8	Rename git_odb_sread to just git_odb_read\n\nM...
171aaf21d9f7582270c390962f61d3d2613c4d59	Hide GIT_{BEGIN,END}_DECL from doxygen as its ...
b51eb250ed0cbda59d3108d04569fab9413909fd	Cleanup git_odb documentation formatting\n\nSi...

GitHub - gitential/datasets: Datasets for popular Open Source projects

Gitential Datasets for Open Source Projects

Schema

Commits

Patches

Blames

Python example

Install Dependencies

Download Files

Reading Parquet file

Reading JSON.gz file

Recommend

GitHub - reyronald/awesome-toolkits: A curated list of open source, high-quality...

7 Techniques for thread-safe classes

JUnit4, JUnit5, and Spock: A Comparison - DZone Java

Java 9 Modules Introduction (Part 1) - DZone Java

GitHub - vim-scripts/AutoComplPop: Automatically opens popup menu for completion...

PhpInternals

GitHub - fission/fission: Fast Serverless Functions for Kubernetes

GitHub - sysdream/hershell

Running dedicated game servers in Kubernetes Engine: tutorial

通用汽车将在 2019 年测试没有方向盘的自动驾驶汽车

About Joyk