74

GitHub - gitential/datasets: Datasets for popular Open Source projects

 6 years ago
source link: https://github.com/gitential/datasets
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Gitential Datasets for Open Source Projects

TLDR: Raw GIT data extracted via libgit2 for a couple of open source projects.

Each directory contains sample CSV files and a README with the download links (Parquet and JSON formats).

Our intention to publish further data (a lot actually) with additional metrics and dimensions.

Please feel free to request datasets for other repositories and/or projects in the issues!

Schema

Commits

Column Type Description
id string The commit's SHA
delay int64 Seconds elapsed between the creation and last application of the commit (rebases can cause negative values)
age int64 Shortest interval between the commit and it's parents
ismerge bool Whether the commit has two or more parents (or is a squash)
squashof int64 Whether it is a squash and merge commit (currently parsed from commit message)
author_name string The author's name
author_email string The author's email address
committer_name string The committer's name
committer_email string The committer's email address
author_time datetime The author signature's timestamp
committer_time datetime The committer signature's timestamp
loc_d int64 Number of lines deleted in this commit
loc_i int64 Number of lines inserted in this commit
comp_d int64 Whitespace complexity deleted in this commit
comp_i int64 Whitespace complexity inserted in this commit
nfiles int64 Number of files (paches) affected by this commit
message string The (nice and shiny and fixless :) commit messages
ndiffs int64 Number of diffs and parent commits
author_email_dedup string Author's deduplicated email address
author_name_dedup string Author's deduplicated name
committer_email_dedup string Committer's deduplicated email address
committer_name_dedup string Committer's deduplicated name

Patches

These are the individual files touched by commits. Patches are generated by diffing two revisions.

Column Type Description
id string The commit the patch belongs to
parent_id string The parent commit which the diff was generated against
oldpath string The file's old path before applying the patch
newpath string The file's new path after applying the patch
ismerge bool Whether the commit has two or more parents (or is a squash)
status string What kind of modification happened with the file (added / deleted / modified / etc)
author_time datetime The author signature's timestamp
oldsize int64 The file's size in bytes before applying the patch
newsize int64 The file's size in bytes after applying the patch
language string Programming language of this patch
langtype string Language types given by Github's linguist
skipped string Whether the patch generation has been skipped or NOT (otherwise the reason)
istest bool Whether the file is a test file or not
loc_d int64 Number of lines deleted in this patch
loc_i int64 Number of lines inserted in this patch
comp_d int64 Whitespace complexity deleted in this commit
comp_i int64 Whitespace complexity inserted in this commit
loc_d_std float32 Deleted number of lines deviation in the hunks
loc_i_std float32 Inserted number of lines deviation in the hunks
comp_d_std float32 Deleted complexity deviation in the hunks
comp_i_std float32 Inserted complexity deviation in the hunks
nhunks int64 Number of hunks in this patch
nblames int64 Number of unique commits this patch has churned lines from
blame_loc int64 Number of lines this patches has churned (deleted)

Blames

Contains blame segments for patches, an example from libgit2's github page.

Column Type Description
id string The commit's SHA
author_email string The author's email address
author_time datetime The author signature's timestamp
ismerge bool Whether the commit has two or more parents (or is a squash)
newpath string The file's new path after applying the patch
istest bool Whether the file is a test file or not
blame_id string The commit's SHA where this commit has churned lines from
loc_d int64 Number of churned lines
language string Programming language of the affected file
blame_author_email string The churned author's email address
blame_author_time datetime The author signature's timestamp of the chirned commit
blame_ismerge bool Whether the churned commit was a merge commit or not
author_email_dedup string Author's deduplicated email address
author_name_dedup string Author's deduplicated name

Both lightweight and annotated tags.

Column Type Description
id string The tags's SHA
name string The tag's name (in case of annotated)
message string The tag's message (in case of annotated)
type int64 Git object type (mostly commit)
author_time datetime The timestamp of the tag's creation

A more detailed documentation is on the way.

Python example

Install Dependencies

Reading Parquet files requires PyArrow

conda install -y pandas pyarrow
Fetching package metadata ...............
Solving package specifications: .

Package plan for installation in environment /Users/krisz/.pyenv/versions/miniconda3-latest/envs/ml3:

The following packages will be UPDATED:

    pandas: 0.21.0-py36_0 conda-forge --> 0.22.0-py36_0 conda-forge

pandas-0.22.0- 100% |################################| Time: 0:00:04   2.30 MB/s

Download Files

Currently there two formats available: Parquet and JSON

wget https://s3.amazonaws.com/gitential-com-datasets/libgit2-libgit2/commits.parquet
wget https://s3.amazonaws.com/gitential-com-datasets/libgit2-libgit2/commits.json.gz
--2018-01-15 11:09:00--  https://s3.amazonaws.com/gitential-com-datasets/libgit2-libgit2/commits.parquet
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.216.100.149
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.216.100.149|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1483345 (1.4M) [binary/octet-stream]
Saving to: ‘commits.parquet’

commits.parquet     100%[===================>]   1.41M   616KB/s    in 2.4s

2018-01-15 11:09:03 (616 KB/s) - ‘commits.parquet’ saved [1483345/1483345]

--2018-01-15 11:09:04--  https://s3.amazonaws.com/gitential-com-datasets/libgit2-libgit2/commits.json.gz
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.216.100.149
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.216.100.149|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1642853 (1.6M) [application/json]
Saving to: ‘commits.json.gz’

commits.json.gz     100%[===================>]   1.57M   903KB/s    in 1.8s

2018-01-15 11:09:06 (903 KB/s) - ‘commits.json.gz’ saved [1642853/1642853]

Reading Parquet file

import pandas as pd

commits = pd.read_parquet('./commits.parquet')
commits[['id', 'message']].head()
id message
c15648cbd059b92c177586ab1701a167222c7681 Initial draft of libgit2\n\nSigned-off-by: Sha...
44181c23ea6c39d51a4b481dc59ecf2cc3967e76 Mark git_oid parameters const when they should...
46d8b885bd65158e8cb53266ba4b627b5991bce8 Rename git_odb_sread to just git_odb_read\n\nM...
171aaf21d9f7582270c390962f61d3d2613c4d59 Hide GIT_{BEGIN,END}_DECL from doxygen as its ...
b51eb250ed0cbda59d3108d04569fab9413909fd Cleanup git_odb documentation formatting\n\nSi...

Reading JSON.gz file

import pandas as pd

commits = pd.read_json('./commits.json.gz', compression='infer')
commits[['id', 'message']].head()
id message
c15648cbd059b92c177586ab1701a167222c7681 Initial draft of libgit2\n\nSigned-off-by: Sha...
44181c23ea6c39d51a4b481dc59ecf2cc3967e76 Mark git_oid parameters const when they should...
46d8b885bd65158e8cb53266ba4b627b5991bce8 Rename git_odb_sread to just git_odb_read\n\nM...
171aaf21d9f7582270c390962f61d3d2613c4d59 Hide GIT_{BEGIN,END}_DECL from doxygen as its ...
b51eb250ed0cbda59d3108d04569fab9413909fd Cleanup git_odb documentation formatting\n\nSi...

About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK