73

The Git project selects SHA-256 as the next-gen hash function to migrate to

 5 years ago
source link: https://www.tuicool.com/articles/hit/quAZnqr
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Git Rev News: Edition 42 (August 22nd, 2018)

Welcome to the 42nd edition ofGit Rev News, a digest of all things Git. For our goals, the archives, the way we work, and how to contribute or to subscribe, see the Git Rev News page ongit.github.io.

This edition covers what happened during the month of July 2018.

Discussions

General

Last month’s edition discussed the state of NewHash work, i.e. the process of selecting Git’s next-generation hash function. This discussion has concluded with the selection of SHA-256. An update to hash-function-transition.txt to change NewHash to SHA-256 is queued in the next branch.

Support

Paweł Paruzel reported that he found some test files in his repo appeared modified just after a clone because he had files like “boolStyle_t_f” and “boolStyle_T_F” that differ only in case and was cloning on a case-insensitive Mac.

He suggested having git clone throw an exception when files that differ only in case are cloned to a case insensitive system.

Brian Carlson replied that this would make it impossible to clone such a repository on a case-insensitive system while the current behavior might still result in a functional repo.

Brian also suggested using something like test $(git status --porcelain | wc -l) -eq 0 to check that a repo is unmodified after clone.

Duy Nguyen agreed with Brian and proposed a patch that uses sparse checkout to hide all the paths that fail to checkout because of the filesystem. Duy’s patch also warns to tell the user what happens.

Jeff King, alias Peff, replied to Duy suggesting just warning and advising the user. And Duy followed up with a modified patch that does just that.

Simon Ruderich commented that the advice message in Duy’s patch should list the problematic file names to help users.

Peff agreed with Simon and wondered if it was better to detect at checkout time if a file already exists on the filesystem rather than checking before the checkout. Peff also noted that Duy’s patch used strcasecmp() to check if filenames diff only by case, but in some cases, especially related to utf8 names, a filesystem could use complex folding rules which we would need to follow.

Brian replied to Peff saying that it was indeed possible to detect the issue at checkout time, and Duy replied that it was actually what his patch was doing.

Duy, Peff and Jeff Hostetler then agreed that it would be difficult to follow complex folding rules that a filesystem might use.

Duy then started sending a real patch in its own email .

Junio Hamano chimed in to suggest a different implementation and a long discussion thread involving Torsten Bögershausen, Elijah Newren, Duy, Junio, Peff and Jeff Hostetler followed about how to best find all the colliding paths.

Duy sent a version 2 of his patch .

The previous long discussion thread continued following this patch though.

Duy sent a version 3 that actually tries to find all the colliding paths on “UNIXy platforms”.

Szeder Gábor found small issues in the patch, so Duy sent a version 4 .

Comments from Torsten started a discussion about clarifying the documentation of the core.checkStat config option which was addressed by a separate patch from Duy and Junio.

Duy then recently sent a version 5 which tries to find all the colliding paths on Windows too, and a version 6 to address a few more comments from Junio and Torsten.

It looks like the latest version will be merged to the “next” branch soon.

Developer Spotlight: Derrick Stolee

  • Who are you and what do you do?

    I’m a software engineer at Microsoft working on the version control client team. My team includes the Git contributors from Microsoft as well as most of the developers for VFS for Git (formerly GVFS). We also work on other version control clients, such as Team Explorer for Visual Studio.

    I joined this team after a couple years on the Git Server team for Visual Studio Team Services (VSTS), where I work generally on performance features, such as the history algorithm, reachability bitmaps, and other scale issues. While I was on the team, we onboarded the Windows development group to Git, which was a very exciting time to be part of the team. After they were using VSTS, the place that needed the most performance improvement was in the client experience, so I switched teams.

    Before Microsoft, I was an academic. I got my Ph.D in Mathematics and Computer Science from the University of Nebraska and was an assistant professor for a few years, working in computational graph theory and combinatorics. I found that being a faculty was not nearly as much fun as being a graduate student, and I couldn’t find enough time to write code for my computational experiments. Luckily, I was able to find a role at Microsoft that could use some of those skills.

  • What would you name your most important contribution to Git?

    My most important contribution to Git has been the commit-graph feature. This feature enables significant performance boosts for Git repos of almost every size. I built a similar feature while rewriting the history algorithm for VSTS, and I changed teams from server to client so I could contribute this feature to core Git.

  • What are you doing on the Git project these days, and why?

    In addition to working on more fallout from the commit-graph feature (including speeding up git log --graph ), I’ve been working on the multi-pack-index feature. This allows indexing a list of objects across several packfiles. I’m the first to admit that this feature is not as necessary as the commit-graph until you get to incredibly large repositories. When git repack stops being a realistic option, then the multi-pack-index can keep your lookup times sane. I hope that there are some more future integrations that are beneficial to other repos, such as tying the reachability bitmap to the multi-pack-index instead of a single packfile.

  • If you could get a team of expert developers to work full time on something in Git for a full year, what would it be?

    THE INDEX! The .git/index file is a linear list of every path available in the working directory, even if those files are not checked out (due to a sparse-checkout). This is one of the most central features to all of Git, and is used in many places with a very transparent API.

    The index is the single biggest bottleneck left to tackle for super-large repos like the Windows repo. These enormous repos are too large for most developers to need the entire thing, so they work in a small cone. VFS for Git virtualizes the files until they are read or written, and we use the sparse-checkout feature to limit Git’s interaction to be in that cone. However, we still need to read and write the entire index file whenever we interact with the staging area. If the index was hierarchical or could prune the list for entire subtrees, then we could drastically reduce this cost.

    The biggest problem with this direction is that it requires refactoring almost all of the consumers to use a new API that is not as coupled to the layout of the index. Only after that change happens can we drastically replace the file format. It’s a bit of a chicken-or-the-egg problem, because we can’t refactor the API unless we know the behavior the new format can present, but we can’t test format options until we refactor the API.

    The task is pretty daunting, but I think someone could get there with enough focus and determination.

  • If you could remove something from Git without worrying about backwards compatibility, what would it be?

    I would rebuild revision.c from scratch. I’m going to need to do a lot of replacement to make git log --graph use generation numbers, but it would be easier if I could start over and add features one at a time.

    This is probably a boring answer, but I have found that every single command-line option is someone’s favorite feature, so I don’t want to take that away from anyone. One of Git’s strengths is that it is so flexible for different use cases and different repos.

  • What is your favorite Git-related tool/library, outside of Git itself?

    It’s rather new, but I’ve been enjoying using GitGitGadget to send patches to the mailing list. Too often I make a mistake sending patches upstream, or lose track of which commit I used in v2, and things like that. Starting from a GitHub pull request (that also ran builds on the code for Windows, Mac, & Linux) is much easier (to me) than going through the hoops of git format-patch and git send-email . I think that getting started submitting patches via email is one of the biggest barriers to entry for our community, and I really believe we are losing quality contributors due to that friction. Hopefully, GitGitGadget can be one way that we can attract and retain more contributors.

Releases

Other News

Various

Light reading

Git tools and sites

  • Web applications for sending patches via email:
    • GitGitGadget (homepage) is a bot (gadget) to serve as glue between GitHub Pull Requests and the Git mailing list, allowing to submit patch series , and to iterate this process.
    • submitGit is an older Heroku app to send GitHub Pull Request to the Git mailing list, correctly formatting the patches. It’s creation was extensively covered in Git Rev News Edition #4 .
  • Version control for Data Science:
  • lazygit is simple [windowed] terminal UI for git commands, written in Go with the gocui library (compare tig , an ncurses-based text-mode interface for Git).
  • Gitrob: Putting the Open Source in OSINT is a tool to help find potentially sensitive files pushed to public repositories on GitHub
  • git-quick-stats is a simple and efficient way to access various statistics in git repository (with a simplehomepage)
  • hofs-churn is a small bash script to approximate code churn for a Git repo as described by Brikman’s article: “ The 10:1 rule of writing and programming
  • Atlas by O’Reilly, is a tool/site for publishing books (which can be written in AsciiDoc, with layout and structure defined in HTML and CSS files), based on Git.
  • Keep a Changelog , a site with proposed changelog format and a motto: “Don’t let your friends dump git logs into changelogs” .

Credits

This edition of Git Rev News was curated by Christian Couder < [email protected] >, Jakub Narębski < [email protected] >, Markus Jansen < [email protected] > and Gabriel Alcaras < [email protected] > with help from Derrick Stolee and Ævar Arnfjörð Bjarmason.


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK