Mark Jason Dominus: Notes on using git-replace to get rid of giant objects
source link: https://www.tuicool.com/articles/hit/bqYF7nb
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.
Mon, 08 Oct 2018
Notes on using git-replace to get rid of giant objects
A couple of years ago someone accidentally committed a 350 megabyte
file to our Git repository. Now it's baked in. I wanted to get rid
of it. I thought that I might be able to work out a partial but
lightweight solution using git-replace
.
Summary: It didn't work .
Details
In 2016 a programmer commited a 350 megabyte file to my employer's repo, then in the following commit they removed it again. Of course it's still in there, because someone might check out the one commit where it existed. Everyone who clones the repo gets a copy of the big file. Every copy of the repo takes up an extra 350 megabytes on disk.
The usual way to fix this is onerous:
-
Use
git-filter-branch
to rebuild all the repository history after the bad commit. -
Update all the existing refs to point to the analogous rebuilt objects.
-
Get everyone in the company to update all the refs in their local copies of the repo.
I thought I'd tinker around with git-replace
to see if there was
some way around this, maybe something that someone could do locally on
their own repo without requiring everyone else to go along with it.
The git-replace
command annotates the Git repository to say that
whenever object A
is wanted, object B
should be used instead. Say
that the 350 MB file has an ID of ffff9999ffff9999ffff9999ffff9999ffff9999
. I can create a small file
that says
This is a replacement object. It replaces a very large file that was committed by mistake. To see the commit as it really was, use git --no-replace-objects show 183a5c7e90b2d4f6183a5c7e90b2d4f6183a5c7e git --no-replace-objects checkout 183a5c7e90b2d4f6183a5c7e90b2d4f6183a5c7e or similarly. To see the file itself, use git --no-replace-objects show ffff9999ffff9999ffff9999ffff9999ffff9999
I can turn this small file into an object with git-add
; say the new
small object has ID 1111333311113333111133331111333311113333
. I
then run:
git replace ffff9999ffff9999ffff9999ffff9999ffff9999 1111333311113333111133331111333311113333
This creates .git/refs/replace/ffff9999ffff9999ffff9999ffff9999ffff9999
, which
contains the text 1111333311113333111133331111333311113333
.
thenceforward, any Git command that tries to access the original
object ffff9999
will silently behave as if it were 11113333
instead. For example, git show 183a5c7e
will show the diff between
that commit and the previous, as if
the user had committed my small
file back in 2016 instead of their large one. And checking out that
commit will check out the small file instead of the large one.
So far this doesn't help much. The checkout is smaller, but nobody was likely to have that commit checked out anyway. The large file is still in the repository, and clones and transfers still clone and transfer it.
The first thing I tried was a wan hope: will git gc
discard the
replaced object? No, of course not. The ref in refs/replace/
counts as a reference to it, and it will never be garbage-collected.
If it had been, you would no longer be able to examine it with the --no-replace-objects
commands. So much for following the rules!
Now comes the hacking part: I am going to destroy the actual object. Say for example, what if:
cp /dev/null .git/objects/ff/ff9999ffff9999ffff9999ffff9999ffff9999
Now the repository is smaller! And maybe Git won't notice, as long as
I do not
use --no-replace-objects
?
Indeed, much normal Git usage doesn't notice. For example, I can make
new commits with no trouble, and of course any other operation that
doesn't go back as far as 2016 doesn't notice the change. And git-log
works just fine even past the bad commit; it only looks at
the replacement object and never notices that the bad object is
missing.
But some things become wonky. You get an error message when you clone
the repo because an object is missing. The replacement refs are local
to the repo, and don't get cloned, so clone doesn't know to use the
replacement object anyway. In the clone, you can use git replace -f
....
to reinstate the replacement, and then all is well unless
something tries to look at the missing object. So maybe a user could
apply this hack on their own local copy if they are willing to
tolerate a little wonkiness…?
No. Unfortunately, there is a show-stopper: git-gc
no longer
works in either the parent repo or in the clone:
fatal: unable to read ffff9999ffff9999ffff9999ffff9999ffff9999 error: failed to run repack
and it doesn't create the pack files. It dies, and leaves behind a .git/objects/pack/tmp_pack_XxXxXx
that has to be cleaned up by hand.
I think I've reached the end of this road. Oh well, it was worth a look.
Recommend
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK