There are some Git repositories in the company contain mostly binary files (Words, Excel, PDFs, etc). As Git is not designed to track binary files effectively, eventually the repository ends up pretty large (over 2GB) and will become a PITA on git clone
.
In order to effectively solves this, switching a regular Git to Git LFS. This post aims to show you how to do it.
Prerequisites
- Remote Git Server MUST support Git LFS (GitHub, GitLab and BitBucket all supports LFS)
git
(>=2.27.0
) and git lfs
(>=2.11.0
) MUST be installed.- A Bash 5.0 shell. (For Windows, Git Bash may work but it is recommended to use WSL v2).
Steps
NOTE: In this post, BitBucket as the remote Git server.
Clone a bare repository
In order to effectively overwrite all the history of a Git history, it is required to clone the entire Git repository in its bare form.
IMPORTANT
Make sure you make a backup of the cloned directory.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
| $ git clone --mirror git@example.com:awesome/awesomely-heavy.git
# Make a quick check to see the current size of the repository so that we can verify it again once its done.
$ git count-objects -vH
# Just an example, your result may vary.
count: 0
size: 0 bytes
in-pack: 1053
packs: 1
size-pack: 615.10 MiB
prune-packable: 0
garbage: 0
size-garbage: 0 bytes
|
Migrate binary files to be tracked by LFS (Rewriting history alerts)
IMPORTANT
The following process will purposely rewrite entire your Git history. All commits IDs will be changed.
1
2
3
4
5
6
7
| # Include more file extensions which you want to track by LFS.
$ git lfs migrate import --everything --include="\*.docx,\*.pdf"
migrate: Sorting commits: ..., done.
migrate: Rewriting commits: 100% (xxx/yyy), done.
# ...lots of refs..omitted
migrate: Updating refs: ..., done
|
You may wonder “How do I know which extensions to include?”. The following script can help you.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
| # The following script will output the file extensions existing in your Git repository.
# It was tested on MacOS with the aforementioned \`git\` version.
$ git log --all --numstat \\
| grep '^-' \\
| cut -f3 \\
\`# On your Linux (or WSL), you may want to use sed instead of gsed\` \\
| gsed -r 's|(.\*)\\{(.\*) => (.\*)\\}(.\*)|\\1\\2\\4\\n\\1\\3\\4|g' \\
| sort -u \`# Up until this point, all full-path of files committed will be printed\` \\
| xargs -I{} sh -c 'x="{}";echo ${x##\*.}' \\
| sort \\
| uniq \\
| awk '{ print "\*."$1 }' \\
| paste -sd, -
# example output
\*.docx,\*.jar,\*.xlsx,\*.xltm
|
Clean up the Git repository
At this point, all of your history should be rewritten by new commits. The old commits will effectively become orphans. Git by default will still keep them in the .git
directory until an explicly prune
is invoked.
1
2
3
4
5
6
| # Trigger Git GC to run immediately to remove
# all orphan commits.
# This will effectively remove all binary files stored
# in your .git/objects.
$ git reflog expire --expire-unreachable=now --all
$ git gc --prune=now
|
Verify Changes
Now verify if your Git repository has become smaller.
1
2
3
4
5
6
7
8
9
10
11
12
| # Print out the space usage of the current Git.
# Compare to the previous run, it should be wayyy smaller.
$ git count-objects -vH
count: 0
size: 0 bytes
in-pack: 1054
packs: 1
size-pack: 199.19 KiB
prune-packable: 0
garbage: 0
size-garbage: 0 bytes
|
If you wonder where are all the files now. The du
command can tell you.
1
2
3
4
5
6
7
8
| $ du -d 1 .
# in this example output, the files are moved from objects to lfs.
232K ./objects
8.0K ./info
72K ./hooks
0B ./refs
1.0G ./lfs
1.0G .
|
Push to remote repository
It’s time to push the new Git repository into your remote one.
IMPORTANT
It is critical that all refs
are pushed into the remote repository.
1
2
3
| # The \`--mirror\` ensure all \`refs\` are pushed.
# @see https://git-scm.com/docs/git-push#Documentation/git-push.txt---mirror
$ git push --mirror --force
|
Successfully pushed the new Git LFS into the remote repository not the end of the story.
As Git is a distributed source version control, so your Provider also holds a copy of your entire Git repository. It’s important that your Provider run Git GC on their own infrastructure.
For BitBucket, you have to open a BitBucket Cloud Support ticket and request them to run git GC. GitLab automatically runs gc
on each push. I cannot find any relevant information on GitHub (though GitHub Enterprise requires you to contact their Support).
References