grokking hard

code smarter, not harder

Convert Git to Git LFS

Posted at — 2020-Sep-16

There are some Git repositories in the company contain mostly binary files (Words, Excel, PDFs, etc). As Git is not designed to track binary files effectively, eventually the repository ends up pretty large (over 2GB) and will become a PITA on git clone.

In order to effectively solves this, switching a regular Git to Git LFS. This post aims to show you how to do it.

Prerequisites

Steps

NOTE: In this post, BitBucket as the remote Git server.

Clone a bare repository

In order to effectively overwrite all the history of a Git history, it is required to clone the entire Git repository in its bare form.

IMPORTANT

Make sure you make a backup of the cloned directory.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
$ git clone --mirror git@example.com:awesome/awesomely-heavy.git

# Make a quick check to see the current size of the repository so that we can verify it again once its done.

$ git count-objects -vH
# Just an example, your result may vary.
count: 0
size: 0 bytes
in-pack: 1053
packs: 1
size-pack: 615.10 MiB
prune-packable: 0
garbage: 0
size-garbage: 0 bytes

Migrate binary files to be tracked by LFS (Rewriting history alerts)

IMPORTANT
The following process will purposely rewrite entire your Git history. All commits IDs will be changed.

1
2
3
4
5
6
7
# Include more file extensions which you want to track by LFS.
$ git lfs migrate import --everything --include="\*.docx,\*.pdf"

migrate: Sorting commits: ..., done.
migrate: Rewriting commits: 100% (xxx/yyy), done.
# ...lots of refs..omitted
migrate: Updating refs: ..., done

You may wonder “How do I know which extensions to include?”. The following script can help you.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
# The following script will output the file extensions existing in your Git repository.
# It was tested on MacOS with the aforementioned \`git\` version.

$ git log --all --numstat \\
    | grep '^-' \\
    | cut -f3 \\
    \`# On your Linux (or WSL), you may want to use sed instead of gsed\` \\
    | gsed -r 's|(.\*)\\{(.\*) => (.\*)\\}(.\*)|\\1\\2\\4\\n\\1\\3\\4|g' \\
    | sort -u \`# Up until this point, all full-path of files committed will be printed\` \\
    | xargs -I{} sh -c 'x="{}";echo ${x##\*.}' \\
    | sort \\
    | uniq \\
    | awk '{ print "\*."$1 }' \\
    | paste -sd, -

# example output
\*.docx,\*.jar,\*.xlsx,\*.xltm

Clean up the Git repository

At this point, all of your history should be rewritten by new commits. The old commits will effectively become orphans. Git by default will still keep them in the .git directory until an explicly prune is invoked.

1
2
3
4
5
6
# Trigger Git GC to run immediately to remove
# all orphan commits.
# This will effectively remove all binary files stored
# in your .git/objects.
$ git reflog expire --expire-unreachable=now --all
$ git gc --prune=now

Verify Changes

Now verify if your Git repository has become smaller.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
# Print out the space usage of the current Git.
# Compare to the previous run, it should be wayyy smaller.
$ git count-objects -vH

count: 0
size: 0 bytes
in-pack: 1054
packs: 1
size-pack: 199.19 KiB
prune-packable: 0
garbage: 0
size-garbage: 0 bytes

If you wonder where are all the files now. The du command can tell you.

1
2
3
4
5
6
7
8
$ du -d 1 .
# in this example output, the files are moved from objects to lfs.
232K    ./objects
8.0K    ./info
 72K    ./hooks
  0B    ./refs
1.0G    ./lfs
1.0G    .

Push to remote repository

It’s time to push the new Git repository into your remote one.

IMPORTANT
It is critical that all refs are pushed into the remote repository.

1
2
3
# The \`--mirror\` ensure all \`refs\` are pushed.
# @see https://git-scm.com/docs/git-push#Documentation/git-push.txt---mirror
$ git push --mirror --force

Contact your Git Hosting provider to run git GC

Successfully pushed the new Git LFS into the remote repository not the end of the story.

As Git is a distributed source version control, so your Provider also holds a copy of your entire Git repository. It’s important that your Provider run Git GC on their own infrastructure.

For BitBucket, you have to open a BitBucket Cloud Support ticket and request them to run git GC. GitLab automatically runs gc on each push. I cannot find any relevant information on GitHub (though GitHub Enterprise requires you to contact their Support).

References