Skip to main content

Why the Hell Is This Git Repo So Large? And How to Trim It Down

Programming Git Version-Control Repository-Management
Ryan Gibson
Author
Ryan Gibson
Quantitative Analyst | Computer Scientist
Table of Contents

It is bad practice to commit large data files to version control unless absolutely necessary. However, many novices treat Git more like a general storage solution than a version control system.

Remember: if you commit something, it’s bloating the repository forever!

Git must effectively retain every version of every file that has ever existed in your repository.1 This is a fundamental feature of version control, allowing recovery of the full history of changes.

To illustrate the issue, let’s look at one of the more popular Hugo themes, Blowfish.

A histogram of Git repository sizes for the top 100 Hugo themes on GitHub. The median Hugo theme is ~8 MB (~5 MB
for a shallow clone) while Blowfish is ~534 MB (~129 MB for a shallow clone).
Blowfish has the largest Git repository out of all the major Hugo themes, by far.

As you can see, Blowfish’s repository is ~25x to ~65x larger than the median, depending on whether you’re considering the most recent version or the entire commit history.2 This has understandably led to a number of user complaints.

Most projects should be quite small and rarely need to exceed a few hundred MB in their Git repositories.3

What went wrong with Blowfish?
#

To understand the problem, let’s see what’s in this repository.

$ ncdu ~/Desktop/blowfish
--- ~/Desktop/blowfish ------------------
  450.1 MiB [##########] /.git                                                                                                                                                                                                                
   48.1 MiB [#         ] /exampleSite
    6.8 MiB [          ] /assets
    6.2 MiB [          ] /images
  544.0 KiB [          ] /layouts
  ...

Incredibly, the author has committed the entire example site to the theme itself. Unsurprisingly, this includes the largest files in the Git tree, with images taking up a significant portion.

$ git ls-tree -r -t -l --full-name HEAD | sort -n -k 4 -r | head -n 5
100644 blob 9f1220a6b0f2c87acd0f52187f905c4a12d7547c 5097493	exampleSite/assets/img/ocean.jpg
100644 blob 6352a781b3a41106cf3cf7a1175692d5009410a2 5000159	exampleSite/assets/img/iceland.jpg
100644 blob 946d6ae539f70cbee1a0ebe939ba0beb75fc619e 3715726	images/home-card.png
100644 blob 3dce007acabbec440d303bb632042981d5354e96 3335717	assets/lib/mermaid/mermaid.min.js
100644 blob 2455b8158a66e80544dbb23da982826e5a8b5a37 2360966	exampleSite/content/guides/202310-blowfish-tutorial/img/01.png

In fact, 90% of the example site’s file size is composed of images.

$ find exampleSite/ -regex '.*\.\(png\|jpg\)' -exec stat -c %s {} \; | awk '{sum+=$1}END{print sum}'
43200010 # ~43 MB of the ~48 MB exampleSite/ directory

Moreover, the author has been committing a screenshot of every site that uses the theme! Needless to say, this is horrifically bad practice as it forces every user to download these files to use the theme.4

A screenshot of the "Users" page in Blowfish's documentation and example site.
Every user has a screenshot of their site committed to Blowfish on request.

A glance through a simple analysis from the git-filter-repo command reveals even more issues.

In the past, they have accidentally committed non-minified versions of JavaScript dependencies, added an entire node_modules/ folder of third-party libraries, pushed several complete builds of since-deleted example sites, and regularly committed large images that are unused in the theme itself.

$ git filter-repo --analyze
$ less .git/filter-repo/analysis/path-all-sizes.txt
=== All paths by reverse accumulated size ===
Format: unpacked size, packed size, date deleted, path name
    ...
    6942334    6943896 2023-10-15 public/docs/welcome/featured.png
    ...
    6600126    6592756 <present>  images/screenshot.png
    ...
    5000159    4997680 2022-10-02 docs/iceland.jpg
    ...
    14460503   1336164 2024-07-02 assets/lib/mermaid/mermaid.js

$ less .git/filter-repo/analysis/directories-deleted-sizes.txt
=== Deleted directories by reverse size ===
Format: unpacked size, packed size, date deleted, directory name
    174077004  166873965 2023-10-15 public
    ...
    100530759   58273713 2022-10-02 docs
    ...
     30062914   29178879 2022-09-11 exampleSitePersonal
    ...
     29283034   28540711 2022-09-10 content

How to fix it
#

Clearly, there are a lot of issues in this repository, so let’s try to fix them.

The methods available will depend on whether we have full write access to the repo as a maintainer or are simply a user.

As a user
#

As a user of the theme,

  • You probably don’t need the history of the repository, so using a shallow clone is a good start.
  • You definitely don’t need the example site and related images, so removing them with a sparse checkout will help significantly.

By cloning only the most recent commit and excluding unnecessary paths, we can greatly reduce the size of the download and disk usage.

git clone --filter=blob:none --no-checkout --depth=1 --sparse https://github.com/nunocoracao/blowfish.git
cd blowfish
printf '/*\n!exampleSite/*\n!images/*\n!assets/img/*\n!*.png' > .git/info/sparse-checkout
git checkout

This approach immediately gets the clone down to a <3 MB download and ~9 MB of total disk usage, which should be suitable for most purposes.

For the sparse checkout, we are only interested in the following paths:

/*             # we want everything, except
!exampleSite/* # nothing from exampleSite/
!images/*      # nothing from images/
!assets/img/*  # nothing from assets/img/
!*.png         # and no PNGs from the root directory

Additionally, much of the remaining large files are bundled flowchart and math typesetting libraries, mermaid and katex. If those are not required for your use case, removing them further reduces the overall theme size to ~2 MB!

As a maintainer
#

As a maintainer with write access to the repository, we’ll need to go back in time and retroactively remove all these unnecessary files and directories. We’ll use the git-filter-repo command to accomplish this.

Warning: rewriting Git history is destructive. Make sure you have local backups before proceeding. After making these changes, all collaborators should start with a fresh clone to prevent conflicts and avoid merging away all your work.

We already know that we need to clean up exampleSite/, images/, and assets/img/. We should also eliminate the public/, docs/, exampleSitePersonal/, content/, and node_modules/ directories that were present in the history but not in recent commits.

$ git filter-repo --invert-paths --path exampleSite/ --path images/ --path assets/img/ --path public/ --path docs/ --path exampleSitePersonal/ --path content/ --path node_modules/
$ du -s --si
37M	.

Applying this filter command reduces the repository size from ~534 MB to ~37 MB – a significant improvement. However, to further address the bloat, we’ll need to dig deeper.

$ git filter-repo --analyze
$ head -n 10 .git/filter-repo/analysis/path-all-sizes.txt 
=== All paths by reverse accumulated size ===
Format: unpacked size, packed size, date deleted, path name
    13659165    2684178 <present>  assets/lib/mermaid/mermaid.min.js
    42913700    1754245 <present>  package-lock.json
    14460503    1336163 2024-07-02 assets/lib/mermaid/mermaid.js
    15266445    1238056 <present>  assets/css/compiled/main.css
     4564827    1047141 2024-07-02 assets/lib/mermaid/flowchart-elk-definition-a7fe3362.js.map
     4564827    1047140 2024-07-02 assets/lib/mermaid/flowchart-elk-definition-2f51e52a.js.map
     3306156     918888 <present>  assets/lib/tw-elements/index.min.js
     4121397     702693 2024-07-02 assets/lib/mermaid/flowchart-elk-definition-6f3d7532.js.map

This analysis reveals some non-minified JavaScript dependencies for the mermaid library. The author has accidentally committed both minified and non-minified version in the past.

The most recent version of the theme only needs the minified assets/lib/mermaid/mermaid.min.js, so we can remove the rest.

$ ls assets/lib/mermaid/
mermaid.min.js
$ git filter-repo --invert-paths --path-regex '^assets\/lib\/mermaid\/(?!mermaid\.min\.js$).*'
$ du -s --si
20M	.

With these additional files filtered out, our local repository size reduces to ~20 MB.

After pushing these changes back to GitHub, a shallow clone takes ~3MB of download and ~10MB of total disk usage. Including the full Git history requires just ~18MB, which is a substantial reduction.

Overall, we’ve managed to trim the Git history down to just 3.4% of its original size without sacrificing functionality or version control. Even a shallow clone of just the most recent commit is ~8.6% the size of the upstream repository.

See also and references
#


  1. To be clear, Git doesn’t store full copies of every version or file. Common components are shared and compressed whenever possible. However, the core lesson remains: adding large files will permanently affect your users. ↩︎

  2. A default (full) git clone takes >500MB of disk space and cloning just the most recent commit with the --depth=1 flag takes >100MB. The .git/ version control information alone accounts for ~470 MB of a full clone and ~50 MB of a shallow one. ↩︎

  3. An analysis of the top 100 most starred projects on GitHub shows that the median repository only uses ~90 MB of disk space for a full clone.

    This analysis includes the most popular libraries in the world, such as React, TensorFlow, Bootstrap, Visual Studio Code, the Go programming language, Electron, Kubernetes, Node.js, and many more.

    In fact, the vast majority of these projects remain under 1 GB, which aligns with GitHub’s own recommendation for ideal repository size. GitHub also strongly recommends keeping repositories under 5 GB, a limit only breached by the Linux kernel here, with its 1.3 million commits and >19 years of history. Your typical project should never reach these sizes without an extremely good reason.

    A histogram of Git repository sizes for the top 100 starred GitHub projects. The median repo size is ~91 MB
    (~23 MB for a shallow clone). Among repos larger than 2.5 MB, the median repo size is ~152 MB (~37 MB for a shallow
    clone). The 90th percentile repo size is ~1136 MB (~327 MB for a shallow clone).
     ↩︎

  4. This is true by default, but we’ll see how to selectively ignore these files later via a Git sparse checkout. ↩︎

Related

Tips on Creating the Smallest Possible QR Codes
Qr-Codes Programming
Some brief tips and discussion on how to make tiny QR codes with a focus on web URLs.
Steganography: Hiding Data Inside Data
Steganography Programming Github Cybersecurity
A general introduction to hiding information in plain sight, its uses in the real-world, and how it works in digital mediums.
Complete Tables of QR Code Character Limits
Extra Qr-Codes Programming
A reference for standard QR code data capacities in the common input modes.