It is bad practice to commit large data files to version control unless absolutely necessary. However, many novices treat Git more like a general storage solution than a version control system.
Remember: if you commit something, it’s bloating the repository forever!
Git must effectively retain every version of every file that has ever existed in your repository.1 This is a fundamental feature of version control, allowing recovery of the full history of changes.
To illustrate the issue, let’s look at one of the more popular Hugo themes, Blowfish.
As you can see, Blowfish’s repository is ~25x to ~65x larger than the median, depending on whether you’re considering the most recent version or the entire commit history.2 This has understandably led to a number of user complaints.
Most projects should be quite small and rarely need to exceed a few hundred MB in their Git repositories.3
What went wrong with Blowfish?#
To understand the problem, let’s see what’s in this repository.
$ ncdu ~/Desktop/blowfish
--- ~/Desktop/blowfish ------------------
450.1 MiB [##########] /.git
48.1 MiB [# ] /exampleSite
6.8 MiB [ ] /assets
6.2 MiB [ ] /images
544.0 KiB [ ] /layouts
...
Incredibly, the author has committed the entire example site to the theme itself. Unsurprisingly, this includes the largest files in the Git tree, with images taking up a significant portion.
$ git ls-tree -r -t -l --full-name HEAD | sort -n -k 4 -r | head -n 5
100644 blob 9f1220a6b0f2c87acd0f52187f905c4a12d7547c 5097493 exampleSite/assets/img/ocean.jpg
100644 blob 6352a781b3a41106cf3cf7a1175692d5009410a2 5000159 exampleSite/assets/img/iceland.jpg
100644 blob 946d6ae539f70cbee1a0ebe939ba0beb75fc619e 3715726 images/home-card.png
100644 blob 3dce007acabbec440d303bb632042981d5354e96 3335717 assets/lib/mermaid/mermaid.min.js
100644 blob 2455b8158a66e80544dbb23da982826e5a8b5a37 2360966 exampleSite/content/guides/202310-blowfish-tutorial/img/01.png
In fact, 90% of the example site’s file size is composed of images.
$ find exampleSite/ -regex '.*\.\(png\|jpg\)' -exec stat -c %s {} \; | awk '{sum+=$1}END{print sum}'
43200010 # ~43 MB of the ~48 MB exampleSite/ directory
Moreover, the author has been committing a screenshot of every site that uses the theme! Needless to say, this is horrifically bad practice as it forces every user to download these files to use the theme.4
A glance through a simple analysis from the git-filter-repo command reveals even more issues.
In the past, they have accidentally committed non-minified versions of JavaScript dependencies, added an
entire node_modules/
folder of third-party libraries, pushed several complete builds of since-deleted example sites,
and regularly committed large images that are unused in the theme itself.
$ git filter-repo --analyze
$ less .git/filter-repo/analysis/path-all-sizes.txt
=== All paths by reverse accumulated size ===
Format: unpacked size, packed size, date deleted, path name
...
6942334 6943896 2023-10-15 public/docs/welcome/featured.png
...
6600126 6592756 <present> images/screenshot.png
...
5000159 4997680 2022-10-02 docs/iceland.jpg
...
14460503 1336164 2024-07-02 assets/lib/mermaid/mermaid.js
$ less .git/filter-repo/analysis/directories-deleted-sizes.txt
=== Deleted directories by reverse size ===
Format: unpacked size, packed size, date deleted, directory name
174077004 166873965 2023-10-15 public
...
100530759 58273713 2022-10-02 docs
...
30062914 29178879 2022-09-11 exampleSitePersonal
...
29283034 28540711 2022-09-10 content
How to fix it#
Clearly, there are a lot of issues in this repository, so let’s try to fix them.
The methods available will depend on whether we have full write access to the repo as a maintainer or are simply a user.
As a user#
As a user of the theme,
- You probably don’t need the history of the repository, so using a shallow clone is a good start.
- You definitely don’t need the example site and related images, so removing them with a sparse checkout will help significantly.
By cloning only the most recent commit and excluding unnecessary paths, we can greatly reduce the size of the download and disk usage.
git clone --filter=blob:none --no-checkout --depth=1 --sparse https://github.com/nunocoracao/blowfish.git
cd blowfish
printf '/*\n!exampleSite/*\n!images/*\n!assets/img/*\n!*.png' > .git/info/sparse-checkout
git checkout
This approach immediately gets the clone down to a <3 MB download and ~9 MB of total disk usage, which should be suitable for most purposes.
For the sparse checkout, we are only interested in the following paths:
/* # we want everything, except
!exampleSite/* # nothing from exampleSite/
!images/* # nothing from images/
!assets/img/* # nothing from assets/img/
!*.png # and no PNGs from the root directory
Additionally, much of the remaining large files are bundled flowchart and math typesetting libraries, mermaid and katex. If those are not required for your use case, removing them further reduces the overall theme size to ~2 MB!
As a maintainer#
As a maintainer with write access to the repository, we’ll need to go back in time and retroactively remove all these
unnecessary files and directories. We’ll use the git-filter-repo
command to accomplish this.
Warning: rewriting Git history is destructive. Make sure you have local backups before proceeding. After making these changes, all collaborators should start with a fresh clone to prevent conflicts and avoid merging away all your work.
We already know that we need to clean up exampleSite/
, images/
, and assets/img/
. We should also eliminate
the public/
, docs/
, exampleSitePersonal/
, content/
, and node_modules/
directories that were present in the
history but not in recent commits.
$ git filter-repo --invert-paths --path exampleSite/ --path images/ --path assets/img/ --path public/ --path docs/ --path exampleSitePersonal/ --path content/ --path node_modules/
$ du -s --si
37M .
Applying this filter command reduces the repository size from ~534 MB to ~37 MB – a significant improvement. However, to further address the bloat, we’ll need to dig deeper.
$ git filter-repo --analyze
$ head -n 10 .git/filter-repo/analysis/path-all-sizes.txt
=== All paths by reverse accumulated size ===
Format: unpacked size, packed size, date deleted, path name
13659165 2684178 <present> assets/lib/mermaid/mermaid.min.js
42913700 1754245 <present> package-lock.json
14460503 1336163 2024-07-02 assets/lib/mermaid/mermaid.js
15266445 1238056 <present> assets/css/compiled/main.css
4564827 1047141 2024-07-02 assets/lib/mermaid/flowchart-elk-definition-a7fe3362.js.map
4564827 1047140 2024-07-02 assets/lib/mermaid/flowchart-elk-definition-2f51e52a.js.map
3306156 918888 <present> assets/lib/tw-elements/index.min.js
4121397 702693 2024-07-02 assets/lib/mermaid/flowchart-elk-definition-6f3d7532.js.map
This analysis reveals some non-minified JavaScript dependencies for the mermaid
library. The author has accidentally
committed both minified and non-minified version in the past.
The most recent version of the theme only needs the minified assets/lib/mermaid/mermaid.min.js
, so we can remove the
rest.
$ ls assets/lib/mermaid/
mermaid.min.js
$ git filter-repo --invert-paths --path-regex '^assets\/lib\/mermaid\/(?!mermaid\.min\.js$).*'
$ du -s --si
20M .
With these additional files filtered out, our local repository size reduces to ~20 MB.
After pushing these changes back to GitHub, a shallow clone takes ~3MB of download and ~10MB of total disk usage. Including the full Git history requires just ~18MB, which is a substantial reduction.
Overall, we’ve managed to trim the Git history down to just 3.4% of its original size without sacrificing functionality or version control. Even a shallow clone of just the most recent commit is ~8.6% the size of the upstream repository.
See also and references#
- The documentation of git-filter-repo, the tool recommended by the Git project for efficiently rewriting large chunks of Git history.
- GitHub’s documentation on rewriting history to remove sensitive data. This is additionally useful for understanding the broader context of history rewriting as it pertains to data security.
- The Git documentation on shallow clones and sparse checkouts.
To be clear, Git doesn’t store full copies of every version or file. Common components are shared and compressed whenever possible. However, the core lesson remains: adding large files will permanently affect your users. ↩︎
A default (full) git clone takes >500MB of disk space and cloning just the most recent commit with the
--depth=1
flag takes >100MB. The.git/
version control information alone accounts for ~470 MB of a full clone and ~50 MB of a shallow one. ↩︎An analysis of the top 100 most starred projects on GitHub shows that the median repository only uses ~90 MB of disk space for a full clone.
This analysis includes the most popular libraries in the world, such as React, TensorFlow, Bootstrap, Visual Studio Code, the Go programming language, Electron, Kubernetes, Node.js, and many more.
In fact, the vast majority of these projects remain under 1 GB, which aligns with GitHub’s own recommendation for ideal repository size. GitHub also strongly recommends keeping repositories under 5 GB, a limit only breached by the Linux kernel here, with its 1.3 million commits and >19 years of history. Your typical project should never reach these sizes without an extremely good reason.
This is true by default, but we’ll see how to selectively ignore these files later via a Git sparse checkout. ↩︎