-
Notifications
You must be signed in to change notification settings - Fork 54.7k
Description
I recently had to get a clean fork of the project again and was a bit puzzled about the grown size I did a bit analysis and want to share the following obervations:
I am curious to understand why there is such a heavy generation of translated_images (see the folder)
IMO, these are generated files which should not necessarily be put under version control but rather into the output (maybe we should think about a publishing pipeline which then included the whole translations)
The same question is about translations (lot of files, it does not cause size problems (as IMO translated_images) but do we have to store generated files in the repo ? I always tell my students not to store them, because they can be reproduced via the tool chain and they are mainly output (like obj files or dll or lib files or clang generated tables).
So here are some numbers:
- the main repo has 1GB data (without translated images)
- the "translated_images" folder has 3 GB ! it can not be even displayed in github webpage anymore (you get an error message because > 6000 files, it can not display any git information anymore)
- the "translated" folder has about 4000 files, while the whole rest of the project has 1800 files for the content.
I really like Co-op Translator as it save a lot of work, however i think we should not store its output in git.
Rather push it into a container or a some binary which we store in artifactory or the github registry..
what do you think about this ?
Below are some screenshots from WizTree which shows that "translated_images" and "translation" contribute significantly to the size. There are also my downloaded LLMs for chapter 19-slnm, the python venv and the git repo store itself above those numbers, but then these two output folder immediatly follow.
Here is a picture of the github UI which has problems.
https://github.com/microsoft/generative-ai-for-beginners/tree/main/translated_images
