Skip to content

Using compression on source code repositories clones? #1

@rolinh

Description

@rolinh

We note that 500'000 git source code repositories take up ~10TB of storage space as tar archive. We currently have more than 3.5mio repositories metadata in the database and once done with importing more data from the GHTorrent project, we will probably have around 10mio repositories (and maybe even more).

Since 500k repositories probably give a good approximation of the average size of repositories, it is probably relatively accurate to state that 10mio repositories require ~200TB of storage space.

Hence, using compression on source code repositories seems to be a good idea with regard to storage space. However, I am worried that this might introduce too much overhead when processing data (crawld will have to uncompress + untar and tar + compress again for the update operation for instance and other tools will be affected as well (repotool and srctool hence language parsers mainly).

If ever using compression, we shall aim at using a compression algorithm which is both fast to compress and decompress, at the cost of compression ratio if necessary (note that only crawld would compress data, all the other tools would only decompress, hence decompression speed is probably more important). Snappy or Zstandard are probably good candidates.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions