Using compression on source code repositories clones?

We note that 500'000 git source code repositories take up ~10TB of storage space as tar archive. We currently have more than 3.5mio repositories metadata in the database and once done with importing more data from the [GHTorrent](http://ghtorrent.org/) project, we will probably have around 10mio repositories (and maybe even more).

Since 500k repositories probably give a good approximation of the average size of repositories, it is probably relatively accurate to state that 10mio repositories require ~200TB of storage space.

Hence, using compression on source code repositories seems to be a good idea with regard to storage space. However, I am worried that this might introduce too much overhead when processing data (`crawld` will have to uncompress + untar and tar + compress again for the update operation for instance and other tools will be affected as well (`repotool` and `srctool` hence language parsers mainly).

If ever using compression, we shall aim at using a compression algorithm which is both fast to compress and decompress, at the cost of compression ratio if necessary (note that only `crawld` would compress data, all the other tools would only decompress, hence decompression speed is probably more important). [Snappy](https://code.google.com/p/snappy/) or [Zstandard](https://github.com/Cyan4973/zstd) are probably good candidates.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Using compression on source code repositories clones? #1

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Using compression on source code repositories clones? #1

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions