Skip to content
This repository was archived by the owner on Jul 28, 2025. It is now read-only.

Conversation

@mart-r
Copy link
Collaborator

@mart-r mart-r commented Nov 14, 2024

Adds a module to convert the vocab vectors from the default (or really anything) to a smaller length.

The default vocab vector length is 300. However, we don't really make use of all this information. Experiments show that we can go quite a lot smaller in vocab size and retain the same performance. See e.g: https://gist.github.com/mart-r/e9db909cde1922464bcc753f54006994
Or (somewhat more comprehensively): https://gist.github.com/mart-r/21460286466d17b9f23719ba3f4dc938

The benefits of using a smaller vocab size mainly boil down to (examples at 50 vector size):

  • Smaller saved vocab on disk
    • The vocab size can go from 314MB down to 142MB
    • The CDB size will also go down significantly
      • Because the context vectors stored within it depend on the vectors in the Vocab
      • In a MIMIC-IV trained model it went from 1.7GB to 1.3GB
      • But this effect can be larger if more concepts have been trained on
    • The model pack size will also change accordingly
      • Normal model (MIMIC IV trained) zip was 1.0GB
      • Down sized model (same model) zip was 410MB
  • Potentially faster load/save times
    • Since the files will be smaller
      • Though I don't have good evidence for that
    • Loading already unpacked (this can very well be run-to-run variance)
      • Normal: 17.4s
      • Down sized: 17.1s
    • Loading before unpacking
      • Normal: 25.7s
      • Down sized: 22.1s

NOTE:
There might be improvements we could do here:

  • Should this be in another module?
  • Should we add CLI for model pack conversion?

@tomolopolis
Copy link
Member

@mart-r mart-r merged commit b96310b into master Nov 27, 2024
7 checks passed
@mart-r mart-r deleted the CU-8696nbm9j-downsize-vocab-vectors branch January 23, 2025 09:53
@mart-r mart-r restored the CU-8696nbm9j-downsize-vocab-vectors branch February 18, 2025 16:24
@mart-r mart-r deleted the CU-8696nbm9j-downsize-vocab-vectors branch February 18, 2025 16:26
alhendrickson pushed a commit to CogStack/cogstack-nlp that referenced this pull request Jul 1, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants