Skip to content

feat: Add Word2Vec/word embeddings support for better semantic similarity #106

@cardmagic

Description

@cardmagic

Problem

LSI classification fails when test documents have no vocabulary overlap with training data. This is a fundamental limitation of bag-of-words approaches.

Example: Classifying poetry by genre fails because:

  • "Deep into that darkness peering, fearing" → stems to: dark, deep, fear, peer
  • "Once upon a midnight dreary, weary" → stems to: dreari, midnight, weari
  • Zero overlap → similarity score = 0.0 → random classification

Proposed Solution

Add optional Word2Vec/word embeddings support that can find semantic similarity even without exact word matches:

  • "fear" and "dread" → similar vectors → high similarity
  • "darkness" and "midnight" → related concepts → some similarity

Implementation Options

Option 1: word2vec-rb gem (recommended)

  • Ruby gem with C extensions wrapping Google's word2vec
  • Supports loading pre-trained models AND training custom ones
  • Last updated May 2022, MIT licensed
  • https://github.com/madcato/word2vec-rb

Option 2: Pre-trained embeddings only

  • Download GloVe or Word2Vec pre-trained vectors
  • Load on demand, cache in memory
  • Simpler but requires large file downloads (~1GB for GloVe)

Proposed API

# Option A: New LSI backend
lsi = Classifier::LSI.new(similarity: :word2vec)
lsi.load_embeddings("path/to/vectors.bin")  # or download pre-trained

# Option B: New classifier type
w2v = Classifier::Word2Vec.new
w2v.load_embeddings("glove.6B.100d.txt")
w2v.add("gothic" => ["darkness", "midnight", "fear"])
w2v.classify("shadows and dread")  # Works even without exact matches

Document Similarity with Word Embeddings

Convert documents to vectors by averaging word embeddings:

  1. Tokenize document → words
  2. Look up each word's embedding vector
  3. Average all vectors → document vector
  4. Compare document vectors with cosine similarity

Considerations

  • Optional dependency: word2vec-rb should be optional, not required
  • Memory usage: Word embeddings are large (~100-300 dimensions × vocabulary size)
  • Pre-trained models: Provide helper to download common models (GloVe, Word2Vec)
  • Fallback: Gracefully fall back to bag-of-words if embeddings unavailable

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions