feat: Add Word2Vec/word embeddings support for better semantic similarity

## Problem

LSI classification fails when test documents have no vocabulary overlap with training data. This is a fundamental limitation of bag-of-words approaches.

**Example:** Classifying poetry by genre fails because:
- "Deep into that darkness peering, fearing" → stems to: `dark, deep, fear, peer`
- "Once upon a midnight dreary, weary" → stems to: `dreari, midnight, weari`
- **Zero overlap** → similarity score = 0.0 → random classification

## Proposed Solution

Add optional Word2Vec/word embeddings support that can find semantic similarity even without exact word matches:
- "fear" and "dread" → similar vectors → high similarity
- "darkness" and "midnight" → related concepts → some similarity

## Implementation Options

### Option 1: word2vec-rb gem (recommended)
- Ruby gem with C extensions wrapping Google's word2vec
- Supports loading pre-trained models AND training custom ones
- Last updated May 2022, MIT licensed
- https://github.com/madcato/word2vec-rb

### Option 2: Pre-trained embeddings only
- Download GloVe or Word2Vec pre-trained vectors
- Load on demand, cache in memory
- Simpler but requires large file downloads (~1GB for GloVe)

## Proposed API

```ruby
# Option A: New LSI backend
lsi = Classifier::LSI.new(similarity: :word2vec)
lsi.load_embeddings("path/to/vectors.bin")  # or download pre-trained

# Option B: New classifier type
w2v = Classifier::Word2Vec.new
w2v.load_embeddings("glove.6B.100d.txt")
w2v.add("gothic" => ["darkness", "midnight", "fear"])
w2v.classify("shadows and dread")  # Works even without exact matches
```

## Document Similarity with Word Embeddings

Convert documents to vectors by averaging word embeddings:
1. Tokenize document → words
2. Look up each word's embedding vector
3. Average all vectors → document vector
4. Compare document vectors with cosine similarity

## Considerations

- **Optional dependency**: word2vec-rb should be optional, not required
- **Memory usage**: Word embeddings are large (~100-300 dimensions × vocabulary size)
- **Pre-trained models**: Provide helper to download common models (GloVe, Word2Vec)
- **Fallback**: Gracefully fall back to bag-of-words if embeddings unavailable

## References

- [word2vec-rb gem](https://github.com/madcato/word2vec-rb)
- [GloVe pre-trained vectors](https://nlp.stanford.edu/projects/glove/)
- [Word2Vec paper](https://arxiv.org/abs/1301.3781)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Add Word2Vec/word embeddings support for better semantic similarity #106

Problem

Proposed Solution

Implementation Options

Option 1: word2vec-rb gem (recommended)

Option 2: Pre-trained embeddings only

Proposed API

Document Similarity with Word Embeddings

Considerations

References

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

feat: Add Word2Vec/word embeddings support for better semantic similarity #106

Description

Problem

Proposed Solution

Implementation Options

Option 1: word2vec-rb gem (recommended)

Option 2: Pre-trained embeddings only

Proposed API

Document Similarity with Word Embeddings

Considerations

References

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions