Skip to content

Optimize Document Embedding with Multi-threading #22

@longdafeng

Description

@longdafeng

Description

The DocumentEmbedder.embed_from_directory() method in src/rag/rag.py processes documents sequentially, which is slow when embedding large document collections. We should implement multi-threading to parallelize file processing, embedding generation, and database insertion operations to significantly improve performance.

Proposed Solution

Add multi-threading support to the DocumentEmbedder class to process multiple files concurrently. This would involve:

  1. Using a thread pool to process files in parallel
  2. Batching embeddings and database insertions efficiently
  3. Maintaining thread safety for database operations

Related Code

  • src/rag/rag.py - DocumentEmbedder class (lines 328-464)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions