diff --git a/README.md b/README.md new file mode 100644 index 0000000..fb556d6 --- /dev/null +++ b/README.md @@ -0,0 +1,288 @@ +# md-embed (c) 2024 web3dguy + +A Python script for processing Markdown files, generating embeddings, and storing them in a vector store. This tool allows you to clean, split, and embed Markdown documents using various methods and embedding models. +Features + + Data Cleaning: Removes duplicates and filters out unwanted content like '404' pages and lines containing the '©' symbol. + Flexible Input: Supports input from JSON files containing URLs and Markdown data, folders of Markdown files, or single Markdown files. + Document Splitting: Splits documents using Markdown headers or recursive character splitting. + Embedding Options: Supports embedding using HuggingFace or Ollama embeddings. + Vector Store Integration: Stores embeddings in a Chroma vector store for efficient retrieval and analysis. + Customizable Filters: Option to disable filters that remove specific content. + Logging: Generates logs for duplicates and removed files for better traceability. + + + +Installation +Prerequisites +```bash + Python 3.7 or higher + pip + Git (optional, for cloning the repository) +``` +Clone the Repository + +```bash + +git clone https://github.com/GATERAGE/mdmbed.git +cd mdmbed +``` +Install Required Packages + +Install the required Python packages using pip: + +```bash + +pip install -r requirements.txt +``` +Note: The requirements.txt file should list all the dependencies, such as tqdm, langchain, chromadb, huggingface, etc. +Usage + +Run the script using Python: + +```bash + +python md-embed.py [--filters-off] +``` +# Command-Line Arguments + + --filters-off: Disable filters that remove lines containing '©' and skip files containing both '404' and 'page not found'. + +Upon running the script, you will be prompted to choose an input method: + + JSON Input File Containing URLs and Markdown Data + Folder of Markdown Files + Single Markdown File + +JSON Input File + +If you choose Option 1, you will be asked to provide: + + Path of the JSON input file: The file should be a JSON array of objects, each containing url and markdown keys. + Path of the output folder: The folder where cleaned Markdown files and logs will be saved. + +The script will: + + Clean the data by removing duplicates. + Save the cleaned Markdown files to the specified output folder. + Generate a file_to_url.json mapping file. + Display a summary of the processing. + +Folder of Markdown Files + +If you choose Option 2, you will be asked to provide: + + Path of the folder containing Markdown files. + +The script will: + + Load all .md files from the specified folder. + Optionally filter out unwanted content. + Proceed to document splitting. + +Single Markdown File + +If you choose Option 3, you will be asked to provide: + + Path of the Markdown file. + +The script will: + + Load the specified Markdown file. + Optionally filter out unwanted content. + Proceed to document splitting. + +Document Splitting + +After loading the documents, you will be prompted to split them: + + Split Method: Choose between markdown or recursive splitting. + Remove Links: Optionally remove links from the Markdown content. + Language: Specify the programming language or language of the content. + Additional Settings: + For Markdown Splitting: + Header Levels: Specify which header levels (#, ##, etc.) to split on. + For Recursive Splitting: + Chunk Size: Specify the maximum size of each chunk. + Chunk Overlap: Specify the number of overlapping characters between chunks. + +You will have the option to preview the split data before proceeding. +Embedding and Saving + +After splitting, you will be prompted to embed and save the documents: + + Embedding Method: Choose between huggingface or ollama. + HuggingFace: Enter the embedding model name (default: all-MiniLM-L6-v2). + Ollama: Enter the Ollama model name (default: nomic-embed-text). + Persist Directory: Specify the directory to save the vector store database. + Collection Name: Enter a name for the Chroma collection. + +The script will: + + Embed the documents using the chosen embedding method. + Save the embeddings to a Chroma vector store. + Display information about the saved collections. + +Examples +Example 1: Process JSON Input File + +```bash + +python md-embed.py +``` +Choose Input Method: 1 + + Enter the path of the JSON input file: ./data/input.json + Enter the path of the output folder: ./output + +Proceed through the prompts to clean data, split documents, and embed them. +Example 2: Process Folder of Markdown Files with Filters Off + +```bash + +python md-embed.py --filters-off +``` +Choose Input Method: 2 + + Enter the path of the folder containing markdown files: ./markdown_files + +Proceed through the prompts to load, split, and embed the documents. +Contributing + +Contributions are welcome! Please follow these steps: + + Fork the repository. + + Create a new branch: + +```bash + +git checkout -b feature/your-feature-name +``` +Make your changes and commit them: + +```bash +git commit -m "Add your message" +``` +Push to the branch: + +```bash +git push origin feature/your-feature-name +``` + Open a Pull Request. + +Please make sure your code adheres to the existing style and that all tests pass. +License + +This project is licensed under the MIT License. +Acknowledgments + web3dguy + LangChain for text splitting and document handling. + HuggingFace for embedding models. + Chroma for the vector store. + TQDM for progress bars. + The open-source community for continuous support and contributions. + +# Markdown Processor and Embedder + +md-embed processes markdown files, cleans and prepares the data, splits the text into manageable chunks, and creates embeddings for use in vector databases (specifically ChromaDB). It supports multiple input methods and provides options for customizing the splitting and embedding process.
+ +## Features + +* **Multiple Input Methods:**
+ * JSON file containing URLs and markdown data
+ * Folder of markdown files
+ * Single markdown file
+* **Data Cleaning:**
+ * Removes duplicate entries based on URL section titles
+ * Handles encoding issues
+ * Sanitizes filenames for safe saving
+ * Optionally filters out files containing "404" and "page not found" (can be disabled)
+ * Removes lines containing the copyright symbol "©"
+* **Text Splitting:**
+ * **Markdown Header Splitting:** Splits text based on specified markdown header levels (e.g., `#`, `##`). Allows for custom header level selection. Preserves header hierarchy in metadata
+ * **Recursive Character Text Splitting:** Splits text into chunks of specified size and overlap
+ * **Link Removal:** Optionally removes markdown links, keeping only the link text
+* **Embedding Generation:*
+ * Supports **Hugging Face** embeddings (using `langchain_huggingface`). Defaults to `all-MiniLM-L6-v2`
+ * Supports **Ollama** embeddings (using `langchain_community`). Defaults to `nomic-embed-text`, requires a local Ollama server running at `http://localhost:11434`
+* **Vector Database Integration:**
+ * Uses **ChromaDB** (`langchain_chroma`) to store embeddings and associated metadata
+ * Allows specifying the collection name and persistence directory
+ * Handles large datasets by processing in batches
+* **Logging:**
+ * Comprehensive logging through the `logging` module
+* **Duplicate Logs**:
+ * Writes URLs with duplicate sections to a log
+* **Removed Files Logs**
+ * Write to a log files that have been removed due to filters
+ +## Requirements + +* Python 3.7+
+* `langchain` (various components - see import statements)
+* `chromadb`
+* `tqdm`
+* `beautifulsoup4` (if you were scraping, but this script doesn't actually use it)
+* `requests` (if you were scraping, but this script doesn't actually use it)
+ +To install the required packages, run:
+ +```bash +pip install langchain langchain-chroma langchain-huggingface tqdm +``` +```markdown +If you are planning to use Ollama, you need to: +Install Ollama by following the instructions provided at Ollama's official website. +Run an Ollama server locally on port 11434 +``` +md-embed can be run from the command line. It provides a command-line interface using argparse with the following option:
+--filters-off: Disables the "404" and "©" filters
+The script will then guide you through a series of interactive prompts to configure the processing:
+Input Method Selection: Choose between JSON input, a folder of markdown files, or a single markdown file
+Input File/Folder/URL: Provide the path to the input file or folder, as appropriate
+Output Folder (for JSON input): Specify the directory where cleaned markdown files will be saved
+Data Cleaning Options: The script will show total entires and total duplicates
+Language: Specify the primary language of the input files (e.g., "TypeScript", "Python")
+Splitting Method: Choose between "markdown" (header-based splitting) and "recursive" (chunk size and overlap)
+Markdown Splitting Options (if applicable):
+Remove Links: Choose whether to remove markdown links
+Header Levels: Specify which header levels to split on (e.g., "1,2,3" for #, ##, and ###). Enter "all" for all header levels
+Recursive Splitting Options (if applicable):
+Remove Links: Choose whether to remove markdown links
+Chunk Size: Specify the desired chunk size (in characters)
+Chunk Overlap: Specify the desired chunk overlap (in characters)
+Preview Splits: Choose whether to preview the split data ("yes", "full", or "no")
+Split Again: You'll be prompted to continue or modify the settings
+Embedding Method: Choose between "huggingface" and "ollama"
+Embedding Model (Hugging Face): Enter the Hugging Face model name (defaults to all-MiniLM-L6-v2)
+Embedding Model (Ollama): Enter the Ollama model name (defaults to nomic-embed-text)
+Persistence Directory: Specify the directory where the ChromaDB database will be stored
+Collection Name: Choose a name for the ChromaDB collection
+Example (JSON Input):
+```bash +python md-embed.py +``` +Follow the prompts, providing the necessary information (input file, output folder, embedding choices, etc.)
+Example (Disabling Filters): +```bash +python md-embed.py --filters-off +``` +Cleaned Markdown Files (JSON Input): If using JSON input, the script will save cleaned markdown files to the specified output folder
+ChromaDB Database: The script will create a ChromaDB database in the specified persistence directory, containing the embeddings and metadata
+Logs: The logs directory will contain logs of removed files (if any) and duplicate entries (if using JSON input)
+file_to_url.json: Json file that contains the original URL of each document
+Error Handling
+The script includes error handling for various scenarios, such as:
+Invalid input file/folder paths
+File I/O errors
+Exceptions during data cleaning, splitting, or embedding
+Invalid user input for prompts
+Errors are logged using the logging module
+Notes
+The script assumes that the input JSON data has "url" and "markdown" keys for each entry
+The script uses uuid4 to generate unique IDs for each document in the vector database
+The script processes in batches to deal with a large number of splits
+ + +Disclaimer: This tool is provided "as is" without warranty of any kind. Use it at your own risk. Open source or go away. diff --git a/md-embed.py b/md-embed.py index 7bc0114..50ca57b 100644 --- a/md-embed.py +++ b/md-embed.py @@ -1,3 +1,5 @@ +# mdmbed (c) 2005 w3d + import json import re import os