Skip to content

Narwhal-Lab/Webis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

60 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Webis: AI-Driven Knowledge Pipeline

δΈ­ζ–‡

Python License PyPI Documentation Tests Coverage Code Style

Webis is a comprehensive, modular framework that powers the next generation of AI applications. It connects diverse data sources (Web, SaaS, databases, etc.) to Large Language Models (LLMs) through a robust pipeline of collection, processing, and extraction.

🎯 Who is Webis for?

πŸ‘₯ User Type πŸš€ Use Case
Researchers Literature reviews, data collection, research analysis
Data Scientists Training data preparation, knowledge base building
Developers Building AI applications, integrating RAG capabilities
Business Users Market monitoring, competitive intelligence, knowledge management
Educators Creating educational resources, research datasets

✨ Key Features

For All Users

  • 🎯 5-Minute Setup: Get started in minutes with our intuitive CLI and web interface
  • πŸ“Š Beautiful Visualizations: Interactive dashboard with charts and graphs
  • πŸ€– AI Assistant: Natural language interaction with your knowledge base
  • πŸ“„ Multi-format Support: PDFs, webpages, HTML, Markdown, CSV, JSON, DOCX

For Developers

  • πŸ”Œ Plugin Architecture: Everything is a plugin (sources, processors, extractors)
  • πŸ› οΈ SDK & API: Clean Python API for integration into your applications
  • πŸ§ͺ Testing Suite: Comprehensive test coverage with pytest
  • πŸ“š Rich Documentation: Detailed API docs and examples

For Power Users

  • πŸ€– Intelligent Crawler: LLM-powered source selection and query generation
  • ⚑ RAG-Ready: Built-in cleaning, chunking, and embedding generation
  • πŸ” Advanced Search: Vector search, keyword search, and hybrid retrieval
  • πŸ“ˆ Monitoring: Real-time pipeline tracking and performance metrics

πŸš€ Quick Start

Installation

Option 1: One-Command Setup (Recommended)

# Automatic setup with conda
bash setup/conda_setup.sh

# Or with uv
bash setup/uv_setup.sh

Option 2: Manual Installation

# Clone the repository
git clone https://github.com/Narwhal-Lab/Webis.git
cd webis

# Install the package
pip install -e .

Option 3: Docker

# Quick start with Docker
docker-compose up

# For production
docker-compose -f docker-compose.prod.yml up -d

First Run

1. Simple Web Data Collection

# Get the latest news about AI
webis run "Latest artificial intelligence news" --limit 5

2. Local Document Processing

# Extract information from a PDF
webis extract ./research.pdf --task "Extract key findings"

3. Launch Web Interface

# Open the visualizer
webis visualizer

πŸ“š Getting Started Guides

Guide Description Target Audience
Quick Start Guide 5 minutes to first result All users
User Guide Complete feature walkthrough Regular users
API Reference Full API documentation Developers
Plugin Development Create custom plugins Advanced users
Deployment Guide Production deployment System admins

πŸ› οΈ Usage Examples

Basic Web Scraping

# Search and collect data from multiple sources
webis run "Machine learning research papers" \
  --sources semantic_scholar,arxiv \
  --limit 10 \
  --output ml_papers

Building a Knowledge Base

# Create a RAG knowledge base
webis run "Recent developments in quantum computing" \
  --rag-mode \
  --chunk-size 1000 \
  --embed-model all-MiniLM-L6-v2

Custom Data Processing

# Process local files with custom schema
webis extract ./financial_reports.pdf \
  --schema ./schemas/financial_report.json \
  --output structured_data.json

Batch Processing

# Process multiple files
webis batch process ./documents/ \
  --task "Extract entities" \
  --output ./processed/

πŸ—οΈ Architecture

graph TB
  subgraph "Data Sources"
    A[Web Sources<br/>GNews, GitHub, Stack Overflow]
    B[Local Files<br/>PDF, HTML, DOCX]
    C[APIs<br/>RSS, Twitter, Slack]
  end

  subgraph "Pipeline Processing"
    D[Intelligent Selection<br/>LLM-based source choice]
    E[Content Processing<br/>Clean, Normalize]
    F[Extraction<br/>LLM-based structuring]
    G[RAG Preparation<br/>Chunk, Embed]
  end

  subgraph "Output & Storage"
    H[Structured Data<br/>JSON, CSV]
    I[Vector Store<br/>ChromaDB, FAISS]
    J[Knowledge Base<br/>RAG-ready]
  end

  K[User Interface] --> A
  K[User Interface] --> B
  A --> D
  B --> D
  C --> D
  D --> E
  E --> F
  F --> G
  G --> H
  G --> I
  G --> J
  H --> K
  I --> K
  J --> K
Loading

πŸ”Œ Plugin System

Webis is built around a powerful plugin architecture:

Data Source Plugins

  • semantic_scholar - Academic papers
  • github - GitHub repositories
  • gnews - Google News
  • reddit - Reddit discussions
  • hackernews - Hacker News

Processing Plugins

  • html_cleaner - HTML content cleaning
  • pdf_processor - PDF text extraction
  • chunking - Document chunking strategies
  • ocr - Image text extraction

Extraction Plugins

  • llm_extractor - LLM-based data extraction
  • pii_redactor - PII removal
  • sentiment_analysis - Sentiment scoring

πŸ“– Documentation

πŸ§ͺ Examples

Explore our examples directory:

🀝 Community & Support

πŸ“ Roadmap

  • v2.0.0 - Stable Release
    • Enterprise features
    • Advanced caching
    • Multi-tenant support
    • Enhanced security
  • v2.1.0 - Enhanced AI
    • Multi-modal support
    • Agent capabilities
    • Auto-scaling
  • v2.2.0 - Ecosystem
    • Marketplace for plugins
    • Integration SDKs
    • Monitoring dashboard

πŸ“œ License

This project is licensed under the Apache 2.0 License - see the LICENSE file for details.

🌟 Star History

Star History Chart


Made with ❀️ by the Webis Team
🌐 [website](https://webis.dev) | πŸ“§ [contact](mailto:contact@webis.dev)

About

Webis is a modular, plugin-based AI data pipeline that supports data acquisition, cleaning, structured extraction, and RAG generation, solving the upstream supply problem for AI applications.

Resources

License

Code of conduct

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Languages