Webis: AI-Driven Knowledge Pipeline

Webis is a comprehensive, modular framework that powers the next generation of AI applications. It connects diverse data sources (Web, SaaS, databases, etc.) to Large Language Models (LLMs) through a robust pipeline of collection, processing, and extraction.

🎯 Who is Webis for?

👥 User Type	🚀 Use Case
Researchers	Literature reviews, data collection, research analysis
Data Scientists	Training data preparation, knowledge base building
Developers	Building AI applications, integrating RAG capabilities
Business Users	Market monitoring, competitive intelligence, knowledge management
Educators	Creating educational resources, research datasets

✨ Key Features

For All Users

🎯 5-Minute Setup: Get started in minutes with our intuitive CLI and web interface
📊 Beautiful Visualizations: Interactive dashboard with charts and graphs
🤖 AI Assistant: Natural language interaction with your knowledge base
📄 Multi-format Support: PDFs, webpages, HTML, Markdown, CSV, JSON, DOCX

For Developers

🔌 Plugin Architecture: Everything is a plugin (sources, processors, extractors)
🛠️ SDK & API: Clean Python API for integration into your applications
🧪 Testing Suite: Comprehensive test coverage with pytest
📚 Rich Documentation: Detailed API docs and examples

For Power Users

🤖 Intelligent Crawler: LLM-powered source selection and query generation
⚡ RAG-Ready: Built-in cleaning, chunking, and embedding generation
🔍 Advanced Search: Vector search, keyword search, and hybrid retrieval
📈 Monitoring: Real-time pipeline tracking and performance metrics

🚀 Quick Start

Installation

Option 1: One-Command Setup (Recommended)

# Automatic setup with conda
bash setup/conda_setup.sh

# Or with uv
bash setup/uv_setup.sh

Option 2: Manual Installation

# Clone the repository
git clone https://github.com/Narwhal-Lab/Webis.git
cd webis

# Install the package
pip install -e .

Option 3: Docker

# Quick start with Docker
docker-compose up

# For production
docker-compose -f docker-compose.prod.yml up -d

First Run

1. Simple Web Data Collection

# Get the latest news about AI
webis run "Latest artificial intelligence news" --limit 5

2. Local Document Processing

# Extract information from a PDF
webis extract ./research.pdf --task "Extract key findings"

3. Launch Web Interface

# Open the visualizer
webis visualizer

📚 Getting Started Guides

Guide	Description	Target Audience
Quick Start Guide	5 minutes to first result	All users
User Guide	Complete feature walkthrough	Regular users
API Reference	Full API documentation	Developers
Plugin Development	Create custom plugins	Advanced users
Deployment Guide	Production deployment	System admins

🛠️ Usage Examples

Basic Web Scraping

# Search and collect data from multiple sources
webis run "Machine learning research papers" \
  --sources semantic_scholar,arxiv \
  --limit 10 \
  --output ml_papers

Building a Knowledge Base

# Create a RAG knowledge base
webis run "Recent developments in quantum computing" \
  --rag-mode \
  --chunk-size 1000 \
  --embed-model all-MiniLM-L6-v2

Custom Data Processing

# Process local files with custom schema
webis extract ./financial_reports.pdf \
  --schema ./schemas/financial_report.json \
  --output structured_data.json

Batch Processing

# Process multiple files
webis batch process ./documents/ \
  --task "Extract entities" \
  --output ./processed/

🏗️ Architecture

graph TB
  subgraph "Data Sources"
    A[Web Sources<br/>GNews, GitHub, Stack Overflow]
    B[Local Files<br/>PDF, HTML, DOCX]
    C[APIs<br/>RSS, Twitter, Slack]
  end

  subgraph "Pipeline Processing"
    D[Intelligent Selection<br/>LLM-based source choice]
    E[Content Processing<br/>Clean, Normalize]
    F[Extraction<br/>LLM-based structuring]
    G[RAG Preparation<br/>Chunk, Embed]
  end

  subgraph "Output & Storage"
    H[Structured Data<br/>JSON, CSV]
    I[Vector Store<br/>ChromaDB, FAISS]
    J[Knowledge Base<br/>RAG-ready]
  end

  K[User Interface] --> A
  K[User Interface] --> B
  A --> D
  B --> D
  C --> D
  D --> E
  E --> F
  F --> G
  G --> H
  G --> I
  G --> J
  H --> K
  I --> K
  J --> K

🔌 Plugin System

Webis is built around a powerful plugin architecture:

Data Source Plugins

semantic_scholar - Academic papers
github - GitHub repositories
gnews - Google News
reddit - Reddit discussions
hackernews - Hacker News

Processing Plugins

html_cleaner - HTML content cleaning
pdf_processor - PDF text extraction
chunking - Document chunking strategies
ocr - Image text extraction

Extraction Plugins

llm_extractor - LLM-based data extraction
pii_redactor - PII removal
sentiment_analysis - Sentiment scoring

📖 Documentation

🧪 Examples

Explore our examples directory:

🤝 Community & Support

📝 Roadmap

📜 License

This project is licensed under the Apache 2.0 License - see the LICENSE file for details.

🌟 Star History

Made with ❤️ by the Webis Team
🌐 [website](https://webis.dev) | 📧 [contact](mailto:contact@webis.dev)

Name		Name	Last commit message	Last commit date
Latest commit History 60 Commits
.github		.github
benchmarks		benchmarks
docs		docs
examples		examples
infra/terraform		infra/terraform
k8s		k8s
legacy_v1		legacy_v1
setup		setup
src		src
templates/plugin_template		templates/plugin_template
tests		tests
.commitlintrc.yml		.commitlintrc.yml
.env.example		.env.example
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.releaserc.json		.releaserc.json
AUTHORS		AUTHORS
CHANGELOG.md		CHANGELOG.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
DELIVERY_SUMMARY.md		DELIVERY_SUMMARY.md
Dockerfile		Dockerfile
INTELLIGENTPIPELINE_GUIDE.md		INTELLIGENTPIPELINE_GUIDE.md
LICENSE		LICENSE
MAINTAINERS.md		MAINTAINERS.md
MANIFEST.in		MANIFEST.in
README.md		README.md
README.zh.md		README.zh.md
TODO.md		TODO.md
docker-compose.yml		docker-compose.yml
enhanced_prompt.txt		enhanced_prompt.txt
mkdocs.yml		mkdocs.yml
pre-commit-config.yaml		pre-commit-config.yaml
pyproject.toml		pyproject.toml
webis_doc.md		webis_doc.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Webis: AI-Driven Knowledge Pipeline

🎯 Who is Webis for?

✨ Key Features

For All Users

For Developers

For Power Users

🚀 Quick Start

Installation

First Run

1. Simple Web Data Collection

2. Local Document Processing

3. Launch Web Interface

📚 Getting Started Guides

🛠️ Usage Examples

Basic Web Scraping

Building a Knowledge Base

Custom Data Processing

Batch Processing

🏗️ Architecture

🔌 Plugin System

Data Source Plugins

Processing Plugins

Extraction Plugins

📖 Documentation

🧪 Examples

🤝 Community & Support

📝 Roadmap

📜 License

🌟 Star History

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 8

Languages

Folders and files

Latest commit

History

Repository files navigation

Webis: AI-Driven Knowledge Pipeline

🎯 Who is Webis for?

✨ Key Features

For All Users

For Developers

For Power Users

🚀 Quick Start

Installation

First Run

1. Simple Web Data Collection

2. Local Document Processing

3. Launch Web Interface

📚 Getting Started Guides

🛠️ Usage Examples

Basic Web Scraping

Building a Knowledge Base

Custom Data Processing

Batch Processing

🏗️ Architecture

🔌 Plugin System

Data Source Plugins

Processing Plugins

Extraction Plugins

📖 Documentation

🧪 Examples

🤝 Community & Support

📝 Roadmap

📜 License

🌟 Star History

About

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 8

Languages

Packages