Webis is a comprehensive, modular framework that powers the next generation of AI applications. It connects diverse data sources (Web, SaaS, databases, etc.) to Large Language Models (LLMs) through a robust pipeline of collection, processing, and extraction.
| π₯ User Type | π Use Case |
|---|---|
| Researchers | Literature reviews, data collection, research analysis |
| Data Scientists | Training data preparation, knowledge base building |
| Developers | Building AI applications, integrating RAG capabilities |
| Business Users | Market monitoring, competitive intelligence, knowledge management |
| Educators | Creating educational resources, research datasets |
- π― 5-Minute Setup: Get started in minutes with our intuitive CLI and web interface
- π Beautiful Visualizations: Interactive dashboard with charts and graphs
- π€ AI Assistant: Natural language interaction with your knowledge base
- π Multi-format Support: PDFs, webpages, HTML, Markdown, CSV, JSON, DOCX
- π Plugin Architecture: Everything is a plugin (sources, processors, extractors)
- π οΈ SDK & API: Clean Python API for integration into your applications
- π§ͺ Testing Suite: Comprehensive test coverage with pytest
- π Rich Documentation: Detailed API docs and examples
- π€ Intelligent Crawler: LLM-powered source selection and query generation
- β‘ RAG-Ready: Built-in cleaning, chunking, and embedding generation
- π Advanced Search: Vector search, keyword search, and hybrid retrieval
- π Monitoring: Real-time pipeline tracking and performance metrics
Option 1: One-Command Setup (Recommended)
# Automatic setup with conda
bash setup/conda_setup.sh
# Or with uv
bash setup/uv_setup.shOption 2: Manual Installation
# Clone the repository
git clone https://github.com/Narwhal-Lab/Webis.git
cd webis
# Install the package
pip install -e .Option 3: Docker
# Quick start with Docker
docker-compose up
# For production
docker-compose -f docker-compose.prod.yml up -d# Get the latest news about AI
webis run "Latest artificial intelligence news" --limit 5# Extract information from a PDF
webis extract ./research.pdf --task "Extract key findings"# Open the visualizer
webis visualizer| Guide | Description | Target Audience |
|---|---|---|
| Quick Start Guide | 5 minutes to first result | All users |
| User Guide | Complete feature walkthrough | Regular users |
| API Reference | Full API documentation | Developers |
| Plugin Development | Create custom plugins | Advanced users |
| Deployment Guide | Production deployment | System admins |
# Search and collect data from multiple sources
webis run "Machine learning research papers" \
--sources semantic_scholar,arxiv \
--limit 10 \
--output ml_papers# Create a RAG knowledge base
webis run "Recent developments in quantum computing" \
--rag-mode \
--chunk-size 1000 \
--embed-model all-MiniLM-L6-v2# Process local files with custom schema
webis extract ./financial_reports.pdf \
--schema ./schemas/financial_report.json \
--output structured_data.json# Process multiple files
webis batch process ./documents/ \
--task "Extract entities" \
--output ./processed/graph TB
subgraph "Data Sources"
A[Web Sources<br/>GNews, GitHub, Stack Overflow]
B[Local Files<br/>PDF, HTML, DOCX]
C[APIs<br/>RSS, Twitter, Slack]
end
subgraph "Pipeline Processing"
D[Intelligent Selection<br/>LLM-based source choice]
E[Content Processing<br/>Clean, Normalize]
F[Extraction<br/>LLM-based structuring]
G[RAG Preparation<br/>Chunk, Embed]
end
subgraph "Output & Storage"
H[Structured Data<br/>JSON, CSV]
I[Vector Store<br/>ChromaDB, FAISS]
J[Knowledge Base<br/>RAG-ready]
end
K[User Interface] --> A
K[User Interface] --> B
A --> D
B --> D
C --> D
D --> E
E --> F
F --> G
G --> H
G --> I
G --> J
H --> K
I --> K
J --> K
Webis is built around a powerful plugin architecture:
semantic_scholar- Academic papersgithub- GitHub repositoriesgnews- Google Newsreddit- Reddit discussionshackernews- Hacker News
html_cleaner- HTML content cleaningpdf_processor- PDF text extractionchunking- Document chunking strategiesocr- Image text extraction
llm_extractor- LLM-based data extractionpii_redactor- PII removalsentiment_analysis- Sentiment scoring
- π User Documentation
- π Plugin Development Guide
- ποΈ API Reference
- π Deployment Guide
- π€ Contributing Guide
Explore our examples directory:
- π Documentation
- π Bug Reports
- π¬ Discussions
- π¬ Discord Chat
- π§ Email Support
- v2.0.0 - Stable Release
- Enterprise features
- Advanced caching
- Multi-tenant support
- Enhanced security
- v2.1.0 - Enhanced AI
- Multi-modal support
- Agent capabilities
- Auto-scaling
- v2.2.0 - Ecosystem
- Marketplace for plugins
- Integration SDKs
- Monitoring dashboard
This project is licensed under the Apache 2.0 License - see the LICENSE file for details.
π [website](https://webis.dev) | π§ [contact](mailto:contact@webis.dev)