🧠 InsightEngine AI

Autonomous Research Agent — From Question to Executive Report in One Click

"It doesn't just search — it plans, scrapes, cross-references, and writes the report for you."

InsightEngine AI is an autonomous research assistant powered by a LangGraph state machine and DeepSeek-V3. Give it any complex topic — it will plan a multi-angle research strategy, search the open web, scrape full page content from the top sources, cross-reference findings through an AI analyst, and produce a downloadable, citation-rich executive report. No human intervention between question and output.

✨ Features · 🏗️ Architecture · 🚀 Setup · 📄 Sample Output

📌 The Research Problem

Genuine research — the kind that produces insights instead of summaries — requires a pipeline that most tools don't offer:

Search engines return snippets, not understanding — you still have to visit 10+ links and synthesize manually
LLMs hallucinate facts when they generate from parametric memory alone — no grounding in real sources
Copy-pasting from multiple tabs into a coherent report takes hours of manual work
No planning phase — tools jump straight to retrieval without thinking about what angles to cover
No citation trail — you can't verify where the information came from

InsightEngine AI solves this end-to-end with a 5-node agentic pipeline: Plan → Search → Scrape → Analyze → Report — each stage executed by a specialized LangGraph node with distinct responsibilities.

✨ Key Features

🗺️ Strategic Research Planning (Node 1)

Before any search is executed, a Planner Agent analyzes the research objective and creates a structured strategy:

Identifies key information required to comprehensively cover the topic
Formulates 3–5 specific search queries to cover different angles
Classifies expected source types: news, academic, general knowledge
The full plan is visible in the UI via the "View Agent's Thinking Strategy" expander

🔍 DuckDuckGo Web Search (Node 2)

Privacy-first web retrieval that fetches the top 5 results per search query:

Returns structured data: Title, URL, and Snippet per result
No API key required — uses the DuckDuckGo search API directly
Results feed URLs to the scraper and provide high-level context to the analyst

🕷️ Deep Web Scraping (Node 3)

Goes beyond snippets — autonomously visits and extracts full content from discovered URLs:

Regex URL extraction from search results to find promising sources
Scrapes the top 3 URLs per query (excluding Google, DuckDuckGo, and Bing results pages)
Extracts up to 4,000 characters per page for the analyst (scraper itself captures up to 10,000)
HTML cleaning: Removes <script> and <style> elements via BeautifulSoup4
Browser-mimicking headers: Sends a Chrome User-Agent string to avoid bot detection
10-second timeout per request — prevents indefinite hanging on slow sites
Graceful error handling: Failed scrapes are logged, not crashed on

🔬 Cross-Reference Analysis (Node 4)

An Analyst Agent synthesizes all scraped content (not just snippets):

Receives the full {url, content} pairs from the scraper node
Falls back to search snippets if scraping returned nothing
Extracts key facts and insights from the combined corpus
Identifies missing information or contradictions across sources
Uses the ANALYZER_PROMPT persona — focused on depth and accuracy

📄 Executive Report Generation (Node 5)

A Writer Agent transforms the analysis into a structured Markdown report:

Full report structure: Title → Executive Summary → Key Findings (with citations) → Detailed Analysis → Conclusion
Source attribution: All scraped URLs are appended to the findings so the writer can include references
Auto-saved to disk: Reports are timestamped and persisted to outputs/reports/ (e.g., research_report_20260220_181500.md)
Downloadable from the UI: One-click download button in the Streamlit interface

🔀 Multi-LLM Fallback

The get_llm() factory function supports three LLM providers with automatic fallback:

Priority	Provider	Model	Required Env Var
1st	DeepSeek	`deepseek-chat` (V3)	`DEEPSEEK_API_KEY`
2nd	Anthropic	`claude-3-sonnet-20240229`	`ANTHROPIC_API_KEY`
3rd	OpenAI	`gpt-4o`	`OPENAI_API_KEY`

If no key is configured, the agent returns a clear error message in the UI instead of crashing.

📚 Wikipedia Integration

The check_wikipedia tool provides a secondary knowledge source:

Uses LangChain's WikipediaQueryRun with WikipediaAPIWrapper
Available for definitions, historical context, and general knowledge
Complements the web search with authoritative encyclopedic data

🎨 Premium Streamlit Interface

A dark-mode glassmorphic UI with custom CSS:

Radial gradient background: #1a1c2c → #0d0e15
Gradient CTA buttons: linear-gradient(90deg, #4776E6, #8E54E9) with hover scale
Info cards: Glassmorphism cards in the sidebar (rgba(255, 255, 255, 0.03), blur(10px))
Gradient header: linear-gradient(90deg, #ff8a00, #e52e71) — bold orange-to-pink text
Status indicator: Real-time progress tracking via st.status()
- "🛰️ Planning strategic data nodes..."
- "✅ Research Synthesized!"
Four-phase methodology sidebar: Strategic Planning → Deep Web Research → Cognitive Synthesis → Executive Reporting

🏗️ Architecture

LangGraph State Machine

InsightEngine AI is built as a 5-node directed acyclic graph in LangGraph:

┌──────────────────────────────────────────────────────────────────┐
│                    LangGraph State Machine                        │
│                                                                  │
│  ┌──────────────┐    ┌──────────────┐    ┌──────────────┐       │
│  │  1. PLANNER  │───▶│  2. SEARCHER │───▶│  3. SCRAPER  │       │
│  │              │    │              │    │              │       │
│  │ • Analyze    │    │ • DuckDuckGo │    │ • Regex URL  │       │
│  │   intent     │    │   top 5      │    │   extraction │       │
│  │ • 3-5 search │    │   results    │    │ • Top 3 URLs │       │
│  │   queries    │    │ • Structured │    │ • 4K chars   │       │
│  │ • Source     │    │   title/link │    │   per page   │       │
│  │   types      │    │   /snippet   │    │ • BS4 clean  │       │
│  └──────────────┘    └──────────────┘    └──────┬───────┘       │
│                                                  │               │
│                                                  ▼               │
│  ┌──────────────┐    ┌──────────────────────────────────┐       │
│  │  5. WRITER   │◀───│          4. ANALYST              │       │
│  │              │    │                                  │       │
│  │ • Title      │    │ • Cross-reference scraped data   │       │
│  │ • Exec       │    │ • Extract key facts              │       │
│  │   Summary    │    │ • Identify gaps                  │       │
│  │ • Key        │    │ • Falls back to snippets if      │       │
│  │   Findings   │    │   scraping returned empty        │       │
│  │ • Detailed   │    └──────────────────────────────────┘       │
│  │   Analysis   │                                               │
│  │ • Conclusion │                                               │
│  │ • Citations  │                                               │
│  │              │───▶  END (report saved + displayed)           │
│  └──────────────┘                                               │
└──────────────────────────────────────────────────────────────────┘

State Schema

The entire research pipeline passes a single ResearchState TypedDict through all nodes:

class ResearchState(TypedDict):
    query: str                     # Original user question
    plan: str                      # Planner output — numbered strategy list
    search_results: List[str]      # Raw search snippets from DuckDuckGo
    scraped_content: List[dict]    # [{"url": str, "content": str}, ...]
    findings: str                  # Analyst's synthesized findings
    report: str                    # Writer's final Markdown report
    iteration: int                 # Reserved for future multi-pass logic
    messages: List[BaseMessage]    # LangChain message accumulator

Agent Personas

Three specialized prompts in prompts.py give each LLM call a distinct identity:

Persona	Prompt	Role
Planner	`PLANNER_PROMPT`	Creates a numbered research plan: key info needed, 3–5 search queries, source type classification
Analyzer	`ANALYZER_PROMPT`	Receives all scraped content, extracts key facts, identifies gaps, cross-references sources
Writer	`WRITER_PROMPT`	Produces the final report: Title → Executive Summary → Key Findings (with citations) → Detailed Analysis → Conclusion

Tool Registry

Four LangChain @tool-decorated functions in tools.py:

Tool	Input	Output	Source
`search_web`	Search query string	Formatted title/link/snippet blocks	DuckDuckGo (5 results)
`check_wikipedia`	Topic name	Wikipedia article summary	`WikipediaAPIWrapper`
`fetch_page`	URL string	Cleaned page text (up to 10K chars)	`web_scraper.scrape_url()`
`create_report`	Markdown content	File path confirmation	Writes to `outputs/reports/`

🚀 Getting Started

Prerequisites

Python 3.10+
A DeepSeek API Key — get one here (or Anthropic / OpenAI key as fallback)

Step 1 — Clone

git clone https://github.com/Ismail-2001/Autonomous-Research-Intelligence.git
cd Autonomous-Research-Intelligence

Step 2 — Install Dependencies

pip install -r research_agent/requirements.txt

Full dependency list (13 packages)

Package	Purpose
`langchain`	Core framework for LLM orchestration
`langchain-community`	DuckDuckGo + Wikipedia tool integrations
`langgraph`	State machine graph builder (`StateGraph`, `END`)
`langchain-openai`	DeepSeek / OpenAI LLM connector
`langchain-anthropic`	Claude fallback LLM connector
`streamlit`	Web application framework
`duckduckgo-search`	Privacy-first web search API
`wikipedia`	Wikipedia article retrieval
`beautifulsoup4`	HTML parsing and content extraction
`python-dotenv`	`.env` file loading
`tiktoken`	Token counting for context window management
`pandas`	Data manipulation utilities
`requests`	HTTP client for web scraping

Step 3 — Configure Environment

Create a .env file in the research_agent/ directory:

DEEPSEEK_API_KEY=your_deepseek_api_key_here

# Optional fallbacks (checked in priority order):
# ANTHROPIC_API_KEY=your_anthropic_key
# OPENAI_API_KEY=your_openai_key

# Optional configuration:
# DEBUG=True
# LOG_LEVEL=INFO
# USER_AGENT=ResearchAgent/1.0

Step 4 — Launch

streamlit run research_agent/app.py

The app will open at http://localhost:8501.

💡 Usage

Running a Research Query

Open the app in your browser
Enter a complex research objective in the text input:
- "The next 10 years of Fusion Energy commercialization"
- "How is CRISPR being used in cancer treatment as of 2026?"
- "Compare the economic strategies of BRICS vs G7 nations"
Click "Generate Deep Report"
Watch the agent progress through its 5 stages:
- 🛰️ Planning strategic data nodes...
- ✅ Research Synthesized!
Explore the results:
- 👁️ View Agent's Thinking Strategy — expand to see the full research plan
- 📊 Key Research Findings — analyst's synthesized facts and insights
- 📄 Executive Summary & Full Analysis — the complete Markdown report
- 📥 Download Executive Report — one-click .md file download

Sidebar: Methodology Panel

The left sidebar always shows the 4-phase research methodology:

Strategic Planning — Agent analyzes intent and creates search tree
Deep Web Research — Autonomous browsing and extraction from verified sources
Cognitive Synthesis — Cross-referencing for accuracy and depth
Executive Reporting — Structured, citation-rich document generation

📄 Output Format

Every report follows a consistent professional structure:

# [Report Title]

## Executive Summary
[2-3 paragraph overview of key findings]

## Key Findings
1. **[Finding 1]** — [details with source citation]
2. **[Finding 2]** — [details with source citation]
3. ...

## Detailed Analysis
[In-depth treatment of each finding, organized by theme]

## Conclusion
[Summary of implications and recommendations]

## Sources Accessed
- [URL 1]
- [URL 2]
- [URL 3]

Reports are automatically saved to outputs/reports/ with timestamped filenames:

outputs/reports/research_report_20260220_181500.md

📂 Project Structure

Autonomous-Research-Intelligence/
│
├── research_agent/
│   │
│   ├── agent/
│   │   ├── research_agent.py   # Core LangGraph state machine
│   │   │                       #   • ResearchState TypedDict (8 fields)
│   │   │                       #   • 5 graph nodes: planner → searcher → scraper → analyst → writer
│   │   │                       #   • get_llm() multi-provider factory (DeepSeek → Anthropic → OpenAI)
│   │   │                       #   • run_research_agent() entry point
│   │   │
│   │   ├── tools.py            # 4 LangChain @tool-decorated functions
│   │   │                       #   • search_web: DuckDuckGo (5 results, structured output)
│   │   │                       #   • check_wikipedia: WikipediaAPIWrapper integration
│   │   │                       #   • fetch_page: Full page scraping via web_scraper.scrape_url()
│   │   │                       #   • create_report: Timestamped .md file writer → outputs/reports/
│   │   │
│   │   └── prompts.py          # 3 specialized agent personas
│   │                           #   • PLANNER_PROMPT: Research strategy + search query generation
│   │                           #   • ANALYZER_PROMPT: Fact extraction + gap identification
│   │                           #   • WRITER_PROMPT: Executive report structure (Title → Conclusion)
│   │
│   ├── config/
│   │   └── settings.py         # Environment configuration
│   │                           #   • Multi-provider API key loading
│   │                           #   • DEBUG, LOG_LEVEL, USER_AGENT defaults
│   │                           #   • REPORT_OUTPUT_DIR auto-creation
│   │
│   ├── utils/
│   │   ├── web_scraper.py      # Deep scraping engine
│   │   │                       #   • BeautifulSoup4 HTML parsing
│   │   │                       #   • Script/style tag removal
│   │   │                       #   • Chrome User-Agent header spoofing
│   │   │                       #   • 10s timeout, 10K char limit
│   │   │
│   │   └── text_processor.py   # Content utilities
│   │                           #   • clean_text(): whitespace normalization
│   │                           #   • chunk_text(): 4K-char chunking for LLM context
│   │
│   ├── app.py                  # Streamlit application (144 lines)
│   │                           #   • Custom CSS: radial gradient, glassmorphism, gradient buttons
│   │                           #   • Sidebar: methodology cards + branding
│   │                           #   • Main: text input → status progress → results panel
│   │                           #   • Expandable agent plan view
│   │                           #   • Download button for final report
│   │
│   ├── requirements.txt        # 13 Python dependencies
│   └── .env                    # API key configuration (gitignored)
│
└── outputs/
    └── reports/                # Auto-generated research reports (timestamped .md files)

🛠️ Tech Stack

Category	Library	Purpose
Language	Python `3.10+`	Core runtime
LLM (Primary)	DeepSeek V3 via `langchain-openai`	Research planning, analysis, report writing
LLM (Fallback 1)	Claude 3 Sonnet via `langchain-anthropic`	Alternative LLM provider
LLM (Fallback 2)	GPT-4o via `langchain-openai`	Third-tier fallback
Orchestration	LangGraph `StateGraph`	5-node directed acyclic state machine
Web Search	`duckduckgo-search`	Privacy-first search − no API key needed
Knowledge Base	Wikipedia via `langchain-community`	Encyclopedic context and definitions
Web Scraping	BeautifulSoup4 + Requests	Full page content extraction with HTML cleaning
Tokenization	TikToken	Token counting for context window awareness
Frontend	Streamlit	Interactive web UI with custom CSS injection
Config	`python-dotenv`	Secure `.env` loading for API keys

🌐 Deployment

Streamlit Cloud

Push your repo to GitHub
Go to share.streamlit.io
Select your repository → set Main file path to research_agent/app.py
Add Secrets in the dashboard:
```
DEEPSEEK_API_KEY = "your_key_here"
```
Click Deploy ✅

Docker

FROM python:3.11-slim
WORKDIR /app
COPY . .
RUN pip install --no-cache-dir -r research_agent/requirements.txt
EXPOSE 8501
CMD ["streamlit", "run", "research_agent/app.py", "--server.port=8501", "--server.headless=true"]

Render / Railway

services:
  - type: web
    name: insightengine-ai
    runtime: python
    buildCommand: pip install -r research_agent/requirements.txt
    startCommand: streamlit run research_agent/app.py --server.port $PORT --server.headless true
    envVars:
      - key: DEEPSEEK_API_KEY
        sync: false

🗺️ Roadmap

✅ Phase 1 — Core Research Pipeline (Complete)

🔨 Phase 2 — Research Depth (Next)

Iterative Research Loop: Re-run the Planner → Search → Scrape cycle with refined queries based on gaps identified by the Analyst
Academic Search Integration: arXiv, Google Scholar, Semantic Scholar APIs for peer-reviewed sources
PDF Document Parsing: Extract content from research papers and reports linked in search results
Source Credibility Scoring: Rank sources by domain authority, publishing date, and citation count

📋 Phase 3 — Enhanced Output (Planned)

Multi-Format Export: PDF, DOCX, and HTML report generation alongside Markdown
Evidence Mapping: Visual graph showing how sources connect to key findings
Comparison Reports: Side-by-side analysis of two competing topics or viewpoints
Report History Dashboard: Browse, search, and re-open past research sessions

🔭 Phase 4 — Enterprise Features (Vision)

Team Workspaces: Shared research library with role-based access
Scheduled Research Monitoring: Weekly automated reports on evolving topics
RAG Integration: Upload internal documents as additional knowledge sources
Multi-Agent Debate: Two analyst agents argue opposing perspectives before synthesis

🤝 Contributing

Contributions are welcome! High-impact areas:

New search providers — add Brave Search, SerpAPI, or Google Scholar tools in tools.py
New agent personas — extend prompts.py with domain-specific analysts (e.g., Medical, Legal, Financial)
Output formats — add PDF/DOCX export in the create_report tool
Scraper improvements — handle JavaScript-rendered pages, pagination, or anti-bot measures
UI enhancements — add report history, search within past reports, or comparative views

To contribute:

Fork the repository
Create a feature branch: git checkout -b feature/your-feature
Commit with Conventional Commits: git commit -m "feat: add arXiv search tool"
Push and open a Pull Request against main

📄 License

Distributed under the MIT License. See LICENSE for details.

Built for researchers who demand depth, not just snippets.

If InsightEngine AI changed how you approach research, star ⭐ the repo.

Built with ❤️ by Ismail Sajid

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
outputs/reports		outputs/reports
research_agent		research_agent
README.md		README.md

Ismail-2001/Autonomous-Research-Intelligence

Folders and files

Latest commit

History

Repository files navigation

🧠 InsightEngine AI

Autonomous Research Agent — From Question to Executive Report in One Click

📌 The Research Problem

✨ Key Features

🗺️ Strategic Research Planning (Node 1)

🔍 DuckDuckGo Web Search (Node 2)

🕷️ Deep Web Scraping (Node 3)

🔬 Cross-Reference Analysis (Node 4)

📄 Executive Report Generation (Node 5)

🔀 Multi-LLM Fallback

📚 Wikipedia Integration

🎨 Premium Streamlit Interface

🏗️ Architecture

LangGraph State Machine

State Schema

Agent Personas

Tool Registry

🚀 Getting Started

Prerequisites

Step 1 — Clone

Step 2 — Install Dependencies

Step 3 — Configure Environment

Step 4 — Launch

💡 Usage

Running a Research Query

Sidebar: Methodology Panel

📄 Output Format

📂 Project Structure

🛠️ Tech Stack

🌐 Deployment

Streamlit Cloud

Docker

Render / Railway

🗺️ Roadmap

✅ Phase 1 — Core Research Pipeline (Complete)

🔨 Phase 2 — Research Depth (Next)

📋 Phase 3 — Enhanced Output (Planned)

🔭 Phase 4 — Enterprise Features (Vision)

🤝 Contributing

📄 License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages