Skip to content

agentic-ai, langgraph, langchain, deepseek, autonomous-agents, research-assistant, web-scraping, streamlit, llm, ai-research, multi-agent-system

Notifications You must be signed in to change notification settings

Ismail-2001/Autonomous-Research-Intelligence

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

4 Commits
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

🧠 InsightEngine AI

Autonomous Research Agent β€” From Question to Executive Report in One Click

Python LangGraph DeepSeek Streamlit DuckDuckGo BeautifulSoup License


"It doesn't just search β€” it plans, scrapes, cross-references, and writes the report for you."

InsightEngine AI is an autonomous research assistant powered by a LangGraph state machine and DeepSeek-V3. Give it any complex topic β€” it will plan a multi-angle research strategy, search the open web, scrape full page content from the top sources, cross-reference findings through an AI analyst, and produce a downloadable, citation-rich executive report. No human intervention between question and output.

✨ Features Β· πŸ—οΈ Architecture Β· πŸš€ Setup Β· πŸ“„ Sample Output


πŸ“Œ The Research Problem

Genuine research β€” the kind that produces insights instead of summaries β€” requires a pipeline that most tools don't offer:

  • Search engines return snippets, not understanding β€” you still have to visit 10+ links and synthesize manually
  • LLMs hallucinate facts when they generate from parametric memory alone β€” no grounding in real sources
  • Copy-pasting from multiple tabs into a coherent report takes hours of manual work
  • No planning phase β€” tools jump straight to retrieval without thinking about what angles to cover
  • No citation trail β€” you can't verify where the information came from

InsightEngine AI solves this end-to-end with a 5-node agentic pipeline: Plan β†’ Search β†’ Scrape β†’ Analyze β†’ Report β€” each stage executed by a specialized LangGraph node with distinct responsibilities.


✨ Key Features

πŸ—ΊοΈ Strategic Research Planning (Node 1)

Before any search is executed, a Planner Agent analyzes the research objective and creates a structured strategy:

  • Identifies key information required to comprehensively cover the topic
  • Formulates 3–5 specific search queries to cover different angles
  • Classifies expected source types: news, academic, general knowledge
  • The full plan is visible in the UI via the "View Agent's Thinking Strategy" expander

πŸ” DuckDuckGo Web Search (Node 2)

Privacy-first web retrieval that fetches the top 5 results per search query:

  • Returns structured data: Title, URL, and Snippet per result
  • No API key required β€” uses the DuckDuckGo search API directly
  • Results feed URLs to the scraper and provide high-level context to the analyst

πŸ•·οΈ Deep Web Scraping (Node 3)

Goes beyond snippets β€” autonomously visits and extracts full content from discovered URLs:

  • Regex URL extraction from search results to find promising sources
  • Scrapes the top 3 URLs per query (excluding Google, DuckDuckGo, and Bing results pages)
  • Extracts up to 4,000 characters per page for the analyst (scraper itself captures up to 10,000)
  • HTML cleaning: Removes <script> and <style> elements via BeautifulSoup4
  • Browser-mimicking headers: Sends a Chrome User-Agent string to avoid bot detection
  • 10-second timeout per request β€” prevents indefinite hanging on slow sites
  • Graceful error handling: Failed scrapes are logged, not crashed on

πŸ”¬ Cross-Reference Analysis (Node 4)

An Analyst Agent synthesizes all scraped content (not just snippets):

  • Receives the full {url, content} pairs from the scraper node
  • Falls back to search snippets if scraping returned nothing
  • Extracts key facts and insights from the combined corpus
  • Identifies missing information or contradictions across sources
  • Uses the ANALYZER_PROMPT persona β€” focused on depth and accuracy

πŸ“„ Executive Report Generation (Node 5)

A Writer Agent transforms the analysis into a structured Markdown report:

  • Full report structure: Title β†’ Executive Summary β†’ Key Findings (with citations) β†’ Detailed Analysis β†’ Conclusion
  • Source attribution: All scraped URLs are appended to the findings so the writer can include references
  • Auto-saved to disk: Reports are timestamped and persisted to outputs/reports/ (e.g., research_report_20260220_181500.md)
  • Downloadable from the UI: One-click download button in the Streamlit interface

πŸ”€ Multi-LLM Fallback

The get_llm() factory function supports three LLM providers with automatic fallback:

Priority Provider Model Required Env Var
1st DeepSeek deepseek-chat (V3) DEEPSEEK_API_KEY
2nd Anthropic claude-3-sonnet-20240229 ANTHROPIC_API_KEY
3rd OpenAI gpt-4o OPENAI_API_KEY

If no key is configured, the agent returns a clear error message in the UI instead of crashing.

πŸ“š Wikipedia Integration

The check_wikipedia tool provides a secondary knowledge source:

  • Uses LangChain's WikipediaQueryRun with WikipediaAPIWrapper
  • Available for definitions, historical context, and general knowledge
  • Complements the web search with authoritative encyclopedic data

🎨 Premium Streamlit Interface

A dark-mode glassmorphic UI with custom CSS:

  • Radial gradient background: #1a1c2c β†’ #0d0e15
  • Gradient CTA buttons: linear-gradient(90deg, #4776E6, #8E54E9) with hover scale
  • Info cards: Glassmorphism cards in the sidebar (rgba(255, 255, 255, 0.03), blur(10px))
  • Gradient header: linear-gradient(90deg, #ff8a00, #e52e71) β€” bold orange-to-pink text
  • Status indicator: Real-time progress tracking via st.status()
    • "πŸ›°οΈ Planning strategic data nodes..."
    • "βœ… Research Synthesized!"
  • Four-phase methodology sidebar: Strategic Planning β†’ Deep Web Research β†’ Cognitive Synthesis β†’ Executive Reporting

πŸ—οΈ Architecture

LangGraph State Machine

InsightEngine AI is built as a 5-node directed acyclic graph in LangGraph:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    LangGraph State Machine                        β”‚
β”‚                                                                  β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”       β”‚
β”‚  β”‚  1. PLANNER  │───▢│  2. SEARCHER │───▢│  3. SCRAPER  β”‚       β”‚
β”‚  β”‚              β”‚    β”‚              β”‚    β”‚              β”‚       β”‚
β”‚  β”‚ β€’ Analyze    β”‚    β”‚ β€’ DuckDuckGo β”‚    β”‚ β€’ Regex URL  β”‚       β”‚
β”‚  β”‚   intent     β”‚    β”‚   top 5      β”‚    β”‚   extraction β”‚       β”‚
β”‚  β”‚ β€’ 3-5 search β”‚    β”‚   results    β”‚    β”‚ β€’ Top 3 URLs β”‚       β”‚
β”‚  β”‚   queries    β”‚    β”‚ β€’ Structured β”‚    β”‚ β€’ 4K chars   β”‚       β”‚
β”‚  β”‚ β€’ Source     β”‚    β”‚   title/link β”‚    β”‚   per page   β”‚       β”‚
β”‚  β”‚   types      β”‚    β”‚   /snippet   β”‚    β”‚ β€’ BS4 clean  β”‚       β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜       β”‚
β”‚                                                  β”‚               β”‚
β”‚                                                  β–Ό               β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”       β”‚
β”‚  β”‚  5. WRITER   │◀───│          4. ANALYST              β”‚       β”‚
β”‚  β”‚              β”‚    β”‚                                  β”‚       β”‚
β”‚  β”‚ β€’ Title      β”‚    β”‚ β€’ Cross-reference scraped data   β”‚       β”‚
β”‚  β”‚ β€’ Exec       β”‚    β”‚ β€’ Extract key facts              β”‚       β”‚
β”‚  β”‚   Summary    β”‚    β”‚ β€’ Identify gaps                  β”‚       β”‚
β”‚  β”‚ β€’ Key        β”‚    β”‚ β€’ Falls back to snippets if      β”‚       β”‚
β”‚  β”‚   Findings   β”‚    β”‚   scraping returned empty        β”‚       β”‚
β”‚  β”‚ β€’ Detailed   β”‚    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜       β”‚
β”‚  β”‚   Analysis   β”‚                                               β”‚
β”‚  β”‚ β€’ Conclusion β”‚                                               β”‚
β”‚  β”‚ β€’ Citations  β”‚                                               β”‚
β”‚  β”‚              │───▢  END (report saved + displayed)           β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                                               β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

State Schema

The entire research pipeline passes a single ResearchState TypedDict through all nodes:

class ResearchState(TypedDict):
    query: str                     # Original user question
    plan: str                      # Planner output β€” numbered strategy list
    search_results: List[str]      # Raw search snippets from DuckDuckGo
    scraped_content: List[dict]    # [{"url": str, "content": str}, ...]
    findings: str                  # Analyst's synthesized findings
    report: str                    # Writer's final Markdown report
    iteration: int                 # Reserved for future multi-pass logic
    messages: List[BaseMessage]    # LangChain message accumulator

Agent Personas

Three specialized prompts in prompts.py give each LLM call a distinct identity:

Persona Prompt Role
Planner PLANNER_PROMPT Creates a numbered research plan: key info needed, 3–5 search queries, source type classification
Analyzer ANALYZER_PROMPT Receives all scraped content, extracts key facts, identifies gaps, cross-references sources
Writer WRITER_PROMPT Produces the final report: Title β†’ Executive Summary β†’ Key Findings (with citations) β†’ Detailed Analysis β†’ Conclusion

Tool Registry

Four LangChain @tool-decorated functions in tools.py:

Tool Input Output Source
search_web Search query string Formatted title/link/snippet blocks DuckDuckGo (5 results)
check_wikipedia Topic name Wikipedia article summary WikipediaAPIWrapper
fetch_page URL string Cleaned page text (up to 10K chars) web_scraper.scrape_url()
create_report Markdown content File path confirmation Writes to outputs/reports/

πŸš€ Getting Started

Prerequisites

  • Python 3.10+
  • A DeepSeek API Key β€” get one here (or Anthropic / OpenAI key as fallback)

Step 1 β€” Clone

git clone https://github.com/Ismail-2001/Autonomous-Research-Intelligence.git
cd Autonomous-Research-Intelligence

Step 2 β€” Install Dependencies

pip install -r research_agent/requirements.txt
Full dependency list (13 packages)
Package Purpose
langchain Core framework for LLM orchestration
langchain-community DuckDuckGo + Wikipedia tool integrations
langgraph State machine graph builder (StateGraph, END)
langchain-openai DeepSeek / OpenAI LLM connector
langchain-anthropic Claude fallback LLM connector
streamlit Web application framework
duckduckgo-search Privacy-first web search API
wikipedia Wikipedia article retrieval
beautifulsoup4 HTML parsing and content extraction
python-dotenv .env file loading
tiktoken Token counting for context window management
pandas Data manipulation utilities
requests HTTP client for web scraping

Step 3 β€” Configure Environment

Create a .env file in the research_agent/ directory:

DEEPSEEK_API_KEY=your_deepseek_api_key_here

# Optional fallbacks (checked in priority order):
# ANTHROPIC_API_KEY=your_anthropic_key
# OPENAI_API_KEY=your_openai_key

# Optional configuration:
# DEBUG=True
# LOG_LEVEL=INFO
# USER_AGENT=ResearchAgent/1.0

Step 4 β€” Launch

streamlit run research_agent/app.py

The app will open at http://localhost:8501.


πŸ’‘ Usage

Running a Research Query

  1. Open the app in your browser
  2. Enter a complex research objective in the text input:
    • "The next 10 years of Fusion Energy commercialization"
    • "How is CRISPR being used in cancer treatment as of 2026?"
    • "Compare the economic strategies of BRICS vs G7 nations"
  3. Click "Generate Deep Report"
  4. Watch the agent progress through its 5 stages:
    • πŸ›°οΈ Planning strategic data nodes...
    • βœ… Research Synthesized!
  5. Explore the results:
    • πŸ‘οΈ View Agent's Thinking Strategy β€” expand to see the full research plan
    • πŸ“Š Key Research Findings β€” analyst's synthesized facts and insights
    • πŸ“„ Executive Summary & Full Analysis β€” the complete Markdown report
    • πŸ“₯ Download Executive Report β€” one-click .md file download

Sidebar: Methodology Panel

The left sidebar always shows the 4-phase research methodology:

  1. Strategic Planning β€” Agent analyzes intent and creates search tree
  2. Deep Web Research β€” Autonomous browsing and extraction from verified sources
  3. Cognitive Synthesis β€” Cross-referencing for accuracy and depth
  4. Executive Reporting β€” Structured, citation-rich document generation

πŸ“„ Output Format

Every report follows a consistent professional structure:

# [Report Title]

## Executive Summary
[2-3 paragraph overview of key findings]

## Key Findings
1. **[Finding 1]** β€” [details with source citation]
2. **[Finding 2]** β€” [details with source citation]
3. ...

## Detailed Analysis
[In-depth treatment of each finding, organized by theme]

## Conclusion
[Summary of implications and recommendations]

## Sources Accessed
- [URL 1]
- [URL 2]
- [URL 3]

Reports are automatically saved to outputs/reports/ with timestamped filenames:

outputs/reports/research_report_20260220_181500.md

πŸ“‚ Project Structure

Autonomous-Research-Intelligence/
β”‚
β”œβ”€β”€ research_agent/
β”‚   β”‚
β”‚   β”œβ”€β”€ agent/
β”‚   β”‚   β”œβ”€β”€ research_agent.py   # Core LangGraph state machine
β”‚   β”‚   β”‚                       #   β€’ ResearchState TypedDict (8 fields)
β”‚   β”‚   β”‚                       #   β€’ 5 graph nodes: planner β†’ searcher β†’ scraper β†’ analyst β†’ writer
β”‚   β”‚   β”‚                       #   β€’ get_llm() multi-provider factory (DeepSeek β†’ Anthropic β†’ OpenAI)
β”‚   β”‚   β”‚                       #   β€’ run_research_agent() entry point
β”‚   β”‚   β”‚
β”‚   β”‚   β”œβ”€β”€ tools.py            # 4 LangChain @tool-decorated functions
β”‚   β”‚   β”‚                       #   β€’ search_web: DuckDuckGo (5 results, structured output)
β”‚   β”‚   β”‚                       #   β€’ check_wikipedia: WikipediaAPIWrapper integration
β”‚   β”‚   β”‚                       #   β€’ fetch_page: Full page scraping via web_scraper.scrape_url()
β”‚   β”‚   β”‚                       #   β€’ create_report: Timestamped .md file writer β†’ outputs/reports/
β”‚   β”‚   β”‚
β”‚   β”‚   └── prompts.py          # 3 specialized agent personas
β”‚   β”‚                           #   β€’ PLANNER_PROMPT: Research strategy + search query generation
β”‚   β”‚                           #   β€’ ANALYZER_PROMPT: Fact extraction + gap identification
β”‚   β”‚                           #   β€’ WRITER_PROMPT: Executive report structure (Title β†’ Conclusion)
β”‚   β”‚
β”‚   β”œβ”€β”€ config/
β”‚   β”‚   └── settings.py         # Environment configuration
β”‚   β”‚                           #   β€’ Multi-provider API key loading
β”‚   β”‚                           #   β€’ DEBUG, LOG_LEVEL, USER_AGENT defaults
β”‚   β”‚                           #   β€’ REPORT_OUTPUT_DIR auto-creation
β”‚   β”‚
β”‚   β”œβ”€β”€ utils/
β”‚   β”‚   β”œβ”€β”€ web_scraper.py      # Deep scraping engine
β”‚   β”‚   β”‚                       #   β€’ BeautifulSoup4 HTML parsing
β”‚   β”‚   β”‚                       #   β€’ Script/style tag removal
β”‚   β”‚   β”‚                       #   β€’ Chrome User-Agent header spoofing
β”‚   β”‚   β”‚                       #   β€’ 10s timeout, 10K char limit
β”‚   β”‚   β”‚
β”‚   β”‚   └── text_processor.py   # Content utilities
β”‚   β”‚                           #   β€’ clean_text(): whitespace normalization
β”‚   β”‚                           #   β€’ chunk_text(): 4K-char chunking for LLM context
β”‚   β”‚
β”‚   β”œβ”€β”€ app.py                  # Streamlit application (144 lines)
β”‚   β”‚                           #   β€’ Custom CSS: radial gradient, glassmorphism, gradient buttons
β”‚   β”‚                           #   β€’ Sidebar: methodology cards + branding
β”‚   β”‚                           #   β€’ Main: text input β†’ status progress β†’ results panel
β”‚   β”‚                           #   β€’ Expandable agent plan view
β”‚   β”‚                           #   β€’ Download button for final report
β”‚   β”‚
β”‚   β”œβ”€β”€ requirements.txt        # 13 Python dependencies
β”‚   └── .env                    # API key configuration (gitignored)
β”‚
└── outputs/
    └── reports/                # Auto-generated research reports (timestamped .md files)

πŸ› οΈ Tech Stack

Category Library Purpose
Language Python 3.10+ Core runtime
LLM (Primary) DeepSeek V3 via langchain-openai Research planning, analysis, report writing
LLM (Fallback 1) Claude 3 Sonnet via langchain-anthropic Alternative LLM provider
LLM (Fallback 2) GPT-4o via langchain-openai Third-tier fallback
Orchestration LangGraph StateGraph 5-node directed acyclic state machine
Web Search duckduckgo-search Privacy-first search βˆ’ no API key needed
Knowledge Base Wikipedia via langchain-community Encyclopedic context and definitions
Web Scraping BeautifulSoup4 + Requests Full page content extraction with HTML cleaning
Tokenization TikToken Token counting for context window awareness
Frontend Streamlit Interactive web UI with custom CSS injection
Config python-dotenv Secure .env loading for API keys

🌐 Deployment

Streamlit Cloud

  1. Push your repo to GitHub
  2. Go to share.streamlit.io
  3. Select your repository β†’ set Main file path to research_agent/app.py
  4. Add Secrets in the dashboard:
    DEEPSEEK_API_KEY = "your_key_here"
  5. Click Deploy βœ…

Docker

FROM python:3.11-slim
WORKDIR /app
COPY . .
RUN pip install --no-cache-dir -r research_agent/requirements.txt
EXPOSE 8501
CMD ["streamlit", "run", "research_agent/app.py", "--server.port=8501", "--server.headless=true"]

Render / Railway

services:
  - type: web
    name: insightengine-ai
    runtime: python
    buildCommand: pip install -r research_agent/requirements.txt
    startCommand: streamlit run research_agent/app.py --server.port $PORT --server.headless true
    envVars:
      - key: DEEPSEEK_API_KEY
        sync: false

πŸ—ΊοΈ Roadmap

βœ… Phase 1 β€” Core Research Pipeline (Complete)

  • 5-node LangGraph state machine (Planner β†’ Searcher β†’ Scraper β†’ Analyst β†’ Writer)
  • DuckDuckGo web search (5 results per query, structured output)
  • Deep web scraping with BeautifulSoup4 (top 3 URLs, 4K chars per page)
  • Wikipedia knowledge integration via LangChain
  • Multi-LLM fallback: DeepSeek β†’ Anthropic β†’ OpenAI
  • Three specialized agent personas (Planner, Analyzer, Writer)
  • Executive report generation with citation trail
  • Auto-save to outputs/reports/ with timestamps
  • Download button in Streamlit UI
  • Premium glassmorphic dark-mode interface
  • Text processor utilities (cleaning + chunking)

πŸ”¨ Phase 2 β€” Research Depth (Next)

  • Iterative Research Loop: Re-run the Planner β†’ Search β†’ Scrape cycle with refined queries based on gaps identified by the Analyst
  • Academic Search Integration: arXiv, Google Scholar, Semantic Scholar APIs for peer-reviewed sources
  • PDF Document Parsing: Extract content from research papers and reports linked in search results
  • Source Credibility Scoring: Rank sources by domain authority, publishing date, and citation count

πŸ“‹ Phase 3 β€” Enhanced Output (Planned)

  • Multi-Format Export: PDF, DOCX, and HTML report generation alongside Markdown
  • Evidence Mapping: Visual graph showing how sources connect to key findings
  • Comparison Reports: Side-by-side analysis of two competing topics or viewpoints
  • Report History Dashboard: Browse, search, and re-open past research sessions

πŸ”­ Phase 4 β€” Enterprise Features (Vision)

  • Team Workspaces: Shared research library with role-based access
  • Scheduled Research Monitoring: Weekly automated reports on evolving topics
  • RAG Integration: Upload internal documents as additional knowledge sources
  • Multi-Agent Debate: Two analyst agents argue opposing perspectives before synthesis

🀝 Contributing

Contributions are welcome! High-impact areas:

  • New search providers β€” add Brave Search, SerpAPI, or Google Scholar tools in tools.py
  • New agent personas β€” extend prompts.py with domain-specific analysts (e.g., Medical, Legal, Financial)
  • Output formats β€” add PDF/DOCX export in the create_report tool
  • Scraper improvements β€” handle JavaScript-rendered pages, pagination, or anti-bot measures
  • UI enhancements β€” add report history, search within past reports, or comparative views

To contribute:

  1. Fork the repository
  2. Create a feature branch: git checkout -b feature/your-feature
  3. Commit with Conventional Commits: git commit -m "feat: add arXiv search tool"
  4. Push and open a Pull Request against main

πŸ“„ License

Distributed under the MIT License. See LICENSE for details.


Built for researchers who demand depth, not just snippets.

If InsightEngine AI changed how you approach research, star ⭐ the repo.

GitHub Stars

Built with ❀️ by Ismail Sajid

About

agentic-ai, langgraph, langchain, deepseek, autonomous-agents, research-assistant, web-scraping, streamlit, llm, ai-research, multi-agent-system

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages