"It doesn't just search β it plans, scrapes, cross-references, and writes the report for you."
InsightEngine AI is an autonomous research assistant powered by a LangGraph state machine and DeepSeek-V3. Give it any complex topic β it will plan a multi-angle research strategy, search the open web, scrape full page content from the top sources, cross-reference findings through an AI analyst, and produce a downloadable, citation-rich executive report. No human intervention between question and output.
β¨ Features Β· ποΈ Architecture Β· π Setup Β· π Sample Output
Genuine research β the kind that produces insights instead of summaries β requires a pipeline that most tools don't offer:
- Search engines return snippets, not understanding β you still have to visit 10+ links and synthesize manually
- LLMs hallucinate facts when they generate from parametric memory alone β no grounding in real sources
- Copy-pasting from multiple tabs into a coherent report takes hours of manual work
- No planning phase β tools jump straight to retrieval without thinking about what angles to cover
- No citation trail β you can't verify where the information came from
InsightEngine AI solves this end-to-end with a 5-node agentic pipeline: Plan β Search β Scrape β Analyze β Report β each stage executed by a specialized LangGraph node with distinct responsibilities.
Before any search is executed, a Planner Agent analyzes the research objective and creates a structured strategy:
- Identifies key information required to comprehensively cover the topic
- Formulates 3β5 specific search queries to cover different angles
- Classifies expected source types: news, academic, general knowledge
- The full plan is visible in the UI via the "View Agent's Thinking Strategy" expander
Privacy-first web retrieval that fetches the top 5 results per search query:
- Returns structured data: Title, URL, and Snippet per result
- No API key required β uses the DuckDuckGo search API directly
- Results feed URLs to the scraper and provide high-level context to the analyst
Goes beyond snippets β autonomously visits and extracts full content from discovered URLs:
- Regex URL extraction from search results to find promising sources
- Scrapes the top 3 URLs per query (excluding Google, DuckDuckGo, and Bing results pages)
- Extracts up to 4,000 characters per page for the analyst (scraper itself captures up to 10,000)
- HTML cleaning: Removes
<script>and<style>elements via BeautifulSoup4 - Browser-mimicking headers: Sends a Chrome User-Agent string to avoid bot detection
- 10-second timeout per request β prevents indefinite hanging on slow sites
- Graceful error handling: Failed scrapes are logged, not crashed on
An Analyst Agent synthesizes all scraped content (not just snippets):
- Receives the full
{url, content}pairs from the scraper node - Falls back to search snippets if scraping returned nothing
- Extracts key facts and insights from the combined corpus
- Identifies missing information or contradictions across sources
- Uses the
ANALYZER_PROMPTpersona β focused on depth and accuracy
A Writer Agent transforms the analysis into a structured Markdown report:
- Full report structure: Title β Executive Summary β Key Findings (with citations) β Detailed Analysis β Conclusion
- Source attribution: All scraped URLs are appended to the findings so the writer can include references
- Auto-saved to disk: Reports are timestamped and persisted to
outputs/reports/(e.g.,research_report_20260220_181500.md) - Downloadable from the UI: One-click download button in the Streamlit interface
The get_llm() factory function supports three LLM providers with automatic fallback:
| Priority | Provider | Model | Required Env Var |
|---|---|---|---|
| 1st | DeepSeek | deepseek-chat (V3) |
DEEPSEEK_API_KEY |
| 2nd | Anthropic | claude-3-sonnet-20240229 |
ANTHROPIC_API_KEY |
| 3rd | OpenAI | gpt-4o |
OPENAI_API_KEY |
If no key is configured, the agent returns a clear error message in the UI instead of crashing.
The check_wikipedia tool provides a secondary knowledge source:
- Uses LangChain's
WikipediaQueryRunwithWikipediaAPIWrapper - Available for definitions, historical context, and general knowledge
- Complements the web search with authoritative encyclopedic data
A dark-mode glassmorphic UI with custom CSS:
- Radial gradient background:
#1a1c2c β #0d0e15 - Gradient CTA buttons:
linear-gradient(90deg, #4776E6, #8E54E9)with hover scale - Info cards: Glassmorphism cards in the sidebar (
rgba(255, 255, 255, 0.03),blur(10px)) - Gradient header:
linear-gradient(90deg, #ff8a00, #e52e71)β bold orange-to-pink text - Status indicator: Real-time progress tracking via
st.status()- "π°οΈ Planning strategic data nodes..."
- "β Research Synthesized!"
- Four-phase methodology sidebar: Strategic Planning β Deep Web Research β Cognitive Synthesis β Executive Reporting
InsightEngine AI is built as a 5-node directed acyclic graph in LangGraph:
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β LangGraph State Machine β
β β
β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ β
β β 1. PLANNER βββββΆβ 2. SEARCHER βββββΆβ 3. SCRAPER β β
β β β β β β β β
β β β’ Analyze β β β’ DuckDuckGo β β β’ Regex URL β β
β β intent β β top 5 β β extraction β β
β β β’ 3-5 search β β results β β β’ Top 3 URLs β β
β β queries β β β’ Structured β β β’ 4K chars β β
β β β’ Source β β title/link β β per page β β
β β types β β /snippet β β β’ BS4 clean β β
β ββββββββββββββββ ββββββββββββββββ ββββββββ¬ββββββββ β
β β β
β βΌ β
β ββββββββββββββββ ββββββββββββββββββββββββββββββββββββ β
β β 5. WRITER ββββββ 4. ANALYST β β
β β β β β β
β β β’ Title β β β’ Cross-reference scraped data β β
β β β’ Exec β β β’ Extract key facts β β
β β Summary β β β’ Identify gaps β β
β β β’ Key β β β’ Falls back to snippets if β β
β β Findings β β scraping returned empty β β
β β β’ Detailed β ββββββββββββββββββββββββββββββββββββ β
β β Analysis β β
β β β’ Conclusion β β
β β β’ Citations β β
β β βββββΆ END (report saved + displayed) β
β ββββββββββββββββ β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
The entire research pipeline passes a single ResearchState TypedDict through all nodes:
class ResearchState(TypedDict):
query: str # Original user question
plan: str # Planner output β numbered strategy list
search_results: List[str] # Raw search snippets from DuckDuckGo
scraped_content: List[dict] # [{"url": str, "content": str}, ...]
findings: str # Analyst's synthesized findings
report: str # Writer's final Markdown report
iteration: int # Reserved for future multi-pass logic
messages: List[BaseMessage] # LangChain message accumulatorThree specialized prompts in prompts.py give each LLM call a distinct identity:
| Persona | Prompt | Role |
|---|---|---|
| Planner | PLANNER_PROMPT |
Creates a numbered research plan: key info needed, 3β5 search queries, source type classification |
| Analyzer | ANALYZER_PROMPT |
Receives all scraped content, extracts key facts, identifies gaps, cross-references sources |
| Writer | WRITER_PROMPT |
Produces the final report: Title β Executive Summary β Key Findings (with citations) β Detailed Analysis β Conclusion |
Four LangChain @tool-decorated functions in tools.py:
| Tool | Input | Output | Source |
|---|---|---|---|
search_web |
Search query string | Formatted title/link/snippet blocks | DuckDuckGo (5 results) |
check_wikipedia |
Topic name | Wikipedia article summary | WikipediaAPIWrapper |
fetch_page |
URL string | Cleaned page text (up to 10K chars) | web_scraper.scrape_url() |
create_report |
Markdown content | File path confirmation | Writes to outputs/reports/ |
- Python
3.10+ - A DeepSeek API Key β get one here (or Anthropic / OpenAI key as fallback)
git clone https://github.com/Ismail-2001/Autonomous-Research-Intelligence.git
cd Autonomous-Research-Intelligencepip install -r research_agent/requirements.txtFull dependency list (13 packages)
| Package | Purpose |
|---|---|
langchain |
Core framework for LLM orchestration |
langchain-community |
DuckDuckGo + Wikipedia tool integrations |
langgraph |
State machine graph builder (StateGraph, END) |
langchain-openai |
DeepSeek / OpenAI LLM connector |
langchain-anthropic |
Claude fallback LLM connector |
streamlit |
Web application framework |
duckduckgo-search |
Privacy-first web search API |
wikipedia |
Wikipedia article retrieval |
beautifulsoup4 |
HTML parsing and content extraction |
python-dotenv |
.env file loading |
tiktoken |
Token counting for context window management |
pandas |
Data manipulation utilities |
requests |
HTTP client for web scraping |
Create a .env file in the research_agent/ directory:
DEEPSEEK_API_KEY=your_deepseek_api_key_here
# Optional fallbacks (checked in priority order):
# ANTHROPIC_API_KEY=your_anthropic_key
# OPENAI_API_KEY=your_openai_key
# Optional configuration:
# DEBUG=True
# LOG_LEVEL=INFO
# USER_AGENT=ResearchAgent/1.0streamlit run research_agent/app.pyThe app will open at http://localhost:8501.
- Open the app in your browser
- Enter a complex research objective in the text input:
- "The next 10 years of Fusion Energy commercialization"
- "How is CRISPR being used in cancer treatment as of 2026?"
- "Compare the economic strategies of BRICS vs G7 nations"
- Click "Generate Deep Report"
- Watch the agent progress through its 5 stages:
- π°οΈ Planning strategic data nodes...
- β Research Synthesized!
- Explore the results:
- ποΈ View Agent's Thinking Strategy β expand to see the full research plan
- π Key Research Findings β analyst's synthesized facts and insights
- π Executive Summary & Full Analysis β the complete Markdown report
- π₯ Download Executive Report β one-click
.mdfile download
The left sidebar always shows the 4-phase research methodology:
- Strategic Planning β Agent analyzes intent and creates search tree
- Deep Web Research β Autonomous browsing and extraction from verified sources
- Cognitive Synthesis β Cross-referencing for accuracy and depth
- Executive Reporting β Structured, citation-rich document generation
Every report follows a consistent professional structure:
# [Report Title]
## Executive Summary
[2-3 paragraph overview of key findings]
## Key Findings
1. **[Finding 1]** β [details with source citation]
2. **[Finding 2]** β [details with source citation]
3. ...
## Detailed Analysis
[In-depth treatment of each finding, organized by theme]
## Conclusion
[Summary of implications and recommendations]
## Sources Accessed
- [URL 1]
- [URL 2]
- [URL 3]Reports are automatically saved to outputs/reports/ with timestamped filenames:
outputs/reports/research_report_20260220_181500.md
Autonomous-Research-Intelligence/
β
βββ research_agent/
β β
β βββ agent/
β β βββ research_agent.py # Core LangGraph state machine
β β β # β’ ResearchState TypedDict (8 fields)
β β β # β’ 5 graph nodes: planner β searcher β scraper β analyst β writer
β β β # β’ get_llm() multi-provider factory (DeepSeek β Anthropic β OpenAI)
β β β # β’ run_research_agent() entry point
β β β
β β βββ tools.py # 4 LangChain @tool-decorated functions
β β β # β’ search_web: DuckDuckGo (5 results, structured output)
β β β # β’ check_wikipedia: WikipediaAPIWrapper integration
β β β # β’ fetch_page: Full page scraping via web_scraper.scrape_url()
β β β # β’ create_report: Timestamped .md file writer β outputs/reports/
β β β
β β βββ prompts.py # 3 specialized agent personas
β β # β’ PLANNER_PROMPT: Research strategy + search query generation
β β # β’ ANALYZER_PROMPT: Fact extraction + gap identification
β β # β’ WRITER_PROMPT: Executive report structure (Title β Conclusion)
β β
β βββ config/
β β βββ settings.py # Environment configuration
β β # β’ Multi-provider API key loading
β β # β’ DEBUG, LOG_LEVEL, USER_AGENT defaults
β β # β’ REPORT_OUTPUT_DIR auto-creation
β β
β βββ utils/
β β βββ web_scraper.py # Deep scraping engine
β β β # β’ BeautifulSoup4 HTML parsing
β β β # β’ Script/style tag removal
β β β # β’ Chrome User-Agent header spoofing
β β β # β’ 10s timeout, 10K char limit
β β β
β β βββ text_processor.py # Content utilities
β β # β’ clean_text(): whitespace normalization
β β # β’ chunk_text(): 4K-char chunking for LLM context
β β
β βββ app.py # Streamlit application (144 lines)
β β # β’ Custom CSS: radial gradient, glassmorphism, gradient buttons
β β # β’ Sidebar: methodology cards + branding
β β # β’ Main: text input β status progress β results panel
β β # β’ Expandable agent plan view
β β # β’ Download button for final report
β β
β βββ requirements.txt # 13 Python dependencies
β βββ .env # API key configuration (gitignored)
β
βββ outputs/
βββ reports/ # Auto-generated research reports (timestamped .md files)
| Category | Library | Purpose |
|---|---|---|
| Language | Python 3.10+ |
Core runtime |
| LLM (Primary) | DeepSeek V3 via langchain-openai |
Research planning, analysis, report writing |
| LLM (Fallback 1) | Claude 3 Sonnet via langchain-anthropic |
Alternative LLM provider |
| LLM (Fallback 2) | GPT-4o via langchain-openai |
Third-tier fallback |
| Orchestration | LangGraph StateGraph |
5-node directed acyclic state machine |
| Web Search | duckduckgo-search |
Privacy-first search β no API key needed |
| Knowledge Base | Wikipedia via langchain-community |
Encyclopedic context and definitions |
| Web Scraping | BeautifulSoup4 + Requests | Full page content extraction with HTML cleaning |
| Tokenization | TikToken | Token counting for context window awareness |
| Frontend | Streamlit | Interactive web UI with custom CSS injection |
| Config | python-dotenv |
Secure .env loading for API keys |
- Push your repo to GitHub
- Go to share.streamlit.io
- Select your repository β set Main file path to
research_agent/app.py - Add Secrets in the dashboard:
DEEPSEEK_API_KEY = "your_key_here"
- Click Deploy β
FROM python:3.11-slim
WORKDIR /app
COPY . .
RUN pip install --no-cache-dir -r research_agent/requirements.txt
EXPOSE 8501
CMD ["streamlit", "run", "research_agent/app.py", "--server.port=8501", "--server.headless=true"]services:
- type: web
name: insightengine-ai
runtime: python
buildCommand: pip install -r research_agent/requirements.txt
startCommand: streamlit run research_agent/app.py --server.port $PORT --server.headless true
envVars:
- key: DEEPSEEK_API_KEY
sync: false- 5-node LangGraph state machine (Planner β Searcher β Scraper β Analyst β Writer)
- DuckDuckGo web search (5 results per query, structured output)
- Deep web scraping with BeautifulSoup4 (top 3 URLs, 4K chars per page)
- Wikipedia knowledge integration via LangChain
- Multi-LLM fallback: DeepSeek β Anthropic β OpenAI
- Three specialized agent personas (Planner, Analyzer, Writer)
- Executive report generation with citation trail
- Auto-save to
outputs/reports/with timestamps - Download button in Streamlit UI
- Premium glassmorphic dark-mode interface
- Text processor utilities (cleaning + chunking)
- Iterative Research Loop: Re-run the Planner β Search β Scrape cycle with refined queries based on gaps identified by the Analyst
- Academic Search Integration: arXiv, Google Scholar, Semantic Scholar APIs for peer-reviewed sources
- PDF Document Parsing: Extract content from research papers and reports linked in search results
- Source Credibility Scoring: Rank sources by domain authority, publishing date, and citation count
- Multi-Format Export: PDF, DOCX, and HTML report generation alongside Markdown
- Evidence Mapping: Visual graph showing how sources connect to key findings
- Comparison Reports: Side-by-side analysis of two competing topics or viewpoints
- Report History Dashboard: Browse, search, and re-open past research sessions
- Team Workspaces: Shared research library with role-based access
- Scheduled Research Monitoring: Weekly automated reports on evolving topics
- RAG Integration: Upload internal documents as additional knowledge sources
- Multi-Agent Debate: Two analyst agents argue opposing perspectives before synthesis
Contributions are welcome! High-impact areas:
- New search providers β add Brave Search, SerpAPI, or Google Scholar tools in
tools.py - New agent personas β extend
prompts.pywith domain-specific analysts (e.g., Medical, Legal, Financial) - Output formats β add PDF/DOCX export in the
create_reporttool - Scraper improvements β handle JavaScript-rendered pages, pagination, or anti-bot measures
- UI enhancements β add report history, search within past reports, or comparative views
To contribute:
- Fork the repository
- Create a feature branch:
git checkout -b feature/your-feature - Commit with Conventional Commits:
git commit -m "feat: add arXiv search tool" - Push and open a Pull Request against
main
Distributed under the MIT License. See LICENSE for details.
Built for researchers who demand depth, not just snippets.
If InsightEngine AI changed how you approach research, star β the repo.
Built with β€οΈ by Ismail Sajid