AdMaven Analytics: Detection & Intelligence Suite

A comprehensive, dual-stack solution for the modern ad-tech ecosystem. This project delivers a forensic SQL investigation into network fraud and a high-throughput Python automation pipeline for competitive intelligence.

It demonstrates the transition from raw log analysis to production-grade anomaly detection and automated lead qualification.

1. Executive Summary

Challenges

Fraud Detection: In a high-volume ad network, distinguishing legitimate traffic from bot farms or click-spam requires more than simple aggregation. It demands statistical rigor to identify "invisible" anomalies in conversion rates (CR).
Competitive Intelligence: Manually identifying competitor publishers is slow and unscalable. The challenge is to automate the discovery, validation, and enrichment of thousands of domains while bypassing WAFs and detecting ad tech stacks without rendering.

Solution

Component A: Forensic SQL Analytics (Snowflake)

A series of advanced SQL queries utilizing Window Functions and Statistical Process Control (SPC).

Methodology: Moved beyond static thresholds to dynamic Z-Score analysis, flagging traffic sources that deviate >1.96 standard deviations from the advertiser's mean.
Outcome: Pinpointed specific tag IDs responsible for "Click Spam" (low CR, high volume) and "Attribution Fraud" (impossible CRs >70%).

Component B: High-Performance Python Pipeline

A robust, concurrent scraper utilizing a Map-Reduce architecture.

Architecture: ProcessPoolExecutor for CPU-bound parsing + asyncio for I/O-bound requests.
Data Engine: Integrated Polars for O(1) filtering and DuckDB for in-memory QA.
Intelligence: Deployed a custom "Manual ML" heuristic engine to detect ad signatures (e.g., Prebid, Google Syndication) with zero external dependencies.

2. System Architecture

Python Pipeline: Map-Reduce Intelligence

Ingest: Loads competitor runtime domains and existing client lists via Polars for sub-millisecond set operations.
Map (Parallel Execution): Spawns a dedicated process per competitor.
- Worker: Fetches API data → Deduplicates against client list → Scrapes HTML asynchronously.
- Enrichment: Extracts contacts (Regex) and detects ads (Heuristic/GenAI).
Reduce (Aggregation): Consolidates results into a single Polars DataFrame.
QA: Runs SQL integrity checks via DuckDB before CSV export.

SQL Analytics: The Fraud Detection DAG

Ingest: Raw impressions and campaigns tables.
Aggregation: Hourly/Daily rollups via CTEs.
Metrics: Calculation of CR, Fill Rate, and eCPM.
Anomaly Detection:
- Z-Score Calculation: (Tag_CR - Avg_Adv_CR) / StdDev_Adv_CR
- Bot Fingerprinting: High IP density + Device Monoculture (100% Mobile).

3. Repository Structure

.
├── queries_shoval_benjer.sql   # Snowflake SQL investigation & fraud logic
├── python_Shoval_Benjer.py     # Competitive intelligence automation pipeline
├── final_output.csv            # Generated dataset from the Python pipeline
├── AdMaven_SQL_Investigation_Report.pdf  # Detailed analytical findings
├── requirements.txt            # Dependencies (Polars, DuckDB, Loguru, etc.)
└── README.md                   # This file

4. Technical Deep Dive

4.1 SQL: Statistical Anomaly Detection

Instead of arbitrary rules (e.g., "CR < 0.1%"), I implemented a statistical framework:

CASE 
    WHEN (tag_cr - avg_cr) / NULLIF(std_cr, 0) < -1.96 THEN 'FRAUD_CONFIRMED'
    ELSE 'REVIEW_REQUIRED'
END

This dynamically adapts to each advertiser's baseline, reducing false positives for naturally low-converting verticals.

4.2 Python: Hybrid Concurrency

The script creates a "Process Pool" where each process manages its own "Event Loop".

Why? Parsing HTML (CPU) blocks the Event Loop. Network requests (I/O) block the CPU.
Result: By combining multiprocessing and asyncio, we saturate both the Network Bandwidth and CPU Cores, achieving maximum throughput.

4.3 Heuristic Ad Engine

A lightweight, zero-latency classifier replacing heavy ML models.

Logic: Scans HTML for weighted tokens: {'googlesyndication': 1.5, 'prebid': 1.2, 'iframe': 0.2}.
Inference: Score = Sum(Weights). If Score > 2.0, the site is classified as is_running_ads=True.

5. Setup & Usage

Installation

pip install -r requirements.txt

Configuration

Create a .env file for API keys (optional for GenAI features):

API_KEY=your_email@example.com
HF_TOKEN=your_huggingface_token

Execution

Run the Python pipeline:

python python_Shoval_Benjer.py

Input: comp_run_time_domains.csv, our_clients.csv
Output: final_output.csv

6. Key Insights (from Dashboard)

The "Bot Blast": Advertiser 601040 experienced a 400% traffic spike with near-zero conversions, a classic DDoS-style click flood.
The "Impossible Campaign": Campaign 653344 showed a 75% conversion rate across 10k+ impressions, a mathematical impossibility suggesting pixel stuffing.
Geographic Arbitrage: Traffic from the Philippines (PH) surged while CR dropped by 60%, indicating a shift to cheaper, lower-quality inventory sources.

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
data		data
public/assets		public/assets
results		results
src		src
README.md		README.md
SQL_Analysis.pdf		SQL_Analysis.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

AdMaven Analytics: Detection & Intelligence Suite

1. Executive Summary

Challenges

Solution

Component A: Forensic SQL Analytics (Snowflake)

Component B: High-Performance Python Pipeline

2. System Architecture

Python Pipeline: Map-Reduce Intelligence

SQL Analytics: The Fraud Detection DAG

3. Repository Structure

4. Technical Deep Dive

4.1 SQL: Statistical Anomaly Detection

4.2 Python: Hybrid Concurrency

4.3 Heuristic Ad Engine

5. Setup & Usage

Installation

Configuration

Execution

6. Key Insights (from Dashboard)

About

Uh oh!

Releases

Packages

Languages

ShovalBenjer/admaven-python-data-engineering

Folders and files

Latest commit

History

Repository files navigation

AdMaven Analytics: Detection & Intelligence Suite

1. Executive Summary

Challenges

Solution

Component A: Forensic SQL Analytics (Snowflake)

Component B: High-Performance Python Pipeline

2. System Architecture

Python Pipeline: Map-Reduce Intelligence

SQL Analytics: The Fraud Detection DAG

3. Repository Structure

4. Technical Deep Dive

4.1 SQL: Statistical Anomaly Detection

4.2 Python: Hybrid Concurrency

4.3 Heuristic Ad Engine

5. Setup & Usage

Installation

Configuration

Execution

6. Key Insights (from Dashboard)

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages