A comprehensive, dual-stack solution for the modern ad-tech ecosystem. This project delivers a forensic SQL investigation into network fraud and a high-throughput Python automation pipeline for competitive intelligence.
It demonstrates the transition from raw log analysis to production-grade anomaly detection and automated lead qualification.
- Fraud Detection: In a high-volume ad network, distinguishing legitimate traffic from bot farms or click-spam requires more than simple aggregation. It demands statistical rigor to identify "invisible" anomalies in conversion rates (CR).
- Competitive Intelligence: Manually identifying competitor publishers is slow and unscalable. The challenge is to automate the discovery, validation, and enrichment of thousands of domains while bypassing WAFs and detecting ad tech stacks without rendering.
A series of advanced SQL queries utilizing Window Functions and Statistical Process Control (SPC).
- Methodology: Moved beyond static thresholds to dynamic Z-Score analysis, flagging traffic sources that deviate >1.96 standard deviations from the advertiser's mean.
- Outcome: Pinpointed specific tag IDs responsible for "Click Spam" (low CR, high volume) and "Attribution Fraud" (impossible CRs >70%).
A robust, concurrent scraper utilizing a Map-Reduce architecture.
- Architecture:
ProcessPoolExecutorfor CPU-bound parsing +asynciofor I/O-bound requests. - Data Engine: Integrated Polars for O(1) filtering and DuckDB for in-memory QA.
- Intelligence: Deployed a custom "Manual ML" heuristic engine to detect ad signatures (e.g., Prebid, Google Syndication) with zero external dependencies.
- Ingest: Loads competitor runtime domains and existing client lists via Polars for sub-millisecond set operations.
- Map (Parallel Execution): Spawns a dedicated process per competitor.
- Worker: Fetches API data → Deduplicates against client list → Scrapes HTML asynchronously.
- Enrichment: Extracts contacts (Regex) and detects ads (Heuristic/GenAI).
- Reduce (Aggregation): Consolidates results into a single Polars DataFrame.
- QA: Runs SQL integrity checks via DuckDB before CSV export.
- Ingest: Raw
impressionsandcampaignstables. - Aggregation: Hourly/Daily rollups via CTEs.
- Metrics: Calculation of CR, Fill Rate, and eCPM.
- Anomaly Detection:
- Z-Score Calculation:
(Tag_CR - Avg_Adv_CR) / StdDev_Adv_CR - Bot Fingerprinting: High IP density + Device Monoculture (100% Mobile).
- Z-Score Calculation:
.
├── queries_shoval_benjer.sql # Snowflake SQL investigation & fraud logic
├── python_Shoval_Benjer.py # Competitive intelligence automation pipeline
├── final_output.csv # Generated dataset from the Python pipeline
├── AdMaven_SQL_Investigation_Report.pdf # Detailed analytical findings
├── requirements.txt # Dependencies (Polars, DuckDB, Loguru, etc.)
└── README.md # This file
Instead of arbitrary rules (e.g., "CR < 0.1%"), I implemented a statistical framework:
CASE
WHEN (tag_cr - avg_cr) / NULLIF(std_cr, 0) < -1.96 THEN 'FRAUD_CONFIRMED'
ELSE 'REVIEW_REQUIRED'
ENDThis dynamically adapts to each advertiser's baseline, reducing false positives for naturally low-converting verticals.
The script creates a "Process Pool" where each process manages its own "Event Loop".
- Why? Parsing HTML (CPU) blocks the Event Loop. Network requests (I/O) block the CPU.
- Result: By combining
multiprocessingandasyncio, we saturate both the Network Bandwidth and CPU Cores, achieving maximum throughput.
A lightweight, zero-latency classifier replacing heavy ML models.
- Logic: Scans HTML for weighted tokens:
{'googlesyndication': 1.5, 'prebid': 1.2, 'iframe': 0.2}. - Inference:
Score = Sum(Weights). IfScore > 2.0, the site is classified asis_running_ads=True.
pip install -r requirements.txtCreate a .env file for API keys (optional for GenAI features):
API_KEY=your_email@example.com
HF_TOKEN=your_huggingface_tokenRun the Python pipeline:
python python_Shoval_Benjer.py- Input:
comp_run_time_domains.csv,our_clients.csv - Output:
final_output.csv
- The "Bot Blast": Advertiser
601040experienced a 400% traffic spike with near-zero conversions, a classic DDoS-style click flood. - The "Impossible Campaign": Campaign
653344showed a 75% conversion rate across 10k+ impressions, a mathematical impossibility suggesting pixel stuffing. - Geographic Arbitrage: Traffic from the Philippines (PH) surged while CR dropped by 60%, indicating a shift to cheaper, lower-quality inventory sources.

