llm-judge

A Streamlit web app that uses a Groq-powered LLM (Llama 3) to act as an impartial judge for evaluating and comparing two model outputs. Supports custom criteria, presets like creativity and brand tone, and returns structured scores, explanations, and a winner. Built end-to-end with Python, Groq API, and Streamlit.

python code-evaluation a-b-testing text-evaluation groq streamlit model-benchmarking ai-automation ai-evaluation llm prompt-evaluation llama3 llm-judge output-evaluation scoring-framework

Updated Nov 24, 2025
Python

Padraigobrien08 / Agentic-GenAI-Capstone-Google-Kaggle

Star

Agent QA Mentor: an agentic QA pipeline that evaluates tool-using AI agent trajectories (scores, issue codes, safety/hallucination detection), rewrites prompts with targeted fixes, and stores long-term memory for continuous improvement—plus a CI-style eval gate and demo notebook.

python qa multi-agent-systems ai-agents gemini-api qa-automation pydantic vector-database llms agentic llm-judge agent-evaluation

Updated Dec 2, 2025
Jupyter Notebook

Improve this page

Add a description, image, and links to the llm-judge topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the llm-judge topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llm-judge

Here are 6 public repositories matching this topic...

haizelabs / verdict

DJMuRo4ever / Prompt_Eval_LLM_Judge

Anmolian / Prompt_Eval_LLM_Judge

PabloCabaleiro / pondera

syed-waleed-ahmed / LLM-as-Judge

Padraigobrien08 / Agentic-GenAI-Capstone-Google-Kaggle

Improve this page

Add this topic to your repo