This repository implements a multi-agent, entropy-driven decoding system for large language models (LLMs), using parallel sampling and self-evaluation to generate coherent long-form text.
Rather than sampling from a single decoding configuration, multiple LLM workers propose next tokens in parallel under different temperature / top-k settings. At each step, the system selects the lowest-entropy proposal — the one the model is most confident about — to extend the text. A secondary “critic” model periodically evaluates the coherence of the generated text and rolls back recent tokens if quality degrades.
Traditional LLM decoding (greedy, sampling, top-k, nucleus, etc.) commits to a single temperature and sampling strategy. This project explores a parallel confidence-based generation loop, where each decoding job acts as an agent exploring a slightly different probabilistic regime.
Key idea:
“Let multiple minds propose, but let entropy decide.”
Each agent produces its candidate next token, entropy (uncertainty), and temperature/top-k metadata. The system selects the token with the lowest entropy — the one representing the most decisive prediction — while still benefiting from diverse search branches.
To maintain global coherence, a secondary evaluator periodically reviews the text and reverts to a stable checkpoint when coherence drops.
-
Parallel Generation Multiple threads run on multiple GPUs (
cuda:0,cuda:1), each loading a quantized version of the base model (e.g.,Qwen/Qwen3-0.6B). -
Entropy-Based Selection Each worker:
- Computes the softmax probabilities for the next token.
- Samples under its
(temperature, top_k)configuration. - Returns the sampled token and its entropy.
- The controller picks the lowest-entropy token.
-
Self-Evaluation Loop Every few steps, another instance of the same base model (used as a critic) judges text quality with a simple prompt:
Text: ... Question: Is this text coherent and not nonsense? Reply GOOD or BAD.If it replies “BAD,” the system rolls back several tokens and re-generates.
-
Logging Each step’s results (tokens, entropy, chosen candidate, evaluator verdicts) are written to
generation_log.txt.
| Parameter | Description | Default |
|---|---|---|
MODEL_ID |
Base model to use | Qwen/Qwen3-0.6B |
NUM_JOBS |
Number of decoding configurations | 8 |
MAX_PARALLEL |
Max threads per step | 8 |
MAX_TOKENS |
Maximum output length | 25000 |
ROLLBACK |
Tokens to roll back after a “BAD” verdict | 5 |
CHECK_EVERY |
Evaluate after this many steps | 5 |
DEVICES |
List of CUDA devices | ["cuda:0", "cuda:1"] |
python parallel_entropy_gen.pyExample log snippet:
Step 42: chosen=the (entropy=1.283)
Evaluator after step 45: GOOD
Step 50: chosen=universe (entropy=0.993)
Evaluator after step 55: BAD → rolling back 5 tokens
- Final text: printed to console and appended at the end of the log file.
- Log file (
generation_log.txt): includes per-step tokens, entropy values, chosen configuration, and evaluator feedback.
- Implement beam-style entropy fusion across time steps.
- Add semantic evaluators using larger critic models (e.g., GPT-based reward model).
- Support dynamic agent allocation based on entropy variance.
- Visualize entropy progression and rollback patterns.
pip install torch transformers accelerateOptional (for multi-GPU / quantized inference):
pip install bitsandbytesThis project introduces a self-correcting, entropy-aware decoding framework that balances diversity and coherence without fine-tuning. It can be extended for:
- Reinforcement-learning-free text quality improvement,
- Multi-agent model orchestration,
- Online adaptive temperature control.
In essence: multiple sampling heads act as explorers, entropy acts as a judge, and a critic model ensures sanity.
Would you like me to include mathematical formalism (e.g., expected entropy minimization equation and system diagram) for the README as well? It could make it suitable for an academic GitHub repository.