Skip to content

Why Do Results from MCP-Trained Models Differ Greatly Between generate_benchmarks.py and train.py #417

@wm19999

Description

@wm19999

Why does a model trained via MCP show a significant discrepancy in results when tested using generate_benchmarks.py, compared to the outcomes from train.py?
A preliminary investigation indicates that the root cause may be related to the following logic: In generate_benchmarks.py, the model must call the complete_task function to be deemed as having finished a task. However, there is no such logic implemented in train.py. Is this the reason for the large result deviation?

generate_benchmarks.py

    qwen3_4b_instruct = art.Model(
        name="qwen3-4b-instruct",
        project=server,
        inference_model_name="qwen3-4b-instruct",
        inference_base_url="http://localhost:8082/v1", #http://localhost:8082/v1
        inference_api_key="dummy",  # API key
        inference_timeout=3600,
    )
source /*****/miniconda3/bin/activate ART
BASE_MODEL_PATH="/*****/Qwen3-4B-Instruct-2507"
LORA_PATH="/*****/examples/mcp-rl/.art/mcp-agent-training/models/mcp-4b-001/checkpoints/0017"


CUDA_VISIBLE_DEVICES=2,3 python -m vllm.entrypoints.openai.api_server \
    --model "$BASE_MODEL_PATH" \
    --served-model-name "mcp-4b-001-finetuned" \
    --enable-lora \
    --lora-modules mcp-4b-001="$LORA_PATH" \
    --host 0.0.0.0 \
    --port 8082 \
    --trust-remote-code \
    --enable-auto-tool-choice \
    --tool-call-parser hermes \
    --max-model-len 16384 \
    --tensor-parallel-size 2
In the generate_benchmarks.py

    mcp-4b-001-finetuned = art.Model(
        name="mcp-4b-001-finetuned",
        project=server,
        inference_model_name="mcp-4b-001-finetuned",
        inference_base_url="http://localhost:8082/v1", #http://localhost:8082/v1
        inference_api_key="dummy",  
        inference_timeout=3600) 

While this method has a certain degree of effectiveness, there is still a significant gap between its current performance and the validation results obtained during training.

I would like to know: during the training process, is it also mandatory for the model to output the "complete task" tool to be considered a successful completion of the task? Because when I used your benchmark, the trained large model tended not to call the "complete task" tool to end the task, resulting in an evaluation success rate of 0.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions