Why Do Results from MCP-Trained Models Differ Greatly Between generate_benchmarks.py and train.py

Why does a model trained via MCP show a significant discrepancy in results when tested using generate_benchmarks.py, compared to the outcomes from train.py?
A preliminary investigation indicates that the root cause may be related to the following logic: In generate_benchmarks.py, the model must call the complete_task function to be deemed as having finished a task. However, there is no such logic implemented in train.py. Is this the reason for the large result deviation?

generate_benchmarks.py
```
    qwen3_4b_instruct = art.Model(
        name="qwen3-4b-instruct",
        project=server,
        inference_model_name="qwen3-4b-instruct",
        inference_base_url="http://localhost:8082/v1", #http://localhost:8082/v1
        inference_api_key="dummy",  # API key
        inference_timeout=3600,
    )
```

```
source /*****/miniconda3/bin/activate ART
BASE_MODEL_PATH="/*****/Qwen3-4B-Instruct-2507"
LORA_PATH="/*****/examples/mcp-rl/.art/mcp-agent-training/models/mcp-4b-001/checkpoints/0017"


CUDA_VISIBLE_DEVICES=2,3 python -m vllm.entrypoints.openai.api_server \
    --model "$BASE_MODEL_PATH" \
    --served-model-name "mcp-4b-001-finetuned" \
    --enable-lora \
    --lora-modules mcp-4b-001="$LORA_PATH" \
    --host 0.0.0.0 \
    --port 8082 \
    --trust-remote-code \
    --enable-auto-tool-choice \
    --tool-call-parser hermes \
    --max-model-len 16384 \
    --tensor-parallel-size 2
```

```
In the generate_benchmarks.py

    mcp-4b-001-finetuned = art.Model(
        name="mcp-4b-001-finetuned",
        project=server,
        inference_model_name="mcp-4b-001-finetuned",
        inference_base_url="http://localhost:8082/v1", #http://localhost:8082/v1
        inference_api_key="dummy",  
        inference_timeout=3600) 
```

While this method has a certain degree of effectiveness, there is still a significant gap between its current performance and the validation results obtained during training.

I would like to know: during the training process, is it also mandatory for the model to output the "complete task" tool to be considered a successful completion of the task? Because when I used your benchmark, the trained large model tended not to call the "complete task" tool to end the task, resulting in an evaluation success rate of 0.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Why Do Results from MCP-Trained Models Differ Greatly Between generate_benchmarks.py and train.py #417

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Why Do Results from MCP-Trained Models Differ Greatly Between generate_benchmarks.py and train.py #417

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions