-
Notifications
You must be signed in to change notification settings - Fork 629
Description
Why does a model trained via MCP show a significant discrepancy in results when tested using generate_benchmarks.py, compared to the outcomes from train.py?
A preliminary investigation indicates that the root cause may be related to the following logic: In generate_benchmarks.py, the model must call the complete_task function to be deemed as having finished a task. However, there is no such logic implemented in train.py. Is this the reason for the large result deviation?
generate_benchmarks.py
qwen3_4b_instruct = art.Model(
name="qwen3-4b-instruct",
project=server,
inference_model_name="qwen3-4b-instruct",
inference_base_url="http://localhost:8082/v1", #http://localhost:8082/v1
inference_api_key="dummy", # API key
inference_timeout=3600,
)
source /*****/miniconda3/bin/activate ART
BASE_MODEL_PATH="/*****/Qwen3-4B-Instruct-2507"
LORA_PATH="/*****/examples/mcp-rl/.art/mcp-agent-training/models/mcp-4b-001/checkpoints/0017"
CUDA_VISIBLE_DEVICES=2,3 python -m vllm.entrypoints.openai.api_server \
--model "$BASE_MODEL_PATH" \
--served-model-name "mcp-4b-001-finetuned" \
--enable-lora \
--lora-modules mcp-4b-001="$LORA_PATH" \
--host 0.0.0.0 \
--port 8082 \
--trust-remote-code \
--enable-auto-tool-choice \
--tool-call-parser hermes \
--max-model-len 16384 \
--tensor-parallel-size 2
In the generate_benchmarks.py
mcp-4b-001-finetuned = art.Model(
name="mcp-4b-001-finetuned",
project=server,
inference_model_name="mcp-4b-001-finetuned",
inference_base_url="http://localhost:8082/v1", #http://localhost:8082/v1
inference_api_key="dummy",
inference_timeout=3600)
While this method has a certain degree of effectiveness, there is still a significant gap between its current performance and the validation results obtained during training.
I would like to know: during the training process, is it also mandatory for the model to output the "complete task" tool to be considered a successful completion of the task? Because when I used your benchmark, the trained large model tended not to call the "complete task" tool to end the task, resulting in an evaluation success rate of 0.