Skip to content

[PZ COMPETITION] code submission by losercheems#165

Open
LoserCheems wants to merge 6 commits intoFlagAI-Open:mainfrom
LoserCheems:main
Open

[PZ COMPETITION] code submission by losercheems#165
LoserCheems wants to merge 6 commits intoFlagAI-Open:mainfrom
LoserCheems:main

Conversation

@LoserCheems
Copy link

@LoserCheems LoserCheems commented Sep 12, 2025

Hi, the code has been submitted through this PR.
The checkpoints and logs have been published on HuggingFace.

- Created `ddp.yaml` for distributed training configuration.
- Added `config.yaml` for model and training parameters.
- Implemented `download.py` for dataset downloading and processing.
- Introduced `merge.py` for merging model checkpoints with various methods.
- Added `train.sh` for SLURM job submission with environment setup.
- Developed `pt_dpsk.py` for training the OpenSeek model with Hugging Face Trainer.
- Created `training_args_configs.py` for defining training argument configurations.
Copilot AI review requested due to automatic review settings September 12, 2025 07:01
Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @LoserCheems, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request establishes a comprehensive training infrastructure for the OpenSeek model, covering both its initial continue pre-training and subsequent KTO-style preference alignment. It provides all necessary configuration files, data preparation scripts, and training entry points, designed for efficient execution on multi-GPU environments using accelerate and DeepSpeed. The changes enable reproducible model development from raw data to a fine-tuned checkpoint, complete with evaluation examples and merging capabilities.

Highlights

  • OpenSeek KTO Alignment Pipeline: Introduced a complete Kahneman–Tversky Optimization (KTO) alignment fine-tuning pipeline for the OpenSeek model, including scripts for asset acquisition, dataset transformation, and alignment training.
  • OpenSeek Continue Pre-training Workflow: Added a three-stage continue pre-training workflow for the OpenSeek model, encompassing asset download, mixed dataset building, distributed training, and model checkpoint merging (model soup).
  • Distributed Training Configurations: Provided accelerate configuration files for both DeepSpeed ZeRO-2 (for KTO) and DDP (for pre-training) to enable efficient multi-GPU training.
  • Data Preparation Scripts: Developed dedicated Python scripts for downloading base models and datasets, as well as transforming raw datasets into formats suitable for KTO preference alignment and pre-training with mixed ratios and sequence packing.
  • SLURM Integration: Included train.sh scripts for both pipelines to facilitate job submission on SLURM clusters, setting up necessary environment variables and launching accelerate commands.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR implements a comprehensive training and alignment pipeline for the OpenSeek model, adding configuration files, data processing scripts, and training components for both pretraining and preference-based fine-tuning (KTO).

  • Creates complete training infrastructure with distributed training configurations and dataset processing utilities
  • Implements both preliminary pretraining pipeline and final KTO alignment workflow
  • Adds comprehensive documentation and example configurations for model training and evaluation

Reviewed Changes

Copilot reviewed 16 out of 18 changed files in this pull request and generated 10 comments.

Show a summary per file
File Description
openseek/competition/pz/losercheems/preliminary/utils/training_args_configs.py Defines PTConfig class for pretraining configuration parameters
openseek/competition/pz/losercheems/preliminary/trainer/pt_dpsk.py Main pretraining script with Trainer setup and dataset processing
openseek/competition/pz/losercheems/preliminary/train.sh SLURM batch script for distributed training launch
openseek/competition/pz/losercheems/preliminary/scripts/merge.py Model checkpoint merging utility with multiple strategies
openseek/competition/pz/losercheems/preliminary/scripts/download.py Dataset downloading and preprocessing script
openseek/competition/pz/losercheems/preliminary/recipes/openseek/config.yaml Training configuration with model and hyperparameter settings
openseek/competition/pz/losercheems/preliminary/recipes/accelerate_configs/ddp.yaml Distributed training configuration for Accelerate
openseek/competition/pz/losercheems/preliminary/processor/pt_datasets_process.py Dataset processing utilities for mixing and tokenization
openseek/competition/pz/losercheems/preliminary/README.md Technical documentation for the pretraining workflow
openseek/competition/pz/losercheems/final/trainer/kto.py KTO alignment training script
openseek/competition/pz/losercheems/final/train.sh Training launch script for KTO alignment
openseek/competition/pz/losercheems/final/scripts/kto_datasets_process.py Dataset processing for KTO preference format
openseek/competition/pz/losercheems/final/scripts/download.py Asset download script for KTO training
openseek/competition/pz/losercheems/final/recipes/openseek/config.yaml KTO training configuration
openseek/competition/pz/losercheems/final/recipes/accelerate_configs/zero2.yaml DeepSpeed ZeRO-2 configuration
openseek/competition/pz/losercheems/final/README.md Technical documentation for KTO alignment workflow

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

set_seed,
)
from transformers.trainer_utils import get_last_checkpoint
from utils.training_args_configs import PTConfig
Copy link

Copilot AI Sep 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This import will fail because 'utils' is not in the Python path. The import should be a relative import: from ..utils.training_args_configs import PTConfig or an absolute import from the package root.

Suggested change
from utils.training_args_configs import PTConfig
from ..utils.training_args_configs import PTConfig

Copilot uses AI. Check for mistakes.
from transformers.trainer_utils import get_last_checkpoint
from utils.training_args_configs import PTConfig

from small_doge.processor import mix_pt_datasets
Copy link

Copilot AI Sep 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This import references 'small_doge.processor' but should import from the local processor module: from ..processor.pt_datasets_process import mix_datasets_by_ratio as mix_pt_datasets.

Suggested change
from small_doge.processor import mix_pt_datasets
from ..processor.pt_datasets_process import mix_datasets_by_ratio as mix_pt_datasets

Copilot uses AI. Check for mistakes.
model_args.model_name_or_path,
config=config,
).to(torch_dtype)
# if model_args.model_name_or_path is not None and model_args.model_name_or_path.endswith("checkpoint") else causal_lm_class(config=config).to(torch_dtype)
Copy link

Copilot AI Sep 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This commented-out code should be removed as it clutters the codebase and provides no value in its current state.

Suggested change
# if model_args.model_name_or_path is not None and model_args.model_name_or_path.endswith("checkpoint") else causal_lm_class(config=config).to(torch_dtype)

Copilot uses AI. Check for mistakes.
#!/bin/bash
#SBATCH -N 1
#SBATCH -n 32
#SBATCH -t 114514
Copy link

Copilot AI Sep 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The time limit '114514' appears to be a placeholder or meme number rather than a realistic job duration. This should be set to an appropriate time limit for the training job.

Suggested change
#SBATCH -t 114514
#SBATCH -t 1-00:00:00

Copilot uses AI. Check for mistakes.
checkpoint_paths.append(r"./data/OpenSeek-1.4B-A0.4B/checkpoint-3000")
checkpoint_paths.append(r"./data/OpenSeek-1.4B-A0.4B/checkpoint-4000")
checkpoint_paths.append(r"./data/OpenSeek-1.4B-A0.4B/checkpoint-5000")
checkpoint_paths.append(r"./data/OpenSeek-1.4B-A0.4B/checkpoint-5000")
Copy link

Copilot AI Sep 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The checkpoint-5000 path is duplicated in the list. This will cause the same checkpoint to be loaded twice during merging, which is likely unintentional.

Suggested change
checkpoint_paths.append(r"./data/OpenSeek-1.4B-A0.4B/checkpoint-5000")

Copilot uses AI. Check for mistakes.
@@ -0,0 +1,105 @@
from preliminary.processor.pt_datasets_process import mix_datasets_by_ratio as mix_pt_datasets
Copy link

Copilot AI Sep 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This import path is incorrect. It should be a relative import: from ..processor.pt_datasets_process import mix_datasets_by_ratio as mix_pt_datasets or an absolute import from the package root.

Suggested change
from preliminary.processor.pt_datasets_process import mix_datasets_by_ratio as mix_pt_datasets
from ..processor.pt_datasets_process import mix_datasets_by_ratio as mix_pt_datasets

Copilot uses AI. Check for mistakes.
Comment on lines +60 to +63
def calculate_total_ratio(datasets_and_ratios):
return sum(item for item in datasets_and_ratios.values())

total_ratio = sum(calculate_total_ratio(dataset) for dataset in datasets_and_ratios)
Copy link

Copilot AI Sep 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The calculate_total_ratio function expects a dictionary but datasets_and_ratios is a list of dictionaries. This line should be: total_ratio = sum(list(dataset.values())[0] for dataset in datasets_and_ratios).

Suggested change
def calculate_total_ratio(datasets_and_ratios):
return sum(item for item in datasets_and_ratios.values())
total_ratio = sum(calculate_total_ratio(dataset) for dataset in datasets_and_ratios)
total_ratio = sum(list(dataset.values())[0] for dataset in datasets_and_ratios)

Copilot uses AI. Check for mistakes.
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 8
Copy link

Copilot AI Sep 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The number of processes (8) doesn't match the GPU allocation in train.sh (4 GPUs). These values should be consistent - either both should be 4 or both should be 8.

Suggested change
num_processes: 8
num_processes: 4

Copilot uses AI. Check for mistakes.

# Process each split of the dataset
for split_name, split_dataset in dataset.items():
split_dataset = split_dataset.select_columns(["input_ids"])
Copy link

Copilot AI Sep 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line assumes the dataset already has an 'input_ids' column, but the dataset may not be preprocessed yet. This should be moved after the prepare_dataset call or made conditional on whether the dataset is already processed.

Copilot uses AI. Check for mistakes.
from transformers import AutoModelForCausalLM, AutoTokenizer, set_seed
from transformers.trainer_utils import get_last_checkpoint

from small_doge.processor import mix_dpo_datasets
Copy link

Copilot AI Sep 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This import is unused in the code and references an external module that may not exist. The import should be removed since the dataset is loaded directly with datasets.load_from_disk.

Suggested change
from small_doge.processor import mix_dpo_datasets

Copilot uses AI. Check for mistakes.
@LoserCheems LoserCheems changed the title Add configuration files, scripts, and training setup for OpenSeek model [PZ COMPETITION] code submission by losercheems Sep 12, 2025
Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a comprehensive set of files for pre-training and KTO alignment of the OpenSeek model. The changes include scripts for data downloading and processing, configuration files for training, and the training scripts themselves, along with detailed documentation. The overall structure is good, but there are several areas for improvement, particularly concerning hardcoded paths, code clarity, and reproducibility. I've identified several critical and high-severity issues related to hardcoded absolute paths in configuration and scripts, which will prevent others from running this code without modification. There are also some logical errors, such as a duplicated checkpoint in the merge script and confusing data processing logic. I've provided specific suggestions to address these points to make the pipeline more robust and portable.

# - /workspace/datasets/OpenSeek-Pretrain-100B/zh_cc-medidum-loss2_part_71_text_document: 0.010643

datasets_and_ratios:
- /workspace/datasets/OpenSeek-Pretrain-30B: 1.0

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The dataset path /workspace/datasets/OpenSeek-Pretrain-30B is hardcoded as an absolute path. This severely impacts reproducibility, as the training will fail on any machine where this exact path does not exist. Please use a relative path or an environment variable.

  - ./datasets/OpenSeek-Pretrain-30B: 1.0

Comment on lines +16 to +21
numina_math_cot = load_from_disk("/root/code/small-doge/datasets/AI-MO/NuminaMath-CoT")
print(numina_math_cot)
numina_math_cot = numina_math_cot.map(process, num_proc=4).select_columns(["prompt", "completion", "label"])
print(numina_math_cot)
print(numina_math_cot["train"][0])
numina_math_cot.save_to_disk("./datasets/AI-MO/NuminaMath-CoT-preference") No newline at end of file

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The script contains a hardcoded absolute path /root/code/small-doge/datasets/AI-MO/NuminaMath-CoT for loading the dataset and a hardcoded relative path for saving. This severely impacts portability and reproducibility. These paths should be parameterized using command-line arguments (e.g., with argparse) to make the script reusable.

Suggested change
numina_math_cot = load_from_disk("/root/code/small-doge/datasets/AI-MO/NuminaMath-CoT")
print(numina_math_cot)
numina_math_cot = numina_math_cot.map(process, num_proc=4).select_columns(["prompt", "completion", "label"])
print(numina_math_cot)
print(numina_math_cot["train"][0])
numina_math_cot.save_to_disk("./datasets/AI-MO/NuminaMath-CoT-preference")
import argparse
if __name__ == "__main__":
parser = argparse.ArgumentParser(description="Process dataset for KTO training.")
parser.add_argument("--input_path", type=str, required=True, help="Path to the input dataset.")
parser.add_argument("--output_path", type=str, required=True, help="Path to save the processed dataset.")
args = parser.parse_args()
numina_math_cot = load_from_disk(args.input_path)
print(numina_math_cot)
numina_math_cot = numina_math_cot.map(process, num_proc=4).select_columns(["prompt", "completion", "label"])
print(numina_math_cot)
print(numina_math_cot["train"][0])
numina_math_cot.save_to_disk(args.output_path)

checkpoint_paths.append(r"./data/OpenSeek-1.4B-A0.4B/checkpoint-3000")
checkpoint_paths.append(r"./data/OpenSeek-1.4B-A0.4B/checkpoint-4000")
checkpoint_paths.append(r"./data/OpenSeek-1.4B-A0.4B/checkpoint-5000")
checkpoint_paths.append(r"./data/OpenSeek-1.4B-A0.4B/checkpoint-5000")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The path to checkpoint-5000 is duplicated in the checkpoint_paths list. This is likely a copy-paste error and will cause this checkpoint's weights to be counted twice, skewing the average during the merge process.


# Process each split of the dataset
for split_name, split_dataset in dataset.items():
split_dataset = split_dataset.select_columns(["input_ids"])

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This line prematurely filters the dataset to only the input_ids column before calling prepare_dataset. This prevents the tokenization logic inside prepare_dataset (which expects a text field) from ever running, making the code confusing and brittle if a raw text dataset is ever used. This selection should happen inside prepare_dataset after tokenization and before packing.

args=training_args,
train_dataset=dataset[script_args.dataset_train_split],
eval_dataset=dataset[script_args.dataset_test_split] if training_args.eval_strategy != "no" else None,
processing_class=tokenizer,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The KTOTrainer from trl does not accept a processing_class argument. It seems you intended to pass the tokenizer, but the argument name is incorrect and it will be ignored. The tokenizer is usually handled through the DataCollator or is implicitly used by the trainer if needed. Since KTOTrainer doesn't require it in its constructor, this line can be removed.

Comment on lines +39 to +40
final_result.json # Aggregated metrics summary
<benchmark_name>/ # Per-benchmark JSONL + metrics

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The indentation for final_result.json in the directory overview seems incorrect. It appears to be a file inside the <benchmark_name> directory, but based on the description, it should likely be at the same level as <benchmark_name>.

Suggested change
final_result.json # Aggregated metrics summary
<benchmark_name>/ # Per-benchmark JSONL + metrics
final_result.json # Aggregated metrics summary
<benchmark_name>/ # Per-benchmark JSONL + metrics

Comment on lines +66 to +150
# datasets_and_ratios:
# - /workspace/datasets/OpenSeek-Pretrain-100B/Nemotron-CC-high-actual-actual-high_part_142_text_document: 0.011068
# - /workspace/datasets/OpenSeek-Pretrain-100B/Nemotron-CC-high-actual-actual-low_part_62_text_document: 0.003577
# - /workspace/datasets/OpenSeek-Pretrain-100B/Nemotron-CC-high-actual-actual-mid_part_189_text_document: 0.007775
# - /workspace/datasets/OpenSeek-Pretrain-100B/Nemotron-CC-high-synthetic-distill-high_part_76_text_document: 0.002859
# - /workspace/datasets/OpenSeek-Pretrain-100B/Nemotron-CC-high-synthetic-distill-low_part_124_text_document: 0.001672
# - /workspace/datasets/OpenSeek-Pretrain-100B/Nemotron-CC-high-synthetic-distill-mid_part_29_text_document: 0.002339
# - /workspace/datasets/OpenSeek-Pretrain-100B/Nemotron-CC-high-synthetic-diverse_qa_pairs-high_part_244_text_document: 0.005397
# - /workspace/datasets/OpenSeek-Pretrain-100B/Nemotron-CC-high-synthetic-diverse_qa_pairs-low_part_150_text_document: 0.004064
# - /workspace/datasets/OpenSeek-Pretrain-100B/Nemotron-CC-high-synthetic-diverse_qa_pairs-mid_part_444_text_document: 0.005005
# - /workspace/datasets/OpenSeek-Pretrain-100B/Nemotron-CC-high-synthetic-extract_knowledge-high_part_498_text_document: 0.004616
# - /workspace/datasets/OpenSeek-Pretrain-100B/Nemotron-CC-high-synthetic-extract_knowledge-low_part_10_text_document: 0.00067
# - /workspace/datasets/OpenSeek-Pretrain-100B/Nemotron-CC-high-synthetic-extract_knowledge-mid_part_144_text_document: 0.003429
# - /workspace/datasets/OpenSeek-Pretrain-100B/Nemotron-CC-high-synthetic-knowledge_list-high_part_86_text_document: 0.00261
# - /workspace/datasets/OpenSeek-Pretrain-100B/Nemotron-CC-high-synthetic-knowledge_list-low_part_133_text_document: 0.001824
# - /workspace/datasets/OpenSeek-Pretrain-100B/Nemotron-CC-high-synthetic-knowledge_list-mid_part_139_text_document: 0.002313
# - /workspace/datasets/OpenSeek-Pretrain-100B/Nemotron-CC-high-synthetic-wrap_medium-high_part_47_text_document: 0.008237
# - /workspace/datasets/OpenSeek-Pretrain-100B/Nemotron-CC-high-synthetic-wrap_medium-low_part_11_text_document: 0.002866
# - /workspace/datasets/OpenSeek-Pretrain-100B/Nemotron-CC-high-synthetic-wrap_medium-mid_part_97_text_document: 0.00667
# - /workspace/datasets/OpenSeek-Pretrain-100B/Nemotron-CC-low-synthetic-wrap_medium-high_part_43_text_document: 0.004657
# - /workspace/datasets/OpenSeek-Pretrain-100B/Nemotron-CC-low-synthetic-wrap_medium-low_part_10_text_document: 0.002005
# - /workspace/datasets/OpenSeek-Pretrain-100B/Nemotron-CC-low-synthetic-wrap_medium-mid_part_164_text_document: 0.004317
# - /workspace/datasets/OpenSeek-Pretrain-100B/Nemotron-CC-medium-actual-actual-high_part_92_text_document: 0.011397
# - /workspace/datasets/OpenSeek-Pretrain-100B/Nemotron-CC-medium-actual-actual-low_part_113_text_document: 0.006782
# - /workspace/datasets/OpenSeek-Pretrain-100B/Nemotron-CC-medium-actual-actual-mid_part_563_text_document: 0.009175
# - /workspace/datasets/OpenSeek-Pretrain-100B/arxiv_007_00000_text_document: 0.006414
# - /workspace/datasets/OpenSeek-Pretrain-100B/books_016_00007_text_document: 0.004696
# - /workspace/datasets/OpenSeek-Pretrain-100B/code-high_part_13_text_document: 0.010102
# - /workspace/datasets/OpenSeek-Pretrain-100B/code-low_part_36_text_document: 0.011403
# - /workspace/datasets/OpenSeek-Pretrain-100B/code-mid_part_37_text_document: 0.009674
# - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis2_CC-high_23_text_document: 0.003755
# - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis2_CC-low_51_text_document: 0.000499
# - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis2_CC-mid_118_text_document: 0.003608
# - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis2_CC-mid_176_text_document: 0.003623
# - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis2_CC-mid_256_text_document: 0.003704
# - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis2_CC-mid_320_text_document: 0.003733
# - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis2_CC-mid_32_text_document: 0.003631
# - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis2_OpenSource-high_1_text_document: 0.002573
# - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis2_OpenSource-low_2_text_document: 0.001638
# - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis2_OpenSource-mid_3_text_document: 0.003251
# - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis2_arxiv-high_2_text_document: 0.060237
# - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis2_arxiv-low_1_text_document: 0.089063
# - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis2_arxiv-mid_2_text_document: 0.101376
# - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis2_code-high_4_text_document: 0.004598
# - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis2_code-low_6_text_document: 0.006857
# - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis2_code-mid_23_text_document: 0.00899
# - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis2_math-high_12_text_document: 0.013135
# - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis2_math-low_3_text_document: 0.01653
# - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis2_math-mid_5_text_document: 0.003536
# - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis2_wiki-high_5_text_document: 0.006314
# - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis2_wiki-low_5_text_document: 0.005978
# - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis2_wiki-mid_4_text_document: 0.007909
# - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis_CC-high_74_text_document: 0.002225
# - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis_CC-low_54_text_document: 0.001797
# - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis_CC-mid_275_text_document: 0.002042
# - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis_OpenSource-high_4_text_document: 0.004081
# - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis_OpenSource-low_2_text_document: 0.001659
# - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis_OpenSource-mid_6_text_document: 0.012828
# - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis_arxiv-high_2_text_document: 0.0568
# - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis_arxiv-low_1_text_document: 0.074907
# - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis_arxiv-mid_1_text_document: 0.089359
# - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis_code-high_13_text_document: 0.007663
# - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis_code-low_9_text_document: 0.004052
# - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis_code-mid_6_text_document: 0.001916
# - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis_math-high_11_text_document: 0.005074
# - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis_math-low_11_text_document: 0.006437
# - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis_math-mid_29_text_document: 0.006406
# - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis_wiki-high_4_text_document: 0.004
# - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis_wiki-low_6_text_document: 0.003564
# - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis_wiki-mid_3_text_document: 0.005768
# - /workspace/datasets/OpenSeek-Pretrain-100B/math-high_part_04_text_document: 0.018165
# - /workspace/datasets/OpenSeek-Pretrain-100B/math-low_part_10_text_document: 0.01694
# - /workspace/datasets/OpenSeek-Pretrain-100B/math-mid_part_07_text_document: 0.016311
# - /workspace/datasets/OpenSeek-Pretrain-100B/pes2o-full-train_train-0041-of-0136_text_document: 0.00687
# - /workspace/datasets/OpenSeek-Pretrain-100B/pes2o-full-train_train-0125-of-0136_text_document: 0.007387
# - /workspace/datasets/OpenSeek-Pretrain-100B/pes2o-full-val_valid-0034-of-0060_text_document: 0.000143
# - /workspace/datasets/OpenSeek-Pretrain-100B/pes2o_pubmedcentral_3_text_document: 0.061982
# - /workspace/datasets/OpenSeek-Pretrain-100B/stack_018_00000_text_document: 0.004229
# - /workspace/datasets/OpenSeek-Pretrain-100B/wiki_012_00000_text_document: 0.004202
# - /workspace/datasets/OpenSeek-Pretrain-100B/zh_cc-high-loss0_part_28_text_document: 0.018171
# - /workspace/datasets/OpenSeek-Pretrain-100B/zh_cc-high-loss1_part_59_text_document: 0.009776
# - /workspace/datasets/OpenSeek-Pretrain-100B/zh_cc-high-loss2_part_16_text_document: 0.003725
# - /workspace/datasets/OpenSeek-Pretrain-100B/zh_cc-medidum-loss0_part_192_text_document: 0.009492
# - /workspace/datasets/OpenSeek-Pretrain-100B/zh_cc-medidum-loss1_part_550_text_document: 0.009236
# - /workspace/datasets/OpenSeek-Pretrain-100B/zh_cc-medidum-loss2_part_71_text_document: 0.010643

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This large block of commented-out datasets_and_ratios makes the configuration file noisy and difficult to read. If this is for reference or an example, it should be moved to the documentation. Otherwise, it should be removed to improve clarity.

Comment on lines +28 to +35
"""
合并多个checkpoints

Args:
checkpoint_paths: 要合并的checkpoint路径列表
output_path: 合并后模型的保存路径
merge_method: 合并方法,支持 "average", "last", "weighted_average"
"""

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The docstring and comments are in Chinese, which is inconsistent with the rest of the project (e.g., READMEs, other scripts) which is in English. For better maintainability and accessibility for all contributors, please translate these to English.

Suggested change
"""
合并多个checkpoints
Args:
checkpoint_paths: 要合并的checkpoint路径列表
output_path: 合并后模型的保存路径
merge_method: 合并方法支持 "average", "last", "weighted_average"
"""
"""
Merges multiple checkpoints.
Args:
checkpoint_paths: A list of checkpoint paths to be merged.
output_path: The path to save the merged model.
merge_method: The method for merging, supports "average", "last", "weighted_average".
"""

#!/bin/bash
#SBATCH -N 1
#SBATCH -n 32
#SBATCH -t 114514

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The SBATCH time limit -t 114514 appears to be a placeholder or a meme value. For professionalism and proper resource management on a SLURM cluster, this should be set to a realistic time estimate for the job, e.g., 24:00:00 for 24 hours.

Suggested change
#SBATCH -t 114514
#SBATCH -t 24:00:00

Comment on lines +207 to +209
model_config = yaml.load(
open(config_path, "r", encoding="utf-8"), Loader=yaml.FullLoader
)["model_config"]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using yaml.load with FullLoader is unsafe as it can execute arbitrary code. For loading simple configuration files, yaml.safe_load should be used instead to prevent potential security vulnerabilities.

Suggested change
model_config = yaml.load(
open(config_path, "r", encoding="utf-8"), Loader=yaml.FullLoader
)["model_config"]
model_config = yaml.safe_load(
open(config_path, "r", encoding="utf-8"),
)["model_config"]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants