[PZ COMPETITION] code submission by losercheems by LoserCheems · Pull Request #165 · FlagAI-Open/OpenSeek

LoserCheems · 2025-09-12T07:01:26Z

Hi, the code has been submitted through this PR.
The checkpoints and logs have been published on HuggingFace.

- Created `ddp.yaml` for distributed training configuration. - Added `config.yaml` for model and training parameters. - Implemented `download.py` for dataset downloading and processing. - Introduced `merge.py` for merging model checkpoints with various methods. - Added `train.sh` for SLURM job submission with environment setup. - Developed `pt_dpsk.py` for training the OpenSeek model with Hugging Face Trainer. - Created `training_args_configs.py` for defining training argument configurations.

gemini-code-assist

Summary of Changes

Hello @LoserCheems, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request establishes a comprehensive training infrastructure for the OpenSeek model, covering both its initial continue pre-training and subsequent KTO-style preference alignment. It provides all necessary configuration files, data preparation scripts, and training entry points, designed for efficient execution on multi-GPU environments using accelerate and DeepSpeed. The changes enable reproducible model development from raw data to a fine-tuned checkpoint, complete with evaluation examples and merging capabilities.

Highlights

OpenSeek KTO Alignment Pipeline: Introduced a complete Kahneman–Tversky Optimization (KTO) alignment fine-tuning pipeline for the OpenSeek model, including scripts for asset acquisition, dataset transformation, and alignment training.
OpenSeek Continue Pre-training Workflow: Added a three-stage continue pre-training workflow for the OpenSeek model, encompassing asset download, mixed dataset building, distributed training, and model checkpoint merging (model soup).
Distributed Training Configurations: Provided accelerate configuration files for both DeepSpeed ZeRO-2 (for KTO) and DDP (for pre-training) to enable efficient multi-GPU training.
Data Preparation Scripts: Developed dedicated Python scripts for downloading base models and datasets, as well as transforming raw datasets into formats suitable for KTO preference alignment and pre-training with mixed ratios and sequence packing.
SLURM Integration: Included train.sh scripts for both pipelines to facilitate job submission on SLURM clusters, setting up necessary environment variables and launching accelerate commands.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

Copilot

Pull Request Overview

This PR implements a comprehensive training and alignment pipeline for the OpenSeek model, adding configuration files, data processing scripts, and training components for both pretraining and preference-based fine-tuning (KTO).

Creates complete training infrastructure with distributed training configurations and dataset processing utilities
Implements both preliminary pretraining pipeline and final KTO alignment workflow
Adds comprehensive documentation and example configurations for model training and evaluation

Reviewed Changes

Copilot reviewed 16 out of 18 changed files in this pull request and generated 10 comments.

Show a summary per file

File	Description
openseek/competition/pz/losercheems/preliminary/utils/training_args_configs.py	Defines PTConfig class for pretraining configuration parameters
openseek/competition/pz/losercheems/preliminary/trainer/pt_dpsk.py	Main pretraining script with Trainer setup and dataset processing
openseek/competition/pz/losercheems/preliminary/train.sh	SLURM batch script for distributed training launch
openseek/competition/pz/losercheems/preliminary/scripts/merge.py	Model checkpoint merging utility with multiple strategies
openseek/competition/pz/losercheems/preliminary/scripts/download.py	Dataset downloading and preprocessing script
openseek/competition/pz/losercheems/preliminary/recipes/openseek/config.yaml	Training configuration with model and hyperparameter settings
openseek/competition/pz/losercheems/preliminary/recipes/accelerate_configs/ddp.yaml	Distributed training configuration for Accelerate
openseek/competition/pz/losercheems/preliminary/processor/pt_datasets_process.py	Dataset processing utilities for mixing and tokenization
openseek/competition/pz/losercheems/preliminary/README.md	Technical documentation for the pretraining workflow
openseek/competition/pz/losercheems/final/trainer/kto.py	KTO alignment training script
openseek/competition/pz/losercheems/final/train.sh	Training launch script for KTO alignment
openseek/competition/pz/losercheems/final/scripts/kto_datasets_process.py	Dataset processing for KTO preference format
openseek/competition/pz/losercheems/final/scripts/download.py	Asset download script for KTO training
openseek/competition/pz/losercheems/final/recipes/openseek/config.yaml	KTO training configuration
openseek/competition/pz/losercheems/final/recipes/accelerate_configs/zero2.yaml	DeepSpeed ZeRO-2 configuration
openseek/competition/pz/losercheems/final/README.md	Technical documentation for KTO alignment workflow

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

Copilot · 2025-09-12T07:03:40Z

openseek/competition/pz/losercheems/preliminary/trainer/pt_dpsk.py

+    set_seed,
+)
+from transformers.trainer_utils import get_last_checkpoint
+from utils.training_args_configs import PTConfig


This import will fail because 'utils' is not in the Python path. The import should be a relative import: from ..utils.training_args_configs import PTConfig or an absolute import from the package root.

Suggested change

from utils.training_args_configs import PTConfig

from ..utils.training_args_configs import PTConfig

Copilot · 2025-09-12T07:03:40Z

openseek/competition/pz/losercheems/preliminary/trainer/pt_dpsk.py

+from transformers.trainer_utils import get_last_checkpoint
+from utils.training_args_configs import PTConfig
+
+from small_doge.processor import mix_pt_datasets


This import references 'small_doge.processor' but should import from the local processor module: from ..processor.pt_datasets_process import mix_datasets_by_ratio as mix_pt_datasets.

Suggested change

from small_doge.processor import mix_pt_datasets

from ..processor.pt_datasets_process import mix_datasets_by_ratio as mix_pt_datasets

Copilot · 2025-09-12T07:03:40Z

openseek/competition/pz/losercheems/preliminary/trainer/pt_dpsk.py

+        model_args.model_name_or_path,
+        config=config,
+    ).to(torch_dtype)
+    #  if model_args.model_name_or_path is not None and model_args.model_name_or_path.endswith("checkpoint") else causal_lm_class(config=config).to(torch_dtype)


This commented-out code should be removed as it clutters the codebase and provides no value in its current state.

Suggested change

# if model_args.model_name_or_path is not None and model_args.model_name_or_path.endswith("checkpoint") else causal_lm_class(config=config).to(torch_dtype)

Copilot · 2025-09-12T07:03:41Z

openseek/competition/pz/losercheems/preliminary/train.sh

+#!/bin/bash
+#SBATCH -N 1
+#SBATCH -n 32
+#SBATCH -t 114514


The time limit '114514' appears to be a placeholder or meme number rather than a realistic job duration. This should be set to an appropriate time limit for the training job.

Suggested change

#SBATCH -t 114514

#SBATCH -t 1-00:00:00

Copilot · 2025-09-12T07:03:41Z

openseek/competition/pz/losercheems/preliminary/scripts/merge.py

+checkpoint_paths.append(r"./data/OpenSeek-1.4B-A0.4B/checkpoint-3000")
+checkpoint_paths.append(r"./data/OpenSeek-1.4B-A0.4B/checkpoint-4000")
+checkpoint_paths.append(r"./data/OpenSeek-1.4B-A0.4B/checkpoint-5000")
+checkpoint_paths.append(r"./data/OpenSeek-1.4B-A0.4B/checkpoint-5000")


The checkpoint-5000 path is duplicated in the list. This will cause the same checkpoint to be loaded twice during merging, which is likely unintentional.

Suggested change

checkpoint_paths.append(r"./data/OpenSeek-1.4B-A0.4B/checkpoint-5000")

Copilot · 2025-09-12T07:03:41Z

openseek/competition/pz/losercheems/preliminary/scripts/download.py

@@ -0,0 +1,105 @@
+from preliminary.processor.pt_datasets_process import mix_datasets_by_ratio as mix_pt_datasets


This import path is incorrect. It should be a relative import: from ..processor.pt_datasets_process import mix_datasets_by_ratio as mix_pt_datasets or an absolute import from the package root.

Suggested change

from preliminary.processor.pt_datasets_process import mix_datasets_by_ratio as mix_pt_datasets

from ..processor.pt_datasets_process import mix_datasets_by_ratio as mix_pt_datasets

Copilot · 2025-09-12T07:03:42Z

openseek/competition/pz/losercheems/preliminary/scripts/download.py

+    def calculate_total_ratio(datasets_and_ratios):
+        return sum(item for item in datasets_and_ratios.values())
+
+    total_ratio = sum(calculate_total_ratio(dataset) for dataset in datasets_and_ratios)


The calculate_total_ratio function expects a dictionary but datasets_and_ratios is a list of dictionaries. This line should be: total_ratio = sum(list(dataset.values())[0] for dataset in datasets_and_ratios).

Suggested change

def calculate_total_ratio(datasets_and_ratios):

return sum(item for item in datasets_and_ratios.values())

total_ratio = sum(calculate_total_ratio(dataset) for dataset in datasets_and_ratios)

total_ratio = sum(list(dataset.values())[0] for dataset in datasets_and_ratios)

Copilot · 2025-09-12T07:03:42Z

openseek/competition/pz/losercheems/preliminary/recipes/accelerate_configs/ddp.yaml

+main_training_function: main
+mixed_precision: bf16
+num_machines: 1
+num_processes: 8


The number of processes (8) doesn't match the GPU allocation in train.sh (4 GPUs). These values should be consistent - either both should be 4 or both should be 8.

Suggested change

num_processes: 8

num_processes: 4

Copilot · 2025-09-12T07:03:42Z

openseek/competition/pz/losercheems/preliminary/processor/pt_datasets_process.py

+
+        # Process each split of the dataset
+        for split_name, split_dataset in dataset.items():
+            split_dataset = split_dataset.select_columns(["input_ids"])


This line assumes the dataset already has an 'input_ids' column, but the dataset may not be preprocessed yet. This should be moved after the prepare_dataset call or made conditional on whether the dataset is already processed.

Copilot · 2025-09-12T07:03:43Z

openseek/competition/pz/losercheems/final/trainer/kto.py

+from transformers import AutoModelForCausalLM, AutoTokenizer, set_seed
+from transformers.trainer_utils import get_last_checkpoint
+
+from small_doge.processor import mix_dpo_datasets


This import is unused in the code and references an external module that may not exist. The import should be removed since the dataset is loaded directly with datasets.load_from_disk.

Suggested change

from small_doge.processor import mix_dpo_datasets

gemini-code-assist

Code Review

This pull request introduces a comprehensive set of files for pre-training and KTO alignment of the OpenSeek model. The changes include scripts for data downloading and processing, configuration files for training, and the training scripts themselves, along with detailed documentation. The overall structure is good, but there are several areas for improvement, particularly concerning hardcoded paths, code clarity, and reproducibility. I've identified several critical and high-severity issues related to hardcoded absolute paths in configuration and scripts, which will prevent others from running this code without modification. There are also some logical errors, such as a duplicated checkpoint in the merge script and confusing data processing logic. I've provided specific suggestions to address these points to make the pipeline more robust and portable.

gemini-code-assist · 2025-09-12T07:04:10Z

openseek/competition/pz/losercheems/preliminary/recipes/openseek/config.yaml

+#   - /workspace/datasets/OpenSeek-Pretrain-100B/zh_cc-medidum-loss2_part_71_text_document: 0.010643
+
+datasets_and_ratios:
+  - /workspace/datasets/OpenSeek-Pretrain-30B: 1.0


The dataset path /workspace/datasets/OpenSeek-Pretrain-30B is hardcoded as an absolute path. This severely impacts reproducibility, as the training will fail on any machine where this exact path does not exist. Please use a relative path or an environment variable.

- ./datasets/OpenSeek-Pretrain-30B: 1.0

gemini-code-assist · 2025-09-12T07:04:10Z

openseek/competition/pz/losercheems/final/scripts/kto_datasets_process.py

+numina_math_cot = load_from_disk("/root/code/small-doge/datasets/AI-MO/NuminaMath-CoT")
+print(numina_math_cot)
+numina_math_cot = numina_math_cot.map(process, num_proc=4).select_columns(["prompt", "completion", "label"])
+print(numina_math_cot)
+print(numina_math_cot["train"][0])
+numina_math_cot.save_to_disk("./datasets/AI-MO/NuminaMath-CoT-preference")


The script contains a hardcoded absolute path /root/code/small-doge/datasets/AI-MO/NuminaMath-CoT for loading the dataset and a hardcoded relative path for saving. This severely impacts portability and reproducibility. These paths should be parameterized using command-line arguments (e.g., with argparse) to make the script reusable.

Suggested change

numina_math_cot = load_from_disk("/root/code/small-doge/datasets/AI-MO/NuminaMath-CoT")

print(numina_math_cot)

numina_math_cot = numina_math_cot.map(process, num_proc=4).select_columns(["prompt", "completion", "label"])

print(numina_math_cot)

print(numina_math_cot["train"][0])

numina_math_cot.save_to_disk("./datasets/AI-MO/NuminaMath-CoT-preference")

import argparse

if __name__ == "__main__":

parser = argparse.ArgumentParser(description="Process dataset for KTO training.")

parser.add_argument("--input_path", type=str, required=True, help="Path to the input dataset.")

parser.add_argument("--output_path", type=str, required=True, help="Path to save the processed dataset.")

args = parser.parse_args()

numina_math_cot = load_from_disk(args.input_path)

print(numina_math_cot)

numina_math_cot = numina_math_cot.map(process, num_proc=4).select_columns(["prompt", "completion", "label"])

print(numina_math_cot)

print(numina_math_cot["train"][0])

numina_math_cot.save_to_disk(args.output_path)

gemini-code-assist · 2025-09-12T07:04:11Z

openseek/competition/pz/losercheems/preliminary/scripts/merge.py

+checkpoint_paths.append(r"./data/OpenSeek-1.4B-A0.4B/checkpoint-3000")
+checkpoint_paths.append(r"./data/OpenSeek-1.4B-A0.4B/checkpoint-4000")
+checkpoint_paths.append(r"./data/OpenSeek-1.4B-A0.4B/checkpoint-5000")
+checkpoint_paths.append(r"./data/OpenSeek-1.4B-A0.4B/checkpoint-5000")


The path to checkpoint-5000 is duplicated in the checkpoint_paths list. This is likely a copy-paste error and will cause this checkpoint's weights to be counted twice, skewing the average during the merge process.

gemini-code-assist · 2025-09-12T07:04:11Z

openseek/competition/pz/losercheems/preliminary/processor/pt_datasets_process.py

+
+        # Process each split of the dataset
+        for split_name, split_dataset in dataset.items():
+            split_dataset = split_dataset.select_columns(["input_ids"])


This line prematurely filters the dataset to only the input_ids column before calling prepare_dataset. This prevents the tokenization logic inside prepare_dataset (which expects a text field) from ever running, making the code confusing and brittle if a raw text dataset is ever used. This selection should happen inside prepare_dataset after tokenization and before packing.

gemini-code-assist · 2025-09-12T07:04:11Z

openseek/competition/pz/losercheems/final/trainer/kto.py

+        args=training_args,
+        train_dataset=dataset[script_args.dataset_train_split],
+        eval_dataset=dataset[script_args.dataset_test_split] if training_args.eval_strategy != "no" else None,
+        processing_class=tokenizer,


The KTOTrainer from trl does not accept a processing_class argument. It seems you intended to pass the tokenizer, but the argument name is incorrect and it will be ignored. The tokenizer is usually handled through the DataCollator or is implicitly used by the trainer if needed. Since KTOTrainer doesn't require it in its constructor, this line can be removed.

gemini-code-assist · 2025-09-12T07:04:12Z

openseek/competition/pz/losercheems/final/README.md

+      final_result.json   # Aggregated metrics summary
+      <benchmark_name>/   # Per-benchmark JSONL + metrics


The indentation for final_result.json in the directory overview seems incorrect. It appears to be a file inside the <benchmark_name> directory, but based on the description, it should likely be at the same level as <benchmark_name>.

Suggested change

final_result.json # Aggregated metrics summary

<benchmark_name>/ # Per-benchmark JSONL + metrics

final_result.json # Aggregated metrics summary

<benchmark_name>/ # Per-benchmark JSONL + metrics

gemini-code-assist · 2025-09-12T07:04:12Z

openseek/competition/pz/losercheems/preliminary/recipes/openseek/config.yaml

+# datasets_and_ratios:
+#   - /workspace/datasets/OpenSeek-Pretrain-100B/Nemotron-CC-high-actual-actual-high_part_142_text_document: 0.011068
+#   - /workspace/datasets/OpenSeek-Pretrain-100B/Nemotron-CC-high-actual-actual-low_part_62_text_document: 0.003577
+#   - /workspace/datasets/OpenSeek-Pretrain-100B/Nemotron-CC-high-actual-actual-mid_part_189_text_document: 0.007775
+#   - /workspace/datasets/OpenSeek-Pretrain-100B/Nemotron-CC-high-synthetic-distill-high_part_76_text_document: 0.002859
+#   - /workspace/datasets/OpenSeek-Pretrain-100B/Nemotron-CC-high-synthetic-distill-low_part_124_text_document: 0.001672
+#   - /workspace/datasets/OpenSeek-Pretrain-100B/Nemotron-CC-high-synthetic-distill-mid_part_29_text_document: 0.002339
+#   - /workspace/datasets/OpenSeek-Pretrain-100B/Nemotron-CC-high-synthetic-diverse_qa_pairs-high_part_244_text_document: 0.005397
+#   - /workspace/datasets/OpenSeek-Pretrain-100B/Nemotron-CC-high-synthetic-diverse_qa_pairs-low_part_150_text_document: 0.004064
+#   - /workspace/datasets/OpenSeek-Pretrain-100B/Nemotron-CC-high-synthetic-diverse_qa_pairs-mid_part_444_text_document: 0.005005
+#   - /workspace/datasets/OpenSeek-Pretrain-100B/Nemotron-CC-high-synthetic-extract_knowledge-high_part_498_text_document: 0.004616
+#   - /workspace/datasets/OpenSeek-Pretrain-100B/Nemotron-CC-high-synthetic-extract_knowledge-low_part_10_text_document: 0.00067
+#   - /workspace/datasets/OpenSeek-Pretrain-100B/Nemotron-CC-high-synthetic-extract_knowledge-mid_part_144_text_document: 0.003429
+#   - /workspace/datasets/OpenSeek-Pretrain-100B/Nemotron-CC-high-synthetic-knowledge_list-high_part_86_text_document: 0.00261
+#   - /workspace/datasets/OpenSeek-Pretrain-100B/Nemotron-CC-high-synthetic-knowledge_list-low_part_133_text_document: 0.001824
+#   - /workspace/datasets/OpenSeek-Pretrain-100B/Nemotron-CC-high-synthetic-knowledge_list-mid_part_139_text_document: 0.002313
+#   - /workspace/datasets/OpenSeek-Pretrain-100B/Nemotron-CC-high-synthetic-wrap_medium-high_part_47_text_document: 0.008237
+#   - /workspace/datasets/OpenSeek-Pretrain-100B/Nemotron-CC-high-synthetic-wrap_medium-low_part_11_text_document: 0.002866
+#   - /workspace/datasets/OpenSeek-Pretrain-100B/Nemotron-CC-high-synthetic-wrap_medium-mid_part_97_text_document: 0.00667
+#   - /workspace/datasets/OpenSeek-Pretrain-100B/Nemotron-CC-low-synthetic-wrap_medium-high_part_43_text_document: 0.004657
+#   - /workspace/datasets/OpenSeek-Pretrain-100B/Nemotron-CC-low-synthetic-wrap_medium-low_part_10_text_document: 0.002005
+#   - /workspace/datasets/OpenSeek-Pretrain-100B/Nemotron-CC-low-synthetic-wrap_medium-mid_part_164_text_document: 0.004317
+#   - /workspace/datasets/OpenSeek-Pretrain-100B/Nemotron-CC-medium-actual-actual-high_part_92_text_document: 0.011397
+#   - /workspace/datasets/OpenSeek-Pretrain-100B/Nemotron-CC-medium-actual-actual-low_part_113_text_document: 0.006782
+#   - /workspace/datasets/OpenSeek-Pretrain-100B/Nemotron-CC-medium-actual-actual-mid_part_563_text_document: 0.009175
+#   - /workspace/datasets/OpenSeek-Pretrain-100B/arxiv_007_00000_text_document: 0.006414
+#   - /workspace/datasets/OpenSeek-Pretrain-100B/books_016_00007_text_document: 0.004696
+#   - /workspace/datasets/OpenSeek-Pretrain-100B/code-high_part_13_text_document: 0.010102
+#   - /workspace/datasets/OpenSeek-Pretrain-100B/code-low_part_36_text_document: 0.011403
+#   - /workspace/datasets/OpenSeek-Pretrain-100B/code-mid_part_37_text_document: 0.009674
+#   - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis2_CC-high_23_text_document: 0.003755
+#   - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis2_CC-low_51_text_document: 0.000499
+#   - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis2_CC-mid_118_text_document: 0.003608
+#   - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis2_CC-mid_176_text_document: 0.003623
+#   - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis2_CC-mid_256_text_document: 0.003704
+#   - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis2_CC-mid_320_text_document: 0.003733
+#   - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis2_CC-mid_32_text_document: 0.003631
+#   - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis2_OpenSource-high_1_text_document: 0.002573
+#   - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis2_OpenSource-low_2_text_document: 0.001638
+#   - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis2_OpenSource-mid_3_text_document: 0.003251
+#   - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis2_arxiv-high_2_text_document: 0.060237
+#   - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis2_arxiv-low_1_text_document: 0.089063
+#   - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis2_arxiv-mid_2_text_document: 0.101376
+#   - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis2_code-high_4_text_document: 0.004598
+#   - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis2_code-low_6_text_document: 0.006857
+#   - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis2_code-mid_23_text_document: 0.00899
+#   - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis2_math-high_12_text_document: 0.013135
+#   - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis2_math-low_3_text_document: 0.01653
+#   - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis2_math-mid_5_text_document: 0.003536
+#   - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis2_wiki-high_5_text_document: 0.006314
+#   - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis2_wiki-low_5_text_document: 0.005978
+#   - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis2_wiki-mid_4_text_document: 0.007909
+#   - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis_CC-high_74_text_document: 0.002225
+#   - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis_CC-low_54_text_document: 0.001797
+#   - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis_CC-mid_275_text_document: 0.002042
+#   - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis_OpenSource-high_4_text_document: 0.004081
+#   - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis_OpenSource-low_2_text_document: 0.001659
+#   - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis_OpenSource-mid_6_text_document: 0.012828
+#   - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis_arxiv-high_2_text_document: 0.0568
+#   - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis_arxiv-low_1_text_document: 0.074907
+#   - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis_arxiv-mid_1_text_document: 0.089359
+#   - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis_code-high_13_text_document: 0.007663
+#   - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis_code-low_9_text_document: 0.004052
+#   - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis_code-mid_6_text_document: 0.001916
+#   - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis_math-high_11_text_document: 0.005074
+#   - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis_math-low_11_text_document: 0.006437
+#   - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis_math-mid_29_text_document: 0.006406
+#   - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis_wiki-high_4_text_document: 0.004
+#   - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis_wiki-low_6_text_document: 0.003564
+#   - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis_wiki-mid_3_text_document: 0.005768
+#   - /workspace/datasets/OpenSeek-Pretrain-100B/math-high_part_04_text_document: 0.018165
+#   - /workspace/datasets/OpenSeek-Pretrain-100B/math-low_part_10_text_document: 0.01694
+#   - /workspace/datasets/OpenSeek-Pretrain-100B/math-mid_part_07_text_document: 0.016311
+#   - /workspace/datasets/OpenSeek-Pretrain-100B/pes2o-full-train_train-0041-of-0136_text_document: 0.00687
+#   - /workspace/datasets/OpenSeek-Pretrain-100B/pes2o-full-train_train-0125-of-0136_text_document: 0.007387
+#   - /workspace/datasets/OpenSeek-Pretrain-100B/pes2o-full-val_valid-0034-of-0060_text_document: 0.000143
+#   - /workspace/datasets/OpenSeek-Pretrain-100B/pes2o_pubmedcentral_3_text_document: 0.061982
+#   - /workspace/datasets/OpenSeek-Pretrain-100B/stack_018_00000_text_document: 0.004229
+#   - /workspace/datasets/OpenSeek-Pretrain-100B/wiki_012_00000_text_document: 0.004202
+#   - /workspace/datasets/OpenSeek-Pretrain-100B/zh_cc-high-loss0_part_28_text_document: 0.018171
+#   - /workspace/datasets/OpenSeek-Pretrain-100B/zh_cc-high-loss1_part_59_text_document: 0.009776
+#   - /workspace/datasets/OpenSeek-Pretrain-100B/zh_cc-high-loss2_part_16_text_document: 0.003725
+#   - /workspace/datasets/OpenSeek-Pretrain-100B/zh_cc-medidum-loss0_part_192_text_document: 0.009492
+#   - /workspace/datasets/OpenSeek-Pretrain-100B/zh_cc-medidum-loss1_part_550_text_document: 0.009236
+#   - /workspace/datasets/OpenSeek-Pretrain-100B/zh_cc-medidum-loss2_part_71_text_document: 0.010643


This large block of commented-out datasets_and_ratios makes the configuration file noisy and difficult to read. If this is for reference or an example, it should be moved to the documentation. Otherwise, it should be removed to improve clarity.

gemini-code-assist · 2025-09-12T07:04:12Z

openseek/competition/pz/losercheems/preliminary/scripts/merge.py

+    """
+    合并多个checkpoints
+
+    Args:
+        checkpoint_paths: 要合并的checkpoint路径列表
+        output_path: 合并后模型的保存路径
+        merge_method: 合并方法，支持 "average", "last", "weighted_average"
+    """


The docstring and comments are in Chinese, which is inconsistent with the rest of the project (e.g., READMEs, other scripts) which is in English. For better maintainability and accessibility for all contributors, please translate these to English.

Suggested change

"""

合并多个checkpoints

Args:

checkpoint_paths: 要合并的checkpoint路径列表

output_path: 合并后模型的保存路径

merge_method: 合并方法，支持 "average", "last", "weighted_average"

"""

"""

Merges multiple checkpoints.

Args:

checkpoint_paths: A list of checkpoint paths to be merged.

output_path: The path to save the merged model.

merge_method: The method for merging, supports "average", "last", "weighted_average".

"""

gemini-code-assist · 2025-09-12T07:04:12Z

openseek/competition/pz/losercheems/preliminary/train.sh

+#!/bin/bash
+#SBATCH -N 1
+#SBATCH -n 32
+#SBATCH -t 114514


The SBATCH time limit -t 114514 appears to be a placeholder or a meme value. For professionalism and proper resource management on a SLURM cluster, this should be set to a realistic time estimate for the job, e.g., 24:00:00 for 24 hours.

Suggested change

#SBATCH -t 114514

#SBATCH -t 24:00:00

gemini-code-assist · 2025-09-12T07:04:12Z

openseek/competition/pz/losercheems/preliminary/trainer/pt_dpsk.py

+    model_config = yaml.load(
+        open(config_path, "r", encoding="utf-8"), Loader=yaml.FullLoader
+    )["model_config"]


Using yaml.load with FullLoader is unsafe as it can execute arbitrary code. For loading simple configuration files, yaml.safe_load should be used instead to prevent potential security vulnerabilities.

Suggested change

model_config = yaml.load(

open(config_path, "r", encoding="utf-8"), Loader=yaml.FullLoader

)["model_config"]

model_config = yaml.safe_load(

open(config_path, "r", encoding="utf-8"),

)["model_config"]

…ing_args_configs.py

Copilot AI review requested due to automatic review settings September 12, 2025 07:01

gemini-code-assist bot reviewed Sep 12, 2025

View reviewed changes

Copilot AI reviewed Sep 12, 2025

View reviewed changes

LoserCheems changed the title ~~Add configuration files, scripts, and training setup for OpenSeek model~~ [PZ COMPETITION] code submission by losercheems Sep 12, 2025

gemini-code-assist bot reviewed Sep 12, 2025

View reviewed changes

algo-home added 5 commits September 12, 2025 15:18

refactor: remove copyright notice and unused import from kto.py

ccc4b4a

refactor: remove copyright notice from pt_datasets_process.py

61f24a6

fix: correct import path for mix_datasets_by_ratio in download.py

311fea3

fix: update import path for mix_pt_datasets in pt_dpsk.py

36a0ba0

refactor: remove unnecessary configuration class docstring from train…

7929014

…ing_args_configs.py

	from utils.training_args_configs import PTConfig
	from ..utils.training_args_configs import PTConfig

	from small_doge.processor import mix_pt_datasets
	from ..processor.pt_datasets_process import mix_datasets_by_ratio as mix_pt_datasets

		@@ -0,0 +1,105 @@
		from preliminary.processor.pt_datasets_process import mix_datasets_by_ratio as mix_pt_datasets

		final_result.json # Aggregated metrics summary
		<benchmark_name>/ # Per-benchmark JSONL + metrics

Conversation

LoserCheems commented Sep 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Sep 12, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Sep 12, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Sep 12, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Sep 12, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Sep 12, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Sep 12, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Sep 12, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Sep 12, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Sep 12, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Sep 12, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Sep 12, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 12, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 12, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 12, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 12, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 12, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 12, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 12, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 12, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 12, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

LoserCheems commented Sep 12, 2025 •

edited

Loading