[PZ COMPETITION] code submission by losercheems#165
[PZ COMPETITION] code submission by losercheems#165LoserCheems wants to merge 6 commits intoFlagAI-Open:mainfrom
Conversation
- Created `ddp.yaml` for distributed training configuration. - Added `config.yaml` for model and training parameters. - Implemented `download.py` for dataset downloading and processing. - Introduced `merge.py` for merging model checkpoints with various methods. - Added `train.sh` for SLURM job submission with environment setup. - Developed `pt_dpsk.py` for training the OpenSeek model with Hugging Face Trainer. - Created `training_args_configs.py` for defining training argument configurations.
There was a problem hiding this comment.
Summary of Changes
Hello @LoserCheems, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request establishes a comprehensive training infrastructure for the OpenSeek model, covering both its initial continue pre-training and subsequent KTO-style preference alignment. It provides all necessary configuration files, data preparation scripts, and training entry points, designed for efficient execution on multi-GPU environments using accelerate and DeepSpeed. The changes enable reproducible model development from raw data to a fine-tuned checkpoint, complete with evaluation examples and merging capabilities.
Highlights
- OpenSeek KTO Alignment Pipeline: Introduced a complete Kahneman–Tversky Optimization (KTO) alignment fine-tuning pipeline for the OpenSeek model, including scripts for asset acquisition, dataset transformation, and alignment training.
- OpenSeek Continue Pre-training Workflow: Added a three-stage continue pre-training workflow for the OpenSeek model, encompassing asset download, mixed dataset building, distributed training, and model checkpoint merging (model soup).
- Distributed Training Configurations: Provided
accelerateconfiguration files for both DeepSpeed ZeRO-2 (for KTO) and DDP (for pre-training) to enable efficient multi-GPU training. - Data Preparation Scripts: Developed dedicated Python scripts for downloading base models and datasets, as well as transforming raw datasets into formats suitable for KTO preference alignment and pre-training with mixed ratios and sequence packing.
- SLURM Integration: Included
train.shscripts for both pipelines to facilitate job submission on SLURM clusters, setting up necessary environment variables and launchingacceleratecommands.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.
| Feature | Command | Description |
|---|---|---|
| Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
| Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
| Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
| Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Pull Request Overview
This PR implements a comprehensive training and alignment pipeline for the OpenSeek model, adding configuration files, data processing scripts, and training components for both pretraining and preference-based fine-tuning (KTO).
- Creates complete training infrastructure with distributed training configurations and dataset processing utilities
- Implements both preliminary pretraining pipeline and final KTO alignment workflow
- Adds comprehensive documentation and example configurations for model training and evaluation
Reviewed Changes
Copilot reviewed 16 out of 18 changed files in this pull request and generated 10 comments.
Show a summary per file
| File | Description |
|---|---|
| openseek/competition/pz/losercheems/preliminary/utils/training_args_configs.py | Defines PTConfig class for pretraining configuration parameters |
| openseek/competition/pz/losercheems/preliminary/trainer/pt_dpsk.py | Main pretraining script with Trainer setup and dataset processing |
| openseek/competition/pz/losercheems/preliminary/train.sh | SLURM batch script for distributed training launch |
| openseek/competition/pz/losercheems/preliminary/scripts/merge.py | Model checkpoint merging utility with multiple strategies |
| openseek/competition/pz/losercheems/preliminary/scripts/download.py | Dataset downloading and preprocessing script |
| openseek/competition/pz/losercheems/preliminary/recipes/openseek/config.yaml | Training configuration with model and hyperparameter settings |
| openseek/competition/pz/losercheems/preliminary/recipes/accelerate_configs/ddp.yaml | Distributed training configuration for Accelerate |
| openseek/competition/pz/losercheems/preliminary/processor/pt_datasets_process.py | Dataset processing utilities for mixing and tokenization |
| openseek/competition/pz/losercheems/preliminary/README.md | Technical documentation for the pretraining workflow |
| openseek/competition/pz/losercheems/final/trainer/kto.py | KTO alignment training script |
| openseek/competition/pz/losercheems/final/train.sh | Training launch script for KTO alignment |
| openseek/competition/pz/losercheems/final/scripts/kto_datasets_process.py | Dataset processing for KTO preference format |
| openseek/competition/pz/losercheems/final/scripts/download.py | Asset download script for KTO training |
| openseek/competition/pz/losercheems/final/recipes/openseek/config.yaml | KTO training configuration |
| openseek/competition/pz/losercheems/final/recipes/accelerate_configs/zero2.yaml | DeepSpeed ZeRO-2 configuration |
| openseek/competition/pz/losercheems/final/README.md | Technical documentation for KTO alignment workflow |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
| set_seed, | ||
| ) | ||
| from transformers.trainer_utils import get_last_checkpoint | ||
| from utils.training_args_configs import PTConfig |
There was a problem hiding this comment.
This import will fail because 'utils' is not in the Python path. The import should be a relative import: from ..utils.training_args_configs import PTConfig or an absolute import from the package root.
| from utils.training_args_configs import PTConfig | |
| from ..utils.training_args_configs import PTConfig |
| from transformers.trainer_utils import get_last_checkpoint | ||
| from utils.training_args_configs import PTConfig | ||
|
|
||
| from small_doge.processor import mix_pt_datasets |
There was a problem hiding this comment.
This import references 'small_doge.processor' but should import from the local processor module: from ..processor.pt_datasets_process import mix_datasets_by_ratio as mix_pt_datasets.
| from small_doge.processor import mix_pt_datasets | |
| from ..processor.pt_datasets_process import mix_datasets_by_ratio as mix_pt_datasets |
| model_args.model_name_or_path, | ||
| config=config, | ||
| ).to(torch_dtype) | ||
| # if model_args.model_name_or_path is not None and model_args.model_name_or_path.endswith("checkpoint") else causal_lm_class(config=config).to(torch_dtype) |
There was a problem hiding this comment.
This commented-out code should be removed as it clutters the codebase and provides no value in its current state.
| # if model_args.model_name_or_path is not None and model_args.model_name_or_path.endswith("checkpoint") else causal_lm_class(config=config).to(torch_dtype) |
| #!/bin/bash | ||
| #SBATCH -N 1 | ||
| #SBATCH -n 32 | ||
| #SBATCH -t 114514 |
There was a problem hiding this comment.
The time limit '114514' appears to be a placeholder or meme number rather than a realistic job duration. This should be set to an appropriate time limit for the training job.
| #SBATCH -t 114514 | |
| #SBATCH -t 1-00:00:00 |
| checkpoint_paths.append(r"./data/OpenSeek-1.4B-A0.4B/checkpoint-3000") | ||
| checkpoint_paths.append(r"./data/OpenSeek-1.4B-A0.4B/checkpoint-4000") | ||
| checkpoint_paths.append(r"./data/OpenSeek-1.4B-A0.4B/checkpoint-5000") | ||
| checkpoint_paths.append(r"./data/OpenSeek-1.4B-A0.4B/checkpoint-5000") |
There was a problem hiding this comment.
The checkpoint-5000 path is duplicated in the list. This will cause the same checkpoint to be loaded twice during merging, which is likely unintentional.
| checkpoint_paths.append(r"./data/OpenSeek-1.4B-A0.4B/checkpoint-5000") |
| @@ -0,0 +1,105 @@ | |||
| from preliminary.processor.pt_datasets_process import mix_datasets_by_ratio as mix_pt_datasets | |||
There was a problem hiding this comment.
This import path is incorrect. It should be a relative import: from ..processor.pt_datasets_process import mix_datasets_by_ratio as mix_pt_datasets or an absolute import from the package root.
| from preliminary.processor.pt_datasets_process import mix_datasets_by_ratio as mix_pt_datasets | |
| from ..processor.pt_datasets_process import mix_datasets_by_ratio as mix_pt_datasets |
| def calculate_total_ratio(datasets_and_ratios): | ||
| return sum(item for item in datasets_and_ratios.values()) | ||
|
|
||
| total_ratio = sum(calculate_total_ratio(dataset) for dataset in datasets_and_ratios) |
There was a problem hiding this comment.
The calculate_total_ratio function expects a dictionary but datasets_and_ratios is a list of dictionaries. This line should be: total_ratio = sum(list(dataset.values())[0] for dataset in datasets_and_ratios).
| def calculate_total_ratio(datasets_and_ratios): | |
| return sum(item for item in datasets_and_ratios.values()) | |
| total_ratio = sum(calculate_total_ratio(dataset) for dataset in datasets_and_ratios) | |
| total_ratio = sum(list(dataset.values())[0] for dataset in datasets_and_ratios) |
| main_training_function: main | ||
| mixed_precision: bf16 | ||
| num_machines: 1 | ||
| num_processes: 8 |
There was a problem hiding this comment.
The number of processes (8) doesn't match the GPU allocation in train.sh (4 GPUs). These values should be consistent - either both should be 4 or both should be 8.
| num_processes: 8 | |
| num_processes: 4 |
|
|
||
| # Process each split of the dataset | ||
| for split_name, split_dataset in dataset.items(): | ||
| split_dataset = split_dataset.select_columns(["input_ids"]) |
There was a problem hiding this comment.
This line assumes the dataset already has an 'input_ids' column, but the dataset may not be preprocessed yet. This should be moved after the prepare_dataset call or made conditional on whether the dataset is already processed.
| from transformers import AutoModelForCausalLM, AutoTokenizer, set_seed | ||
| from transformers.trainer_utils import get_last_checkpoint | ||
|
|
||
| from small_doge.processor import mix_dpo_datasets |
There was a problem hiding this comment.
This import is unused in the code and references an external module that may not exist. The import should be removed since the dataset is loaded directly with datasets.load_from_disk.
| from small_doge.processor import mix_dpo_datasets |
There was a problem hiding this comment.
Code Review
This pull request introduces a comprehensive set of files for pre-training and KTO alignment of the OpenSeek model. The changes include scripts for data downloading and processing, configuration files for training, and the training scripts themselves, along with detailed documentation. The overall structure is good, but there are several areas for improvement, particularly concerning hardcoded paths, code clarity, and reproducibility. I've identified several critical and high-severity issues related to hardcoded absolute paths in configuration and scripts, which will prevent others from running this code without modification. There are also some logical errors, such as a duplicated checkpoint in the merge script and confusing data processing logic. I've provided specific suggestions to address these points to make the pipeline more robust and portable.
| # - /workspace/datasets/OpenSeek-Pretrain-100B/zh_cc-medidum-loss2_part_71_text_document: 0.010643 | ||
|
|
||
| datasets_and_ratios: | ||
| - /workspace/datasets/OpenSeek-Pretrain-30B: 1.0 |
There was a problem hiding this comment.
The dataset path /workspace/datasets/OpenSeek-Pretrain-30B is hardcoded as an absolute path. This severely impacts reproducibility, as the training will fail on any machine where this exact path does not exist. Please use a relative path or an environment variable.
- ./datasets/OpenSeek-Pretrain-30B: 1.0| numina_math_cot = load_from_disk("/root/code/small-doge/datasets/AI-MO/NuminaMath-CoT") | ||
| print(numina_math_cot) | ||
| numina_math_cot = numina_math_cot.map(process, num_proc=4).select_columns(["prompt", "completion", "label"]) | ||
| print(numina_math_cot) | ||
| print(numina_math_cot["train"][0]) | ||
| numina_math_cot.save_to_disk("./datasets/AI-MO/NuminaMath-CoT-preference") No newline at end of file |
There was a problem hiding this comment.
The script contains a hardcoded absolute path /root/code/small-doge/datasets/AI-MO/NuminaMath-CoT for loading the dataset and a hardcoded relative path for saving. This severely impacts portability and reproducibility. These paths should be parameterized using command-line arguments (e.g., with argparse) to make the script reusable.
| numina_math_cot = load_from_disk("/root/code/small-doge/datasets/AI-MO/NuminaMath-CoT") | |
| print(numina_math_cot) | |
| numina_math_cot = numina_math_cot.map(process, num_proc=4).select_columns(["prompt", "completion", "label"]) | |
| print(numina_math_cot) | |
| print(numina_math_cot["train"][0]) | |
| numina_math_cot.save_to_disk("./datasets/AI-MO/NuminaMath-CoT-preference") | |
| import argparse | |
| if __name__ == "__main__": | |
| parser = argparse.ArgumentParser(description="Process dataset for KTO training.") | |
| parser.add_argument("--input_path", type=str, required=True, help="Path to the input dataset.") | |
| parser.add_argument("--output_path", type=str, required=True, help="Path to save the processed dataset.") | |
| args = parser.parse_args() | |
| numina_math_cot = load_from_disk(args.input_path) | |
| print(numina_math_cot) | |
| numina_math_cot = numina_math_cot.map(process, num_proc=4).select_columns(["prompt", "completion", "label"]) | |
| print(numina_math_cot) | |
| print(numina_math_cot["train"][0]) | |
| numina_math_cot.save_to_disk(args.output_path) |
| checkpoint_paths.append(r"./data/OpenSeek-1.4B-A0.4B/checkpoint-3000") | ||
| checkpoint_paths.append(r"./data/OpenSeek-1.4B-A0.4B/checkpoint-4000") | ||
| checkpoint_paths.append(r"./data/OpenSeek-1.4B-A0.4B/checkpoint-5000") | ||
| checkpoint_paths.append(r"./data/OpenSeek-1.4B-A0.4B/checkpoint-5000") |
|
|
||
| # Process each split of the dataset | ||
| for split_name, split_dataset in dataset.items(): | ||
| split_dataset = split_dataset.select_columns(["input_ids"]) |
There was a problem hiding this comment.
This line prematurely filters the dataset to only the input_ids column before calling prepare_dataset. This prevents the tokenization logic inside prepare_dataset (which expects a text field) from ever running, making the code confusing and brittle if a raw text dataset is ever used. This selection should happen inside prepare_dataset after tokenization and before packing.
| args=training_args, | ||
| train_dataset=dataset[script_args.dataset_train_split], | ||
| eval_dataset=dataset[script_args.dataset_test_split] if training_args.eval_strategy != "no" else None, | ||
| processing_class=tokenizer, |
There was a problem hiding this comment.
The KTOTrainer from trl does not accept a processing_class argument. It seems you intended to pass the tokenizer, but the argument name is incorrect and it will be ignored. The tokenizer is usually handled through the DataCollator or is implicitly used by the trainer if needed. Since KTOTrainer doesn't require it in its constructor, this line can be removed.
| final_result.json # Aggregated metrics summary | ||
| <benchmark_name>/ # Per-benchmark JSONL + metrics |
There was a problem hiding this comment.
The indentation for final_result.json in the directory overview seems incorrect. It appears to be a file inside the <benchmark_name> directory, but based on the description, it should likely be at the same level as <benchmark_name>.
| final_result.json # Aggregated metrics summary | |
| <benchmark_name>/ # Per-benchmark JSONL + metrics | |
| final_result.json # Aggregated metrics summary | |
| <benchmark_name>/ # Per-benchmark JSONL + metrics |
| # datasets_and_ratios: | ||
| # - /workspace/datasets/OpenSeek-Pretrain-100B/Nemotron-CC-high-actual-actual-high_part_142_text_document: 0.011068 | ||
| # - /workspace/datasets/OpenSeek-Pretrain-100B/Nemotron-CC-high-actual-actual-low_part_62_text_document: 0.003577 | ||
| # - /workspace/datasets/OpenSeek-Pretrain-100B/Nemotron-CC-high-actual-actual-mid_part_189_text_document: 0.007775 | ||
| # - /workspace/datasets/OpenSeek-Pretrain-100B/Nemotron-CC-high-synthetic-distill-high_part_76_text_document: 0.002859 | ||
| # - /workspace/datasets/OpenSeek-Pretrain-100B/Nemotron-CC-high-synthetic-distill-low_part_124_text_document: 0.001672 | ||
| # - /workspace/datasets/OpenSeek-Pretrain-100B/Nemotron-CC-high-synthetic-distill-mid_part_29_text_document: 0.002339 | ||
| # - /workspace/datasets/OpenSeek-Pretrain-100B/Nemotron-CC-high-synthetic-diverse_qa_pairs-high_part_244_text_document: 0.005397 | ||
| # - /workspace/datasets/OpenSeek-Pretrain-100B/Nemotron-CC-high-synthetic-diverse_qa_pairs-low_part_150_text_document: 0.004064 | ||
| # - /workspace/datasets/OpenSeek-Pretrain-100B/Nemotron-CC-high-synthetic-diverse_qa_pairs-mid_part_444_text_document: 0.005005 | ||
| # - /workspace/datasets/OpenSeek-Pretrain-100B/Nemotron-CC-high-synthetic-extract_knowledge-high_part_498_text_document: 0.004616 | ||
| # - /workspace/datasets/OpenSeek-Pretrain-100B/Nemotron-CC-high-synthetic-extract_knowledge-low_part_10_text_document: 0.00067 | ||
| # - /workspace/datasets/OpenSeek-Pretrain-100B/Nemotron-CC-high-synthetic-extract_knowledge-mid_part_144_text_document: 0.003429 | ||
| # - /workspace/datasets/OpenSeek-Pretrain-100B/Nemotron-CC-high-synthetic-knowledge_list-high_part_86_text_document: 0.00261 | ||
| # - /workspace/datasets/OpenSeek-Pretrain-100B/Nemotron-CC-high-synthetic-knowledge_list-low_part_133_text_document: 0.001824 | ||
| # - /workspace/datasets/OpenSeek-Pretrain-100B/Nemotron-CC-high-synthetic-knowledge_list-mid_part_139_text_document: 0.002313 | ||
| # - /workspace/datasets/OpenSeek-Pretrain-100B/Nemotron-CC-high-synthetic-wrap_medium-high_part_47_text_document: 0.008237 | ||
| # - /workspace/datasets/OpenSeek-Pretrain-100B/Nemotron-CC-high-synthetic-wrap_medium-low_part_11_text_document: 0.002866 | ||
| # - /workspace/datasets/OpenSeek-Pretrain-100B/Nemotron-CC-high-synthetic-wrap_medium-mid_part_97_text_document: 0.00667 | ||
| # - /workspace/datasets/OpenSeek-Pretrain-100B/Nemotron-CC-low-synthetic-wrap_medium-high_part_43_text_document: 0.004657 | ||
| # - /workspace/datasets/OpenSeek-Pretrain-100B/Nemotron-CC-low-synthetic-wrap_medium-low_part_10_text_document: 0.002005 | ||
| # - /workspace/datasets/OpenSeek-Pretrain-100B/Nemotron-CC-low-synthetic-wrap_medium-mid_part_164_text_document: 0.004317 | ||
| # - /workspace/datasets/OpenSeek-Pretrain-100B/Nemotron-CC-medium-actual-actual-high_part_92_text_document: 0.011397 | ||
| # - /workspace/datasets/OpenSeek-Pretrain-100B/Nemotron-CC-medium-actual-actual-low_part_113_text_document: 0.006782 | ||
| # - /workspace/datasets/OpenSeek-Pretrain-100B/Nemotron-CC-medium-actual-actual-mid_part_563_text_document: 0.009175 | ||
| # - /workspace/datasets/OpenSeek-Pretrain-100B/arxiv_007_00000_text_document: 0.006414 | ||
| # - /workspace/datasets/OpenSeek-Pretrain-100B/books_016_00007_text_document: 0.004696 | ||
| # - /workspace/datasets/OpenSeek-Pretrain-100B/code-high_part_13_text_document: 0.010102 | ||
| # - /workspace/datasets/OpenSeek-Pretrain-100B/code-low_part_36_text_document: 0.011403 | ||
| # - /workspace/datasets/OpenSeek-Pretrain-100B/code-mid_part_37_text_document: 0.009674 | ||
| # - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis2_CC-high_23_text_document: 0.003755 | ||
| # - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis2_CC-low_51_text_document: 0.000499 | ||
| # - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis2_CC-mid_118_text_document: 0.003608 | ||
| # - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis2_CC-mid_176_text_document: 0.003623 | ||
| # - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis2_CC-mid_256_text_document: 0.003704 | ||
| # - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis2_CC-mid_320_text_document: 0.003733 | ||
| # - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis2_CC-mid_32_text_document: 0.003631 | ||
| # - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis2_OpenSource-high_1_text_document: 0.002573 | ||
| # - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis2_OpenSource-low_2_text_document: 0.001638 | ||
| # - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis2_OpenSource-mid_3_text_document: 0.003251 | ||
| # - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis2_arxiv-high_2_text_document: 0.060237 | ||
| # - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis2_arxiv-low_1_text_document: 0.089063 | ||
| # - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis2_arxiv-mid_2_text_document: 0.101376 | ||
| # - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis2_code-high_4_text_document: 0.004598 | ||
| # - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis2_code-low_6_text_document: 0.006857 | ||
| # - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis2_code-mid_23_text_document: 0.00899 | ||
| # - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis2_math-high_12_text_document: 0.013135 | ||
| # - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis2_math-low_3_text_document: 0.01653 | ||
| # - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis2_math-mid_5_text_document: 0.003536 | ||
| # - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis2_wiki-high_5_text_document: 0.006314 | ||
| # - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis2_wiki-low_5_text_document: 0.005978 | ||
| # - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis2_wiki-mid_4_text_document: 0.007909 | ||
| # - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis_CC-high_74_text_document: 0.002225 | ||
| # - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis_CC-low_54_text_document: 0.001797 | ||
| # - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis_CC-mid_275_text_document: 0.002042 | ||
| # - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis_OpenSource-high_4_text_document: 0.004081 | ||
| # - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis_OpenSource-low_2_text_document: 0.001659 | ||
| # - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis_OpenSource-mid_6_text_document: 0.012828 | ||
| # - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis_arxiv-high_2_text_document: 0.0568 | ||
| # - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis_arxiv-low_1_text_document: 0.074907 | ||
| # - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis_arxiv-mid_1_text_document: 0.089359 | ||
| # - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis_code-high_13_text_document: 0.007663 | ||
| # - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis_code-low_9_text_document: 0.004052 | ||
| # - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis_code-mid_6_text_document: 0.001916 | ||
| # - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis_math-high_11_text_document: 0.005074 | ||
| # - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis_math-low_11_text_document: 0.006437 | ||
| # - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis_math-mid_29_text_document: 0.006406 | ||
| # - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis_wiki-high_4_text_document: 0.004 | ||
| # - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis_wiki-low_6_text_document: 0.003564 | ||
| # - /workspace/datasets/OpenSeek-Pretrain-100B/cot_synthesis_wiki-mid_3_text_document: 0.005768 | ||
| # - /workspace/datasets/OpenSeek-Pretrain-100B/math-high_part_04_text_document: 0.018165 | ||
| # - /workspace/datasets/OpenSeek-Pretrain-100B/math-low_part_10_text_document: 0.01694 | ||
| # - /workspace/datasets/OpenSeek-Pretrain-100B/math-mid_part_07_text_document: 0.016311 | ||
| # - /workspace/datasets/OpenSeek-Pretrain-100B/pes2o-full-train_train-0041-of-0136_text_document: 0.00687 | ||
| # - /workspace/datasets/OpenSeek-Pretrain-100B/pes2o-full-train_train-0125-of-0136_text_document: 0.007387 | ||
| # - /workspace/datasets/OpenSeek-Pretrain-100B/pes2o-full-val_valid-0034-of-0060_text_document: 0.000143 | ||
| # - /workspace/datasets/OpenSeek-Pretrain-100B/pes2o_pubmedcentral_3_text_document: 0.061982 | ||
| # - /workspace/datasets/OpenSeek-Pretrain-100B/stack_018_00000_text_document: 0.004229 | ||
| # - /workspace/datasets/OpenSeek-Pretrain-100B/wiki_012_00000_text_document: 0.004202 | ||
| # - /workspace/datasets/OpenSeek-Pretrain-100B/zh_cc-high-loss0_part_28_text_document: 0.018171 | ||
| # - /workspace/datasets/OpenSeek-Pretrain-100B/zh_cc-high-loss1_part_59_text_document: 0.009776 | ||
| # - /workspace/datasets/OpenSeek-Pretrain-100B/zh_cc-high-loss2_part_16_text_document: 0.003725 | ||
| # - /workspace/datasets/OpenSeek-Pretrain-100B/zh_cc-medidum-loss0_part_192_text_document: 0.009492 | ||
| # - /workspace/datasets/OpenSeek-Pretrain-100B/zh_cc-medidum-loss1_part_550_text_document: 0.009236 | ||
| # - /workspace/datasets/OpenSeek-Pretrain-100B/zh_cc-medidum-loss2_part_71_text_document: 0.010643 |
| """ | ||
| 合并多个checkpoints | ||
|
|
||
| Args: | ||
| checkpoint_paths: 要合并的checkpoint路径列表 | ||
| output_path: 合并后模型的保存路径 | ||
| merge_method: 合并方法,支持 "average", "last", "weighted_average" | ||
| """ |
There was a problem hiding this comment.
The docstring and comments are in Chinese, which is inconsistent with the rest of the project (e.g., READMEs, other scripts) which is in English. For better maintainability and accessibility for all contributors, please translate these to English.
| """ | |
| 合并多个checkpoints | |
| Args: | |
| checkpoint_paths: 要合并的checkpoint路径列表 | |
| output_path: 合并后模型的保存路径 | |
| merge_method: 合并方法,支持 "average", "last", "weighted_average" | |
| """ | |
| """ | |
| Merges multiple checkpoints. | |
| Args: | |
| checkpoint_paths: A list of checkpoint paths to be merged. | |
| output_path: The path to save the merged model. | |
| merge_method: The method for merging, supports "average", "last", "weighted_average". | |
| """ |
| #!/bin/bash | ||
| #SBATCH -N 1 | ||
| #SBATCH -n 32 | ||
| #SBATCH -t 114514 |
There was a problem hiding this comment.
| model_config = yaml.load( | ||
| open(config_path, "r", encoding="utf-8"), Loader=yaml.FullLoader | ||
| )["model_config"] |
There was a problem hiding this comment.
Using yaml.load with FullLoader is unsafe as it can execute arbitrary code. For loading simple configuration files, yaml.safe_load should be used instead to prevent potential security vulnerabilities.
| model_config = yaml.load( | |
| open(config_path, "r", encoding="utf-8"), Loader=yaml.FullLoader | |
| )["model_config"] | |
| model_config = yaml.safe_load( | |
| open(config_path, "r", encoding="utf-8"), | |
| )["model_config"] |
…ing_args_configs.py
Hi, the code has been submitted through this PR.
The checkpoints and logs have been published on HuggingFace.