This is the Knowledge Engine for Genomics (KnowEnG), an NIH BD2K Center of Excellence, Samples Clustering Pipeline.
This pipeline clusters the columns of a given spreadsheet, where spreadsheet's columns correspond to sample-labels and rows correspond to gene-labels.
There are four clustering methods that one can choose from:
| Options | Method | Parameters |
|---|---|---|
| Clustering | nmf | nmf |
| Consensus Clustering | bootstrapping with nmf | cc_nmf |
| Clustering with network regularization | network-based nmf | net_nmf |
| Consensus Clustering with network regularization | bootstrapping with network-based nmf | cc_net_nmf |
Note: all of the clustering methods mentioned above use the non-negative matrix factorization (nmf) as the main clustering algorithm.
If a pheotype data file is included this pipeline evaluates the clustering result.
There are two evaluation methods:
| Method | Trait Type |
|---|---|
| one-way ANOVA(f_oneway) | Continuous |
| one-way chi square test(chisquare) | Categorical |
git clone https://github.com/KnowEnG-Research/Samples_Clustering_Pipeline.git
pip3 install pyyaml
pip3 install knpackage
pip3 install scipy==0.18.0
pip3 install numpy==1.11.1
pip3 install pandas==0.18.1
pip3 install matplotlib==1.4.2
pip3 install scikit-learn==0.17.1
apt-get install -y python3-pip
apt-get install -y libfreetype6-dev libxft-dev
apt-get install -y libblas-dev liblapack-dev libatlas-base-dev gfortran
cd Samples_Clustering_Pipeline
cd test
make env_setup
| Command | Option |
|---|---|
| make run_nmf | Clustering |
| make run_net_nmf | Clustering with network regularization |
| make run_cc_nmf_serial | Consensus Clustering |
| make run_cc_nmf_parallel_shared | Consensus Clustering |
| make run_cc_net_nmf_serial | Consensus Clustering with network regularization |
| make run_cc_net_nmf_parallel_shared | Consensus Clustering with network regularization |
Follow steps 1-3 above then do the following:
mkdir run_directory
cd run_directory
mkdir results_directory
Look for examples of run_parameters in the Sample_Clustering_Pipeline/data/run_files zTEMPLATE_cc_net_nmf.yml
Change processing_method to one of: serial, parallel depending on your machine.
processing_method: serial
set the data file targets to the files you want to run, and the parameters as appropriate for your data.
- Update PYTHONPATH enviroment variable
export PYTHONPATH='../src':$PYTHONPATH
- Run
python3 ../src/samples_clustering.py -run_directory ./run_dir -run_file zTEMPLATE_cc_net_nmf.yml
| Key | Value | Comments |
|---|---|---|
| method | nmf, cc_nmf, net_nmf or cc_net_nmf | Choose clustering method |
| gg_network_name_full_path | directory+gg_network_name | Path and file name of the 4 col network file |
| spreadsheet_name_full_path | directory+spreadsheet_name | Path and file name of user supplied gene sets |
| phenotype_data_full_path | directory+phenotype_data_name | Path and file name of user supplied phenotype data |
| threshold | 10 | cluster eval - catagorical vs continuous cut off level |
| results_directory | directory | Directory to save the output files |
| tmp_directory | directory | Directory to save the intermediate files |
| rwr_max_iterations | 100 | Maximum number of iterations without convergence in random walk with restart |
| rwr_convergence_tolerence | 1.0e-8 | Frobenius norm tolerence of spreadsheet vector in random walk |
| rwr_restart_probability | 0.7 | alpha in V_(n+1) = alpha * N * Vn + (1-alpha) * Vo |
| rows_sampling_fraction | 0.8 | Select 80% of spreadsheet rows |
| cols_sampling_fraction | 0.8 | Select 80% of spreadsheet columns |
| number_of_bootstraps | 4 | Number of random samplings |
| number_of_clusters | 3 | Estimated number of clusters |
| nmf_conv_check_freq | 50 | Check convergence at given frequency |
| nmf_max_invariance | 200 | Maximum number of invariance |
| nmf_max_iterations | 10000 | Maximum number of iterations |
| nmf_penalty_parameter | 1400 | Penalty parameter |
| top_number_of_genes | 100 | Number of top genes selected |
| processing_method | serial or parallel or distribute | Choose processing method |
| parallelism | number of cores to use in parallel processing | Set number of cores for speed or memory |
gg_network_name = STRING_experimental_gene_gene.edge
spreadsheet_name = ProGENI_rwr20_STExp_GDSC_500.rname.gxc.tsv
phenotype_data_name = UCEC_phenotype.txt
- Output files of all four methods save genes by sample heatmap variances per row with name genes_variance_{method}_{timestamp}_viz.tsv.
| variance | |
|---|---|
| gene 1 | float |
| ... | ... |
| gene m | float |
- Output files of all four methods save genes by samples heatmap with name genes_by_samples_heatmp_{method}_{timestamp}_viz.tsv.
| sample 1 | ... | sample n | |
|---|---|---|---|
| gene 1 | float | ... | float |
| ... | ... | ... | ... |
| gene m | float | ... | float |
- Output files of all four methods save samples by samples heatmap with name consensus_matrix_{method}_{timestamp}_viz.tsv.
| sample 1 | ... | sample n | |
|---|---|---|---|
| sample 1 | float | ... | float |
| ... | ... | ... | ... |
| sample n | float | ... | float |
- Output files of all four methods save patients to cluster map with name samples_labeled_by_cluster_{method}_{timestamp}_viz.tsv.
| cluster | |
|---|---|
| sample 1 | int |
| ... | ... |
| sample n | int |
- Output files of all four methods save gene scores by cluster with name genes_averages_by_cluster_{method}_{timestamp}_viz.tsv.
| cluster 1 | ... | cluster k | |
|---|---|---|---|
| gene 1 | float | ... | float |
| ... | ... | ... | ... |
| gene m | float | ... | float |
- Output files of all four methods save spreadsheet with top ranked genes per sample with name top_genes_by_cluster_{method}_{timestamp}_download.tsv.
| cluster 1 | ... | cluster k | |
|---|---|---|---|
| gene 1 | 1/0 | ... | 1/0 |
| ... | ... | ... | ... |
| gene m | 1/0 | ... | 1/0 |
-
All methods save three silhouette scores: silhouette overall score, silhouette per cluster score and silhouette per sample with name silhouette_{method}_{timestamp}_viz.tsv.
- silhouette overall score file: | number of clusters | silhouette score |
- silhouette per cluster score file: | ith clusters | corresponding silhouette score |
- silhouette per sample score file: | ith sample | corresponding silhouette score|
-
Output files of all four methods save patients to cluster map with name phenotypes_labeled_by_cluster_{method}_{timestamp}_viz.tsv.
| sample id | cluster | phenotype 1 | ... | phenotype k |
|---|---|---|---|---|
| sample 1 | int | mixed type | ... | mixed type |
| ... | ... | ... | ... | ... |
| sample n | int | mixed type | ... | mixed type |
- The clustering evaluation output file has the name
clustering_evaluation_result_{timestamp}.tsv.
| Measure | Trait_length_after_dropna | Sample_number_after_dropna | chi/fval | pval | |
|---|---|---|---|---|---|
| sample 1 | f_oneway | int(more than threshold) | int | float | float |
| ... | ... | ... | ... | ... | ... |
| sample m | chisquare | int(less than threshold) | int | float | float |
References:
- Hofree, M., Shen, J. P., Carter, H., Gross, A. & Ideker, T. Network-based stratification of tumor mutations. Nat. Methods 10, 1108–1115 (2013).