Take in multiple inputs with different assemblies; pull chr info from bam instead of fasta#44
Conversation
…ser controlled input filtering
…s from input bam instead of fasta
…ent assembly inputs
mrvollger
left a comment
There was a problem hiding this comment.
I have two comments that apply in lots of places.
- Delete old unused code.
- We should use only the bam header instead of fai_df and the bam header.
workflow/Snakefile
Outdated
| FAI = get_fai() | ||
| REF_NAME = config["ref_name"] | ||
| EXCLUDES = get_excludes() | ||
| #REF = get_ref() |
There was a problem hiding this comment.
delete completely instead of commenting
workflow/rules/common.smk
Outdated
| raise ValueError(f"FIRE: reference file {ref} does not exist") | ||
| return os.path.abspath(ref) | ||
|
|
||
| def get_ref_old(wc): |
There was a problem hiding this comment.
I think the functions you have added with _old or _orig are not used. if so they should be removed.
workflow/rules/common.smk
Outdated
| return fai | ||
|
|
||
| def get_fai(wc): | ||
| ref = MANIFEST.loc[wc.sm, "ref"] |
There was a problem hiding this comment.
to get ref you should call get_ref. And then generally it is good to use f strings in situations like this:
ref = get_ref(wc)
fai = f"{ref}.fai"
...
workflow/rules/common.smk
Outdated
|
|
||
| bam_chr_list=[] | ||
| input_bam_path=get_input_bam(wc) | ||
| input_bam = pysam.AlignmentFile(input_bam_path, "rc", threads=MAX_THREADS) |
There was a problem hiding this comment.
you don't need extra threads here, I would drop that. Also the "rc" part can be infered and not all inputs will be cram so we dont want it. basically delete the last two args.
workflow/rules/common.smk
Outdated
| input_bam = pysam.AlignmentFile(input_bam_path, "rc", threads=MAX_THREADS) | ||
| bam_header_dict = input_bam.header.to_dict() | ||
|
|
||
| for line in bam_header_dict['SQ']: |
There was a problem hiding this comment.
The length of each chromosome is also available in the bam header. So we should use the bam header only instead of mixing fai_df and this method.
workflow/rules/track-hub.smk
Outdated
| suffix=get_hap_col_suffix, | ||
| nzooms=NZOOMS, | ||
| chrom=get_chroms()[0], | ||
| #chroms=get_chroms, |
…d removed commented unused code
This PR address two main issues:
(1) Previously, a single assembly was defined in the config file and if multiple samples were submitted in the input file, all had to be associated with that assembly. This PR makes it possible to structure an input file to have 4 columns: sample, bam, ref, ref_name. See example: config/config_4col.tbl. The yaml file can omit ref and ref_name entirely or leave them blank (example: config/config_4col.yaml). If a single reference can be used with all samples, the original format where the ref and ref_name are only indicated in config.yaml will still work.
(2) Previously, a list of chrs was pulled from the input fasta. This can be problematic if not all of these are present in the input bam. This PR modifies these names to be pulled directly from the bam instead.