We have also automatically generated a general report for the workflow, which is stored in tutorial/ subfolder relative to the working directory of the pipeline. Take a look at the statistics in report.html. Some rules took longer to complete than others, but they were still very fast.
Throughout the pipeline, several simple plots are generated to give insights into the insertions' characteristics, such as their length and chromosomal specificity. Navigate to the results tab to explore the detected insertion lengths. It appears that some reads only contain parts of the insertion.
If you would like to explore quality control metrics, check out the multiqc.html report in the results tab. Since our data is simulated, you will probably not be too happy with it.
Now, let's examine the output files directly generated by the pipeline. Navigate to the output folder as specified in the config. To get an overview of the file structure in this directory, run tree tutorial/out/simulation_tutorial/.
Output directory structure
tutorial/out/simulation_tutorial/
├── config_settings.yml
├── final
│ ├── functional_genomics
│ │ ├── Functional_distances_to_Insertions_S1.bed
│ │ └── Functional_distances_to_Insertions_S2.bed
│ ├── localization
│ │ ├── ExactInsertions_S1.bed
│ │ ├── ExactInsertions_S2.bed
│ │ ├── Heatmap_Insertion_Chr.png
│ │ ├── Insertion_length.png
│ │ ├── InsertionPoints_S1.bed
│ │ └── InsertionPoints_S2.bed
│ └── qc
│ ├── Fragmentation
│ │ ├── Insertions
│ │ │ ├── insertions_100_S1
│ │ │ │ ├── 100_fragmentation_distribution.png
│ │ │ │ └── 100_read_match_fragmentation_distribution.png
│ │ │ └── insertions_100_S2
│ │ │ ├── 100_fragmentation_distribution.png
│ │ │ └── 100_read_match_fragmentation_distribution.png
│ │ ├── Longest_Interval
│ │ │ ├── S1
│ │ │ │ ├── Longest_interval_Read-343.png
│ │ │ │ ├── Longest_interval_Read-555.png
│ │ │ │ ├── Longest_interval_Read-561.png
│ │ │ │ ├── Longest_interval_Read-745.png
│ │ │ │ └── Longest_interval_Read-902.png
│ │ │ └── S2
│ │ │ ├── Longest_interval_Read-262.png
│ │ │ ├── Longest_interval_Read-417.png
│ │ │ ├── Longest_interval_Read-522.png
│ │ │ ├── Longest_interval_Read-682.png
│ │ │ └── Longest_interval_Read-824.png
│ │ └── Reference
│ │ ├── reference_100_S1
│ │ │ └── 100_fragmentation_distribution.png
│ │ └── reference_100_S2
│ │ └── 100_fragmentation_distribution.png
│ ├── mapq
│ │ ├── Insertions_S1_mapq.txt
│ │ ├── Insertions_S2_mapq.txt
│ │ ├── S1_mapq_plot.png
│ │ └── S2_mapq_plot.png
│ └── multiqc_report.html
└── intermediate
├── blastn
│ ├── Coordinates_100_InsertionMatches_S1.blastn
│ ├── Coordinates_100_InsertionMatches_S2.blastn
│ ├── Filtered_Annotated_100_InsertionMatches_S1.blastn
│ ├── Filtered_Annotated_100_InsertionMatches_S2.blastn
│ ├── Readnames_100_InsertionMatches_S1.txt
│ ├── Readnames_100_InsertionMatches_S2.txt
│ └── ref
│ ├── Filtered_Annotated_100_InsertionMatches_S1.blastn
│ └── Filtered_Annotated_100_InsertionMatches_S2.blastn
├── fasta
│ ├── fragments
│ │ ├── 100_Insertion_fragments.fa
│ │ ├── 100_Insertion_fragments.fa.ndb
│ │ ├── 100_Insertion_fragments.fa.nhr
│ │ ├── 100_Insertion_fragments.fa.nin
│ │ ├── 100_Insertion_fragments.fa.njs
│ │ ├── 100_Insertion_fragments.fa.not
│ │ ├── 100_Insertion_fragments.fa.nsq
│ │ ├── 100_Insertion_fragments.fa.ntf
│ │ ├── 100_Insertion_fragments.fa.nto
│ │ └── Forward_Backward_Insertion.fa
│ ├── Full_S1.fa
│ ├── Full_S2.fa
│ ├── Insertion_S1.fa
│ ├── Insertion_S2.fa
│ ├── Isolated_Reads_S1.fa
│ ├── Isolated_Reads_S2.fa
│ ├── Modified_S1.fa
│ └── Modified_S2.fa
├── functional_genomics
│ ├── Annotation_ucsc_genes_Insertions_S1.bed
│ └── Annotation_ucsc_genes_Insertions_S2.bed
├── localization
│ ├── ExactInsertions_S1.bed
│ ├── ExactInsertions_S2.bed
│ ├── Sorted_InsertionPoints_S1.bed
│ └── Sorted_InsertionPoints_S2.bed
├── log
│ ├── detection
│ │ ├── BAM_to_BED
│ │ │ ├── Postcut_S1.log
│ │ │ ├── Postcut_S2.log
│ │ │ ├── Precut_S1.log
│ │ │ └── Precut_S2.log
│ │ ├── basic_insertion_plots
│ │ │ ├── heat.log
│ │ │ └── length.log
│ │ ├── build_insertion_reference
│ │ │ └── out.log
│ │ ├── calculate_exact_insertion_coordinates
│ │ │ ├── S1.log
│ │ │ └── S2.log
│ │ ├── clean_postcut_by_maping_quality
│ │ │ ├── S1.log
│ │ │ └── S2.log
│ │ ├── collect_outputs
│ │ │ ├── S1.log
│ │ │ └── S2.log
│ │ ├── copy_config_version
│ │ │ └── out.log
│ │ ├── extract_by_length
│ │ │ ├── S1.log
│ │ │ └── S2.log
│ │ ├── fasta_insertion_reads_cmod
│ │ │ ├── S1.log
│ │ │ └── S2.log
│ │ ├── find_insertion_BLASTn
│ │ │ ├── S1.log
│ │ │ └── S2.log
│ │ ├── find_insertion_BLASTn_in_Ref
│ │ │ ├── S1.log
│ │ │ └── S2.log
│ │ ├── get_coordinates_for_fasta
│ │ │ ├── S1.log
│ │ │ └── S2.log
│ │ ├── hardcode_blast_header
│ │ │ ├── S1.log
│ │ │ └── S2.log
│ │ ├── insertion_fragmentation
│ │ │ └── out.log
│ │ ├── insertion_mapping
│ │ │ ├── S1.log
│ │ │ └── S2.log
│ │ ├── insertion_points
│ │ │ ├── S1.log
│ │ │ └── S2.log
│ │ ├── insertion_reads_cmod
│ │ │ ├── S1.log
│ │ │ └── S2.log
│ │ ├── make_blastn_DB
│ │ │ └── out.log
│ │ ├── make_fasta_without_tags
│ │ │ ├── S1.log
│ │ │ └── S2.log
│ │ ├── minimap_index
│ │ │ └── out.log
│ │ ├── Non_insertion_mapping
│ │ │ ├── S1.log
│ │ │ └── S2.log
│ │ ├── prepare_insertion
│ │ │ └── out.log
│ │ └── split_fasta_by_borders
│ │ ├── S1.log
│ │ └── S2.log
│ ├── functional_genomics
│ │ ├── annotation_overlap_insertion
│ │ │ ├── S1.log
│ │ │ └── S2.log
│ │ ├── calc_distance_to_elements
│ │ │ ├── S1.log
│ │ │ └── S2.log
│ │ └── sort_insertion_file
│ │ ├── points_S1.log
│ │ ├── points_S2.log
│ │ ├── S1.log
│ │ └── S2.log
│ └── qc
│ ├── detailed_fragmentation_length_plot
│ │ ├── S1.log
│ │ └── S2.log
│ ├── extract_fastq_insertions
│ │ ├── S1.log
│ │ └── S2.log
│ ├── extract_mapping_quality
│ │ ├── S1.log
│ │ └── S2.log
│ ├── finalize_mapping_quality
│ │ ├── S1.log
│ │ └── S2.log
│ ├── fragmentation_distribution_plots
│ │ ├── fragmentation_match_distribution_S1.log
│ │ ├── fragmentation_match_distribution_S2.log
│ │ ├── fragmentation_read_match_distribution_S1.log
│ │ └── fragmentation_read_match_distribution_S2.log
│ ├── generate_mapq_plot
│ │ ├── S1.log
│ │ └── S2.log
│ ├── multiqc
│ │ └── out.log
│ ├── nanoplot
│ │ ├── S1.log
│ │ └── S2.log
│ └── read_level_fastqc
│ ├── S1.log
│ └── S2.log
├── mapping
│ ├── insertion_ref_genome.fa
│ ├── Isolated_Reads_S1.bam
│ ├── Isolated_Reads_S1.bam.bai
│ ├── Isolated_Reads_S2.bam
│ ├── Isolated_Reads_S2.bam.bai
│ ├── Postcut_S1.bed
│ ├── Postcut_S1_sorted.bam
│ ├── Postcut_S1_sorted.bam.bai
│ ├── Postcut_S1_unfiltered_sorted.bam
│ ├── Postcut_S1_unfiltered_sorted.bam.bai
│ ├── Postcut_S2.bed
│ ├── Postcut_S2_sorted.bam
│ ├── Postcut_S2_sorted.bam.bai
│ ├── Postcut_S2_unfiltered_sorted.bam
│ ├── Postcut_S2_unfiltered_sorted.bam.bai
│ ├── Precut_S1.bed
│ ├── Precut_S1_sorted.bam
│ ├── Precut_S1_sorted.bam.bai
│ ├── Precut_S2.bed
│ ├── Precut_S2_sorted.bam
│ └── Precut_S2_sorted.bam.bai
└── qc
├── fastqc
│ ├── readlevel_S1
│ │ ├── S1_read_Read-343.fastq
│ │ ├── S1_read_Read-343_fastqc.html
│ │ ├── S1_read_Read-343_fastqc.zip
│ │ ├── S1_read_Read-555.fastq
│ │ ├── S1_read_Read-555_fastqc.html
│ │ ├── S1_read_Read-555_fastqc.zip
│ │ ├── S1_read_Read-561.fastq
│ │ ├── S1_read_Read-561_fastqc.html
│ │ ├── S1_read_Read-561_fastqc.zip
│ │ ├── S1_read_Read-745.fastq
│ │ ├── S1_read_Read-745_fastqc.html
│ │ ├── S1_read_Read-745_fastqc.zip
│ │ ├── S1_read_Read-902.fastq
│ │ ├── S1_read_Read-902_fastqc.html
│ │ └── S1_read_Read-902_fastqc.zip
│ ├── readlevel_S2
│ │ ├── S2_read_Read-262.fastq
│ │ ├── S2_read_Read-262_fastqc.html
│ │ ├── S2_read_Read-262_fastqc.zip
│ │ ├── S2_read_Read-417.fastq
│ │ ├── S2_read_Read-417_fastqc.html
│ │ ├── S2_read_Read-417_fastqc.zip
│ │ ├── S2_read_Read-522.fastq
│ │ ├── S2_read_Read-522_fastqc.html
│ │ ├── S2_read_Read-522_fastqc.zip
│ │ ├── S2_read_Read-682.fastq
│ │ ├── S2_read_Read-682_fastqc.html
│ │ ├── S2_read_Read-682_fastqc.zip
│ │ ├── S2_read_Read-824.fastq
│ │ ├── S2_read_Read-824_fastqc.html
│ │ └── S2_read_Read-824_fastqc.zip
│ ├── S1_filtered.fastq
│ └── S2_filtered.fastq
├── multiqc_data
│ ├── fastqc_adapter_content_plot.txt
│ ├── fastqc_overrepresented_sequences_plot.txt
│ ├── fastqc_per_base_n_content_plot.txt
│ ├── fastqc_per_base_sequence_quality_plot.txt
│ ├── fastqc_per_sequence_gc_content_plot_Counts.txt
│ ├── fastqc_per_sequence_gc_content_plot_Percentages.txt
│ ├── fastqc_per_sequence_quality_scores_plot.txt
│ ├── fastqc_sequence_counts_plot.txt
│ ├── fastqc_sequence_duplication_levels_plot.txt
│ ├── fastqc-status-check-heatmap.txt
│ ├── fastqc_top_overrepresented_sequences_table.txt
│ ├── multiqc_citations.txt
│ ├── multiqc_data.json
│ ├── multiqc_fastqc.txt
│ ├── multiqc_general_stats.txt
│ ├── multiqc.log
│ ├── multiqc_nanostat.txt
│ ├── multiqc_software_versions.txt
│ ├── multiqc_sources.txt
│ ├── nanostat_aligned_stats_table.txt
│ └── nanostat_quality_dist.txt
├── multiqc_report.html
└── nanoplot
├── S1
│ ├── AlignedReadlengthvsSequencedReadLength_dot.html
│ ├── AlignedReadlengthvsSequencedReadLength_dot.png
│ ├── AlignedReadlengthvsSequencedReadLength_kde.html
│ ├── AlignedReadlengthvsSequencedReadLength_kde.png
│ ├── NanoPlot_20250317_1203.log
│ ├── NanoPlot-report.html
│ ├── NanoStats.txt
│ ├── Non_weightedHistogramReadlength.html
│ ├── Non_weightedHistogramReadlength.png
│ ├── Non_weightedLogTransformed_HistogramReadlength.html
│ ├── Non_weightedLogTransformed_HistogramReadlength.png
│ ├── PercentIdentityHistogramDynamic_Histogram_percent_identity.html
│ ├── PercentIdentityHistogramDynamic_Histogram_percent_identity.png
│ ├── PercentIdentityvsAlignedReadLength_dot.html
│ ├── PercentIdentityvsAlignedReadLength_dot.png
│ ├── PercentIdentityvsAlignedReadLength_kde.html
│ ├── PercentIdentityvsAlignedReadLength_kde.png
│ ├── WeightedHistogramReadlength.html
│ ├── WeightedHistogramReadlength.png
│ ├── WeightedLogTransformed_HistogramReadlength.html
│ ├── WeightedLogTransformed_HistogramReadlength.png
│ ├── Yield_By_Length.html
│ └── Yield_By_Length.png
└── S2
├── AlignedReadlengthvsSequencedReadLength_dot.html
├── AlignedReadlengthvsSequencedReadLength_dot.png
├── AlignedReadlengthvsSequencedReadLength_kde.html
├── AlignedReadlengthvsSequencedReadLength_kde.png
├── NanoPlot_20250317_1203.log
├── NanoPlot-report.html
├── NanoStats.txt
├── Non_weightedHistogramReadlength.html
├── Non_weightedHistogramReadlength.png
├── Non_weightedLogTransformed_HistogramReadlength.html
├── Non_weightedLogTransformed_HistogramReadlength.png
├── PercentIdentityHistogramDynamic_Histogram_percent_identity.html
├── PercentIdentityHistogramDynamic_Histogram_percent_identity.png
├── WeightedHistogramReadlength.html
├── WeightedHistogramReadlength.png
├── WeightedLogTransformed_HistogramReadlength.html
├── WeightedLogTransformed_HistogramReadlength.png
├── Yield_By_Length.html
└── Yield_By_Length.png
71 directories, 248 files
The sequence-guided detection of insertions is the core of the workflow. In addition to simply identifying the insertions, several other interesting parameters are automatically evaluated during the execution of the pipeline.
File: ../final/localization/ExactInsertions_{sample}.bed
Simulated S1:
chr1 270204 272451 Read-561 [257666, 291832] +
chr1 314899 323644 Read-343 [296872, 297968] +
chr1 432141 440886 Read-902 [428005, 432140] +
!!! warning
The `strand` column in `ExactInsertions_{sample}.bed` refers to the alignment of the read, not the insertion itself.
!!! info
This file is the primary output and shows the positions of the detected insertions, which are dependent on the reference. It resembles the standard [BED6](https://samtools.github.io/hts-specs/BEDv1.pdf) format with the columns: `Chromosome - Start - End - Read - Original Read Start/End - Strand`. Here, the `Original Read Start/End` column replaces the score column and illustrates the mapped position of the insertion-carrying read.
In addition to the main output, it can be useful to examine the orientation of the insertion and the exact structure of the inserted sequence within the read.
File: ../final/qc/Fragmentation/Longest_Interval/{sample}/Longest_interval_{read}.bed
S1 Read-343:
The small numbers displayed above the line represent the matching vector fragments, while the x-axis indicates the actual length in base pairs (bp) of the longest consecutive interval.
The longest detected interval of this read contained all possible 100 bp vector fragments from 0 to 87, with ambiguous 100 bp matches in the region around positions 6/7 and 55/56 of the insertion sequence. This ambigous region of the insertion corresponds to the long-terminal reapeats (LTRs) of the vector construct.
!!! info
Since the underlying vector sequence FASTA is in the 5'-3' orientation, and this order is maintained in the longest-matching interval of the fragmented sequence, the insertion and the read share the same `+` orientation.
S2 Read-262:
The small numbers displayed above the line represent the borders of the matching vector fragments, while the x-axis indicates the actual length in base pairs (bp) of the interval.
The longest consecutively detected interval of this read included only a subset of all 100 bp vector fragments, resulting in a shorter insertion of approximately 2500 bp. Additionally, the fragment numbers appear to be detected in descending order.
!!! info
Since the insertion sequence FASTA is oriented in the 5'-3' direction, and this order is **not** preserved in the longest-matching interval of the fragmented sequence, the insertion in the read has a `-` orientation. This indicates that the vector sequence is located in the `-` orientation on a `+` directional read.
The workflow automatically assesses the quality of the input sequencing data, the alignments performed with and without fragmentation, and the fragmentation itself. This allows not only for detecting insertions but also for evaluating the likelihood of true positives and the overall effectiveness of the search strategy employed by the pipeline.
The pipeline integrates basic quality assessment tools from widely established resources, including FastQC, MultiQC, and NanoPlot. An overview of the results can be accessed via Snakemake's workflow report, which is generated using snakemake --report or directly in the output directory.
File: ../final/qc/multiqc_report.html
!!! info
The pipeline uses fastqc by processing the FASTQ of each read with a detected insertion individually.
!!! Hint "Further Details"
For detailed explanations of the plots provided in the report, consult the documentation of each quality control tool. To access the individual quality control results, navigate to the following directories within the output folder:
fastqc: `../intermediate/qc/fastqc/`<br>
multiqc: `../intermediate/qc/multiqc/`<br>
nanoplot: `../intermediate/qc/nanoplot/`<br>
The pipeline incorporates two mapping steps to improve the quality of mapping by modifying reads that contain insertions. These steps are essential for accurately localizing the insertions, making it crucial to track the mapping quality of the affected reads at each key alignment stage.
File: ../intermediate/qc/mapq/Insertions_{sample}_mapq.txt
S1:
Read PrecutChr PrecutMAPQ PostcutChr PostcutMAPQ FilteredChr FilteredMAPQ
Read-343 pSLCAR-CD19-CD3z 60 chr1 44 chr1 44.0
Read-555 pSLCAR-CD19-CD3z 60 * 0
Read-561 chr1 60 chr1 60 chr1 60.0
Read-745 pSLCAR-CD19-CD3z 60 * 0
Read-902 pSLCAR-CD19-CD3z 60 chr1 60 chr1 60.0
The table illustrates changes in mapping quality and chromosome alignment for each read with an insertion across three stages: Precut mapping before any modifications, Postcut mapping after the reads were modified, and Filtered mapping after filtering based on mapping quality.
!!! info
During the initial mapping of the unaltered reads, four out of the five reads containing detected insertions predominantly aligned with high quality to the vector reference. However, after the modification (`Buffer`), where every base of the insertion was replaced with `N`, two additional reads successfully mapped to a region in the reference genome, while the other two reads became unmappable.
!!! info
The scores from the table are automatically visualized in the plot. However, due to overlapping quality scores, some reads may be obscured by others with identical values. In the example data, this occurs with `Read-902` and `Read-561`, as well as for `Read-555` and `Read-745`.
**S1:**

The fragmentation process is a crucial step not only for detecting insertions but also for gaining a detailed understanding of the exact composition and orientation of the inserted sequence. Some aspects of fragmentation quality control align closely with the analysis of the orientation and structure of the detected insertions.
However, the analysis of the previously mentioned output files overlooks another critical factor: The existence of fragments with significant sequence similarity to other "normal" sequences in the reference FASTA.
The pipeline includes functionality to perform a BLASTN search of the fragmented insertion sequence against a pre-built version of your reference's BLAST database. To enable this feature, simply specify the blastn_db argument in the config.yml.
!!! Danger
The potential similarity of the insertion sequence to other sequences in your reference is particularly important when using the pipeline in conjunction with complex vector expression systems. For example, CAR T cell vector constructs (like our example vector [construct](../other/other_simulation.md/#insertion-sequence) ) often insert sequences partially derived from human genes.
As this option is not configured for the tutorial, we can instead rely on two other automatically generated plots to gain insights into potential false-positive matches for the insertion sequence.
Directory: ../final/qc/Fragmentation/Insertions/Insertions_{fragmentsize}_{sample}/
These two plots illustrate the distributions of all insertion fragments (left) and the number of fragment matches "contributed" by each read (right).
!!! info
The `Combined distribution of all 100 bp fragments` plot reveals that every fragment of the vector is represented at least four times. However, fragments `6`,`7`,`55`, and `56` are noticeably overrepresented in the reads. As mentioned in the [orientation and structure](#orientation-and-structure) section, these fragments correspond to the vector's LTRs, making their alignment ambiguous. The slight plateau observed between fragments `57` and `78` is better understood in conjunction with the second plot.
The `Contribution of reads to the toal count of 100 bp fragments` plot clarifies this plateau by showing the read-specific contributions of fragments. Four reads contribute the maximum number of vector fragments, whereas `Read-561` includes only about 21 vector fragments. This leads to the slight overrepresentation of fragments `57` to `78` in the `Combined distribution of all 100 bp fragments` plot.
!!! Attention
Observations like these can help to determine the most accurate `MinInsertionLength` threshold in the `config.yml`.
!!! Hint "Further Details"
For the example data, we selected only a very small portion of the reference genome to generate reads. This is why there are no additional "off-target" fragment matches within our reads. Since the vector construct contains several human-derived components in its architecture, a real sequencing dataset would likely result in a more complex barplot.
As mentioned before, the safest way to identify potential misleading fragment matches in advance is to provide a human `BLASTN` reference database to the pipeline. The vector fragments are then automatically aligned against this reference, and the resulting plots offer an overview of the vector regions that are highly likely to appear, even in the absence of an actual insertion.
<details><summary> S1 Barplots when provided a `BLASTN` reference: </summary>

The bar plots now illustrate which vector fragments are likely to produce false positives. When comparing these fragments with the structure of the [construct](../other/other_simulation.md/#insertion-sequence), you can identify three main regions of fragment matches: fragments `22`–`25` correspond to the EF-1a core promoter, fragments `35`–`39` align with CD28 and CD247, and fragments `56`–`57` represent the 3'LTR. These are all (to some extend) human components in the vector architecture that we can also anticipate detecting with the pipeline by using the vector genome as the target sequence.
</details>
Typically, identifying the genomic localization of an insertion is just the starting point. A basic yet essential functionality for annotating the detected insertion sites is included in the pipeline through the functional_genomics.smk rule collection. The pipeline can work with different user-defined BED annotation files that can be provided in the config.yml as simple as annotate_{key}.
For the tutorial, we have defined only one annotation file in the config.yml, which simply contains the known genes located in our specified reference FASTA. For details on generating this file, refer to this. The pipeline compares the locations of the insertions with the entries in the provided annotation file and reports the closest match of each insertion with each annotation, thus producing the file below.
File: ../final/functional_genomics/Functional_distances_to_Insertions_{sample}.bed
S1:
| InsertionChromosome | InsertionStart | InsertionEnd | InsertionRead | InsertionOrig | InsertionStrand | AnnotationChromosome | AnnotationStart | AnnotationEnd | AnnotationID | AnnotationScore | AnnotationStrand | AnnotationSource | Distance |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| chr1 | 270204 | 270205 | Read-561 | [257666, 291832] | + | chr1 | 266854 | 268655 | ENSG00000286448 | . | + | annotate_ucsc_genes | -1550 |
| chr1 | 314899 | 314900 | Read-343 | [296872, 297968] | + | chr1 | 360056 | 366052 | ENSG00000236601 | . | + | annotate_ucsc_genes | 45157 |
| chr1 | 432141 | 432142 | Read-902 | [428005, 432140] | + | chr1 | 450739 | 451678 | OR4F29 | . | - | annotate_ucsc_genes | 18598 |
!!! info The header above was only added to make the interpretation of the output easier. Your own output will be without the column names.
A good starting point to get familiar with the personalization of the pipeline tailored to your specific research question can be including a rule for the visualisation of this table. Check out the advanced usage for more on this.
!!! Hint "Further Details"
The reads for this tutorial are artificially generated based on the first `50kb` of sequence from human chromosome 1. The regions at the beginning of chromosomes (near the centromeres and telomeres) are typically less gene-dense compared to the more gene-rich areas toward the middle of the chromosomes. This relative scarcity of coding genes also makes these regions less accessible for the integration of lentiviral-based vector systems, thus reducing the biological plausibility of our simulated data.
The workflow generates numerous additional files beyond previously listed. Most of these files are quite easy to understand once you are familiar with the pipeline's functionality. They are typically not essential for most use cases unless debugging is required or you integrate custom downstream rules into the analysis.
Directory: ../intermediate/
!!! info Here is a list of each subdirectory and a description of what to find in them:
**`blastn/`**
- `Filtered_Annotated_{fragmentsize}_InsertionMatches_{sample}.blastn`: Results from the BLASTn searches after filtering
- `Coordinates_{fragmentsize}_InsertionMatches_{sample}.blastn`: Dictionary of the identified FASTA coordinates based on insertions in the reads
-`Readnames_{fragmentsize}_InsertionMatches_{sample}.txt`: Names of insertion-carrying reads.
- **`ref/`**: `BLASTN` matches of vector fragments with provided ref blastdb (empty files if no `blast_db` provided)
**`fasta/`**
- **`fragments/`**: Constructed `BLASTN` database based on the query insertion
- `Modified_{sample}_mod.fa`: Modified FASTA file of input BAM (read modification dependent on `Buffer`, `Split`, or `Join`)
- `Full_{sample}.fa`: Unmodifed FASTA file of input BAM
- `Insertion_{sample}.fa`: Detected insertion sequences extacted from the reads
- `Isolated_Reads_{sample}.fa`: Isolated reads with insertions
**`functional_genomics/`**
- `Annotation_ucsc_genes_Insertions_{sample}.bed`: Insertions with exact annotation matches (distance=0). Based on `bedtools intersect`.
**`localization/`**
- `ExactInsertions_{sample}.bed`: File as in final output
- `Sorted_InsertionPoints_{sample}.bed`: Exact points of insertion (`stop = start + 1`)
**`log/`**
- See [Error handling](../other.md/#log-files)
**`mapping/`**
- `insertion_ref_genome.fa`: Genome used for mapping. Consists of user-defiend reference genome and insertion reference sequence.
- `Isolated_Reads_{sample}.bam`: Isolated reads with insertions
- `Precut_{sample}_sorted.bam`: Unmodified reads after reference mapping
- `Postcut_{sample}_unfiltered_sorted.bam`: (Modified) reads after reference mapping
- `Postcut_{sample}_sorted.bam`: (Modified) Reads passing the quality filter after reference mapping
- `Postcut_{sample}_sorted.bed`: Genomic locations of aligned reads
**`qc/`**
- **`fastqc/`**: Fastqc input and raw output
- **`multiqc_data/`**: Multiqc raw output
- **`nanoplot/`**: Nanoplot raw output
- `multiqc_report.html`: Report as in final output




