Skip to content

Softwares used in the fada module

bedtools_intersect_cnvkit

Exports CNVs that include the majority of the target gene from a VCF based on a design BED file, effectively filtering CNVs by genomic intervals of interest.

Snakemake Rule

rule bedtools_intersect_cnvkit:
    input:
        left="cnv_sv/cnvkit_vcf/{sample}_{type}.annotate_cnv.refseq_genes.bcftools_view.CNVS.vcf.gz",
        right=config.get("reference", {}).get("design_genes_bed", ""),
    output:
        vcf="cnv_sv/cnvkit_vcf/{sample}_{type}.annotate_cnv.refseq_genes.bcftools_view.gene.CNVS.vcf",
    params:
        extra=f"-F 0.6 -u -header {config.get('bedtools_intersect_cnvkit', {}).get('extra', '')}",
    log:
        "cnv_sv/cnvkit_vcf/{sample}_{type}.annotate_cnv.refseq_genes.bcftools_view.gene.CNVS.vcf.log",
    benchmark:
        repeat(
            "cnv_sv/cnvkit_vcf/{sample}_{type}.annotate_cnv.refseq_genes.bcftools_view.gene.CNVS.vcf.benchmark.tsv",
            config.get("bedtools_intersect_cnvkit", {}).get("benchmark_repeats", 1),
        )
    threads: config.get("bedtools_intersect_cnvkit", {}).get("threads", config["default_resources"]["threads"])
    resources:
        mem_mb=config.get("bedtools_intersect_cnvkit", {}).get("mem_mb", config["default_resources"]["mem_mb"]),
        mem_per_cpu=config.get("bedtools_intersect_cnvkit", {}).get("mem_per_cpu", config["default_resources"]["mem_per_cpu"]),
        partition=config.get("bedtools_intersect_cnvkit", {}).get("partition", config["default_resources"]["partition"]),
        threads=config.get("bedtools_intersect_cnvkit", {}).get("threads", config["default_resources"]["threads"]),
        time=config.get("bedtools_intersect_cnvkit", {}).get("time", config["default_resources"]["time"]),
    container:
        config.get("bedtools_intersect_cnvkit", {}).get("container", config["default_container"])
    message:
        "{rule}: export CNVS that include the majority of the target gene from {input.left} based on {input.right}"
    wrapper:
        "v1.32.0/bio/bedtools/intersect"

↔ input / output files

Rule parameters Key Value Description
input left "cnv_sv/cnvkit_vcf/{sample}_{type}.annotate_cnv.refseq_genes.bcftools_view.CNVS.vcf.gz" CNVkit vcf with all CNVS
_ _ right config.get("reference", {}).get("design_genes_bed", "") bed file with target genes
output vcf "cnv_sv/cnvkit_vcf/{sample}_{type}.annotate_cnv.refseq_genes.bcftools_view.gene.CNVS.vcf" CNVkit vcf with only CNVS that include the whole target gene

🔧 Configuration

Software settings (config.yaml)

Key Type Description
benchmark_repeats integer set number of times benchmark should be repeated
container string name or path to docker/singularity container
extra string parameters that should be forwarded

Resources settings (resources.yaml)

Key Type Description
mem_mb integer max memory in MB to be available
mem_per_cpu integer memory in MB used per cpu
partition string partition to use on cluster
threads integer number of threads to be available
time string max execution time

cnvkit_segment

Segments copy number ratio data (log2 bins) using CNVkit, incorporating SNP information for more accurate segmentation and germline HMM modeling.

Snakemake Rule

rule cnvkit_segment:
    input:
        log2_bins="cnv_sv/cnvkit_batch/{sample}/{sample}_{type}.cnr",
        snps="snv_indels/deepvariant/{sample}_{type}.fix_af.bcftools_view.SNPS.vcf.gz",
    output:
        segment=temp("cnv_sv/cnvkit_segment/{sample}_{type}.cns"),
    params:
        extra=config.get("cnvkit_segment", {}).get("extra", ""),
        method=config.get("cnvkit_segment", {}).get("method", "hmm-germline"),
    log:
        "cnv_sv/cnvkit_segment/{sample}_{type}.segmetrics.cns.log",
    benchmark:
        repeat(
            "cnv_sv/cnvkit_segment/{sample}_{type}.segmetrics.cns.benchmark.tsv",
            config.get("cnvkit_segment", {}).get("benchmark_repeats", 1),
        )
    threads: config.get("cnvkit_segment", {}).get("threads", config["default_resources"]["threads"])
    resources:
        mem_mb=config.get("cnvkit_segment", {}).get("mem_mb", config["default_resources"]["mem_mb"]),
        mem_per_cpu=config.get("cnvkit_segment", {}).get("mem_per_cpu", config["default_resources"]["mem_per_cpu"]),
        partition=config.get("cnvkit_segment", {}).get("partition", config["default_resources"]["partition"]),
        threads=config.get("cnvkit_segment", {}).get("threads", config["default_resources"]["threads"]),
        time=config.get("cnvkit_segment", {}).get("time", config["default_resources"]["time"]),
    container:
        config.get("cnvkit_segment", {}).get("container", config["default_container"])
    message:
        "{rule}: segment {input.log2_bins} with cnvkit using SNPs from {input.snps}"
    shell:
        "cnvkit.py segment "
        "-v {input.snps} "
        "-m {params.method} "
        "-o {output.segment} "
        "{params.extra} {input.log2_bins}  > {log} 2>&1 "

↔ input / output files

Rule parameters Key Value Description
input log2_bins "cnv_sv/cnvkit_batch/{sample}/{sample}_{type}.cnr" Bin-level log2 ratios file from cnvkit
_ _ snps "snv_indels/deepvariant/{sample}_{type}.fix_af.bcftools_view.SNPS.vcf.gz" SNPs vcf file to use for segmentation
output segment "cnv_sv/cnvkit_segment/{sample}_{type}.cns" Segmented log2 ratios file from cnvkit

🔧 Configuration

Software settings (config.yaml)

Key Type Description
benchmark_repeats integer set number of times benchmark should be repeated
container string name or path to docker/singularity container
extra string parameters that should be forwarded

Resources settings (resources.yaml)

Key Type Description
mem_mb integer max memory in MB to be available
mem_per_cpu integer memory in MB used per cpu
partition string partition to use on cluster
threads integer number of threads to be available
time string max execution time

picard_bed_to_interval_list

Converts a genomic design BED file into a Picard-style interval list, using the reference FASTA and sequence dictionary for header information.

Snakemake Rule

rule picard_bed_to_interval_list:
    input:
        bed=config.get("reference", {}).get("design_bed", ""),
        reference=config["reference"]["fasta"],
        dict="qc/picard_create_sequence_dictionary/reference.dict",
    output:
        interval_list=temp("qc/picard_bed_to_interval_list/targets.interval_list"),
    params:
        extra=config.get("picard_bed_to_interval_list", {}).get("extra", ""),
    log:
        "qc/picard_bed_to_interval_list/targets.interval_list.log",
    benchmark:
        repeat(
            "qc/picard_bed_to_interval_list/targets.interval_list.benchmark.tsv",
            config.get("picard_bed_to_interval_list", {}).get("benchmark_repeats", 1),
        )
    threads: config.get("picard_bed_to_interval_list", {}).get("threads", config["default_resources"]["threads"])
    resources:
        mem_mb=config.get("picard_bed_to_interval_list", {}).get("mem_mb", config["default_resources"]["mem_mb"]),
        mem_per_cpu=config.get("picard_bed_to_interval_list", {}).get("mem_per_cpu", config["default_resources"]["mem_per_cpu"]),
        partition=config.get("picard_bed_to_interval_list", {}).get("partition", config["default_resources"]["partition"]),
        threads=config.get("picard_bed_to_interval_list", {}).get("threads", config["default_resources"]["threads"]),
        time=config.get("picard_bed_to_interval_list", {}).get("time", config["default_resources"]["time"]),
    container:
        config.get("picard_bed_to_interval_list", {}).get("container", config["default_container"])
    message:
        "{rule}: create interval_list file from {input.bed}"
    wrapper:
        "0.79.0/bio/picard/bedtointervallist"

↔ input / output files

Rule parameters Key Value Description
input bed config.get("reference", {}).get("design_bed", "") design bed file (aka. target bed file)
reference config["reference"]["fasta"] reference genome fasta file
_ _ dict "qc/picard_create_sequence_dictionary/reference.dict" sequence dictionary file for the reference genome
output interval_list "qc/picard_bed_to_interval_list/targets.interval_list" picard interval_list file for design regions

🔧 Configuration

Software settings (config.yaml)

Key Type Description
benchmark_repeats integer set number of times benchmark should be repeated
container string name or path to docker/singularity container
extra string parameters that should be forwarded

Resources settings (resources.yaml)

Key Type Description
mem_mb integer max memory in MB to be available
mem_per_cpu integer memory in MB used per cpu
partition string partition to use on cluster
threads integer number of threads to be available
time string max execution time

picard_create_sequence_dictionary

Generates a sequence dictionary (SAM/BAM header format) from a reference FASTA file, required by various Picard and GATK tools.

Snakemake Rule

rule picard_create_sequence_dictionary:
    input:
        fasta=config["reference"]["fasta"],
    output:
        dict_file=temp("qc/picard_create_sequence_dictionary/reference.dict"),
    params:
        extra=config.get("picard_create_sequence_dictionary", {}).get("extra", ""),
    log:
        "qc/picard_create_sequence_dictionary/reference.dict.log",
    benchmark:
        repeat(
            "qc/picard_create_sequence_dictionary/reference.dict.benchmark.tsv",
            config.get("picard_create_sequence_dictionary", {}).get("benchmark_repeats", 1),
        )
    threads: config.get("picard_create_sequence_dictionary", {}).get("threads", config["default_resources"]["threads"])
    resources:
        mem_mb=config.get("picard_create_sequence_dictionary", {}).get("mem_mb", config["default_resources"]["mem_mb"]),
        mem_per_cpu=config.get("picard_create_sequence_dictionary", {}).get(
            "mem_per_cpu", config["default_resources"]["mem_per_cpu"]
        ),
        partition=config.get("picard_create_sequence_dictionary", {}).get("partition", config["default_resources"]["partition"]),
        threads=config.get("picard_create_sequence_dictionary", {}).get("threads", config["default_resources"]["threads"]),
        time=config.get("picard_create_sequence_dictionary", {}).get("time", config["default_resources"]["time"]),
    container:
        config.get("picard_create_sequence_dictionary", {}).get("container", config["default_container"])
    message:
        "{rule}: create a sequence dictionary file from {input}"
    wrapper:
        "0.79.0/bio/picard/createsequencedictionary"

↔ input / output files

Rule parameters Key Value Description
input fasta config["reference"]["fasta"] reference genome fasta file
output dict_file "qc/picard_create_sequence_dictionary/reference.dict" sequence dictionary file for the reference genome

🔧 Configuration

Software settings (config.yaml)

Key Type Description
benchmark_repeats integer set number of times benchmark should be repeated
container string name or path to docker/singularity container
extra string parameters that should be forwarded

Resources settings (resources.yaml)

Key Type Description
mem_mb integer max memory in MB to be available
mem_per_cpu integer memory in MB used per cpu
partition string partition to use on cluster
threads integer number of threads to be available
time string max execution time

export_qc_bedtools_intersect

Intersects mosdepth per-base coverage data with an exon BED file to identify and export low-coverage regions across targeted exons.

Snakemake Rule

rule export_qc_bedtools_intersect:
    input:
        left="qc/mosdepth_bed_exon/{sample}_{type}.per-base.bed.gz",
        coverage_csi="qc/mosdepth_bed_exon/{sample}_{type}.per-base.bed.gz.csi",
        right=config["reference"]["exon_bed"],
    output:
        results=temp("qc/mosdepth_bed_exon/{sample}_{type}.mosdepth.per-base.exon_bed.bed"),
    params:
        extra=config.get("export_qc_bedtools_intersect", {}).get("extra", ""),
    log:
        "qc/mosdepth_bed_exon/{sample}_{type}.mosdepth.per-base.exon_bed.log",
    benchmark:
        repeat(
            "qc/mosdepth_bed_exon/{sample}_{type}.mosdepth.per-base.exon_bed.benchmark.tsv",
            config.get("export_qc_bedtools_intersect", {}).get("benchmark_repeats", 1),
        )
    threads: config.get("export_qc_bedtools_intersect", {}).get("threads", config["default_resources"]["threads"])
    resources:
        mem_mb=config.get("export_qc_bedtools_intersect", {}).get("mem_mb", config["default_resources"]["mem_mb"]),
        mem_per_cpu=config.get("export_qc_bedtools_intersect", {}).get("mem_per_cpu", config["default_resources"]["mem_per_cpu"]),
        partition=config.get("export_qc_bedtools_intersect", {}).get("partition", config["default_resources"]["partition"]),
        threads=config.get("export_qc_bedtools_intersect", {}).get("threads", config["default_resources"]["threads"]),
        time=config.get("export_qc_bedtools_intersect", {}).get("time", config["default_resources"]["time"]),
    container:
        config.get("export_qc_bedtools_intersect", {}).get("container", config["default_container"])
    message:
        "{rule}: export low cov regions from {input.left} based on {input.right}"
    wrapper:
        "v1.32.0/bio/bedtools/intersect"

↔ input / output files

Rule parameters Key Value Description
input left "qc/mosdepth_bed_exon/{sample}_{type}.per-base.bed.gz" per-base coverage file from mosdepth
coverage_csi "qc/mosdepth_bed_exon/{sample}_{type}.per-base.bed.gz.csi" index file for per-base.bed.gz file
_ _ right config["reference"]["exon_bed"] design bed used to only look at coverage based on bedfile
output results "qc/mosdepth_bed_exon/{sample}_{type}.mosdepth.per-base.exon_bed.bed" bed file with coverage per base for design file

🔧 Configuration

Software settings (config.yaml)

Key Type Description
benchmark_repeats integer set number of times benchmark should be repeated
container string name or path to docker/singularity container
extra string parameters that should be forwarded

Resources settings (resources.yaml)

Key Type Description
mem_mb integer max memory in MB to be available
mem_per_cpu integer memory in MB used per cpu
partition string partition to use on cluster
threads integer number of threads to be available
time string max execution time

export_qc_bedtools_intersect_pgrs

Similar to the above, but specifically intersects per-base coverage with a PGRS (polyguanosine-rich sequences) BED file to assess coverage in complex regions.

Snakemake Rule

rule export_qc_bedtools_intersect_pgrs:
    input:
        left="qc/mosdepth_bed_exon/{sample}_{type}.per-base.bed.gz",
        coverage_csi="qc/mosdepth_bed_exon/{sample}_{type}.per-base.bed.gz.csi",
        right=config["reference"]["pgrs_bed"],
    output:
        results=temp("qc/mosdepth_bed_exon/{sample}_{type}.mosdepth.pgrs_cov.bed"),
    params:
        extra=config.get("export_qc_bedtools_intersect_pgrs", {}).get("extra", ""),
    log:
        "qc/mosdepth_bed_exon/{sample}_{type}.mosdepth.pgrs_cov.log",
    benchmark:
        repeat(
            "qc/mosdepth_bed_exon/{sample}_{type}.mosdepth.pgrs_cov.benchmark.tsv",
            config.get("export_qc_bedtools_intersect_pgrs", {}).get("benchmark_repeats", 1),
        )
    threads: config.get("export_qc_bedtools_intersect_pgrs", {}).get("threads", config["default_resources"]["threads"])
    resources:
        mem_mb=config.get("export_qc_bedtools_intersect_pgrs", {}).get("mem_mb", config["default_resources"]["mem_mb"]),
        mem_per_cpu=config.get("export_qc_bedtools_intersect_pgrs", {}).get(
            "mem_per_cpu", config["default_resources"]["mem_per_cpu"]
        ),
        partition=config.get("export_qc_bedtools_intersect_pgrs", {}).get("partition", config["default_resources"]["partition"]),
        threads=config.get("export_qc_bedtools_intersect_pgrs", {}).get("threads", config["default_resources"]["threads"]),
        time=config.get("export_qc_bedtools_intersect_pgrs", {}).get("time", config["default_resources"]["time"]),
    container:
        config.get("export_qc_bedtools_intersect_pgrs", {}).get("container", config["default_container"])
    message:
        "{rule}: export low cov regions from {input.left} based on {input.right}"
    wrapper:
        "v1.32.0/bio/bedtools/intersect"

↔ input / output files

Rule parameters Key Value Description
input left "qc/mosdepth_bed_exon/{sample}_{type}.per-base.bed.gz" per-base coverage file from mosdepth
coverage_csi "qc/mosdepth_bed_exon/{sample}_{type}.per-base.bed.gz.csi" index file for per-base.bed.gz file
_ _ right config["reference"]["pgrs_bed"] design bed used to only look at coverage based on bedfile, in this case pgrs positions
output results "qc/mosdepth_bed_exon/{sample}_{type}.mosdepth.pgrs_cov.bed" bed file with coverage per base for design file

🔧 Configuration

Software settings (config.yaml)

Key Type Description
benchmark_repeats integer set number of times benchmark should be repeated
container string name or path to docker/singularity container
extra string parameters that should be forwarded

Resources settings (resources.yaml)

Key Type Description
mem_mb integer max memory in MB to be available
mem_per_cpu integer memory in MB used per cpu
partition string partition to use on cluster
threads integer number of threads to be available
time string max execution time

export_qc_xlsx_tc_report

A comprehensive Python script that aggregates various QC metrics—including mosdepth summaries, coverage thresholds, and read statistics—into a single Excel (XLSX) report for Twist Cancer designs.

Snakemake Rule

rule export_qc_xlsx_tc_report:
    input:
        mosdepth_summary="qc/mosdepth_bed_exon/{sample}_{type}.mosdepth.summary.txt",
        mosdepth_thresholds="qc/mosdepth_bed_exon/{sample}_{type}.thresholds.bed.gz",
        mosdepth_regions="qc/mosdepth_bed_exon/{sample}_{type}.regions.bed.gz",
        mosdepth_perbase="qc/mosdepth_bed_exon/{sample}_{type}.mosdepth.per-base.exon_bed.bed",
        pgrs_coverage="qc/mosdepth_bed_exon/{sample}_{type}.mosdepth.pgrs_cov.bed",
        design_bed=config["reference"]["exon_bed"],
        pgrs_bed=config["reference"]["pgrs_bed"],
        reads_summary="qc/extract_read_metrics/{sample}_{type}.summary.tsv",
        wanted_transcripts=config["reference"]["wanted_transcripts"],
    output:
        results=temp("qc/xlsx_report/{sample}_{type}.xlsx"),
    params:
        extra=config.get("export_qc_xlsx_report", {}).get("extra", ""),
        coverage_thresholds=config["mosdepth_bed"]["thresholds"],
        sequenceid=config["sequenceid"],
    log:
        "qc/xlsx_report/{sample}_{type}.xlsx.log",
    benchmark:
        repeat(
            "qc/xlsx_report/{sample}_{type}.xlsx.benchmark.tsv",
            config.get("export_qc_xlsx_report", {}).get("benchmark_repeats", 1),
        )
    threads: config.get("export_qc_xlsx_report", {}).get("threads", config["default_resources"]["threads"])
    resources:
        mem_mb=config.get("export_qc_xlsx_report", {}).get("mem_mb", config["default_resources"]["mem_mb"]),
        mem_per_cpu=config.get("export_qc_xlsx_report", {}).get("mem_per_cpu", config["default_resources"]["mem_per_cpu"]),
        partition=config.get("export_qc_xlsx_report", {}).get("partition", config["default_resources"]["partition"]),
        threads=config.get("export_qc_xlsx_report", {}).get("threads", config["default_resources"]["threads"]),
        time=config.get("export_qc_xlsx_report", {}).get("time", config["default_resources"]["time"]),
    container:
        config.get("export_qc_xlsx_report", {}).get("container", config["default_container"])
    message:
        "{rule}: collecting qc values into {output} for the twist cancer design"
    script:
        "../scripts/export_qc_xlsx_tc_report.py"

↔ input / output files

Rule parameters Key Value Description
input mosdepth_summary "qc/mosdepth_bed_exon/{sample}_{type}.mosdepth.summary.txt" mosdepth bed summary file
mosdepth_thresholds "qc/mosdepth_bed_exon/{sample}_{type}.thresholds.bed.gz" Mosdepth bed thresholds file
mosdepth_regions "qc/mosdepth_bed_exon/{sample}_{type}.regions.bed.gz" mosdepth bed coverage per region file
mosdepth_perbase "qc/mosdepth_bed_exon/{sample}_{type}.mosdepth.per-base.exon_bed.bed" mosdepth bed per-base result file subsampled into exons in export_qc_bedtools_intersect output
pgrs_coverage "qc/mosdepth_bed_exon/{sample}_{type}.mosdepth.pgrs_cov.bed" mosdepth per-base file from export_qc_bedtools_intersect_pgrs output
design_bed config["reference"]["exon_bed"] design bed defined in config-file
pgrs_bed config["reference"]["pgrs_bed"] bedfile with PGRS score SNPs
reads_summary "qc/extract_read_metrics/{sample}_{type}.summary.tsv" reads summary stats with duplication rate (generated by calculate_read_metrics.py)
_ _ wanted_transcripts config["reference"]["wanted_transcripts"] path to txt-file in bedformat of transcripts of interest
output results "qc/xlsx_report/{sample}_{type}.xlsx" .xlsx file with summarized QC-values per sample

🔧 Configuration

Software settings (config.yaml)

Key Type Description
benchmark_repeats integer set number of times benchmark should be repeated
container string name or path to docker/singularity container
extra string parameters that should be forwarded

Resources settings (resources.yaml)

Key Type Description
mem_mb integer max memory in MB to be available
mem_per_cpu integer memory in MB used per cpu
partition string partition to use on cluster
threads integer number of threads to be available
time string max execution time

extract_read_metrics

Calculates and extracts summary statistics from BAM files, including read lengths, qualities, and duplicate counts to assess library quality.

Snakemake Rule

rule extract_read_metrics:
    input:
        bam="prealignment/pbmarkdup/{sample}_{type}.bam",
    output:
        reads_metrics=temp("qc/extract_read_metrics/{sample}_{type}.metrics.tsv"),
        reads_summary=temp("qc/extract_read_metrics/{sample}_{type}.summary.tsv"),
    params:
        extra=config.get("extract_read_metrics", {}).get("extra", ""),
    log:
        "qc/extract_read_metrics/{sample}_{type}.metrics.log",
    benchmark:
        repeat(
            "extract_read_metrics/{sample}_{type}.metrics.benchmark.tsv",
            config.get("extract_read_metrics", {}).get("benchmark_repeats", 1),
        )
    threads: config.get("extract_read_metrics", {}).get("threads", config["default_resources"]["threads"])
    resources:
        mem_mb=config.get("extract_read_metrics", {}).get("mem_mb", config["default_resources"]["mem_mb"]),
        mem_per_cpu=config.get("extract_read_metrics", {}).get("mem_per_cpu", config["default_resources"]["mem_per_cpu"]),
        partition=config.get("extract_read_metrics", {}).get("partition", config["default_resources"]["partition"]),
        threads=config.get("extract_read_metrics", {}).get("threads", config["default_resources"]["threads"]),
        time=config.get("extract_read_metrics", {}).get("time", config["default_resources"]["time"]),
    container:
        config.get("extract_read_metrics", {}).get("container", config["default_container"])
    message:
        "{rule}: extract read metrics from {input.bam}"
    script:
        "../scripts/calculate_read_metrics.py"

↔ input / output files

Rule parameters Key Value Description
input bam "prealignment/pbmarkdup/{sample}_{type}.bam" unmapped bam file with duplicates marked by pbmardup
output reads_metrics "qc/extract_read_metrics/{sample}_{type}.metrics.tsv" tab-delimited file with summary stats for each read
_ _ reads_summary "qc/extract_read_metrics/{sample}_{type}.summary.tsv" tab-delimited file with summary stats over all the reads in reads_metrics

🔧 Configuration

Software settings (config.yaml)

Key Type Description
benchmark_repeats integer set number of times benchmark should be repeated
container string name or path to docker/singularity container
extra string parameters that should be forwarded

Resources settings (resources.yaml)

Key Type Description
mem_mb integer max memory in MB to be available
mem_per_cpu integer memory in MB used per cpu
partition string partition to use on cluster
threads integer number of threads to be available
time string max execution time

strdust

Calls Short Tandem Repeats (STRs) from BAM files using the STRdust tool, guided by a reference FASTA and a catalog of repeat loci.

Snakemake Rule

rule strdust:
    input:
        bam="alignment/minimap2_align/{sample}_{type}.bam",
        bai="alignment/minimap2_align/{sample}_{type}.bam.bai",
        fasta=config.get("reference", {}).get("fasta", ""),
        repeats=config.get("reference", {}).get("str_bed", ""),
    output:
        vcf="cnv_sv/strdust/{sample}_{type}.vcf",
    params:
        extra=config.get("strdust", {}).get("extra", ""),
        sample=lambda wildcards: f"{wildcards.sample}_{wildcards.type}",
    log:
        "cnv_sv/strdust/{sample}_{type}.output.log",
    benchmark:
        repeat("cnv_sv/strdust/{sample}_{type}.output.benchmark.tsv", config.get("strdust", {}).get("benchmark_repeats", 1))
    threads: config.get("strdust", {}).get("threads", config["default_resources"]["threads"])
    resources:
        mem_mb=config.get("strdust", {}).get("mem_mb", config["default_resources"]["mem_mb"]),
        mem_per_cpu=config.get("strdust", {}).get("mem_per_cpu", config["default_resources"]["mem_per_cpu"]),
        partition=config.get("strdust", {}).get("partition", config["default_resources"]["partition"]),
        threads=config.get("strdust", {}).get("threads", config["default_resources"]["threads"]),
        time=config.get("strdust", {}).get("time", config["default_resources"]["time"]),
    container:
        config.get("strdust", {}).get("container", config["default_container"])
    message:
        "{rule}: Call STRs in {input.bam} using STRdust"
    shell:
        "STRdust "
        "{params.extra} "
        "-t {threads} "
        "-R {input.repeats} "
        "--sample {params.sample} "
        "{input.fasta} "
        "{input.bam} > {output.vcf} 2> {log}"

↔ input / output files

Rule parameters Key Value Description
input bam "alignment/minimap2_align/{sample}_{type}.bam" bam file with alignend long-reads
bai "alignment/minimap2_align/{sample}_{type}.bam.bai" index for the bam file
fasta config.get("reference", {}).get("fasta", "") reference genome fasta file
_ _ repeats config.get("reference", {}).get("str_bed", "") bed file of str regions for strdust
output vcf "cnv_sv/strdust/{sample}_{type}.vcf" vcf file with str calls from strdust

🔧 Configuration

Software settings (config.yaml)

Key Type Description
benchmark_repeats integer set number of times benchmark should be repeated
container string name or path to docker/singularity container
extra string parameters that should be forwarded

Resources settings (resources.yaml)

Key Type Description
mem_mb integer max memory in MB to be available
mem_per_cpu integer memory in MB used per cpu
partition string partition to use on cluster
threads integer number of threads to be available
time string max execution time

strkit_call

An alternative STR caller that performs locus-specific genotyping from long-read BAM files using the strkit tool.

Snakemake Rule

rule strkit_call:
    input:
        bam="alignment/minimap2_align/{sample}_{type}.bam",
        bai="alignment/minimap2_align/{sample}_{type}.bam.bai",
        fasta=config.get("reference", {}).get("fasta", ""),
        repeats=config.get("strkit_call", {}).get("bed", ""),
    output:
        vcf="cnv_sv/strkit_call/{sample}_{type}.vcf",
    params:
        extra=config.get("strkit_call", {}).get("extra", ""),
        sample_id=lambda wildcards: f"{wildcards.sample}_{wildcards.type}",
    log:
        "cnv_sv/strkit_call/{sample}_{type}.vcf.log",
    benchmark:
        repeat(
            "cnv_sv/strkit_call/{sample}_{type}.output.benchmark.tsv", config.get("strkit_call", {}).get("benchmark_repeats", 1)
        )
    threads: config.get("strkit_call", {}).get("threads", config["default_resources"]["threads"])
    resources:
        mem_mb=config.get("strkit_call", {}).get("mem_mb", config["default_resources"]["mem_mb"]),
        mem_per_cpu=config.get("strkit_call", {}).get("mem_per_cpu", config["default_resources"]["mem_per_cpu"]),
        partition=config.get("strkit_call", {}).get("partition", config["default_resources"]["partition"]),
        threads=config.get("strkit_call", {}).get("threads", config["default_resources"]["threads"]),
        time=config.get("strkit_call", {}).get("time", config["default_resources"]["time"]),
    container:
        config.get("strkit_call", {}).get("container", config["default_container"])
    message:
        "{rule}: Call STRs in {input.bam} using strkit"
    shell:
        "strkit call "
        "{input.bam} "
        "--ref {input.fasta} "
        "--sample-id {params.sample_id} "
        "--loci {input.repeats} "
        "--vcf {output.vcf} "
        "--processes {threads} &> {log}"

↔ input / output files

Rule parameters Key Value Description
input bam "alignment/minimap2_align/{sample}_{type}.bam" bam file with alignend long-reads
bai "alignment/minimap2_align/{sample}_{type}.bam.bai" index for the bam file
fasta config.get("reference", {}).get("fasta", "") reference genome fasta file
_ _ repeats config.get("strkit_call", {}).get("bed", "") bed file of str regions for strkit
output vcf "cnv_sv/strkit_call/{sample}_{type}.vcf" vcf file with str calls from strkit

🔧 Configuration

Software settings (config.yaml)

Key Type Description
benchmark_repeats integer set number of times benchmark should be repeated
container string name or path to docker/singularity container
extra string parameters that should be forwarded

Resources settings (resources.yaml)

Key Type Description
mem_mb integer max memory in MB to be available
mem_per_cpu integer memory in MB used per cpu
partition string partition to use on cluster
threads integer number of threads to be available
time string max execution time

add_vcf_ref

A utility script that adds reference genome information to the header of a VCF file, ensuring compatibility with downstream tools that require the reference path.

Snakemake Rule

rule add_vcf_ref:
    input:
        vcf="{file}.vcf.gz",
        ref=config["reference"]["fasta"],
    output:
        vcf="{file}_GRCh38.vcf",
    params:
        extra=config.get("add_vcf_ref", {}).get("extra", ""),
    log:
        "fada/add_vcf_ref/{file}_GRCh38.vcf.gz.log",
    benchmark:
        repeat("fada/add_vcf_ref/{file}_GRCh38.vcf.gz.benchmark.tsv", config.get("add_vcf_ref", {}).get("benchmark_repeats", 1))
    threads: config.get("add_vcf_ref", {}).get("threads", config["default_resources"]["threads"])
    resources:
        mem_mb=config.get("add_vcf_ref", {}).get("mem_mb", config["default_resources"]["mem_mb"]),
        mem_per_cpu=config.get("add_vcf_ref", {}).get("mem_per_cpu", config["default_resources"]["mem_per_cpu"]),
        partition=config.get("add_vcf_ref", {}).get("partition", config["default_resources"]["partition"]),
        threads=config.get("add_vcf_ref", {}).get("threads", config["default_resources"]["threads"]),
        time=config.get("add_vcf_ref", {}).get("time", config["default_resources"]["time"]),
    container:
        config.get("add_vcf_ref", {}).get("container", config["default_container"])
    message:
        "{rule}: Add reference to the header of the {input.vcf}"
    script:
        "../scripts/add_ref.py"

↔ input / output files

Rule parameters Key Value Description
input vcf "{file}.vcf.gz" Input vcf file
_ _ ref config["reference"]["fasta"] reference fasta file path
output vcf "{file}_GRCh38.vcf" vcf with reference genome path in vcf-header

🔧 Configuration

Software settings (config.yaml)

Key Type Description
benchmark_repeats integer set number of times benchmark should be repeated
container string name or path to docker/singularity container
extra string parameters that should be forwarded

Resources settings (resources.yaml)

Key Type Description
mem_mb integer max memory in MB to be available
mem_per_cpu integer memory in MB used per cpu
partition string partition to use on cluster
threads integer number of threads to be available
time string max execution time