生信分析进阶5 - 全外显子组变异检测和ANNOVAR注释Snakemake分析流程

news2024/11/23 12:36:26

基于yaml或ini配置文件,配置文件包含例如样本名称、参考基因组版本、exon capture bed文件路径、参考基因组路径和ANNOVAR注释文件等信息。

基于该流程可以实现全外显测序的fastq文件输入到得到最终变异VCF文件。

1. Snakemake分析流程基础软件安装

# conda安装
conda install -c bioconda snakemake -y
conda install -c bioconda fastqc -y
conda install -c bioconda trim-galore -y
conda install -c bioconda bwa -y
conda install -c bioconda samtools -y

# GATK 4.2.1版本
conda install -c bioconda gatk -y

ANNOVAR注释软件安装和使用参考文章:
https://www.jianshu.com/p/461f2cd47564

ANNOVAR注释全外显子有些坑和经常会报错的问题,
本人后续有空会写下对全外显子注释的ANNOVAR软件安装和使用

2. AgilentSureSelect 外显子序列捕获流程图

外显子捕获基本原理参考下图。
AgilentSureSelect

3. GATK SNP和INDEL分析基本流程

GATK

4. Snakemake全外显子分析流程

其中regions_bed配置为外显子捕获所采用试剂盒对应的bed文件路径。

##################### parser configfile #########################
import os
configfile: "config.yaml"
# configfile: "config.ini"

sample = config["sample"]
genome = config["genome"]
#################################################################
PROJECT_DATA_DIR = config["project_data_dir"]
PROJECT_RUN_DIR = config["project_run_dir"]

os.system("mkdir -p {0}".format(PROJECT_RUN_DIR))

######################### Output #################################
ALL_FASTQC_RAW_ZIP1 = PROJECT_RUN_DIR + "/fastqc/raw_data/" + sample + "_1_fastqc.zip"
ALL_FASTQC_RAW_HTML1 = PROJECT_RUN_DIR + "/fastqc/raw_data/" + sample + "_1_fastqc.html"
ALL_FASTQC_RAW_ZIP2 = PROJECT_RUN_DIR + "/fastqc/raw_data/" + sample + "_2_fastqc.zip"
ALL_FASTQC_RAW_HTML2 = PROJECT_RUN_DIR + "/fastqc/raw_data/" + sample + "_2_fastqc.html"

ALL_CELAN_FASTQ1 = PROJECT_RUN_DIR + "/fastqc/clean_data/" + sample + "_1_val_1.fq.gz"
ALL_CELAN_FASTQ2 = PROJECT_RUN_DIR + "/fastqc/clean_data/" + sample + "_2_val_2.fq.gz"

ALL_SORTED_BAM = PROJECT_RUN_DIR + "/align/" + sample + ".sorted.bam"
ALL_SORTED_BAM_BAI1 = PROJECT_RUN_DIR + "/align/" + sample + ".sorted.bam.bai"
ALL_BAM_FLAGSTAT = PROJECT_RUN_DIR + "/align/" + sample + ".sorted.bam.flagstat"

ALL_EXON_INTERVAL_BED = PROJECT_RUN_DIR + "/gatk/" + sample + ".Exon.Interval.bed"
ALL_STAT_TXT = PROJECT_RUN_DIR + "/gatk/" + sample + ".stat.txt"
ALL_MKDUP_BAM = PROJECT_RUN_DIR + "/align/" + sample + ".sorted.mkdup.bam"
ALL_METRICS = PROJECT_RUN_DIR + "/gatk/" + sample + ".sorted.bam.metrics"
ALL_SORTED_BAM_BAI2 = PROJECT_RUN_DIR + "/align/" + sample + ".sorted.mkdup.bam.bai"

ALL_INSERT_SIZE_HISTOGRAM_pdf = PROJECT_RUN_DIR + "/gatk/" + sample + ".insert_size_histogram.pdf"

ALL_RECAL_DATA_TABLE = PROJECT_RUN_DIR + "/gatk/" + sample + ".recal_data.table"
ALL_BQSR_SORTED_BAM = PROJECT_RUN_DIR + "/gatk/" + sample + ".sorted.mkdup.BQSR.bam"
ALL_depthOfcoverage = PROJECT_RUN_DIR + "/gatk/depthOfcoverage"

ALL_RAW_VCF = PROJECT_RUN_DIR + "/vcf/" + sample + ".raw.vcf"
All_SNP_SELECT = PROJECT_RUN_DIR + "/vcf/" + sample + ".snp.vcf"
ALL_SNP_FILTER_VCF = PROJECT_RUN_DIR + "/vcf/" + sample + ".snp.vcf"
ALL_SNP_FILTER_VCF_STAT = PROJECT_RUN_DIR + "/vcf/" + sample + ".snp.filter.vcf_stat.csv"

All_INDEL_SELECT = PROJECT_RUN_DIR + "/vcf/" + sample + ".indel.vcf"
ALL_INDEL_FILTER_VCF = PROJECT_RUN_DIR + "/vcf/" + sample + ".indel.vcf"

ALL_SNP_AVINPUT = PROJECT_RUN_DIR + "/vcf/" + sample + ".snp.avinput"
ALL_INDEL_AVINPUT = PROJECT_RUN_DIR + "/vcf/" + sample + ".indel.avinput"

ALL_MERG_FILTERE_VCF = PROJECT_RUN_DIR + "/vcf/" + sample + ".merge.filter.vcf"

ALL_SNPANNO = PROJECT_RUN_DIR + "/vcf/" + sample + "_snpanno." + genome + "_multianno.csv"
ALL_INDELANNO = PROJECT_RUN_DIR + "/vcf/" + sample + "_indelanno." + genome + "_multianno.csv"
ALL_ALLANNO = PROJECT_RUN_DIR + "/vcf/" + sample + "_allanno." + genome + "_multianno.csv"

QUALIMAP_REPORT = PROJECT_RUN_DIR + "/gatk/bam_QC/" + sample + "_bamQC_report.pdf"


######################## Workflow #################################
rule all:
    input:
        ALL_FASTQC_RAW_ZIP1,
        ALL_FASTQC_RAW_HTML1,
        ALL_FASTQC_RAW_ZIP2,
        ALL_FASTQC_RAW_HTML2,
        ALL_SORTED_BAM,
        ALL_SORTED_BAM_BAI1,
        ALL_BAM_FLAGSTAT,
        ALL_EXON_INTERVAL_BED,
        ALL_STAT_TXT,
        ALL_MKDUP_BAM,
        ALL_METRICS,
        ALL_SORTED_BAM_BAI2,
        ALL_RECAL_DATA_TABLE,
        ALL_BQSR_SORTED_BAM,
        ALL_RAW_VCF,
        All_SNP_SELECT,
        All_INDEL_SELECT,
        ALL_SNP_FILTER_VCF,
        ALL_INDEL_FILTER_VCF,
        ALL_MERG_FILTERE_VCF,
        ALL_SNPANNO,
        ALL_INDELANNO,
        ALL_ALLANNO


rule fastqc_raw:
    input:
        r1 = PROJECT_DATA_DIR + "/" + sample + "_1.fq.gz",
        r2 = PROJECT_DATA_DIR + "/" + sample + "_2.fq.gz"
    output:
        PROJECT_RUN_DIR + "/fastqc/raw_data/" + sample + "_1_fastqc.zip",
        PROJECT_RUN_DIR + "/fastqc/raw_data/" + sample + "_2_fastqc.zip",
        PROJECT_RUN_DIR + "/fastqc/raw_data/" + sample + "_1_fastqc.html",
        PROJECT_RUN_DIR + "/fastqc/raw_data/" + sample + "_2_fastqc.html"
    threads: 8
    params:
        raw_data_dir = PROJECT_RUN_DIR + "/fastqc/raw_data",
        clean_data_dir = PROJECT_RUN_DIR + "/fastqc/clean_data",	
        align_dir = PROJECT_RUN_DIR + "/align", 
        gatk_dir = PROJECT_RUN_DIR + "/gatk",
        vcf_dir = PROJECT_RUN_DIR + "/vcf",
        tmp_dir = PROJECT_RUN_DIR + "/tmp",
	fastqc_report_dir = PROJECT_RUN_DIR + "/report/fastqc_report",
	align_report_dir = PROJECT_RUN_DIR + "/report/align_report"
    shell:
        """
        mkdir -p {params.raw_data_dir}
        mkdir -p {params.clean_data_dir}
        mkdir -p {params.align_dir}
        mkdir -p {params.gatk_dir}
        mkdir -p {params.vcf_dir}
        mkdir -p {params.tmp_dir}
        fastqc -f fastq --extract -t {threads} -o {params.raw_data_dir} {input.r1} {input.r2}        
	"""

rule fastqc_clean:
    input:
        r1 = PROJECT_DATA_DIR + "/" + sample + "_1.fq.gz",
        r2 = PROJECT_DATA_DIR + "/" + sample + "_2.fq.gz"
    output:
        temp(PROJECT_RUN_DIR + "/fastqc/clean_data/" + sample + "_1_val_1.fq.gz"),
        temp(PROJECT_RUN_DIR + "/fastqc/clean_data/" + sample + "_2_val_2.fq.gz")
    threads: 8
    params:
        quality = 20,
        length = 20,
	    clean_data_dir = PROJECT_RUN_DIR + "/fastqc/clean_data"
    shell:
        """
        trim_galore --paired --quality {params.quality} --length {params.length} -o {params.clean_data_dir} {input.r1} {input.r2}
	    """

# BWA mem比对
rule bwa_mem:
    input:
        reads1 = PROJECT_RUN_DIR + "/fastqc/clean_data/" + sample + "_1_val_1.fq.gz",
        reads2 = PROJECT_RUN_DIR + "/fastqc/clean_data/" + sample + "_2_val_2.fq.gz"
    output:
        PROJECT_RUN_DIR + "/align/" + sample + ".sorted.bam"
    threads:8
    params:
        hg = config["hg_fa"]
    shell:
        """
        bwa mem -t {threads} -M -R "@RG\\tID:{sample}\\tPL:ILLUMINA\\tLB:{sample}\\tSM:{sample}" {params.hg} {input.reads1} {input.reads2}|samtools sort -@ {threads} -o {output}
        """

rule bam_index_1:
    input:
        PROJECT_RUN_DIR + "/align/" + sample + ".sorted.bam"
    output:
        PROJECT_RUN_DIR + "/align/" + sample + ".sorted.bam.bai"
    threads: 2
    shell:
        "samtools index {input} > {output}"

rule flagstat:
    input:
        PROJECT_RUN_DIR + "/align/" + sample + ".sorted.bam"
    output:
        PROJECT_RUN_DIR + "/align/" + sample + ".sorted.bam.flagstat"
    threads: 1
    shell:
        "samtools flagstat {input} > {output}"

# 创建reference fasta 字典,GATK需要用到
# rule CreateSequenceDictionary:
#     input:
#         "/reference/hg19/ucsc.hg19.fa"
#     output:
#         PROJECT_RUN_DIR + "/" + sample + "/gatk/" + sample + ".hg19.dict"
#         # /GATK_bundle/hg19/ucsc.hg19.dict
#     threads: 1
#     params:
#         gatk4 = "/public/software/gatk-4.0.6.0/gatk",
#         gatk_dir = PROJECT_RUN_DIR + "/" + sample + "/gatk"
#     shell:
#         """
#         mkdir -p {params.gatk_dir}
#         gatk CreateSequenceDictionary -R {input} -O {ouptut}
#         """ 

rule CreateExonIntervalBed:
    input:
        regions_bed = config["regions_bed"],
        hg_dict = config["hg_dict"]
    output:
        Exon_Interval_bed = PROJECT_RUN_DIR + "/gatk/" + sample + ".Exon.Interval.bed"
    threads: 1
    params:
        gatk4 = "/public/software/gatk-4.0.6.0/gatk"
    shell:
        """
        gatk BedToIntervalList -I {input.regions_bed} -O {output.Exon_Interval_bed} -SD {input.hg_dict}
        """ 

# 外显子区域覆盖度
rule exon_region:
    input:
        bam = PROJECT_RUN_DIR + "/align/" + sample + ".sorted.bam",
        Exon_Interval_bed = PROJECT_RUN_DIR + "/gatk/" + sample + ".Exon.Interval.bed"
    output:
        stat_txt = PROJECT_RUN_DIR + "/gatk/" + sample + ".stat.txt"
    threads: 2
    params:
        gatk4 = "/public/software/gatk-4.0.6.0/gatk"
    shell:
        """
        gatk CollectHsMetrics -BI {input.Exon_Interval_bed} -TI {input.Exon_Interval_bed} -I {input.bam} -O {output.stat_txt}
        """ 

# 标记PCR重复序列
rule mark_duplicates:
    input:
        PROJECT_RUN_DIR + "/align/" + sample + ".sorted.bam"
    output:
        mkdup_bam = PROJECT_RUN_DIR + "/align/" + sample + ".sorted.mkdup.bam",
        metrics = PROJECT_RUN_DIR + "/gatk/" + sample + ".sorted.bam.metrics"
    threads: 4
    params:
        gatk4 = "/public/software/gatk-4.0.6.0/gatk"
    shell:
        """
        gatk --java-options "-Xmx30G -Djava.io.tmpdir=./" MarkDuplicates -I {input} -O {output.mkdup_bam} -M {output.metrics}
        """

rule bam_index_2:
    input:
        PROJECT_RUN_DIR + "/align/" + sample + ".sorted.mkdup.bam"
    output:
        PROJECT_RUN_DIR + "/align/" + sample + ".sorted.mkdup.bam.bai"
    threads: 2
    shell:
        "samtools index {input} > {output}"

# 插入片段分布统计
#rule InsertSizeStat:
#    input:
#        PROJECT_RUN_DIR + "/gatk/" + sample + ".sorted.mkdup.bam"
#    output:
#        insert_size_metrics_txt = temp(PROJECT_RUN_DIR + "/gatk/" + sample + ".insert_size_metrics.txt"),
#        insert_size_histogram_pdf = PROJECT_RUN_DIR + "/gatk/" + sample + ".insert_size_histogram.pdf"
#    threads:1
#    params:
#        picard = "/public/software/picard.jar",
#        M = 0.5
#    shell:
#        """
#       java -jar {params.picard} CollectInsertSizeMetrics \
#        I={input} \
#        O={output.insert_size_metrics_txt} \
#        H={output.insert_size_histogram_pdf} \
#        M=0.5
#        """

# 重新校正碱基质量值
# 计算出所有需要进行重校正的read和特征值,输出为一份校准表文件.recal_data.table
rule BaseRecalibrator:
    input:
        mkdup_bam = PROJECT_RUN_DIR + "/align/" + sample + ".sorted.mkdup.bam",
        mkdup_bam_index = PROJECT_RUN_DIR + "/align/" + sample + ".sorted.mkdup.bam.bai",
        vcf_1000G_phase1_indels_sites = config["vcf_1000G_phase1_indels_sites"],
        vcf_Mills_and_1000G_gold_standard_indels_sites = config["vcf_Mills_and_1000G_gold_standard_indels_sites"],
        vcf_dbsnp = config["vcf_dbsnp"]
    output:
        PROJECT_RUN_DIR + "/gatk/" + sample + ".recal_data.table"
    threads: 8
    params:
        gatk4 = "/public/software/gatk-4.0.6.0/gatk",
        hg_fa = config["hg_fa"],
        regions_bed = config["regions_bed"]
    shell:
        """
        gatk --java-options "-Xmx30G -Djava.io.tmpdir=./" BaseRecalibrator \
        -R {params.hg_fa} \
        -I {input.mkdup_bam} \
        --known-sites {input.vcf_1000G_phase1_indels_sites} \
        --known-sites {input.vcf_Mills_and_1000G_gold_standard_indels_sites} \
        --known-sites {input.vcf_dbsnp} \
        -L {params.regions_bed} \
        -O {output}
        """
        
# gatk --java-options "-Xmx30G -Djava.io.tmpdir=./" BaseRecalibrator -R genome/hg19.fa -I ./TKQX.sorted.mkdup.bam --known-sites /GATK_bundle/hg19/1000G_phase1.indels.hg19.sites.vcf --known-sites /GATK_bundle/hg19/Mills_and_1000G_gold_standard.indels.hg19.sites.vcf --known-sites /GATK_bundle/hg19/dbsnp_138.hg19.vcf -L /GATK_bundle/bed/SureSelectAllExonV6/S07604514_Regions.bed -O ./TKQX.recal_data.table
# java -jar CreateSequenceDictionary.jar R= Homo_sapiens_assembly18.fasta O= Homo_sapiens_assembly18.dict

# 利用校准表文件重新调整原来BAM文件中的碱基质量值,使用这个新的质量值重新输出新的BAM文件
rule ApplyBQSR:
    input:
        mkdup_bam = PROJECT_RUN_DIR + "/align/" + sample + ".sorted.mkdup.bam",
        recal_data = PROJECT_RUN_DIR + "/gatk/" + sample + ".recal_data.table",
        regions_bed = config["regions_bed"]
        #regions_bed = "/GATK_bundle/bed/SureSelectAllExonV6/S07604514_Regions.bed"
    output:
        PROJECT_RUN_DIR + "/gatk/" + sample + ".sorted.mkdup.BQSR.bam"
    threads: 4
    params:
        gatk4 = "/public/software/gatk-4.0.6.0/gatk",
        hg_fa = config["hg_fa"]
        #hg_fa = "genome/hg19.fa"
    shell:
        """
        gatk --java-options "-Xmx30G -Djava.io.tmpdir=./" ApplyBQSR \
        -R {params.hg_fa} \
        -I {input.mkdup_bam} \
        -bqsr {input.recal_data} \
        -L {input.regions_bed} \
        -O {output}
        """
# gatk --java-options "-Xmx30G -Djava.io.tmpdir=./" ApplyBQSR -R genome/hg19.fa -I ./test_sample.sorted.mkdup.bam -bqsr ./test_sample.recal_data.table -L /GATK_bundle/bed/SureSelectAllExonV6/S07604514_Regions.bed -O ./test_sample.sorted.mkdup.BQSR.bam

# # 校正碱基图
# rule recal_data_plot:
#     input:
#         PROJECT_RUN_DIR + "/" + sample + "/gatk/" + sample + ".recal_data.table"
#     output:
#         PROJECT_RUN_DIR + "/" + sample + "/gatk/" + sample + ".recal_data.table.plot"
#     threads: 1
#     params:
#         gatk4 = "/public/software/gatk-4.0.6.0/gatk",
#         gatk = "/public/software/anaconda3/envs/PGS_1M/bin/gatk"
#     shell:
#         """
#         gatk AnalyzeCovariates -bqsr {input} -plots {output}
#         """
# gatk AnalyzeCovariates -bqsr ./test_sample.recal_data.table -plots ./test_sample.recal_data.table.plot

# 统计测序深度和覆盖度
#rule depthOfcoverage:
#    input:
#        BQSR_mkdup_bam = PROJECT_RUN_DIR + "/gatk/" + sample + ".sorted.mkdup.BQSR.bam"
#    output:
#        PROJECT_RUN_DIR + "/gatk/depthOfcoverage"
#    threads:1
#    params:
#        #hg19_fa = "genome/hg19.fa",
#        hg_fa = config["hg_fa"],
#        bamdst = "/public/software/bamdst/bamdst",
#        regions_bed = config["regions_bed"]
#        #regions_bed = "/GATK_bundle/bed/SureSelectAllExonV6/S07604514_Regions.bed"
#    shell:
#        """
#        {params.bamdst} -p {params.regions_bed} -o {output} {input.BQSR_mkdup_bam}
#        """
# """
# gatk --java-options "-Xmx20G -Djava.io.tmpdir=./" DepthOfCoverage  \
# -R {params.hg19} \
# -o {output} \
# -I {input.BQSR_mkdup_bam} \
# -L {input.regions_bed} \
# --omitDepthOutputAtEachBase --omitIntervalStatistics \
# -ct 1 -ct 5 -ct 10 -ct 20 -ct 30 -ct 50 -ct 100
# """
#gatk --java-options "-Xmx20G -Djava.io.tmpdir=./" DepthOfCoverage -R genome/hg19.fa -o ./TKQX -I ./TKQX.sorted.mkdup.BQSR.bam -L /GATK_bundle/bed/SureSelectAllExonV6/S07604514_Regions.bed --omitDepthOutputAtEachBase --omitIntervalStatistics -ct 1 -ct 5 -ct 10 -ct 20 -ct 30 -ct 50 -ct 100

# /public/software/bamdst/bamdst -p /GATK_bundle/bed/SureSelectAllExonV6/S07604514_Regions.bed -o ./depth_coverage ./TKQX.sorted.mkdup.BQSR.bam

# # 单样本VCF calling
rule vcf_calling:
    input:
        BQSR_mkdup_bam = PROJECT_RUN_DIR + "/gatk/" + sample + ".sorted.mkdup.BQSR.bam"
    output:
        PROJECT_RUN_DIR + "/vcf/" + sample + ".raw.vcf"
    threads: 4
    params:
        gatk4 = "/public/software/gatk-4.0.6.0/gatk",
        hg_fa = config["hg_fa"],
        vcf_dbsnp = config["vcf_dbsnp"],
        regions_bed = config["regions_bed"]
    shell:
        """
        gatk --java-options "-Xmx30G -Djava.io.tmpdir=./" HaplotypeCaller \
        -R {params.hg_fa} \
        -I {input.BQSR_mkdup_bam} \
        -D {params.vcf_dbsnp} \
        -L {params.regions_bed} \
        -O {output}
        """
# gatk --java-options "-Xmx30G -Djava.io.tmpdir=./" HaplotypeCaller -R genome/hg19.fa -I ./TKQX.sorted.mkdup.BQSR.bam -D /GATK_bundle/hg19/dbsnp_138.hg19.vcf -L /GATK_bundle/bed/SureSelectAllExonV6/S07604514_Regions.bed -O ./TKQX.raw.vcf

# 变异质控和过滤,对raw data进行质控,剔除假阳性的标记
rule vcf_select_SNP:
    input:
        PROJECT_RUN_DIR + "/vcf/" + sample + ".raw.vcf"
    output:
        PROJECT_RUN_DIR + "/vcf/" + sample + ".snp.vcf"
    threads: 2
    params:
        gatk4 = "/public/software/gatk-4.0.6.0/gatk"
    shell:
        """
        gatk SelectVariants -select-type SNP -V {input} -O {output}
        """

rule vcf_filter_SNP:
    input:
        PROJECT_RUN_DIR + "/vcf/" + sample + ".snp.vcf"
    output:
        PROJECT_RUN_DIR + "/vcf/" + sample + ".snp.filter.vcf"
    threads: 2
    params:
        gatk4 = "/public/software/gatk-4.0.6.0/gatk",
        QD = 2.0,
        MQ = 40.0,
        FS = 60.0,
        SOR = 3.0,
        MQRankSum = -12.5,
        ReadPosRankSum = -8.0
    shell:
        """
        gatk VariantFiltration \
        -V {input} \
        --filter-expression "QD < {params.QD} || MQ < {params.MQ} || FS > {params.FS} || SOR > {params.SOR} || MQRankSum < {params.MQRankSum} || ReadPosRankSum < {params.ReadPosRankSum}" --filter-name "PASS" \
        -O {output}
        """
# --filter-expression "QD < 2.0|| MQ < 40.0 || FS > 60.0 || SOR > 3.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0" --filter-name "PASS"‘
# gatk VariantFiltration -V TKQX.snp.vcf --filter-expression "QD < 2.0|| MQ < 40.0 || FS > 60.0 || SOR > 3.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0" --filter-name "PASS" -O TKQX.snp.filter.vcf

# rule snp_frequency_stat:
#     input:
#         PROJECT_RUN_DIR + "/" + sample + "/vcf/" + sample + ".snp.filter.vcf"
#     output:
#         PROJECT_RUN_DIR + "/" + sample + "/vcf/" + sample + ".snp.filter.vcf_stat.csv"
#     threads: 1
#     params:
#         gatk4 = "/public/software/gatk-4.0.6.0/gatk",
#         snp_frequency = "/public/analysis/pipeline/WES/scripts/snp_frequency.py"
#     shell:
#         """
#         python {params.snp_frequency} {input}
#         """

rule vcf_select_InDel:
    input:
        PROJECT_RUN_DIR + "/vcf/" + sample + ".raw.vcf"
    output:
        PROJECT_RUN_DIR + "/vcf/" + sample + ".indel.vcf"
    threads: 2
    params:
        gatk4 = "/public/software/gatk-4.0.6.0/gatk"
    shell:
        """
        gatk SelectVariants -select-type INDEL -V {input} -O {output}
        """

rule vcf_filter_InDel:
    input:
        PROJECT_RUN_DIR + "/vcf/" + sample + ".indel.vcf"
    output:
        PROJECT_RUN_DIR + "/vcf/" + sample + ".indel.filter.vcf"
    threads: 2
    params:
        gatk4 = "/public/software/gatk-4.0.6.0/gatk",
        QD = 2.0,
        FS = 200.0,
        SOR = 10.0,
        MQRankSum = -12.5,
        ReadPosRankSum = -8.0
    shell:
        """
        gatk VariantFiltration \
        -V {input} \
        --filter-expression "QD < {params.QD} || FS > {params.FS} || SOR > {params.SOR} || MQRankSum < {params.MQRankSum} || ReadPosRankSum < {params.ReadPosRankSum}" --filter-name "PASS" \
        -O {output}
        """
# --filter-expression "QD < 2.0 || FS > 200.0 || SOR > 10.0|| MQRankSum < -12.5 || ReadPosRankSum < -8.0" --filter-name "PASS"

rule merge_vcf:
    input:
        snp_filter_vcf = PROJECT_RUN_DIR + "/vcf/" + sample + ".snp.filter.vcf",
        indel_filter_vcf = PROJECT_RUN_DIR + "/vcf/" + sample + ".indel.filter.vcf"
    output:
        PROJECT_RUN_DIR + "/vcf/" + sample + ".merge.filter.vcf"
    threads: 1
    params:
        vcf_dir = PROJECT_RUN_DIR + "/vcf"
    shell:
        """
        cd {params.vcf_dir}
        gatk MergeVcfs \
        -I {input.snp_filter_vcf} \
        -I {input.indel_filter_vcf} \
        -O {output}
        """

rule transfer_avinput:
    input:
        snp = PROJECT_RUN_DIR + "/vcf/" + sample + ".snp.filter.vcf",
        indel= PROJECT_RUN_DIR + "/vcf/" + sample + ".indel.filter.vcf",
        merge = PROJECT_RUN_DIR + "/vcf/" + sample + ".merge.filter.vcf"
    output:
        snp = temp(PROJECT_RUN_DIR + "/vcf/" + sample + ".snp.avinput"),
        indel = temp(PROJECT_RUN_DIR + "/vcf/" + sample + ".indel.avinput"),
        merge = temp(PROJECT_RUN_DIR + "/vcf/" + sample + ".merge.avinput")
    threads: 4
    params:
        convert2annovar = "/public/software/annovar/convert2annovar.pl"
    shell:
        """
        perl {params.convert2annovar} -format vcf4 {input.snp} > {output.snp}
        perl {params.convert2annovar} -format vcf4 {input.indel} > {output.indel}
        perl {params.convert2annovar} -format vcf4 {input.merge} > {output.merge}
        """
# SNP注释
rule snp_annotate:
    input:
        PROJECT_RUN_DIR + "/vcf/" + sample + ".snp.avinput"
    output:
        PROJECT_RUN_DIR + "/vcf/" + sample + "_snpanno." + genome + "_multianno.csv"
    threads: 2
    params:
        table_annovar = "/public/software/annovar/table_annovar.pl",
        humandb = "/public/software/annovar/humandb/",
        prefix = sample + "_snpanno",
        genome = config["genome"],
        vcf_dir = PROJECT_RUN_DIR + "/vcf"
    shell:
        """
        cd {params.vcf_dir}
        perl {params.table_annovar} {input} {params.humandb} -buildver {params.genome} -out {params.prefix} \
        -remove -protocol refGene,cytoBand,clinvar_20221231,avsnp147,exac03,gnomad211_exome,SAS.sites.2015_08,esp6500siv2_all,intervar_20180118,dbnsfp42c \
        -operation g,r,f,f,f,f,f,f,f,f -nastring . -csvout
        """
        
# INDEL注释
rule indel_annotate:
    input:
        PROJECT_RUN_DIR + "/vcf/" + sample + ".indel.avinput"
    output:
        PROJECT_RUN_DIR + "/vcf/" + sample + "_indelanno." + genome + "_multianno.csv"
    threads: 2
    params:
        table_annovar = "/public/software/annovar/table_annovar.pl",
        humandb = "/public/software/annovar/humandb/",
        prefix = sample + "_indelanno",
        genome = config["genome"],
        vcf_dir = PROJECT_RUN_DIR + "/vcf",
        perl = "/home/miniconda3/pkgs/perl-5.26.2-h36c2ea0_1008/bin/perl"
    shell:
        """
        cd {params.vcf_dir}
        perl {params.table_annovar} {input} {params.humandb} -buildver {params.genome} -out {params.prefix} \
        -remove -protocol refGene,cytoBand,clinvar_20221231,gnomad211_exome,exac03,esp6500siv2_all,ALL.sites.2015_08 \
        -operation g,r,f,f,f,f,f -nastring . -csvout
        """

#
rule merge_annotate:
    input:
        PROJECT_RUN_DIR + "/vcf/" + sample + ".merge.avinput"
    output:
        PROJECT_RUN_DIR + "/vcf/" + sample + "_allanno." + genome + "_multianno.csv"
    threads: 2
    params:
        table_annovar = "/public/software/annovar/table_annovar.pl",
        humandb = "/public/software/annovar/humandb/",
        prefix = sample + "_allanno",
        genome = config["genome"],
        vcf_dir = PROJECT_RUN_DIR + "/vcf"
    shell:
        """
        cd {params.vcf_dir}
        perl {params.table_annovar} {input} {params.humandb} -buildver {params.genome} -out {params.prefix} \
        -remove -protocol refGene,cytoBand,clinvar_20221231,avsnp147,exac03,gnomad211_exome,SAS.sites.2015_08,esp6500siv2_all,intervar_20180118,dbnsfp42c \
        -operation g,r,f,f,f,f,f,f,f,f -nastring . -csvout
        """

生信分析进阶文章推荐

生信分析进阶1 - HLA分析的HLA区域reads提取及bam转换fastq

生信分析进阶2 - 利用GC含量的Loess回归矫正reads数量

生信分析进阶3 - pysam操作bam文件统计unique reads和mapped reads高级技巧合辑

生信分析进阶4 - 比对结果的FLAG和CIGAR信息含义与BAM文件指定区域提取

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.coloradmin.cn/o/1829182.html

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈,一经查实,立即删除!

相关文章

Vue3【二十一】Vue 路由模式(createWebHashHistory /createWebHistory )和RouterLink写法

Vue3【二十一】Vue 路由模式&#xff08;createWebHashHistory /createWebHistory &#xff09;和RouterLink写法 Vue3【二十一】Vue 路由模式和普通组件目录结构 createWebHistory history模式&#xff1a;url不带#号&#xff0c;需要后端做url适配 适合销售项目 利于seo crea…

pytorch学习笔记6

想要找一些官方的小工具数据集&#xff0c;可以进入pytorch官网&#xff0c;DOCS-》pytorch下拉至libraries&#xff0c;点击torchversion&#xff0c;调整版本至0.9.0就可以找到相应的一些数据集&#xff0c;训练集 ctrlp可以看一个函数中需要设置哪些参数 下载数据集可以参考…

C/C++中内存开辟与柔性数组

C/C中内存的开辟 在C中&#xff0c;我们都知道有三个区&#xff1a; 1. 栈区&#xff08;stack&#xff09;&#xff1a;在执行函数时&#xff0c;函数内局部变量的存储单元都可以在栈上创建&#xff0c;函数执行结 束时这些存储单元自动被释放。栈内存分配运算内置于处理器的指…

云和运维(SRE)的半生缘-深读实证02

这个标题不算太夸张&#xff0c;云计算和很多IT岗位都有缘&#xff0c;但是和运维&#xff08;SRE&#xff09;岗位的缘分最深。 “深读实证”系列文章都会结合一些外部事件&#xff0c;点明分析《云计算行业进阶指南》书中的内容。本次分享介绍了下列内容&#xff1a; 我以运维…

Git学习记录v1.0

1、常用操作 git clonegit configgit branchgitt checkoutgit statusgit addgit commitgit pushgit pullgit loggit tag 1.1 git clone 从git服务器拉取代码 git clone https://gitee.com/xxx/studyJava.git1.2 git config 配置开发者用户名和邮箱 git config user.name …

数值分析笔记(二)函数插值

函数插值 已知函数 f ( x ) f(x) f(x)在区间[a,b]上n1个互异节点 { x i } i 0 n \{{x_i}\}_{i0}^{n} {xi​}i0n​处的函数值 { y i } i 0 n \{{y_i}\}_{i0}^{n} {yi​}i0n​&#xff0c;若函数集合 Φ \Phi Φ中函数 ϕ ( x ) \phi(x) ϕ(x)满足条件 ϕ ( x i ) y i ( i …

决策树概念

图例 概念 决策树基本上就是对经验的总结 决策树的构成&#xff0c;分为两个阶段。构造和剪枝 构造 概念 构造就是生成一颗完整的决策树。构造的过程就是选择什么属性作为节点的过程 构造过程&#xff0c;会存在3种节点 根节点&#xff1a;就是树的最顶端&#xff0c;最…

基于STM32和人工智能的自动驾驶小车系统

目录 引言环境准备自动驾驶小车系统基础代码实现&#xff1a;实现自动驾驶小车系统 4.1 数据采集模块4.2 数据处理与分析4.3 控制系统4.4 用户界面与数据可视化应用场景&#xff1a;自动驾驶应用与优化问题解决方案与优化收尾与总结 1. 引言 随着人工智能和嵌入式系统技术的…

竟然与 package-lock.json 更新有关!部分用户 H5 页面白屏问题!

一.问题 1 场景 现象 接到部分用户反馈进入xxx H5 页面空白&#xff1b; 研发测日志里问题用户的线上页面URL地址可以正常访问&#xff0c;没有复现问题&#xff01;&#xff01;&#xff01; 定位问题 监控平台和客户端日志报错&#xff1a; SyntaxError: Unexpected toke…

pc repair

pc repair 修理电脑&#xff0c;换配件

数字化转型,不做是等死,做了是找死

“ 有不少人调侃说&#xff1a;数字化转型&#xff0c;不做是等死&#xff0c;做了是找死。如果你是一个老板&#xff0c;你会怎么选择呢&#xff0c;下面我来剖析一下。” 我按照“做正确的事&#xff0c;正确的做事”来分析数字化转型&#xff0c;再通过抓痛点和流程再造两项…

MySQL经典面试题:谈一谈你对事务的理解

文章目录 &#x1f4d1;事务事务的基本概念回滚开启事务的sql语句 事务的基本特性总结一下涉及到的三个问题 ☁️结语 &#x1f4d1;事务 事务的基本概念 事务是用来解决一类特定场景的问题的&#xff0c;在有些场景中&#xff0c;完成某个操作&#xff0c;需要多个sql配合完…

HCIA 16 构建 IPv6 网络基础配置

IPv6&#xff08;Internet Protocol Version 6&#xff09;也被称为 IPng&#xff08;IP Next Generation&#xff09;。由 Internet 工程任务组 IETF&#xff08;Internet Engineering Task Force&#xff09;设计&#xff0c;是 IPv4下一代版本。 相比较于 IPv4&#xff0c;I…

第 6 章: Spring 中的 JDBC

JDBC 的全称是 Java Database Connectivity&#xff0c;是一套面向关系型数据库的规范。虽然数据库各有不同&#xff0c;但这些数据库都提供了基于 JDBC 规范实现的 JDBC 驱动。开发者只需要面向 JDBC 接口编程&#xff0c;就能在很大程度上规避数据库差异带来的问题。Java 应用…

【Linux】进程间通信1——管道概念,匿名管道

1.进程间通信介绍 进程是计算机系统分配资源的最小单位&#xff08;严格说来是线程&#xff09;。每个进程都有自己的一部分独立的系统资源&#xff0c;彼此是隔离的。为了能使不同的进程互相访问资源并进行协调工作&#xff0c;才有了进程间通信。 进程间通信&#xff0c;顾名…

STM32CubeMX配置-看门狗配置

一、简介 MCU为STM32G070&#xff0c;LSI为32K&#xff0c;看门狗IWDG配置为4S溢出&#xff0c;则配置是设置分频为32分频&#xff0c;重装载值为3000。 二、IWDG配置 1.外设配置 2.时钟配置 3.生成代码 HAL_IWDG_Refresh(&hiwdg); //喂狗

ADS基础教程21 - 电磁仿真(EM)模型的远场和场可视化

模型的远场和场可视化 一、引言二、操作步骤1.定义参数2.执行远场视图&#xff08;失败案例&#xff09;3.重新仿真提取参数 三、总结 一、引言 本文介绍电磁仿真模型的远场和场可视化。 二、操作步骤 1.定义参数 1&#xff09;在Layout视图&#xff0c;工具栏中点击EM调出…

Autosar诊断-FIM模块功能介绍

文章目录 前言一、FIM模块概述二、FID概念介绍Event ID和DTC之间的关系Event ID与FID之间的关系FIM数据结构三、FiM模块与SW-C模块交互关系四、FIM模块函数调用关系FiM功能模块作用过程前言 Autosar诊断的主体为UDS(Unified Diagnostic Services)协议,即统一的诊断服务,是…

力扣191. 位1的个数

Problem: 191. 位1的个数 文章目录 题目描述思路复杂度Code 题目描述 思路 题目规定数值的范围不会超过32位整形数 1.定义统计个数的变量oneCount&#xff1b;由于每次与给定数字求与的变量mask初始化为1 2.for循环从0~32&#xff0c;每一次拿mask与给定数字求与运算&#xff…

鸿蒙求职面试内容总结——6月3日ZR的FS项目

最近接到了一些公司的入职面试邀约&#xff0c;这里略去公司的和项目的名字&#xff0c;做一些整理分享。 一、长列表如何实现部分渲染&#xff0c;使用的是哪一个API 在鸿蒙系统中&#xff0c;可以使用List组件来实现长列表的部分渲染。List组件支持使用条件渲染、循环渲染、…