文章目录
- 目的
- 实现步骤
目的
NGS得到了很多的reads,其中有一些paired reads我想根据overlap 搭建起来,因为我对序列的ID做了删减,所以再pandaseq那里跑不通。
总结来说,目的很简单,就是把 有重叠区域的 reads 搭起来,变成一段更长的序列。
实现步骤
conda install -c bioconda spades
spades.py -h
# 看一下我的文件长什么样子
head all_reads_map_to_plasmid_1.fq all_reads_map_to_plasmid_2.fq
# 因为我的文件中有些序列可能不存在配对序列,我现在提前把他们剔除掉
seqkit pair -1 all_reads_map_to_plasmid_1.fq -2 all_reads_map_to_plasmid_2.fq --id-regexp '^(\S+)\/[12]'
# 产生了如下文件
# [INFO] 254 paired-end reads saved to all_reads_map_to_plasmid_1.paired.fq and all_reads_map_to_plasmid_2.paired.fq
# 运行spades 进行拼接
spades.py -1 all_reads_map_to_plasmid_1.paired.fq -2 all_reads_map_to_plasmid_2.paired.fq -o ./test
# 最终的结果日下
# 确定不是我想要的结果
SPAdes genome assembler v3.13.1
Usage: /opt/miniconda3/bin/spades.py [options] -o <output_dir>
Basic options:
-o <output_dir> directory to store all the resulting files (required)
--sc this flag is required for MDA (single-cell) data # 拼接单细胞测序数据
--meta this flag is required for metagenomic sample data # 拼接宏基因组测序数据
--rna this flag is required for RNA-Seq data # 拼接转录组测序数据
--plasmid runs plasmidSPAdes pipeline for plasmid detection # 拼接质粒
--iontorrent this flag is required for IonTorrent data
--test runs SPAdes on toy dataset
-h/--help prints this usage message
-v/--version prints version
Input data:
--12 <filename> file with interlaced forward and reverse paired-end reads # PE 双末端交错的排位的reads (fastq)
-1 <filename> file with forward paired-end reads # PE forward端reads (fastq)
-2 <filename> file with reverse paired-end reads # PE reverse端reads (fastq)
-s <filename> file with unpaired reads # PE 未配对reads (fastq)
--merged <filename> file with merged forward and reverse paired-end reads # 合并的PE 双末端reads (fastq)
(此处省略19个参数)
--sanger <filename> file with Sanger reads # 与sanger测序结果混合拼接
--pacbio <filename> file with PacBio reads # 与PacBio测序结果混合拼接
--nanopore <filename> file with Nanopore reads # 与Nanopore测序结果混合拼接
Pipeline options:
--only-error-correction runs only read error correction (without assembling) # 只进行纠错
--only-assembler runs only assembling (without read error correction) # 只进行拼接
--careful tries to reduce number of mismatches and short indels
# 通过运行 MismatchCorrector 模块进行基因组上 mismatches 和 short indels 的修正。推荐使用此参数。
--continue continue run from the last available check-point
(此处省略3个参数)
Advanced options:
--dataset <filename> file with dataset description in YAML format
-t/--threads <int> number of threads [default: 16] # 计算核心/线程数
-m/--memory <int> RAM limit for SPAdes in Gb (terminates if exceeded) [default: 250]
# SPAdes对内存的要求较高 !!!硬件允许的情况下最好设定-m 500 甚至跟高。
--tmp-dir <dirname> directory for temporary files [default: <output_dir>/tmp]
-k <int,int,...> comma-separated list of k-mer sizes (must be odd and less than 128) [default: 'auto']
# Kmer长度,可设置多个:-k 33,43,55,63,73,89
--cov-cutoff <float> coverage cutoff value (a positive float number, or 'auto', or 'off') [default: 'off']
--phred-offset <33 or 64> PHRED quality offset in the input reads (33 or 64) [default: auto-detect]