生物信息-快速序列比对-edlib包安装与使用方法-python和dll

news2024/12/29 10:21:19

安装

下载：https://github.com/Martinsos/edlib

cd build && cmake -D CMAKE_BUILD_TYPE=Release .. && make

在这里插入图片描述

升级cmake

https://cmake.org/download/ 下载tar.gz包，我这里下载的是：
在这里插入图片描述
拷贝到 /home/xxx/app 下解压：

sudo tar xf cmake-3.25.0-rc2-linux-x86_64.tar.gz

此时/home/xxx/app/cmake-3.25.0-rc2-linux-x86_64/bin 下的cmake就是我们需要的，建立软链接：

sudo ln -sf /home/xxx/app/cmake-3.25.0-rc2-linux-x86_64/bin/cmake /usr/bin/cmake

检查版本：cmake --version

cmake version 3.25.0-rc2

CMake suite maintained and supported by Kitware (kitware.com/cmake).

linux下调用可执行程序

cd edlib-master/build/bin

time ./edlib-aligner src.txt tgt.txt -m HW -p -f NICE > st1.txt

real    0m0.002s
user    0m0.002s
sys     0m0.000s

其中src有826个碱基，tgt有1917个碱基，st1.txt部分内容如图所示：
在这里插入图片描述
中间省略

参数说明：

XXX@XXX-pc:~/Desktop/edlib-master/build/bin$ ./edlib-aligner

Usage: ./edlib-aligner [options...] <queries.fasta> <target.fasta>
Options:
        -s  If specified, there will be no score or alignment output (silent mode).
        -m HW|NW|SHW  Alignment mode that will be used. [default: NW]
        -n N  Score will be calculated only for N best sequences (best = with smallest score). If N = 0 then all sequences will be calculated. Specifying small N can make total calculation much faster. [default: 0]
        -k K  Sequences with score > K will be discarded. Smaller k, faster calculation. If -1, no sequences will be discarded. [default: -1]
        -p  If specified, alignment path will be found and printed. This may significantly slow down the calculation.
        -l  If specified, start locations will be found and printed. Each start location corresponds to one end location. This may somewhat slow down the calculation, but is still faster then finding alignment path and does not consume any extra memory.
        -f NICE|CIG_STD|CIG_EXT  Format that will be used to print alignment path, can be used only with -p. NICE will give visually attractive format, CIG_STD will  give standard cigar format and CIG_EXT will give extended cigar format. [default: NICE]
        -r N  Core part of calculation will be repeated N times. This is useful only for performance measurement, when single execution is too short to measure. [default: 1]

其中这个cigar，是排列的字符串形式描述。解释方法如下：

字符	含义
X	mismatch (extended format)
I	insertion
D	deletion
=	match
M	mismatch (standard format)

所以如果想得到cigar，可以用如下的命令：

xxx@xxx-pc:~/Desktop/edlib-master/build/bin$ time ./edlib-aligner src.txt tgt.txt -m HW -p -f CIG_EXT
Using HW alignment mode.
Reading queries...
Read 1 queries, 826 residues total.
Reading target fasta file...
Read target, 1917 residues.

Comparing queries to target...

Query #0 (826 residues): score = 13
Cigar:
6X1=2X1I2=1X1=1X3=1I23=1D784=

Cpu time of searching: 0.000217

real    0m0.002s
user    0m0.000s
sys     0m0.001s

所以上面示例的cigar可以解释为：6X (6个mismatch) 1= (1个match) 2X (2个mismatch) 。。。784= (784个match)
正对应这一段的匹配图：

T: TTATATATC-CCTTTGTT-GGGTTTATACCACTTAGCTTGAAAGAGGTTA (872 - 919)
         |   || | ||| ||||||||||||||||||||||| |||||||
Q: GGGGGCAGGACCGTAGTTAGGGTTTATACCACTTAGCTTGAA-GAGGTTA (0 - 48)

python版

https://pypi.org/project/edlib/

如果对c++版本有要求参见上面的cmake。pip直接安装，页面有示例
举个例子：

import edlib
# source和target使用的分别是826个和1917个碱基的序列，内容略
source = "XXX...XXXXX"
target = "xxx......XX"
alignment = edlib.align(source, target, mode="HW", task="path")
print(alignment)
print("-------------------------------------------")
s = t = ""
# - 表示gap
print(edlib.getNiceAlignment(alignment, s, t, "-"))

# 输出
# locations表示匹配的位置
{'editDistance': 13, 'alphabetLength': 4, 'locations': [(872, 1696)], 'cigar': '6X1=2X1I2=1X1=1X3=1I23=1D784='}
-------------------------------------------
{'query_aligned': '-', 'matched_aligned': '......|..-||.|.|||-|||||||||||||||||||||||-||中间省略||||||||||||||||||', 'target_aligned': '--'}