欢迎关注我的CSDN:https://spike.blog.csdn.net/
本文地址:https://spike.blog.csdn.net/article/details/132025401
在 AlphaFold2 中,使用 HHblits 算法搜索 BFD 与 UniRef30,输出 bfd_uniref_hits.a3m 文件。MMseqs2 优化搜索速度之后,比较使用 MMseqs2 算法与 HHblits 算法之间,BFD 与 UniRef30的搜索结果差异。
构建 MMseqs2
搜索库使用 ffindex2fasta
工具 (来源于 MMseqs v1),将 BFD
与 UniRef30
转换成 fasta
文件,即:
$MMDIR/bin/ffindex2fasta resultDB resultDB.fasta
转换之后 BFD 是 1.5T,UniRef30 是 160G。
测试 Case 来源于 CASP15,即:T1104-D1_A117.fasta
、T1137s8-D1_A251.fasta
、T1188-D1_A573.fasta
、T1157s1_A1029.fasta
,序列长度100、250、500、1000等。
1. AF2 MSA
bfd_uniref_hits.a3m
中序列数量,一般而言,序列越长,可搜索出的序列数量越多,即:
T1104-D1_A117
: 333 (包括自己)T1137s8-D1_A251
:1687T1188-D1_A573
:3648T1157s1_A1029
:3383
即,来自于 HHblits 搜索的日志 wc -l
:
666 bfd_uniref_hits.a3m # T1104-D1_A117
3374 bfd_uniref_hits.a3m # T1137s8-D1_A251
7296 bfd_uniref_hits.a3m # T1188-D1_A573
6766 bfd_uniref_hits.a3m # T1157s1_A1029
其中, T1104-D1_A117
的 bfd_uniref_hits.a3m
数据如下,一部分来自BFD,一部分来自UniRef30,即:
>A
QLEDSEVEAVAKGLEEMYANGVTEDNFKNYVKNNFAQQEISSVEEELNVNISDSCVANKIKDEFFAMISISAIVKAAQKKAWKELAVTVLRFAKANGLKTNAIIVAGQLALWAVQCG
>tr|A0A256LDH3|A0A256LDH3_9LACO Streptococcin A-M57 OS=Lactobacillus taiwanensis GN=CBF70_06415 PE=4 SV=1
VDNDPEVAVVAQELEKIFANGVSQENLNRYVLKNFSNKELTVAEKELDVNYNPFslqskndnnslhsvsvygwnnlgqCMYNKIKDEFFAMVNIGVIVKYAKKKAWKELAKVVIRFAKGAGVRTNAAIVAAQLAVWAVQCG
>tr|F7SG70|F7SG70_LACJH Uncharacterized protein OS=Lactobacillus johnsonii pf01 GN=PF01_01547 PE=4 SV=1
VDNDPEVAIAAQELEKIFVDVVSQKNLNRYALKNFSNKELTVAEKELDVNYNPFslqlkndnnslhsvsvyg---------------------------------------------------------------
>tr|Q6DRR6|Q6DRR6_STRPY Streptococcin A-M57 OS=Streptococcus pyogenes GN=scnM57 PE=4 SV=1
IEEQRQIDEVAAVLEKMFADGVTEENLKQYAQANYSEEELIIADNELNTNLSQIqdenaimykvdwgalgnCMANKIKDELLAMISVGTIIKYAQKKAWKELAKIVIKYVAKAGVKTNAALIAGQLAIWGLQCG
>tr|A0A1C0USB4|A0A1C0USB4_STREE Streptococcin A-M57 OS=Streptococcus pneumoniae GN=A4260_03625 PE=4 SV=1
TQEQKQIDDVANVLEQMFRNGVNEKNFTEYVYKNFSQKDIALAENELETNINNPydrvpwdemggCIAGKIRDEFFAMINVSLIVKYAQKKAWSELAKVVLRFVKANGLKTNIYIIAGQLAIWAVQCG
>tr|A0A133S201|A0A133S201_STRMT Uncharacterized protein OS=Streptococcus mitis GN=HMPREF3228_00364 PE=4 SV=1
----------------MFRNGVNEKNFTECVYKNFSQKDIALAENKLETNINNLydrvpwdemggCIARKIREEFFAMTNVSLTVRYA----------------------------------------
>tr|R2N7C7|R2N7C7_ENTFC Uncharacterized protein OS=Enterococcus faecium EnGen0191 GN=SSI_02965 PE=4 SV=1
QIKNSEVDTVAQGLEQMFSNGVSEENFKNYVNANFSSEEITKSEKELDVNLSNTsspiqarvnwnglgqCMANKIKDEFFAMINVGAIVAAAQKKAWKELAMTVLIFAKANGLKTNALIVAGQLAVWAVQCG
>UniRef100_A0A2N8LAA9 Uncharacterized protein n=1 Tax=Streptococcus penaeicida TaxID=1765960 RepID=A0A2N8LAA9_9STRE
-ISQSEVDAVAIEFEKLFSNGIiiSGNNYSinyDYLNNNYTSEEIQAFINLMASSELSPissgrrkrsissfvvCMKDKAVSDIADMFKVSAFVSFVQRKAWKEAAKFAVSWLAKNGIKRNVAATAALLSWYGIQCA
>UniRef100_A0A2X3SLP5 Uncharacterized protein n=2 Tax=Streptococcus equi TaxID=1336 RepID=A0A2X3SLP5_STRSZ
-TNtsSQEVDQVAQALELMFDNNVSTSNFKKYVNNNFSDSEIAIAELELESRISNSrsefrvawnemggCIAGKIRDEFFAMISVGTIVKYAQKKAWKELALVVLKFVKANGLKTNAIIVAGQLALWAVQCG
>UniRef100_A0A2X4H7F6 Uncharacterized protein n=4 Tax=Streptococcus uberis TaxID=1349 RepID=A0A2X4H7F6_STRUB
-VSQTEIESVASEFEQLFTKGIiiSGNNYTfnhDYLTNNYTSDEIQAFIHLMDSTDLSPtfskrrkrsigsfavCMKDKAVSDIADMFKVGAFVSFVQRKAWKEAAKFAVSWLAKNGIKRNVAATAALLSWYGIQCA
其中,tr
开头的来自于 BFD,UniRef100
开头的来自于 UniRef30
。
具体而言:
>UniRef100
是 250 条,UniRef30>tr|
+>SRR
是 81 条,BFD- 其他 1 条,BFD
- 合计 332 条
序列示例如下:
>UniRef100_A0A2W5AUB7 Uncharacterized protein n=1 Tax=Streptococcus pyogenes TaxID=1314 RepID=A0A2W5AUB7_STRPY
-----------------------------MIGDLLKSKEVLLKKTARQTSLSQiQdddvimyktdwnalgsCMANKIKDELLAMISIGTIITYAKRKAWKELATIVIKYVAKAGVRTHAAFIAGQLAIWGLQCG
>SoimicmetaTmtLPB_FD_contig_61_1212631_length_209_multi_1_in_0_out_0_1 # 2 # 76 # 1 # ID=3042713_1;partial=10;start_type=Edge;rbs_motif=None;rbs_spacer=None;gc_cont=0.693
---NEEIEQLAADLEFLMEEAAiydEKGKVVnfdfDLLEERFgYVLELEMLKEEIEAYnattegd--NDeiqlfswksCMISALKGHFGvALIEValtGGLWSYLEKKAYKEAAKLLLKI----GIEGNVIGLTAFLTWYSVDCI
>tr|Q4V1Z0|Q4V1Z0_BACCZ Uncharacterized protein OS=Bacillus cereus (strain ZK / E33L) GN=pE33L466_0102 PE=4 SV=1
-----LVQAVAAQLKFVVEEAAvkdKHGRVVdiDMIENKYgKTEELEQLRQEIQRVNTPPgyedpfkqeteavgnCIERKLIANYVEVLSVgflGSIIANITNKEYELAARKMIKL----GVKGNLISLAGQLAWYLGTCI
>SRR5690606_14355373
-----QIEELAAQLEFLMEEALiiENGERTfdfEKIENEFgkeVKDEIKMLTVDAQVWqvqpgaitlaanqPWKDCMVGAIKDHFGvALVTAaleGGLWAYLEKKAYKEAAKLLVKF----AVGTNAVGIAGTLIYYGGKCT
结论:在T1104-D1_A117
的 bfd_uniref_hits.a3m
数据中,UniRef30
的搜索数量略大于 BFD
,与数据库大小相反。
BFD (Big Fantastic Database) 是由 Martin Steinegger 博士构建的大型蛋白质序列数据库,包含超过 65 亿个蛋白质家族和 220 亿个蛋白质序列,覆盖了多种来源的蛋白质数据,如 UniProt、Metaclust、SRC 和 MERC。BFD 蛋白质数据库的目的是为了提供一个高质量、高覆盖率、低冗余的蛋白质序列资源,用于蛋白质结构预测、功能注释、进化分析等领域。
BFD 蛋白质数据库的构建过程是这样的:
- 从三大序列库 (GenBank \ EMBL \ DDBJ) 中采集 24 亿条蛋白质序列,使用 MMSeqs2 / Linclust 工具进行聚类,根据序列相似度和排列覆盖率的标准,将相似的序列归为一个家族,并且选取一个代表序列。
- 从 Metaclust 序列库中,补充了一些较长的序列,与已有的代表序列进行比对,将满足条件的序列加入相应的家族,将不满足条件的序列单独聚类。
- 对每个家族进行多序列比对 (MSA) 和隐马尔可夫模型 (HMM) 的构建,得到 BFD 蛋白质数据库。
2. MMseqs2 MSA
测试 BFD 脚本:
bash mmseqs2_search.sh test_fasta/T1104-D1_A117.fasta test_fasta/T1104-D1_A117-bfd-i3s8.a3m msa_databases/af2_msa_mmseqs/bfd_mmseqs/bfd_db 3 8 T1104-D1_A117-bfd-i3s8
测试效果:
[Info] Time taken to execute commands is 1974 seconds. (32.9 min)
154 test_fasta/T1104-D1_A117-bfd-i3s8.a3m
测试 UniRef30 脚本:
bash mmseqs2_search.sh test_fasta/T1104-D1_A117.fasta test_fasta/T1104-D1_A117-uniref30-i3s8.a3m msa_databases/af2_msa_mmseqs/uniref30_mmseqs/uniref30_db 3 8 T1104-D1_A117-uniref30-i3s8
测试效果:
[Info] Time taken to execute commands is 1266 seconds.
194 test_fasta/T1104-D1_A117-uniref30-i3s8.a3m
测试脚本:
# params
#=========================================
mmseqs=mmseqs
tmp=tmp_my
query_fasta=$1 # fasta
a3m_db=$2 # MSA
target_db=$3 # DB Path
num_iterations=$4 # 迭代轮次
sensitivity=$5 # 敏感度
out_tag=$6 # 唯一标识, 避免多进程搜索干扰
target_db_index=${target_db}.idx
query_db=$tmp/${out_tag}_queryDB
result_db=$tmp/${out_tag}_res # 文件
result_db_realign=$tmp/${out_tag}_res_realign
result_db_realign_filter=$tmp/${out_tag}_res_realign_filter
tmp_db=$tmp/${out_tag}_tmp
#=========================================
echo "[Info] query_fasta: $1"
echo "[Info] a3m_db: $2"
echo "[Info] target_db: $3"
echo "[Info] num_iterations: $4"
echo "[Info] sensitivity: $5"
echo "[Info] out_tag: $6"
mkdir -p $tmp
time_start=$(date +%s)
#=========================================
$mmseqs createdb ${query_fasta} ${query_db}
#$mmseqs search ${query_db} ${target_db} ${result_db} ${tmp_db} --db-load-mode 2 --num-iterations ${num_iterations} -s ${sensitivity} --max-seqs 10000 -e 0.1 -a --threads 16 --mpi-runner "mpirun --allow-run-as-root -np 8"
$mmseqs search ${query_db} ${target_db} ${result_db} ${tmp_db} --db-load-mode 2 --num-iterations ${num_iterations} -s ${sensitivity} --max-seqs 10000 -e 0.1 -a --threads 16
$mmseqs align ${query_db} ${target_db_index} ${result_db} ${result_db_realign} --db-load-mode 2 -e 10 --max-accept 100000 --alt-ali 10 -a --threads 16
$mmseqs filterresult ${query_db} ${target_db_index} ${result_db_realign} ${result_db_realign_filter} --db-load-mode 2 --qid 0 --qsc 0.8 --diff 0 --max-seq-id 1.0 --filter-min-enable 100 --threads 16
$mmseqs result2msa ${query_db} ${target_db_index} ${result_db_realign_filter} ${a3m_db} --msa-format-mode 6 --db-load-mode 2 --filter-msa 1 --filter-min-enable 1000 --diff 3000 --qid 0.0,0.2,0.4,0.6,0.8,1.0 --qsc 0 --max-seq-id 0.95 --threads 16
$mmseqs rmdb ${result_db_realign_filter}
$mmseqs rmdb ${result_db}
$mmseqs rmdb ${result_db_realign}
#=========================================
time_end=$(date +%s)
time_take=$(( time_end - time_start ))
echo "[Info] MMseqs2 path is ${mmseqs} ."
echo "[Info] Time taken to execute commands is ${time_take} seconds."
参数网格搜索的多进程脚本:
#!/usr/bin/env python
# -- coding: utf-8 --
"""
Copyright (c) 2022. All rights reserved.
Created by C. L. Wang on 2023/7/31
"""
import os
import subprocess
import sys
from multiprocessing import Pool
from time import time
from tqdm import tqdm
p = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
if p not in sys.path:
sys.path.append(p)
from myutils.project_utils import mkdir_if_not_exist, traverse_dir_files, time_elapsed, create_empty_file
from root_dir import ROOT_DIR, DATA_DIR
def process(params):
"""
单进程
"""
sh_file, fasta_path, output_path, db_path, i_idx, s_idx, out_name = params
# 输出文件
s_time = time()
res = subprocess.check_output([
"bash", sh_file,
fasta_path,
output_path,
db_path,
str(i_idx),
str(s_idx),
out_name, # 用于缓存临时文件
], shell=False)
# print(f"[Info] res: \n{res.decode()}")
print(f"[Info] time: {time_elapsed(s_time, time())}, output_path: {output_path}")
def main():
"""
参数网格搜索的脚本
"""
output_dir = os.path.join(DATA_DIR, "results")
fasta_dir = os.path.join(ROOT_DIR, "data", "fasta")
mkdir_if_not_exist(output_dir)
print(f"[Info] 输出文件夹: {output_dir}")
print(f"[Info] 输入 fasta 文件夹: {fasta_dir}")
path_list = traverse_dir_files(fasta_dir, "fasta")
print(f"[Info] fasta 数量: {len(path_list)}")
bfd_db = "msa_databases/af2_msa_mmseqs/bfd_mmseqs/bfd_db"
uniref30_db = "msa_databases/af2_msa_mmseqs/uniref30_mmseqs/uniref30_db"
sh_file = os.path.join(ROOT_DIR, "search.sh")
params_list = []
# 遍历次数 4x2x3x8 = 192
for fasta_path in path_list:
base_name = os.path.basename(fasta_path).split(".")[0]
# print(f"[Info] fasta_path: {fasta_path}")
output_fasta_dir = os.path.join(output_dir, base_name)
mkdir_if_not_exist(output_fasta_dir)
max_i, max_s = 3, 8
for db in ["bfd", "uniref30"]:
if db == "bfd": # 数据库 设置
db_path = bfd_db
elif db == "uniref30":
db_path = uniref30_db
else:
raise Exception(f"db name error: {db}")
for i_idx in range(1, max_i+1):
for s_idx in range(1, max_s+1):
out_name = f"{base_name}-{db}-i{i_idx}s{s_idx}"
output_path = os.path.join(output_fasta_dir, f"{out_name}.a3m")
create_empty_file(output_path)
params_list.append((sh_file, fasta_path, output_path, db_path, i_idx, s_idx, out_name))
print(f"[Info] 推理次数: {len(params_list)}")
# 多进程
n_proc = min(len(params_list), 20)
pool = Pool(processes=n_proc)
list(tqdm(pool.imap(process, params_list), desc="[Info] run"))
pool.close()
pool.join()
# 单进程测试
# for params in tqdm(params_list, desc="[Info] run"):
# process(params)
print("[Info] 全部处理完成!")
if __name__ == '__main__':
main()
其他
PyCharm 其他版本下载
下载地址:PyCharm - 2022.3
参考
- CSDN - python执行shell脚本的几种方法
- GitHub - user guide error and unknown output format
- GitHub - MMseqs