0. 说明:
Foldseek 是由韩国国立首尔大学 (Seoul National University) 的 Martin Steinegger (MMseqs2 和 Linclust 的作者) 开发的一款用于快速地从大型蛋白质结构数据库中检索相似结构蛋白质的工具,可以用于计算两个蛋白之间的结构相似性,可以用于蛋白质结构比对,也可以与 MMseqs2 和 Linclust 结合实现基于结构对蛋白质聚类。
本文的目的在于利用 Foldseek 将蛋白质的PDB结构转化为 3Di alphabet 表示的 3Di 序列,同时获取蛋白质蓄力的 3Di Embedding Matrix。
1. 下载和安装 Foldseek:
根据 github 上提供的安装教程(https://github.com/steineggerlab/foldseek),首先确定 Linux 的架构,然后根据相应的下载和安装命令进行 Foldseek 的下载和安装。
# Linux AVX2 build (check using: cat /proc/cpuinfo | grep avx2)
wget https://mmseqs.com/foldseek/foldseek-linux-avx2.tar.gz; tar xvzf foldseek-linux-avx2.tar.gz; export PATH=$(pwd)/foldseek/bin/:$PATH
# Linux SSE2 build (check using: cat /proc/cpuinfo | grep sse2)
wget https://mmseqs.com/foldseek/foldseek-linux-sse2.tar.gz; tar xvzf foldseek-linux-sse2.tar.gz; export PATH=$(pwd)/foldseek/bin/:$PATH
# Linux ARM64 build
wget https://mmseqs.com/foldseek/foldseek-linux-arm64.tar.gz; tar xvzf foldseek-linux-arm64.tar.gz; export PATH=$(pwd)/foldseek/bin/:$PATH
# MacOS
wget https://mmseqs.com/foldseek/foldseek-osx-universal.tar.gz; tar xvzf foldseek-osx-universal.tar.gz; export PATH=$(pwd)/foldseek/bin/:$PATH
# Conda installer (Linux and macOS)
conda install -c conda-forge -c bioconda foldseek
2. 利用 Foldseek 将PDB转化为3Di
在 foldseek 程序所在的目录下,运行:
命令:./foldseek structureto3didescriptor --help
usage: foldseek structureto3didescriptor <i:PDB|mmCIF[.gz]> ... <i:PDB|mmCIF[.gz]> <o:3didescriptor> [options]
By Martin Steinegger <martin.steinegger@snu.ac.kr>
options: misc:
--mask-bfactor-threshold FLOAT mask residues for seeding if b-factor < thr [0,100] [0.000]
--file-include STR Include file names based on this regex [.*]
--file-exclude STR Exclude file names based on this regex [^$]
common:
--threads INT Number of CPU-cores used (all by default) [40]
-v INT Verbosity level: 0: quiet, 1: +errors, 2: +warnings, 3: +info [3]
expert:
--chain-name-mode INT Add chain to name:
0: auto
1: always add
[0]
--write-mapping INT write _mapping file containing mapping from internal id to taxonomic identifier [0]
--coord-store-mode INT Coordinate storage mode:
1: C-alpha as float
2: C-alpha as difference (uint16_t) [2]
--write-lookup INT write .lookup file containing mapping from internal id, fasta id and file number [1]
--tar-include STR Include file names based on this regex [.*]
--tar-exclude STR Exclude file names based on this regex [^$]
examples:
Convert PDB/mmCIF/tar[.gz] files to a db
references:
- van Kempen, M., Kim, S.S., Tumescheit, C., Mirdita, M., Lee, J., Gilchrist, C.L.M., Söding, J., and Steinegger, M. Fast and accurate protein structure search with Foldseek. Nature Biotechnology, doi:10.1038/s41587-023-01773-0 (2023)
从上述帮助文档的结果可以看出,将一个蛋白PDB转化为3Di,命令:
./foldseek structureto3didescriptor prot.pdb res_prot.3di --threads 1
(用一个线程将 prot.pdb
转化为 res_prot.3di
)
结果示例:
将人类蛋白 A1IGU5.pdb 转化为 A1IGU5.3di,部分结果如下:
3. 从 3Di 结果中将 3Di Token 和 3Di Embedding 取出
import numpy as np
def deal3DiRes(threeDifile):
with open(threeDifile) as inF:
for line in inF:
line = line.strip().split("\t")
## 3Di Token
token_3di = line[-2].strip()
## 3Di Embedding
matrix_3di = np.array(line[-1].strip().split(","), dtype=float)
matrix_3di_reshape = matrix_3di.reshape(-1,10) ## 因为每个氨基酸是用长度为 10 的向量来表示的,所以把最后一列分成 nx10 的矩阵即可。
break
return token_3di, matrix_3di_reshape
if __name__ == "__main__":
res = deal3DiRes(threeDifile="../VirusHumanProt3DiFiles/Human3Di/A1IGU5.3di")
print(res[0]) ## 3Di 序列
print(res[1]) ## 3Di matrix
A1IGU5.3di 的处理结果如下
DDDDDDDDDPDPPPPVVVVVVVLLVVLLVQLVVLLVVVLVVLVVLLVLLCCVVPQLLVLVVVDDPVLSCLLCPVSVLVSVLSVVLSVQLVVLVVPSVCNLLSNLVSLVVCLVSLLVRLLRLLLSLVVNVVSLVVQVVVVVSVVSQQVSQCVSPVVCPPVRSVSSSCSSVVVLVCPLVSLVSSLVSDDCPDPSNVSSVVSSVSSVVSNVSSVLSSLLSVLLVVFLPPDPDDPVVVVVPDDPVVVVLVVQLVVLVVCCVVVVDPADDDPLVVVLVVLLVVLLVVLVVQLVVLVVVLVVLVVVLVDQPLPDDPVPPDAPVPLVSVLSVCCSVPLSVVLSVLCCVQQNVLSVVLNSVSSSVVRLVVVLSSLSSLQVVQVVCCVVPVDDDPVSVVSNVSNVSSVVSSSVVSVVSSVVSVVSVVVSVVSVVVSVVVSVVVSVVSVVVSLVPHPCSPPDPVRVVVVVVVVVVVVVVVVVVVVVVVLVPDDFPDLDDDDVPCPVQVVVVCVVANLQFKKFQRAFDDDDDDQDDGDHGGQIWGFPACADPVRHNQWTWIDSSPDIGIDGPVRIDRRDDDPPVVNVCVVVPDDDDDDDDDDDDDDDDDDDDDPPFKKFFCDWDDDDDPQADTHHHRAIKDFPACAPPVRHNQWTFIDGPNDTHIDGSVRMDTDDDDDPDDDDDDD
[[ 2.629e-316 1.156e-316 2.629e-316 ... 1.482e-323 -1.661e+001
2.872e+013]
[ 7.838e-001 6.043e-001 7.838e-001 ... 3.854e+000 1.000e+000
6.931e-001]
[ 6.043e-001 1.280e-001 6.043e-001 ... 3.703e+000 1.000e+000
6.931e-001]
...
[ 3.958e-001 3.127e-001 -1.000e+000 ... 3.815e+000 -1.000e+000
-6.931e-001]
[ 3.945e-001 3.958e-001 -1.000e+000 ... 3.853e+000 -1.000e+000
-6.931e-001]
[ 0.000e+000 0.000e+000 0.000e+000 ... 0.000e+000 0.000e+000
0.000e+000]]
参考:
[1]. van Kempen M, Kim S, Tumescheit C, Mirdita M, Lee J, Gilchrist C, Söding J, and Steinegger M. Fast and accurate protein structure search with Foldseek. Nature Biotechnology, doi:10.1038/s41587-023-01773-0 (2023)
[2]. Barrio-Hernandez I, Yeo J, Jänes J, Mirdita M, Gilchrist LMC, Wein T, Varadi M, Velankar S, Beltrao P and Steinegger M. Clustering predicted structures at the scale of the known protein universe. Nature, doi:10.1038/s41586-023-06510-w (2023)
[3]. https://github.com/steineggerlab/foldseek