欢迎关注我的CSDN:https://spike.blog.csdn.net/
本文地址:https://spike.blog.csdn.net/article/details/132427617
结构模版 (Template) 是一种已知的蛋白质结构,可以作为 AlphaFold2 蛋白质结构预测的参考,AlphaFold2 可以从多个数据库中搜索和选择最合适的模板,也可以使用自定义的模板。AlphaFold2 使用了一种基于深度学习的方法,利用模板和序列对齐信息来预测蛋白质的三维结构,还可以预测多聚体蛋白质的结构,这是一种由多个亚基组成的复杂蛋白质。
Monomer 模版搜索的结果文件是 pdb_hits.hhr
,而 Multimer 是以 sto 格式结尾,搜索算法不同导致的。
测试 Case 是 T1104-D1_A117
,来源于 CASP15。模版搜索结果的处理逻辑,位于 alphafold/data/pipeline.py
中:
templates_result
就是模版特征,是 AF2 的三大特征 (Sequence | MSA | Template)之一。
即:
# ...
# 加载模版文件
if os.path.isfile(pdb_hits_out_path): # avoid to search templates
with open(pdb_hits_out_path, "r") as f:
pdb_templates_result = f.read()
logging.info("[CL] use saved template. %s", pdb_hits_out_path)
# ...
# 获取pdb命中结果
pdb_template_hits = self.template_searcher.get_template_hits(
output_string=pdb_templates_result, input_sequence=input_sequence)
# ...
# 解析模版结果
templates_result = self.template_featurizer.get_templates(
query_sequence=input_sequence,
hits=pdb_template_hits)
# ...
日志:
[CL] use saved template. mydata/T1104-D1_A117/msas/pdb_hits.hhr
1. get_template_hits() — 解析 hhr 文件
获取模版在 PDB 库的命中结果,主要是解析 pdb_hits.hhr
文件。逻辑位于 alphafold/data/tools/hhsearch.py
,即:
def get_template_hits(self,
output_string: str,
input_sequence: str) -> Sequence[parsers.TemplateHit]:
"""Gets parsed template hits from the raw string output by the tool."""
del input_sequence # Used by hmmseach but not needed for hhsearch.
return parsers.parse_hhr(output_string)
调用解析 hhr 的函数 parse_hhr()
,逻辑位于alphafold/data/parsers.py
,即
def parse_hhr(hhr_string: str) -> Sequence[TemplateHit]:
"""Parses the content of an entire HHR file."""
lines = hhr_string.splitlines()
# Each .hhr file starts with a results table, then has a sequence of hit
# "paragraphs", each paragraph starting with a line 'No <hit number>'. We
# iterate through each paragraph to parse each hit.
block_starts = [i for i, line in enumerate(lines) if line.startswith('No ')]
hits = []
if block_starts:
block_starts.append(len(lines)) # Add the end of the final block.
for i in range(len(block_starts) - 1):
hits.append(_parse_hhr_hit(lines[block_starts[i]:block_starts[i + 1]]))
return hits
插入日志逻辑,即:
from absl import logging
logging.set_verbosity(logging.INFO)
for hit in hits:
logging.info(f"[CL] hit: {hit}")
日志如下:
[CL] hit: TemplateHit(index=5, name='2KW0_A CcmH protein; oxidoreductase, cytochrome c maturation; NMR {Escherichia coli}', aligned_cols=38, sum_probs=30.8, query='AVAKGLEEMYANGVTEDNFKNYVKNNFAQQEISSVEEE', hit_sequence='DLRQKVYELMQEGKSKKEIVDYMVARYGNFVTYDPPLT', indices_query=[8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45], indices_hit=[43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80])
其中,pdb_hits.hhr
如下:
Query A
Match_columns 117
No_of_seqs 22 out of 29
Neff 2.94667
Searched_HMMs 80799
Date Tue May 9 10:08:54 2023
Command /root/miniconda3/envs/alphafold/bin/hhsearch -i /tmp/tmp7qe78wbc/query.a3m -o /tmp/tmp7qe78wbc/output.hhr -maxseq 1000000 -d /nfs_baoding/chenlong/af2_data_dir/pdb70/pdb70
No Hit Prob E-value P-value Score SS Cols Query HMM Template HMM
1 2HL7_A Cytochrome C-type bioge 31.3 41 0.00051 24.1 0.0 34 9-42 47-80 (84)
2 3D3M_A Eukaryotic translation 26.6 56 0.00069 24.7 0.0 33 9-41 3-35 (168)
3 3JVN_A Acetyltransferase (E.C. 25.8 59 0.00073 21.0 0.0 31 70-100 94-124 (166)
4 3OAO_A uncharacterized protein 24.7 64 0.00079 23.7 0.0 43 11-53 56-98 (147)
5 2KW0_A CcmH protein; oxidoredu 23.4 70 0.00087 23.3 0.0 38 9-46 44-81 (90)
6 3F0A_A N-ACETYLTRANSFERASE; N- 22.9 73 0.00091 20.4 0.0 39 2-40 11-53 (162)
7 5LXU_A Transcription factor LU 22.4 76 0.00094 20.3 0.0 26 11-36 29-54 (57)
8 2DXQ_A Acetyltransferase; acet 22.0 78 0.00097 19.6 0.0 25 75-99 92-116 (150)
9 2DXQ_B Acetyltransferase; acet 22.0 78 0.00097 19.6 0.0 25 75-99 92-116 (150)
10 5UN0_3 Proteasome Activator; A 19.9 92 0.0011 25.3 0.0 24 78-101 88-111 (251)
No 1
>2HL7_A Cytochrome C-type biogenesis protein CcmH; Three-helices bundle, OXIDOREDUCTASE; HET: PG4; 1.7A {Pseudomonas aeruginosa}
Probab=31.28 E-value=41 Score=24.06 Aligned_cols=34 Identities=9% Similarity=0.188 Sum_probs=27.6 Template_Neff=6.200
Q A 9 AVAKGLEEMYANGVTEDNFKNYVKNNFAQQEISS 42 (117)
Q Consensus 9 ~VA~~LEkMF~nGVse~Nf~~Yv~~Nfs~~EIs~ 42 (117)
..-.++.+|...|-|++-+.+|+.+.|++.=+..
T Consensus 47 ~~R~~I~~~l~~G~s~~eI~~~~v~~YG~~IL~~ 80 (84)
T 2HL7_A 47 DLRKQIYGQLQQGKSDGEIVDYMVARYGDFVRYK 80 (84)
T ss_dssp HHHHHHHHHHHHTCCHHHHHHHHHHHHTTTCEEC
T ss_pred HHHHHHHHHHHCCCCHHHHHHHHHHHHCccceeC
Confidence 3445678899999999999999999998765443
No 2
>3D3M_A Eukaryotic translation initiation factor 4; HEAT repeat domain, Structural Genomics; 1.9A {Homo sapiens}
Probab=26.65 E-value=56 Score=24.68 Aligned_cols=33 Identities=12% Similarity=0.157 Sum_probs=27.2 Template_Neff=8.500
Q A 9 AVAKGLEEMYANGVTEDNFKNYVKNNFAQQEIS 41 (117)
Q Consensus 9 ~VA~~LEkMF~nGVse~Nf~~Yv~~Nfs~~EIs 41 (117)
.|-.+|++++..|-+.+.+.+|+.+|.+++...
T Consensus 3 ~~~~~L~~~l~~~~~~~~i~~wi~~~v~~~~~~ 35 (168)
T 3D3M_A 3 KLEKELLKQIKLDPSPQTIYKWIKDNISPKLHV 35 (168)
T ss_pred HHHHHHHHHHhhCCCHHHHHHHHHHhCCHHHcC
Confidence 355689999999999999999999998876543
# ...
2. get_templates() — 提取模版特征
核心逻辑位于 alphafold/data/templates.py
中,get_templates()
循环处理 hit
,调用 _process_single_hit()
,即:
def get_templates(
self,
query_sequence: str,
hits: Sequence[parsers.TemplateHit]) -> TemplateSearchResult:
"""Computes the templates for given query sequence (more details above)."""
# ...
for hit in sorted(hits, key=lambda x: x.sum_probs, reverse=True):
# We got all the templates we wanted, stop processing hits.
if num_hits >= self._max_hits:
break
result = _process_single_hit(
query_sequence=query_sequence,
hit=hit,
mmcif_dir=self._mmcif_dir,
max_template_date=self._max_template_date,
release_dates=self._release_dates,
obsolete_pdbs=self._obsolete_pdbs,
strict_error_check=self._strict_error_check,
kalign_binary_path=self._kalign_binary_path)
#...
核心逻辑位于 _process_single_hit()
中,生成单个模版特征,具体在 _extract_template_features()
函数中,生成特征,即:
# ...
mapping = _build_query_to_hit_index_mapping(
hit.query, hit.hit_sequence, hit.indices_hit, hit.indices_query,
query_sequence)
# ...
template_sequence = hit.hit_sequence.replace('-', '')
# ...
cif_path = os.path.join(mmcif_dir, hit_pdb_code + '.cif')
# ...
parsing_result = mmcif_parsing.parse(
file_id=hit_pdb_code, mmcif_string=cif_string)
# ...
# 核心逻辑
features, realign_warning = _extract_template_features(
mmcif_object=parsing_result.mmcif_object,
pdb_id=hit_pdb_code,
mapping=mapping,
template_sequence=template_sequence,
query_sequence=query_sequence,
template_chain_id=hit_chain_id,
kalign_binary_path=kalign_binary_path)
# ...
插入日志逻辑,即:
from absl import logging
logging.set_verbosity(logging.INFO)
logging.info(f"[CL] hit_pdb_code: {hit_pdb_code}, hit_chain_id: {hit_chain_id}, cif_path: {cif_path}, query_sequence: {query_sequence}")
日志如下:
[CL] hit_pdb_code: 2kw0, hit_chain_id: A, cif_path: af2_data_dir/pdb_mmcif/mmcif_files/2kw0.cif, query_sequence: QLEDSEVEAVAKGLEEMYANGVTEDNFKNYVKNNFAQQEISSVEEELNVNISDSCVANKIKDEFFAMISISAIVKAAQKKAWKELAVTVLRFAKANGLKTNAIIVAGQLALWAVQCG