同源搜索,多序列比对等都是常用的方式,但是有很多的软件可以实现这些同源搜索和多序列比对,但是不同的软件输出的文件格式却是不完全一致,有熟悉的FASTA
格式的,也有A2M, A3M,stockholm
等格式。
详细介绍:
https://github.com/soedinglab/hh-suite/wiki#multiple-sequence-alignment-formats
A3M格式文件(.a3m)示例:
- 每个序列都以 > 开头的行开始,并包含序列的标识信息。
- 在序列标识行之后,是与该序列相关的比对信息,通常使用字母来表示氨基酸或核酸。‘-’表示缺失,小写字母表示插入。
Stockholm格式文件(.sto)示例:
import dataclasses
from typing import Sequence, Tuple
import string
import collections
# Sequence 表示序列类型,内部的 Sequence[int] 表示整数序列。
# DeletionMatrix 表示一个由整数组成的二维数组。
DeletionMatrix = Sequence[Sequence[int]]
### 1. 定义Msa类
# Python中,dataclass 是一个装饰器(Decorator),用于创建称为数据类(data class)的类。
# dataclass 装饰器自动生成一些特殊方法,如 __init__、__repr__、__eq__ 等,
# 减少了编写这些方法的样板代码。
@dataclasses.dataclass(frozen=True)
class Msa:
"""Class representing a parsed MSA file."""
## 初始化参数
sequences: Sequence[str]
deletion_matrix: DeletionMatrix
descriptions: Sequence[str]
# __post_init__ 是Python数据类(data class)中的特殊方法,
# 用于在创建数据类的实例之后进行进一步的初始化操作
def __post_init__(self):
if not (len(self.sequences) ==
len(self.deletion_matrix) ==
len(self.descriptions)):
raise ValueError(
'All fields for an MSA must have the same length. '
f'Got {len(self.sequences)} sequences, '
f'{len(self.deletion_matrix)} rows in the deletion matrix and '
f'{len(self.descriptions)} descriptions.')
def __len__(self):
return len(self.sequences)
def truncate(self, max_seqs: int):
return Msa(sequences=self.sequences[:max_seqs],
deletion_matrix=self.deletion_matrix[:max_seqs],
descriptions=self.descriptions[:max_seqs])
m_seq = ["AAALLL","AT-LAL","S-ALLI"] # 多序列比对后的数据
m_del_matrix = [[0,0,0,0,0,0],[0,0,0,0,0,0],[0,0,0,0,0,0]]
m_descriptions = ["seq1","seq2","seq3"]
# 实例化
test_msa = Msa(m_seq, m_del_matrix, m_descriptions)
print(test_msa)
print(len(test_msa))
# 去除msa第三条序列
print(test_msa.truncate(2))
### 2. 定义函数,解析fasta格式字符串
def parse_fasta(fasta_string: str) -> Tuple[Sequence[str], Sequence[str]]:
"""Parses FASTA string and returns list of strings with amino-acid sequences.
Arguments:
fasta_string: The string contents of a FASTA file.
Returns:
A tuple of two lists:
* A list of sequences.
* A list of sequence descriptions taken from the comment lines. In the
same order as the sequences.
"""
sequences = []
descriptions = []
index = -1
for line in fasta_string.splitlines():
line = line.strip()
if line.startswith('>'):
index += 1
descriptions.append(line[1:]) # Remove the '>' at the beginning.
sequences.append('')
continue
elif not line:
continue # Skip blank lines.
sequences[index] += line
return sequences, descriptions
with open("test_aln.a3m") as f:
a3m_string = f.read()
sequences, description = parse_fasta(a3m_string)
print(sequences)
print(description)
## 多序列比对a3m格式:
## 1. 每个序列都以 > 开头的行开始,并包含序列的标识信息。
## 2.在序列标识行之后,是与该序列相关的比对信息,通常使用字母来表示氨基酸或核酸。
## ‘-’表示缺失,小写字母表示插入。
### 3.定义函数,解析a3m格式的msa字符串,生成Msa实例,该函数调用parse_fasta函数
def parse_a3m(a3m_string: str) -> Msa:
"""Parses sequences and deletion matrix from a3m format alignment.
Args:
a3m_string: The string contents of a a3m file. The first sequence in the
file should be the query sequence.
Returns:
A tuple of:
* A list of sequences that have been aligned to the query. These
might contain duplicates.
* The deletion matrix for the alignment as a list of lists. The element
at `deletion_matrix[i][j]` is the number of residues deleted from
the aligned sequence i at residue position j.
* A list of descriptions, one per sequence, from the a3m file.
"""
sequences, descriptions = parse_fasta(a3m_string)
deletion_matrix = []
for msa_sequence in sequences:
deletion_vec = []
deletion_count = 0
for j in msa_sequence:
if j.islower():
deletion_count += 1
else:
deletion_vec.append(deletion_count)
deletion_count = 0
deletion_matrix.append(deletion_vec)
# Make the MSA matrix out of aligned (deletion-free) sequences.
# string.ascii_lowercase, string模块提供的字符串常量,包含了所有小写字母的 ASCII 字符
# str.maketrans 是 Python 字符串方法,用于创建一个字符映射表(translation table),
# ''换成''并删除string.ascii_lowercase
deletion_table = str.maketrans('', '', string.ascii_lowercase)
# str.translate 使用映射表执行字符转换(删除小写字母)
aligned_sequences = [s.translate(deletion_table) for s in sequences]
return Msa(sequences=aligned_sequences,
deletion_matrix=deletion_matrix,
descriptions=descriptions)
with open("test_aln.a3m") as f:
a3m_string = f.read()
msa1 = parse_a3m(a3m_string)
print(msa1)
### 4.定义函数, 解析stockholm格式的msa字符串,生成Msa实例
def parse_stockholm(stockholm_string: str) -> Msa:
"""Parses sequences and deletion matrix from stockholm format alignment.
Args:
stockholm_string: The string contents of a stockholm file. The first
sequence in the file should be the query sequence.
Returns:
A tuple of:
* A list of sequences that have been aligned to the query. These
might contain duplicates.
* The deletion matrix for the alignment as a list of lists. The element
at `deletion_matrix[i][j]` is the number of residues deleted from
the aligned sequence i at residue position j.
* The names of the targets matched, including the jackhmmer subsequence
suffix.
"""
## 有序字典,保持多序列比对中的序列顺序
name_to_sequence = collections.OrderedDict()
for line in stockholm_string.splitlines():
line = line.strip()
# 去除空行和注释行
if not line or line.startswith(('#', '//')):
continue
name, sequence = line.split()
if name not in name_to_sequence:
name_to_sequence[name] = ''
name_to_sequence[name] += sequence
msa = []
deletion_matrix = []
query = ''
keep_columns = []
for seq_index, sequence in enumerate(name_to_sequence.values()):
## 第一行为query序列
if seq_index == 0:
# Gather the columns with gaps from the query
query = sequence
keep_columns = [i for i, res in enumerate(query) if res != '-']
# Remove the columns with gaps in the query from all sequences.
aligned_sequence = ''.join([sequence[c] for c in keep_columns])
msa.append(aligned_sequence)
# Count the number of deletions w.r.t. query.
deletion_vec = []
deletion_count = 0
# query序列相对于每一个同源序列,氨基酸位置的缺失情况,累加连续缺失
for seq_res, query_res in zip(sequence, query):
if seq_res != '-' or query_res != '-':
if query_res == '-':
deletion_count += 1
else:
deletion_vec.append(deletion_count)
deletion_count = 0
deletion_matrix.append(deletion_vec)
return Msa(sequences=msa,
deletion_matrix=deletion_matrix,
descriptions=list(name_to_sequence.keys()))
with open("test_aln.sto") as f:
stockholm_string = f.read()
print(stockholm_string)
msa2 = parse_stockholm(stockholm_string)
print(msa2)
## 注:parse_stockholm 和 parse_a3m 函数生成Msa对象中,
## deletion_matrix中在查询序列deletion位置填上缺失的个数,
## 下一个氨基酸位置的0跳过,所以总长度相等
## 如函数输入msa中第一条序列(query序列)为:“A--CE-H”, 则函数输出的第一条序列为:“ACEH”,
## deletion_matrix的第一个元素为:[0,2,0,1]