1 问题:
读取 PDBBindv2019的数据集,尝试把所有配体的mol2文件转换成对应smiles表达式。大约超过1千个出现问题。
主要问题就是‘warning - O.co2 with non C.2 or S.o2 neighbor’。
2 原因:
Phosphate group - warning O.co2 with non C.2 or S.o2 neighbor
[Rdkit-discuss] Phosphate containing mol2 files
Since the mol2 format is a bit of an inconsistent mess, where different toolkits/packages use different dialects of the format (or different meanings for the atom types), we chose to support the dialect generated by corina.
简单来说,mol2的格式不止一种,而RDKit采用了其中一种:corina。
from rdkit import Chem
Chem.MolFromMol2File()
:文档信息
Docstring:
MolFromMol2File( (str)molFileName [, (bool)sanitize=True [, (bool)removeHs=True [, (bool)cleanupSubstructures=True]]]) -> Mol :
Construct a molecule from a Tripos Mol2 file.
NOTE:
The parser expects the atom-typing scheme used by Corina.
Atom types from Tripos' dbtranslate are less supported.
Other atom typing schemes are unlikely to work.
ARGUMENTS:
- fileName: name of the file to read
- sanitize: (optional) toggles sanitization of the molecule.
Defaults to true.
- removeHs: (optional) toggles removing hydrogens from the molecule.
This only make sense when sanitization is done.
Defaults to true.
- cleanupSubstructures: (optional) toggles standardizing some
substructures found in mol2 files.
Defaults to true.
RETURNS:
a Mol object, None on failure.
C++ signature :
class RDKit::ROMol * __ptr64 MolFromMol2File(char const * __ptr64 [,bool=True [,bool=True [,bool=True]]]) Type: function
3 解决方法
找了半天,直接换成openbabei来读取。
from openbabel import pybel
pybel.readfile ()
文档
Required parameters:
format - see the informats variable for a list of available
input formats
filename
Optional parameters:
opt - a dictionary of format-specific options
For format options with no parameters, specify the
value as None.
You can access the first molecule in a file using the next() method
of the iterator (or the next() keyword in Python 3):
mol = readfile("smi", "myfile.smi").next() # Python 2
mol = next(readfile("smi", "myfile.smi")) # Python 3
You can make a list of the molecules in a file using:
mols = list(readfile("smi", "myfile.smi"))
You can iterate over the molecules in a file as shown in the
following code snippet:
>>> atomtotal = 0
>>> for mol in readfile("sdf", "head.sdf"):
... atomtotal += len(mol.atoms)
...
4 相关代码:
import tqdm
import numpy as np
path = './v2019-other-PL/'
def process_chunk(chunk):
for i, row in chunk.iterrows():
file_name = path + row["id"] + "/" + row["id"] + "_ligand.sdf"
try:
for mol in pybel.readfile('sdf', file_name):
chunk.at[i, 'Smiles'] = str(mol).split()[0]
print(row["id"],":",str(mol).split()[0])
except:
pass
return chunk
# df_pro 之前已经处理过的pdframe
df_imputation = df_pro.copy()
chunks = np.array_split(df_imputation, 100)
out_smiles = []
for chunk in tqdm(chunks):
out_chunks = process_chunk(chunk)
out_smiles.append(out_chunks)
参考链接:
rdkit 读取各种小分子
批量转换.sdf文件为smiles到结构化数据表格的python脚本