1 问题：

读取 PDBBindv2019的数据集，尝试把所有配体的mol2文件转换成对应smiles表达式。大约超过1千个出现问题。

在这里插入图片描述
主要问题就是‘warning - O.co2 with non C.2 or S.o2 neighbor’。

2 原因：

Phosphate group - warning O.co2 with non C.2 or S.o2 neighbor
[Rdkit-discuss] Phosphate containing mol2 files

Since the mol2 format is a bit of an inconsistent mess, where different toolkits/packages use different dialects of the format (or different meanings for the atom types), we chose to support the dialect generated by corina.

简单来说，mol2的格式不止一种，而RDKit采用了其中一种:corina。

from rdkit import Chem
Chem.MolFromMol2File()：文档信息

Docstring:
MolFromMol2File( (str)molFileName [, (bool)sanitize=True [, (bool)removeHs=True [, 		(bool)cleanupSubstructures=True]]]) -> Mol :
Construct a molecule from a Tripos Mol2 file.

  NOTE:
    The parser expects the atom-typing scheme used by Corina.
    Atom types from Tripos' dbtranslate are less supported.
    Other atom typing schemes are unlikely to work.

  ARGUMENTS:
                                  
    - fileName: name of the file to read

    - sanitize: (optional) toggles sanitization of the molecule.
      Defaults to true.

    - removeHs: (optional) toggles removing hydrogens from the molecule.
      This only make sense when sanitization is done.
      Defaults to true.

    - cleanupSubstructures: (optional) toggles standardizing some 
      substructures found in mol2 files.
      Defaults to true.

  RETURNS:

    a Mol object, None on failure.



C++ signature :
    class RDKit::ROMol * __ptr64 MolFromMol2File(char const * __ptr64 [,bool=True [,bool=True [,bool=True]]]) Type:      function

3 解决方法

找了半天，直接换成openbabei来读取。

from openbabel import pybel
pybel.readfile () 文档

Required parameters:
   format - see the informats variable for a list of available
            input formats
   filename

Optional parameters:
   opt    - a dictionary of format-specific options
            For format options with no parameters, specify the
            value as None.

You can access the first molecule in a file using the next() method
of the iterator (or the next() keyword in Python 3):
    mol = readfile("smi", "myfile.smi").next() # Python 2
    mol = next(readfile("smi", "myfile.smi"))  # Python 3

You can make a list of the molecules in a file using:
    mols = list(readfile("smi", "myfile.smi"))

You can iterate over the molecules in a file as shown in the
following code snippet:
>>> atomtotal = 0
>>> for mol in readfile("sdf", "head.sdf"):
...     atomtotal += len(mol.atoms)
...

4 相关代码：

import tqdm
import numpy as np
path = './v2019-other-PL/'
def process_chunk(chunk):
    for i, row in chunk.iterrows():
        file_name = path + row["id"] + "/" + row["id"] + "_ligand.sdf"
        try:
            for mol in pybel.readfile('sdf', file_name):
                chunk.at[i, 'Smiles'] = str(mol).split()[0]
                print(row["id"],":",str(mol).split()[0])
        except:
            pass
    return chunk
# df_pro 之前已经处理过的pdframe
df_imputation = df_pro.copy()
chunks = np.array_split(df_imputation, 100)
out_smiles = []
for chunk in tqdm(chunks):
    out_chunks = process_chunk(chunk)
    out_smiles.append(out_chunks)