使用Python和NLTK进行NLP分析的高级指南

在本文中，将利用数据集来比较和分析自然语言。

本文涵盖的基本构建块是：

WordNet和同义词集
相似度比较
树和树岸
命名实体识别

WordNet和同义词集

WordNet是NLTK中的大型词汇数据库语料库。WordNet维护与名词，动词，形容词，副词，同义词，反义词等相关的单词的认知同义词（通常称为同义词集）。

WordNet是一个非常有用的文本分析工具。根据许多许可（从开源到商业），它可用于多种语言（中文，英语，日语，俄语，西班牙语等）。第一个WordNet是由普林斯顿大学在类似MIT的许可下为英语创建的。

一个单词通常根据其含义和词性与多个同义词集相关联。每个同义词集通常提供以下属性：还有其他一些属性，您可以在中的nltk/corpus/reader/wordnet.py源文件中找到它们<your python install>/Lib/site-packages。

某些代码可能有助于解决这个问题。

这个辅助函数：

def synset_info(synset):
    print("Name", synset.name())
    print("POS:", synset.pos())
    print("Definition:", synset.definition())
    print("Examples:", synset.examples())
    print("Lemmas:", synset.lemmas())
    print("Antonyms:", [lemma.antonyms() for lemma in synset.lemmas() if len(lemma.antonyms()) > 0])
    print("Hypernyms:", synset.hypernyms())
    print("Instance Hypernyms:", synset.instance_hypernyms())
    print("Part Holonyms:", synset.part_holonyms())
    print("Part Meronyms:", synset.part_meronyms())
    print()

synsets = wordnet.synsets('code')

如下所示

5 synsets:
Name code.n.01
POS: n
Definition: a set of rules or principles or laws (especially written ones)
Examples: []
Lemmas: [Lemma('code.n.01.code'), Lemma('code.n.01.codification')]
Antonyms: []
Hypernyms: [Synset('written_communication.n.01')]
Instance Hpernyms: []
Part Holonyms: []
Part Meronyms: []

...

Name code.n.03
POS: n
Definition: (computer science) the symbolic arrangement of data or instructions in a computer program or the set of such instructions
Examples: []
Lemmas: [Lemma('code.n.03.code'), Lemma('code.n.03.computer_code')]
Antonyms: []
Hypernyms: [Synset('coding_system.n.01')]
Instance Hpernyms: []
Part Holonyms: []
Part Meronyms: []

...

Name code.v.02
POS: v
Definition: convert ordinary language into code
Examples: ['We should encode the message for security reasons']
Lemmas: [Lemma('code.v.02.code'), Lemma('code.v.02.encipher'), Lemma('code.v.02.cipher'), Lemma('code.v.02.cypher'), Lemma('code.v.02.encrypt'), Lemma('code.v.02.inscribe'), Lemma('code.v.02.write_in_code')]
Antonyms: []
Hypernyms: [Synset('encode.v.01')]
Instance Hpernyms: []
Part Holonyms: []
Part Meronyms: []

同义词集和引理遵循可以可视化的树结构：

def hypernyms(synset):
    return synset.hypernyms()

synsets = wordnet.synsets('soccer')
for synset in synsets:
    print(synset.name() + " tree:")
    pprint(synset.tree(rel=hypernyms))
    print()

code.n.01 tree:
[Synset('code.n.01'),
 [Synset('written_communication.n.01'),
   ...

code.n.02 tree:
[Synset('code.n.02'),
 [Synset('coding_system.n.01'),
   ...

code.n.03 tree:
[Synset('code.n.03'),
   ...

code.v.01 tree:
[Synset('code.v.01'),
 [Synset('tag.v.01'),
   ...

code.v.02 tree:
[Synset('code.v.02'),
 [Synset('encode.v.01'),
   ...

WordNet不能涵盖所有单词及其信息（今天大约有170,000个英语单词，而最新版本的WordNet则大约有155,000个单词），但这是一个很好的起点。在学习了此构建基块的概念之后，如果发现它不足以满足您的需求，则可以迁移到另一个。或者，您可以构建自己的WordNet！

自己尝试

使用Python库，从开放源代码下载Wikipedia的页面，并列出所有单词的同义词集和引理。

相似度比较

相似度比较是一个标识两个文本之间相似度的构件。它在搜索引擎，聊天机器人等中具有许多应用程序。

例如，“足球”和“足球”这两个词是否相关？

syn1 = wordnet.synsets('football')
syn2 = wordnet.synsets('soccer')

# A word may have multiple synsets, so need to compare each synset of word1 with synset of word2
for s1 in syn1:
    for s2 in syn2:
        print("Path similarity of: ")
        print(s1, '(', s1.pos(), ')', '[', s1.definition(), ']')
        print(s2, '(', s2.pos(), ')', '[', s2.definition(), ']')
        print("   is", s1.path_similarity(s2))
        print()

Path similarity of:
Synset('football.n.01') ( n ) [ any of various games played with a ball (round or oval) in which two teams try to kick or carry or propel the ball into each other's goal ]
Synset('soccer.n.01') ( n ) [ a football game in which two teams of 11 players try to kick or head a ball into the opponents' goal ]
   is 0.5

Path similarity of:
Synset('football.n.02') ( n ) [ the inflated oblong ball used in playing American football ]
Synset('soccer.n.01') ( n ) [ a football game in which two teams of 11 players try to kick or head a ball into the opponents' goal ]
   is 0.05

单词的最高路径相似性得分是0.5，表示它们密切相关。

那么“代码”和“错误”呢？这些词在计算机科学中的相似度得分是：

Path similarity of:
Synset('code.n.01') ( n ) [ a set of rules or principles or laws (especially written ones) ]
Synset('bug.n.02') ( n ) [ a fault or defect in a computer program, system, or machine ]
   is 0.1111111111111111
...
Path similarity of:
Synset('code.n.02') ( n ) [ a coding system used for transmitting messages requiring brevity or secrecy ]
Synset('bug.n.02') ( n ) [ a fault or defect in a computer program, system, or machine ]
   is 0.09090909090909091
...
Path similarity of:
Synset('code.n.03') ( n ) [ (computer science) the symbolic arrangement of data or instructions in a computer program or the set of such instructions ]
Synset('bug.n.02') ( n ) [ a fault or defect in a computer program, system, or machine ]
   is 0.09090909090909091

这些是最高的相似性评分，表明它们是相关的。

NLTK提供了多个相似性评分器，例如：

路径相似度
lch_similarity
wup_similarity
res_similarity
jcn_similarity
lin_similarity

树和树岸

使用NLTK，您可以树形形式表示文本的结构，以帮助进行文本分析。

这是一个例子：

预处理并带有词性（POS）标记的简单文本：

import nltk

text = "I love open source"
# Tokenize to words
words = nltk.tokenize.word_tokenize(text)
# POS tag the words
words_tagged = nltk.pos_tag(words)

您必须定义语法以将文本转换为树形结构。本示例使用基于Penn Treebank标签的简单语法。

# A simple grammar to create tree
grammar = "NP: {<JJ><NN>}"

接下来，使用语法创建树：

# Create tree
parser = nltk.RegexpParser(grammar)
tree = parser.parse(words_tagged)
pprint(tree)

这将产生：

Tree('S', [('I', 'PRP'), ('love', 'VBP'), Tree('NP', [('open', 'JJ'), ('source', 'NN')])])

您可以通过图形更好地看到它。

tree.draw()

这种结构有助于正确解释文本的含义。例如，在此文本中标识主题：

subject_tags = ["NN", "NNS", "NP", "NNP", "NNPS", "PRP", "PRP$"]
def subject(sentence_tree):
    for tagged_word in sentence_tree:
        # A crude logic for this case -  first word with these tags is considered subject
        if tagged_word[1] in subject_tags:
            return tagged_word[0]

print("Subject:", subject(tree))

它显示“ I”是主题：

Subject: I

这是适用于大型应用程序的基本文本分析构建块。例如，当用户说“从1月1日起为我的妈妈Jane预订从伦敦飞往纽约的航班”时，使用此代码块的聊天机器人可以将请求解释为：

动作：书
什么：飞行
旅行者：简
来自：伦敦
到纽约
日期：1月1日（明年）

树库是指带有预标记树的语料库。开源，有条件的免费使用和商业树库可用于多种语言。英文最常用的是Penn Treebank，摘自《华尔街日报》，其子集包含在NLTK中。使用树库的一些方法：

words = nltk.corpus.treebank.words()
print(len(words), "words:")
print(words)

tagged_sents = nltk.corpus.treebank.tagged_sents()
print(len(tagged_sents), "sentences:")
print(tagged_sents)

100676 words:['Pierre', 'Vinken', ',', '61', 'years', 'old', ',', ...]

3914 sentences:[[('Pierre', 'NNP'), ('Vinken', 'NNP'), (',', ','), ('61', 'CD'), ('years', 'NNS'), ('old', 'JJ'), (',', ','), ('will', 'MD'), ('join', 'VB'), ('the', 'DT'), ('board', 'NN'), ('as', 'IN'), ('a', 'DT'), ('nonexecutive', 'JJ'), ('director', 'NN'), ...]

See tags in a sentence:

sent0 = tagged_sents[0]
pprint(sent0)
[('Pierre', 'NNP'), ('Vinken', 'NNP'), (',', ','), ('61', 'CD'), ('years', 'NNS'),...

Create a grammar to convert this to a tree:

grammar = '''
    Subject: {<NNP><NNP>}
    SubjectInfo: {<CD><NNS><JJ>}
    Action: {<MD><VB>}
    Object: {<DT><NN>}
    Stopwords: {<IN><DT>}
    ObjectInfo: {<JJ><NN>}
    When: {<NNP><CD>}
'''
parser = nltk.RegexpParser(grammar)
tree = parser.parse(sent0)
print(tree)

(S
  (Subject Pierre/NNP Vinken/NNP)
  ,/,
  (SubjectInfo 61/CD years/NNS old/JJ)
  ,/,
  (Action will/MD join/VB)
  (Object the/DT board/NN)
  as/IN
  a/DT
  (ObjectInfo nonexecutive/JJ director/NN)
  (Subject Nov./NNP)
  29/CD
  ./.)

See it graphically: