理论课:C2W1.Auto-correct
文章目录
- Vocabulary Creation
- Imports and Data
- Preprocessing
- Create Vocabulary
- 法1.集合法
- 法2.词典加词频法
- Visualization
- Ungraded Exercise
- Candidates from String Edits
- Imports and Data
- Splits
- Delete Edit
- Ungraded Exercise
理论课: C2W1.Auto-correct
Vocabulary Creation
从一个小语料库中创建词表
Imports and Data
导入包
# imports
import re # regular expression library; for tokenization of words
from collections import Counter # collections library; counter: dict subclass for counting hashable objects
import matplotlib.pyplot as plt # for data visualization
语料库也就一句话
# the tiny corpus of text !
text = 'red pink pink blue blue yellow ORANGE BLUE BLUE PINK'
print(text)
print('string length : ',len(text))
结果:
red pink pink blue blue yellow ORANGE BLUE BLUE PINK
string length : 52
Preprocessing
由于没有包含特殊字符,可以简单进行数据预处理:
# convert all letters to lower case
text_lowercase = text.lower()
print(text_lowercase)
print('string length : ',len(text_lowercase))
结果:
red pink pink blue blue yellow orange blue blue pink
string length : 52
# some regex to tokenize the string to words and return them in a list
words = re.findall(r'\w+', text_lowercase)
print(words)
print('count : ',len(words))
结果:
[‘red’, ‘pink’, ‘pink’, ‘blue’, ‘blue’, ‘yellow’, ‘orange’, ‘blue’, ‘blue’, ‘pink’]
count : 10
Create Vocabulary
法1.集合法
# create vocab
vocab = set(words)
print(vocab)
print('count : ',len(vocab))
结果:
{‘red’, ‘pink’, ‘orange’, ‘blue’, ‘yellow’}
count : 5
法2.词典加词频法
利用get
# create vocab including word count
counts_a = dict()
for w in words:
counts_a[w] = counts_a.get(w,0)+1
print(counts_a)
print('count : ',len(counts_a))
结果:
{‘red’: 1, ‘pink’: 3, ‘blue’: 4, ‘yellow’: 1, ‘orange’: 1}
count : 5
利用Counter
# create vocab including word count using collections.Counter
counts_b = dict()
counts_b = Counter(words)
print(counts_b)
print('count : ',len(counts_b))
结果同上
Visualization
# barchart of sorted word counts
d = {'blue': counts_b['blue'], 'pink': counts_b['pink'], 'red': counts_b['red'], 'yellow': counts_b['yellow'], 'orange': counts_b['orange']}
plt.bar(range(len(d)), list(d.values()), align='center', color=d.keys())
_ = plt.xticks(range(len(d)), list(d.keys()))
结果:
Ungraded Exercise
上面由 collections.Counter
返回的 counts_b
是按字频排序的
修改小语料库的text,使counts_b
中的pink和red之间出现新的颜色
需要重新运行所有单元格,还是只运行特定单元格?
# 修改 text 变量
text = 'red pink green pink green blue blue yellow ORANGE BLUE BLUE PINK'
# 重新运行以下代码来更新 counts_b 的值
text_lowercase = text.lower()
words = re.findall(r'\w+', text_lowercase)
counts_b = Counter(words)
print(counts_b)
print('count : ', len(counts_b))
Candidates from String Edits
Imports and Data
不需要导入什么包,数据也就一个词:
# data
word = 'dearz' # 🦌
Splits
找出将一个单词分成两个部分的所有方法!
# splits with a loop
splits_a = []
for i in range(len(word)+1):
splits_a.append([word[:i],word[i:]])
for i in splits_a:
print(i)
结果:
[‘’, ‘dearz’]
[‘d’, ‘earz’]
[‘de’, ‘arz’]
[‘dea’, ‘rz’]
[‘dear’, ‘z’]
[‘dearz’, ‘’]
也可以用list来完成:
# same splits, done using a list comprehension
splits_b = [(word[:i], word[i:]) for i in range(len(word) + 1)]
for i in splits_b:
print(i)
结果同上。
Delete Edit
从拆分列表splits
中的后半部分的每个字符串中删除一个字母。
这样做的目的是有效删除被编辑的原始单词中每个可能的字母。
# deletes with a loop
splits = splits_a
deletes = []
print('word : ', word)
# 遍历分割的结果,检查后半部分是否不为空
for L,R in splits:
if R: # 如果后半部分不为空,则打印删除第一个字符后的结果
print(L + R[1:], ' <-- delete ', R[0])
结果:
word : dearz
earz <-- delete d
darz <-- delete e
derz <-- delete a
deaz <-- delete r
dear <-- delete z
下面给出了删除的原理示意:
# breaking it down
print('word : ', word)
one_split = splits[0]
print('first item from the splits list : ', one_split)
L = one_split[0]
R = one_split[1]
print('L : ', L)
print('R : ', R)
print('*** now implicit delete by excluding the leading letter ***')
print('L + R[1:] : ',L + R[1:], ' <-- delete ', R[0])
结果:
word : dearz
first item from the splits list : [‘’, ‘dearz’]
L :
R : dearz
*** now implicit delete by excluding the leading letter ***
L + R[1:] : earz <-- delete d
当然也可以用list更加简洁
# deletes with a list comprehension
splits = splits_a
deletes = [L + R[1:] for L, R in splits if R]
print(deletes)
print('*** which is the same as ***')
for i in deletes:
print(i)
结果:
[‘earz’, ‘darz’, ‘derz’, ‘deaz’, ‘dear’]
*** which is the same as ***
earz
darz
derz
deaz
dear
Ungraded Exercise
经过上面的操作,得到了执行删除编辑后创建的候选字符串列表deletes
。
下一步是过滤该列表,以查找词汇表中的候选词。
在下面的示例词汇表中,你能想到创建候选词列表的方法吗?
[‘dean’,‘deer’,‘dear’,‘fries’,‘and’,‘coke’]
vocab = ['dean','deer','dear','fries','and','coke']
edits = list(deletes)
print('vocab : ', vocab)
print('edits : ', edits)
candidates=[]
### START CODE HERE ###
#candidates = ?? # hint: 'set.intersection'
#candidates = list(set(edits) & set(vocab))
candidates = list(set(edits).intersection(set(vocab)))
### END CODE HERE ###
print('candidate words : ', candidates)
注意:除了splits和deletes操作,还有其他的编辑类型,例如:insert, replace, switch等,这里没有一一实现,留待各位补全。