C2W1.LAB.Vocabulary Creation+Candidates from String Edits

news2025/2/23 0:07:42

理论课：C2W1.Auto-correct

文章目录

Vocabulary Creation
- Imports and Data
- Preprocessing
- Create Vocabulary
- - 法1.集合法
  - 法2.词典加词频法
  - Visualization
- Ungraded Exercise
Candidates from String Edits
- Imports and Data
- Splits
- Delete Edit
Ungraded Exercise

理论课： C2W1.Auto-correct

Vocabulary Creation

从一个小语料库中创建词表

Imports and Data

导入包

# imports
import re # regular expression library; for tokenization of words
from collections import Counter # collections library; counter: dict subclass for counting hashable objects
import matplotlib.pyplot as plt # for data visualization

语料库也就一句话

# the tiny corpus of text ! 
text = 'red pink pink blue blue yellow ORANGE BLUE BLUE PINK' 
print(text)
print('string length : ',len(text))

结果：
red pink pink blue blue yellow ORANGE BLUE BLUE PINK
string length : 52

Preprocessing

由于没有包含特殊字符，可以简单进行数据预处理：

# convert all letters to lower case
text_lowercase = text.lower()
print(text_lowercase)
print('string length : ',len(text_lowercase))

结果：
red pink pink blue blue yellow orange blue blue pink
string length : 52

# some regex to tokenize the string to words and return them in a list
words = re.findall(r'\w+', text_lowercase)
print(words)
print('count : ',len(words))

结果：
[‘red’, ‘pink’, ‘pink’, ‘blue’, ‘blue’, ‘yellow’, ‘orange’, ‘blue’, ‘blue’, ‘pink’]
count : 10

Create Vocabulary

法1.集合法

# create vocab
vocab = set(words)
print(vocab)
print('count : ',len(vocab))

结果：
{‘red’, ‘pink’, ‘orange’, ‘blue’, ‘yellow’}
count : 5

法2.词典加词频法

利用get

# create vocab including word count
counts_a = dict()
for w in words:
    counts_a[w] = counts_a.get(w,0)+1
print(counts_a)
print('count : ',len(counts_a))

结果：
{‘red’: 1, ‘pink’: 3, ‘blue’: 4, ‘yellow’: 1, ‘orange’: 1}
count : 5
利用Counter

# create vocab including word count using collections.Counter
counts_b = dict()
counts_b = Counter(words)
print(counts_b)
print('count : ',len(counts_b))

结果同上

Visualization

# barchart of sorted word counts
d = {'blue': counts_b['blue'], 'pink': counts_b['pink'], 'red': counts_b['red'], 'yellow': counts_b['yellow'], 'orange': counts_b['orange']}
plt.bar(range(len(d)), list(d.values()), align='center', color=d.keys())
_ = plt.xticks(range(len(d)), list(d.keys()))

结果：
在这里插入图片描述

Ungraded Exercise

上面由 collections.Counter 返回的 counts_b 是按字频排序的
修改小语料库的text，使counts_b中的pink和red之间出现新的颜色

需要重新运行所有单元格，还是只运行特定单元格？

# 修改 text 变量
text = 'red pink green pink green blue blue yellow ORANGE BLUE BLUE PINK'

# 重新运行以下代码来更新 counts_b 的值
text_lowercase = text.lower()
words = re.findall(r'\w+', text_lowercase)
counts_b = Counter(words)
print(counts_b)
print('count : ', len(counts_b))

Candidates from String Edits

Imports and Data

不需要导入什么包，数据也就一个词：

# data
word = 'dearz' # 🦌

Splits

找出将一个单词分成两个部分的所有方法！

# splits with a loop
splits_a = []
for i in range(len(word)+1):
    splits_a.append([word[:i],word[i:]])

for i in splits_a:
    print(i)

结果：
[‘’, ‘dearz’]
[‘d’, ‘earz’]
[‘de’, ‘arz’]
[‘dea’, ‘rz’]
[‘dear’, ‘z’]
[‘dearz’, ‘’]

也可以用list来完成：

# same splits, done using a list comprehension
splits_b = [(word[:i], word[i:]) for i in range(len(word) + 1)]

for i in splits_b:
    print(i)

结果同上。

Delete Edit

从拆分列表splits中的后半部分的每个字符串中删除一个字母。
这样做的目的是有效删除被编辑的原始单词中每个可能的字母。

# deletes with a loop
splits = splits_a
deletes = []

print('word : ', word)
# 遍历分割的结果，检查后半部分是否不为空
for L,R in splits:
    if R: # 如果后半部分不为空，则打印删除第一个字符后的结果
        print(L + R[1:], ' <-- delete ', R[0])

结果：
word : dearz
earz <-- delete d
darz <-- delete e
derz <-- delete a
deaz <-- delete r
dear <-- delete z
下面给出了删除的原理示意：

# breaking it down
print('word : ', word)
one_split = splits[0]
print('first item from the splits list : ', one_split)
L = one_split[0]
R = one_split[1]
print('L : ', L)
print('R : ', R)
print('*** now implicit delete by excluding the leading letter ***')
print('L + R[1:] : ',L + R[1:], ' <-- delete ', R[0])

结果：
word : dearz
first item from the splits list : [‘’, ‘dearz’]
L :
R : dearz
*** now implicit delete by excluding the leading letter ***
L + R[1:] : earz <-- delete d

当然也可以用list更加简洁

# deletes with a list comprehension
splits = splits_a
deletes = [L + R[1:] for L, R in splits if R]

print(deletes)
print('*** which is the same as ***')
for i in deletes:
    print(i)

结果：
[‘earz’, ‘darz’, ‘derz’, ‘deaz’, ‘dear’]
*** which is the same as ***
earz
darz
derz
deaz
dear

Ungraded Exercise

经过上面的操作，得到了执行删除编辑后创建的候选字符串列表deletes。
下一步是过滤该列表，以查找词汇表中的候选词。
在下面的示例词汇表中，你能想到创建候选词列表的方法吗？
[‘dean’,‘deer’,‘dear’,‘fries’,‘and’,‘coke’]

vocab = ['dean','deer','dear','fries','and','coke']
edits = list(deletes)

print('vocab : ', vocab)
print('edits : ', edits)

candidates=[]

### START CODE HERE ###
#candidates = ??  # hint: 'set.intersection'
#candidates = list(set(edits) & set(vocab))
candidates = list(set(edits).intersection(set(vocab)))
### END CODE HERE ###

print('candidate words : ', candidates)