- 🍨 本文为🔗365天深度学习训练营 中的学习记录博客
- 🍖 原作者:K同学啊# 前言
前言
上周学习了英文文本分类,这次进行中文分类实战。
1. 数据读取
import pandas as pd
train_data = pd.read_csv('train.csv',sep='\t',header=None)
print(train_data.head())
读取包含训练数据的 CSV 文件。数据被存储在 train_data
中,并输出前几行进行检查。
2. 自定义数据迭代器
def coustom_data_iter(texts, labels):
for x, y in zip(texts, labels):
yield x, y
train_iter = coustom_data_iter(train_data[0].values[:], train_data[1].values[:])
定义了一个生成器函数 coustom_data_iter
,用于迭代文本和标签。
3. 词汇表构建与文本处理
from torchtext.data.utils import get_tokenizer
from torchtext.vocab import build_vocab_from_iterator
import jieba
tokenizer = jieba.lcut
def yield_tokens(data_iter):
for text, _ in data_iter:
yield tokenizer(text)
vocab = build_vocab_from_iterator(yield_tokens(train_iter), specials=["<unk>"])
vocab.set_default_index(vocab["<unk>"])
使用 Jieba 分词器对文本进行分词,然后从分词结果中构建词汇表 vocab
。<unk>
表示未知词。
4. 数据加载与批处理
from torch.utils.data import DataLoader
def collate_batch(batch):
label_list, text_list, offsets = [], [], [0]
for (_text, _label) in batch:
label_list.append(label_pipeline(_label))
processed_text = torch.tensor(text_pipeline(_text), dtype=torch.int64)
text_list.append(processed_text)
offsets.append(processed_text.size(0))
label_list = torch.tensor(label_list, dtype=torch.int64)
text_list = torch.cat(text_list)
offsets = torch.tensor(offsets[:-1]).cumsum(dim=0)
return text_list.to(device), label_list.to(device), offsets.to(device)
dataloader = DataLoader(train_iter, batch_size=8, shuffle=False, collate_fn=collate_batch)
定义了一个批处理函数 collate_batch
,用于将文本和标签转换成适合模型输入的格式,并创建数据加载器 dataloader
。
5. 模型定义
class TextClassificationModel(nn.Module):
def __init__(self, vocab_size, embed_dim, num_class):
super(TextClassificationModel, self).__init__()
self.embedding = nn.EmbeddingBag(vocab_size, embed_dim, sparse=False)
self.fc = nn.Linear(embed_dim, num_class)
self.init_weights()
def init_weights(self):
initrange = 0.5
self.embedding.weight.data.uniform_(-initrange, initrange)
self.fc.weight.data.uniform_(-initrange, initrange)
self.fc.bias.data.zero_()
def forward(self, text, offsets):
embedded = self.embedding(text, offsets)
return self.fc(embedded)
一个简单的文本分类模型,包含嵌入层 nn.EmbeddingBag
和全连接层 nn.Linear
。
6. 训练与评估函数
def train(dataloader):
model.train()
total_acc, train_loss, total_count = 0, 0, 0
log_interval = 50
start_time = time.time()
for idx, (text, label, offsets) in enumerate(dataloader):
predicted_label = model(text, offsets)
optimizer.zero_grad()
loss = criterion(predicted_label, label)
loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), 0.1)
optimizer.step()
total_acc += (predicted_label.argmax(1) == label).sum().item()
train_loss += loss.item()
total_count += label.size(0)
if idx % log_interval == 0 and idx > 0:
elapsed = time.time() - start_time
print('| epoch {:1d}|{:4d}/{:4d} batches'
'| train_acc {:4.3f} train_loss{:4.5f}'.format(epoch, idx, len(dataloader),
total_acc / total_count, train_loss / total_count))
total_acc, train_loss, total_count = 0, 0, 0
start_time = time.time()
def evaluate(dataloader):
model.eval()
total_acc, train_loss, total_count = 0, 0, 0
with torch.no_grad():
for idx, (text, label, offsets) in enumerate(dataloader):
predicted_label = model(text, offsets)
loss = criterion(predicted_label, label)
total_acc += (predicted_label.argmax(1) == label).sum().item()
train_loss += loss.item()
total_count += label.size(0)
return total_acc / total_count, train_loss / total_count
这部分代码定义了训练函数 train
和评估函数 evaluate
。训练函数执行模型的前向和后向传播以及参数更新,而评估函数用于在验证数据集上评估模型性能。
7. 数据集划分与训练
from torch.utils.data.dataset import random_split
from torchtext.data.functional import to_map_style_dataset
EPOCHS = 10
LR = 5
BATCH_SIZE = 64
criterion = torch.nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(model.parameters(), lr=LR)
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, 1.0, gamma=0.1)
total_accu = None
train_iter = coustom_data_iter(train_data[0].values[:], train_data[1].values[:])
train_dataset = to_map_style_dataset(train_iter)
split_train_, split_valid_ = random_split(train_dataset,
[int(len(train_dataset) * 0.8), int(len(train_dataset) * 0.2)])
train_dataloader = DataLoader(split_train_, batch_size=BATCH_SIZE,
shuffle=True, collate_fn=collate_batch)
valid_dataloader = DataLoader(split_valid_, batch_size=BATCH_SIZE,
shuffle=True, collate_fn=collate_batch)
for epoch in range(1, EPOCHS + 1):
epoch_start_time = time.time()
train(train_dataloader)
val_acc, val_loss = evaluate(valid_dataloader)
lr = optimizer.state_dict()['param_groups'][0]['lr']
if total_accu is not None and total_accu > val_acc:
scheduler.step()
else:
total_accu = val_acc
print('-' * 69)
print('|epoch {:1d} |time: {:4.2f}s | '
'valid_acc{:4.3f} valid_loss{:4.3f} | lr{:4.6f}'.format(epoch,
time.time() - epoch_start_time,
val_acc, val_loss, lr))
print('-' * 69)
将数据集分为训练集和验证集,并创建对应的数据加载器。然后进行模型训练和验证,使用调度器根据验证集的表现调整学习率。
8. 模型预测
def predict(text, text_pipeline):
with torch.no_grad():
text = torch.tensor(text_pipeline(text))
output = model(text, torch.tensor([0]))
return output.argmax(1).item()
ex_text_str = "还有双鸭山到淮阴的汽车票吗13号的"
model = model.to("cpu")
print("该文本的类别是:%s" % label_name[predict(ex_text_str, text_pipeline)])
这部分代码定义了一个预测函数 predict
,用于对输入文本进行分类预测。最后,示例文本被传入预测函数,并输出预测的类别。
结果
总结
整个项目实现了一个文本分类任务,流程包括数据读取、分词和词汇表构建、数据加载与批处理、模型定义、训练与验证、最终精度达89%。这周熟悉文字信息的处理流程,为以后学习做了好的铺垫。