通过训练NLP制作一个自己的简易输入法

news2024/11/6 7:36:00

最近开始研究NLP,然后根据手写CV UP主的视频,写了一个N Gram的NLP模型,算是该领域里的hello world吧。然后我又添加了几行代码实现了一个非常简易的输入法

项目说明:

数据集可以自创,导入txt文件即可;

单词联想功能:输入前两个单词,预测(联想)第三个单词【也就是输入法中的提升功能】


本人有关NLP学习记录可以参考下面的博客【会持续更新】

NLP(自然语言处理)学习记录_爱吃肉的鹏的博客-CSDN博客

目录

数据集

N Gram网络结构

模型训练 

输入法测试


数据集

我这里的数据集是个非常非常简易的,仅供学习用的。数据集格式为txt格式。支持中文。

我这里是一段chatgpt帮我写的一段话,我用来做数据集。

Introduction:
Artificial Intelligence (AI) is revolutionizing the world as we know it. With its ability to mimic human intelligence and perform tasks that were once deemed impossible for machines, AI has become a driving force behind innovation across various industries. In this article, we will delve into the fascinating world of AI, exploring its applications, challenges, and the potential it holds for the future.
Section 1: Understanding AI.
Artificial Intelligence refers to the development of computer systems that can perform tasks that typically require human intelligence. These tasks include speech recognition, decision-making, problem-solving, and even visual perception. AI systems are designed to learn from data, adapt to new information, and improve their performance over time. Machine Learning, a subset of AI, enables computers to learn from experience and make predictions or take actions without explicit programming.
Section 2: Applications of AI.
AI has found extensive applications in various fields. In healthcare, AI is being used to assist in medical diagnosis, drug discovery, and personalized treatment plans. In finance, AI algorithms analyze vast amounts of data to detect fraud, predict market trends, and optimize investment strategies. AI-powered virtual assistants like Siri and Alexa have become ubiquitous, enhancing our daily lives with voice recognition and natural language processing capabilities.
Section 3: Challenges and Ethical Considerations.
While AI brings immense potential, it also poses significant challenges and ethical considerations. One concern is job displacement, as AI and automation may replace certain human roles. Striking a balance between technological advancement and employment opportunities becomes crucial. Additionally, AI systems must be developed with transparency and accountability to ensure they make fair and unbiased decisions. Ethical dilemmas arise when AI is used in areas like facial recognition, privacy invasion, or autonomous weapons.
Section 4: The Future of AI.
The future of AI holds boundless possibilities. Advancements in AI research, such as Deep Learning and Neural Networks, continue to push the boundaries of what machines can achieve. AI is expected to revolutionize transportation with autonomous vehicles, transform manufacturing through robotics, and enhance customer experiences with personalized recommendations. However, it is vital to address concerns like data privacy, algorithmic bias, and regulations to foster responsible AI development.
Section 5: NLP and AI Training.
Natural Language Processing (NLP), a branch of AI, focuses on the interaction between computers and human language. NLP enables machines to understand, interpret, and generate human language, facilitating applications like language translation, sentiment analysis, and chatbots. Training NLP networks requires vast amounts of text data, annotated with corresponding labels or outcomes. The collected data is used to train models using algorithms like recurrent neural networks (RNNs) or transformer models like GPT-3.
Conclusion:
Artificial Intelligence is reshaping our world in unprecedented ways, with endless opportunities and challenges. As we continue to explore the realms of AI, it is crucial to ensure responsible development, ethical considerations, and regulations that safeguard the well-being of society. With ongoing advancements, AI has the potential to augment human capabilities and revolutionize industries, opening doors to a future where machines and humans collaborate harmoniously.
def NLP_Sentence(file_path):
    # test_sentence = """When forty winters shall besiege thy brow,
    # And dig deep trenches in thy beauty's field,
    # Thy youth's proud livery so gazed on now,
    # Will be a totter'd weed of small worth held:
    # Then being asked, where all thy beauty lies,
    # Where all the treasure of thy lusty days;
    # To say, within thine own deep sunken eyes,
    # Were an all-eating shame, and thriftless praise.
    # How much more praise deserv'd thy beauty's use,
    # If thou couldst answer 'This fair child of mine
    # Shall sum my count, and make my old excuse,'
    # Proving his beauty by succession thine!
    # This were to be new made when thou art old,
    # And see thy blood warm when thou feel'st it cold.""".split()  # 按空格进行单词的划分
    # print(type(test_sentence))

    with open(file_path,'r',encoding='utf-8') as f:
        test_sentence = f.read()
    test_sentence = test_sentence.split()
    print(test_sentence)
    return test_sentence


def build_dataset(test_sentence):
    # 建立数据集
    trigram = [((test_sentence[i], test_sentence[i + 1]), test_sentence[i + 2]) for i in range(len(test_sentence) - 2)]
    vocb = set(test_sentence)  # 通过set将重复的单词去掉  这个vocb顺序是随机的
    word_to_idx = {word: i for i, word in enumerate(vocb)}  # 将每个单词编码,即用数字来表示每个单词,只有这样才能够传入nn.Embedding得到词向量。
    idx_to_word = {word_to_idx[word]: word for word in
                   word_to_idx}  # 返回的是{0: 'Will', 1: 'praise.', 2: 'by', 3: 'worth',是word_to_idx步骤的逆操作
    return word_to_idx, idx_to_word, trigram

NLP_Sentence函数是读取数据集txt文件,单词之间用空格划分。build_dataset是数据集的处理,trgram是将三个单词划分为一组,前两个单词用来训练,后一个单词作为标签。这就好比CV中,前两个单词是特征,后一个单词是Label,就是一个简单的分类任务而已。

vocb是去除重复单词。

word_to_idx:是将词转为索引,也就是编码操作。

idx_to_word:解码操作


N Gram网络结构

vocb_size:单词的长度

context_size:上下文的窗口长度

import torch.nn as nn
import torch.nn.functional as F


# 定义 N Gram模型
class NgramModel(nn.Module):
    def __init__(self, vocb_size, context_size, n_dim):
        super(NgramModel, self).__init__()
        self.n_word = vocb_size  # 已经去除了重复的单词,有97个单词
        self.context_size = context_size
        self.n_dim = n_dim
        self.embedding = nn.Embedding(self.n_word, self.n_dim)  # (97,10)
        self.linear1 = nn.Linear(self.context_size * self.n_dim, 128)  # 全连接层(20,128) 每次输入两个词有contest_size*n_dim个维度(20维度)
        self.linear2 = nn.Linear(128, self.n_word)  # (128,97)

    def forward(self, x):
        emb = self.embedding(x)
        emb = emb.view(1, -1)  # 输出的是1行 len(x) / 1 列 就是按行平铺开
        out = self.linear1(emb)
        out = F.relu(out)
        out = self.linear2(out)
        log_prob = F.log_softmax(out)  # 得出每个词的词嵌入(词属性)的概率值
        return log_prob

模型训练 

CONTEXT_SIZE是上下文窗口长度,这里默认为2,表示根据输入的前两个词预测后一个,如果测试的时候要修改,要重训练的,不然会报维度错误。

test_sentence是数据集,训练好的权重命名为myNLP.pth.

#coding utf-8
import torch
from torch import optim
import torch.nn as nn
from torch.autograd import Variable
from NLP_dataset import NLP_Sentence, build_dataset
from NLP_model import NgramModel


CONTEXT_SIZE = 2  # 表示想由前面的几个单词来预测这个单词,这里设置为2,就是说我们希望通过这个单词的前两个单词来预测这一个单词
test_sentence = NLP_Sentence('data2.txt')
word_to_idx, idx_to_word, trigram = build_dataset(test_sentence)
# train
ngrammodel = NgramModel(len(word_to_idx), CONTEXT_SIZE, 10).cuda()
criterion = nn.NLLLoss()
optimizer = optim.SGD(ngrammodel.parameters(), lr=1e-3)
for epoch in range(300):
     print('epoch: {}'.format(epoch+1))
     print('*'*10)
     running_loss = 0
     total_samples_epoch = 0
     correct_predictions_epoch = 0
     for data in trigram:
         # we use 'word' to represent the two words forward the predict word, we use 'label' to represent the predict word
         word, label = data  # word = (When,forty) label=winters
         word = Variable(torch.LongTensor([word_to_idx[e] for e in word])).cuda()
         label = Variable(torch.LongTensor([word_to_idx[label]])).cuda()
         # forward
         out = ngrammodel(word)  # word为索引形式
         loss = criterion(out, label)
         running_loss += loss.item()

         _, predicted = torch.max(out.data, 1)
         total_samples_epoch += label.size(0)
         correct_predictions_epoch += (predicted == label).sum().item()

         # backward
         optimizer.zero_grad()
         loss.backward()
         optimizer.step()
     epoch_accuracy = correct_predictions_epoch / total_samples_epoch
     print('loss: {:.6f}'.format(running_loss / len(word_to_idx)))
     print('accuracy: {:.2%}'.format(epoch_accuracy))
torch.save(ngrammodel.state_dict(), "myNLP.pth")

输入法测试

import torch
from torch.autograd import Variable
from NLP_model import NgramModel
from NLP_dataset import NLP_Sentence, build_dataset


test_sentence = NLP_Sentence('data2.txt')
word_to_idx, idx_to_word, trigram = build_dataset(test_sentence)
CONTEXT_SIZE = 2
# predict
while True:
    word = input("请输入单词,按回车\n")
    word_ = word.split()
    word_ = Variable(torch.LongTensor([word_to_idx[i] for i in word_]))
    model = NgramModel(len(word_to_idx), CONTEXT_SIZE, 10)
    pretrained = torch.load('myNLP.pth')
    model.load_state_dict(pretrained)
    model.eval()
    out = model(word_)
    _, predict_label = torch.max(out, 1)
    predict_word = idx_to_word[predict_label.item()]
    print("最大概率词语:", predict_word)
    prob, pre_labels = out.sort(descending=True)  # 从大到小排序
    pre_labels = pre_labels.squeeze(0)[:10]
    print_dict = {}
    # for idx in pre_labels:
    for i, idx in enumerate(pre_labels):
        idx = idx.item()
        try:
            print_dict[str(i)] = idx_to_word[idx]
        except:
            continue
    print(print_dict)
    idx_input = input("你可选择以下序号补充你的输出\n")
    print('{} {}'.format(word, print_dict[idx_input]))

比如输入以下内容(the word):

请输入单词,按回车
the world

 会根据你的输入自动预测你第三个词,只需要输入编号即可补全。

 最大概率词语: programming.
{'0': 'programming.', '1': 'development.', '2': 'discovery,', '3': 'across', '4': 'can', '5': 'refers', '6': 'typically', '7': 'fascinating', '8': 'language', '9': 'or'}
你可选择以下序号补充你的输出
8
the world language


以上就是一个简易办的用NLP实习的输入法【就是用来给枯燥的学习过程消遣一下哈哈~】 

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.coloradmin.cn/o/585224.html

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈,一经查实,立即删除!

相关文章

K8s in Action 阅读笔记——【6】Volumes: attaching disk storage to containers

K8s in Action 阅读笔记——【6】Volumes: attaching disk storage to containers 在前三章中,我们介绍了Pods以及它们与ReplicationControllers、ReplicaSets、DaemonSets、Jobs和Services等Kubernetes资源的交互。现在,我们将回到Pod内部,…

【java 基础一】 纯语法基础记录

一、基础 1.1 变量 Java 变量是程序中存储数据的容器。 在 Java 中,变量需要先声明再使用,并且必须先声明再赋值。 声明变量:声明变量时需要指定变量的类型、名称和初始值。例如,声明一个整型变量可以如下所示: in…

水处理施工方案合集

编制说明及工程简介 (一) 编制说明 本施工组织设计是依据建设单位提供的招标文件、施工图、同类工程施工资料和国家有关施工规范及验收标准进行编制的。本施工组织设计针对本工程施工中的关键点、难点及其处理措施,主要施工方法,施工组织部署&#xff…

如何理解 CRM 客户关系管理系统?

如何理解 CRM 客户关系管理系统? CRM的作用不止是工具,根据用户和对象的不同,CRM创造的价值也是不同的。 打个比方: 有的企业是用CRM来管理客户信息;有的企业是用CRM来获取流量的;有的企业是用CRM来做销售管理的;还…

Springboot整合OSS并实现文件上传和下载

目录 一.OSS服务器开通并创建账户 二.Springboot整合OSS 1.创建springboot项目 2.整合OSS 三.postman测试 一.OSS服务器开通并创建账户 参考阿里云OSS的使用(全程请登陆)_zhz小白的博客-CSDN博客https://blog.csdn.net/zhouhengzhe/article/details/112077301 二.Springb…

深度学习笔记之循环神经网络(八)LSTM的轻量级变体——门控循环单元(GRU)

深度学习笔记之LSTM轻量级变体——门控循环单元[GRU] 引言回顾: LSTM \text{LSTM} LSTM的前馈计算过程 LSTM \text{LSTM} LSTM的问题 GRU \text{GRU} GRU的前馈计算过程 GRU \text{GRU} GRU的优势 引言 上一节介绍了从反向传播过程的角度认识 LSTM \text{LSTM} LST…

【29】核心易中期刊推荐——计算语言学人工智能(AI)技术

🚀🚀🚀NEW!!!核心易中期刊推荐栏目来啦 ~ 📚🍀 核心期刊在国内的应用范围非常广,核心期刊发表论文是国内很多作者晋升的硬性要求,并且在国内属于顶尖论文发表,具有很高的学术价值。在中文核心目录体系中,权威代表有CSSCI、CSCD和北大核心。其中,中文期刊的数…

使用javascript-obfuscator给js文件加密

一、安装javascript-obfuscator包 npm install javascript-obfuscator -g二、默认配置直接压缩文件 javascript-obfuscator miniprogram/src/utils/utils_create_sign.js --output miniprogram/src/utils/create_sign.js三、根据配置文件压缩文件 3.1、创建mixs.json配置文…

前端阿里云OSS直传,微信小程序版本

前言: 网络上许多的文章资料,全是使用阿里云官方的SDK,ali-oss插件去做直传。可是各位素未谋面的朋友要注意,这个SDK它支持web环境使用,也就是PC端浏览器。 当项目环境切换到微信小程序,是无法使用这种方…

Power BI Embedded自动缩放容量,为公司每个月节省上万元

哎,不知道今年行情怎么就这样了,大厂一边大批毕业生,一边大量招人。难道是今年的新人便宜?就连道哥(吴翰清,阿里P10,中国顶级黑客)都从阿里离职了,当年年少不懂学计算机&…

Nodejs批量处理图片小工具:批量修改图片信息

小工具一:批量修改文件夹里面的图片名称 步骤: 1.安装nodejs。 2.根据需要修改editFileName(filePath, formatName)函数的参数,也可以不改,直接将renameFile.js和img文件夹放在同一个目录下。 3.在renameFile.js目录下开启终端…

SQL报错this is incompatible with sql_mode=only_full_group_by

一、bug记录 1.1.bug截图 1.2.sql语句 SELECT id,batch_no,if_code,channel_mch_no,bill_date,bill_type,currency,order_id, channel_order_no,channel_amount,channel_fee_amount,channel_success_at, channel_user,channel_state,org_pay_order_id,channel_refund_amoun…

曾经由盛转衰的骈文,却引领后人在文质兼美的创作之路上坚定前行

又叫骈体文,是和散文相对应的一种文体,它兴起于汉末,形成于魏晋,最盛行于南北朝,在初唐、中唐、唐末、五代、宋初时也盛极一时。古人语:两马并驾为骈,所以骈文最大的特点是用对偶的手法&#xf…

Fiddler抓包工具之fiddler设置手机端抓包

fiddler设置手机端抓包 安卓手机抓包 第一步:配置电脑和安卓的相关设置 1、手机和fiddler位于同一个局域网内;首先从fiddler处获取到ip地址和端口号: ,点击online,最后一行就是ip地址 2、路径:Tools》O…

数据库基础——10.子查询

这篇文章来讲一下数据库的子查询 目录 1. 需求分析与问题解决 1.1 实际问题 1.2 子查询的基本使用 1.3 子查询的分类 2. 单行子查询 2.1 单行比较操作符 2.2 代码示例 2.3 HAVING 中的子查询 2.4 CASE中的子查询 2.5 子查询中的空值问题 2.5 非法使用子查询​编辑…

数字IC验证高频面试问题整理(附答案)

后台有同学私信想要验证的面试题目,这不就来了~ Q1.权重约束中”:”和 /”的区别 : 操作符表示值范围内的每一个值的权重是相同的,比如[1:3]:40,表示1,2,3取到的概率为40/120; :/操作符表示权重要平均分到值范围内的…

spring security(密码编码器、授权,会话)

目录 密码编码器 授权决策 AffirmativeBased ConsensusBased UnanimousBased 授权 web授权 HttpSecurity常用方法及说明 方法授权 会话控制 会话超时 安全会话cookie 密码编码器 Spring Security为了适应多种多样的加密类型,又做了抽象,D…

虚拟机配置

配置虚拟机网络 创建虚拟机 20G 4G内存 初始化用户名和密码 zhao 123456 克隆拷贝2个虚拟机 配置内存为2G 修改主机名和固定IP hostnamectl set-hostname node1 hostnamectl set-hostname node2 vim /etc/sysconfig/network-scripts/ifcfg-ens33 systemctl stop network s…

渗透测试辅助工具箱

0x01 说明 渗透测试辅助工具箱 运行条件:jdk8 双击即可运行 反弹shell,命令生成器,自动编码,输入对应IP端口即可,实现一劳永逸,集成一些小工具,辅助渗透,提高效率 输入框说明 L…

TDengine 报错 failed to connect to server, reason: Unable to establish connection

一、前文 TDengine 入门教程——导读 二、遇到问题 taos 命令行(CLI)连接不上,进不去。 [rootiZ2ze30dygwd6yh7gu6lskZ ~]# taos Welcome to the TDengine Command Line Interface, Client Version:3.0.0.1 Copyright (c) 2022 by TDengine…