零基础5分钟上手亚马逊云科技-NLP文字理解AI服务

简介：

欢迎来到小李哥全新亚马逊云科技AWS云计算知识学习系列，适用于任何无云计算或者亚马逊云科技技术背景的开发者，通过这篇文章大家零基础5分钟就能完全学会亚马逊云科技一个经典的服务开发架构方案。

我会每天介绍一个基于亚马逊云科技AWS云计算平台的全球前沿云开发/架构技术解决方案，帮助大家快速了解国际上最热门的云计算平台亚马逊云科技AWS最佳实践，并应用到自己的日常工作里。本次介绍的是如何利用亚马逊云科技上设计云原生架构，利用S3服务托管前端、EC2服务托管后端应用以及利用DynamoDB托管NoSQL数据库，提升云上应用程序的扩展性，降低运维维护难度。本方案架构图如下：

方案所需基础知识

什么是 Amazon Comprehend 服务？

Amazon Comprehend 是亚马逊云科技提供的一项自然语言处理 (NLP) 服务，旨在帮助用户从非结构化文本中提取有价值的见解和信息。借助机器学习技术，Comprehend 可以自动识别文本中的实体、关键短语、情感、语言等，帮助企业轻松分析客户反馈、社交媒体内容、文章等各种文本数据。

Comprehend 的应用场景非常广泛。例如，它可以用于情感分析，帮助企业了解客户对产品或服务的情感倾向；也可以用于文本分类，将文档或内容自动归类到预定义的类别中。此外，Comprehend 还能识别文本中的人名、地点、组织等实体，并提取出文本的主要话题，帮助企业更好地进行信息管理和决策支持。

Amazon Comprehend 的自定义词汇识别功能是什么

识别自定义实体：

Custom Entity Recognition 功能允许用户根据特定业务需求，训练模型识别文本中的自定义实体，如产品名称、合同条款或行业特定术语，从而更精准地提取关键信息。

提升数据分析精准度：

通过定制化实体识别，企业能够更准确地分析和分类文本数据，提高数据处理的精准度，支持更深入的业务洞察和决策。

无缝集成现有流程：

Custom Entity Recognition 功能可以轻松集成到现有的业务流程和应用中，无需复杂的设置或编码，从而快速部署并产生业务价值。

本方案包括的内容

1. 利用Amazon Comprehend创建一个自定义文字理解模型，用于理解特定专有词汇

2. 利用模型API节点对文字进行实时理解

项目搭建具体步骤

1. 首先进入亚马逊云科技控制台，进入S3服务。

2. 创建一个S3存储桶，命名为”databucket-us-west-2-172330051“。

3. 上传用于训练Amazon Comprehend服务的数据集。

数据集共包括两个文件，第一个文件是”documents.txt“，作为原始文本，用于模型训练。

"Pearson Boosts Security and Productivity Using Amazon Elasticsearch Service"
"2020"
"Global educational media company Pearson needed a more efficient way to analyze and gain insights from its log data. With a number of teams in various locations using Elasticsearch\u2014the popular open-source tool for search and log analytics\u2014Pearson found that keeping track of log data and managing updates led to high operating costs. Faced with this, as well as increasingly complex security log management and analysis, the company found a solution on Amazon Web Services (AWS). Pearson quickly saw improvements by migrating from its self-managed open-source Elasticsearch architecture to Amazon Elasticsearch Service, a fully managed service that makes it easy to deploy, secure, and run Elasticsearch cost effectively at scale. Rather than spending considerable time and resources on managing the Elasticsearch clusters on its own, Pearson used the managed Amazon Elasticsearch Service as part of its initiative to modernize its products. "

第二个文件是”annotations.csv“，利用标记注释了文本文件内的内容，如”AWS_Service“表示该字段为AWS一个服务，帮助模型理解文本文件。

File,Line,Begin Offset,End Offset,Type
documents.txt,0,47,75,AWS_SERVICE
documents.txt,2,167,180,AWS_SERVICE
documents.txt,2,453,479,AWS_SERVICE
documents.txt,2,590,610,AWS_SERVICE
documents.txt,2,860,888,AWS_SERVICE
documents.txt,5,17,45,AWS_SERVICE
documents.txt,7,0,26,JOB_TITLE
documents.txt,7,31,56,JOB_TITLE

4. 我们进入AWS Comprehend服务

5. 点击”Launch“创建一个Comprehend服务。

6. 点击左侧的”Custom entity recognition.“页面，再点击右侧的”Create New Model“创建一个自定义模型。

7. 将模型命名为”aws-entity-recognizer“，版本号设置为1，添加自定义词汇类型，添加我们在标记文件中定义的”AWS_SERVICE“和”JOB_TITLE“

8. 选择数据集类型”Using annotations and training docs“，添加我们上传到S3中的标记文件和文本文件。

9. 为模型添加IAM权限，用于访问S3存储桶中的数据。

10. 点击Create开始训练模型，等待模型训练完成进入Trained状态，点击”aws-entity-recognizer“进入该模型。

11. 点击Performance可以查看该模型的性能评估分数，我们看到所有的分数指标准确率都是100，可以满足我们的需要。

12. 接下来我们为模型创建一个API节点，用于API调用

13. 将节点命名为”aws-entity-recognizer-endpoint“，并选择模型类型为自定义词汇识别模型，选择我们刚刚训练好的模型”aws-entity-recognizer“。点击Create创建。

14. 点击左侧菜单栏”Real-time analysis“，选择分析模型为Custom类型模型，选择我们刚创建的模型作为分析模型，并输入需要分析的文字，点击Analysis开始分析。

15. 结果显示我们的Comprehend自定义模型成功分析出”Amazon HealthLake“为一个AWS服务，并且Confidence score为0.99，NLP分析的准确度很高。

如何通过Python代码创建Amazon Comprehend自定义词汇识别模型？

以下是使用AWS Boto3 SDK创建一个Amazon Comprehend自定义词汇识别模型（Comprehend Custom Entity Recognition Model）并创建一个API端点（Endpoint）的Python代码示例。

import boto3
import time

# 创建Comprehend客户端
comprehend = boto3.client('comprehend', region_name='us-east-1')

# 定义S3路径和模型名称
training_data_s3_uri = 's3://your-bucket-name/training-data/'
model_name = 'MyCustomEntityModel'
data_access_role_arn = 'arn:aws:iam::YOUR_ACCOUNT_ID:role/YourComprehendRole'

# 创建自定义实体识别模型
def create_entity_recognition_model():
    try:
        response = comprehend.create_entity_recognizer(
            RecognizerName=model_name,
            DataAccessRoleArn=data_access_role_arn,
            InputDataConfig={
                'EntityTypes': [{'Type': 'AWS_SERVICE'}, {'Type': 'JOB_TITLE'}],
                'Documents': {'S3Uri': training_data_s3_uri},
                'Annotations': {'S3Uri': 's3://your-bucket-name/annotations/'}
            },
            LanguageCode='en'
        )
        recognizer_arn = response['EntityRecognizerArn']
        print(f'Entity recognizer {model_name} created successfully. ARN: {recognizer_arn}')
        return recognizer_arn
    except Exception as e:
        print(f'Error creating entity recognizer: {str(e)}')
        return None

# 检查模型训练状态
def check_training_status(recognizer_arn):
    while True:
        response = comprehend.describe_entity_recognizer(
            EntityRecognizerArn=recognizer_arn
        )
        status = response['EntityRecognizerProperties']['Status']
        print(f'Model status: {status}')
        if status in ['TRAINED', 'FAILED']:
            break
        time.sleep(600)  # 每10分钟检查一次

# 创建端点
def create_endpoint(recognizer_arn):
    try:
        endpoint_name = 'MyComprehendEndpoint'
        response = comprehend.create_endpoint(
            EndpointName=endpoint_name,
            ModelArn=recognizer_arn,
            DesiredInferenceUnits=1
        )
        print(f'Endpoint {endpoint_name} created successfully. ARN: {response["EndpointArn"]}')
    except Exception as e:
        print(f'Error creating endpoint: {str(e)}')

if __name__ == '__main__':
    # 第一步：创建实体识别模型
    recognizer_arn = create_entity_recognition_model()
    if recognizer_arn:
        # 第二步：检查模型训练状态
        check_training_status(recognizer_arn)
        # 第三步：创建端点
        create_endpoint(recognizer_arn)

代码解释

创建自定义实体识别模型：

利用自定义create_entity_recognition_model 函数调用 create_entity_recognizer API 创建自定义实体识别模型。
输入配置：InputDataConfig 包含实体类型、文档的S3路径和注释的S3路径。
LanguageCode 指定模型的语言，这里我们使用英文。

创建端点：

create_endpoint 函数调用 create_endpoint API 创建一个端点，用于实时处理和识别文本中的词汇。
DesiredInferenceUnits 指定了推理单元的数量，用于控制处理能力。