使用 Elasticsearch 检测抄袭 (二)

news2025/1/16 4:43:59

我在在之前的文章 “使用 Elasticsearch 检测抄袭 (一)” 介绍了如何检文章抄袭。这个在许多的实际使用中非常有意义。我在  CSDN 上的文章也经常被人引用或者抄袭。有的人甚至也不用指明出处。这对文章的作者来说是很不公平的。文章介绍的内容针对很多的博客网站也非常有意义。在那篇文章中,我觉得针对一些开发者来说,不一定能运行的很好。在今天的这篇文章中,我特意使用本地部署,并使用 jupyter notebook 来进行一个展示。这样开发者能一步一步地完整地运行起来。

安装

安装 Elasticsearch 及 Kibana

如果你还没有安装好自己的 Elasticsearch 及 Kibana,那么请参考一下的文章来进行安装:

  • 如何在 Linux,MacOS 及 Windows 上进行安装 Elasticsearch

  • Kibana:如何在 Linux,MacOS 及 Windows 上安装 Elastic 栈中的 Kibana

在安装的时候,请选择 Elastic Stack 8.x 进行安装。在安装的时候,我们可以看到如下的安装信息:

为了能够上传向量模型,我们必须订阅白金版或试用。

上传模型

注意:如果我们在这里通过命令行来进行上传模型的话,那么你就不需要在下面的代码中来实现上传。可以省去那些个步骤。

我们可以参考之前的文章 “Elasticsearch:使用 NLP 问答模型与你喜欢的圣诞歌曲交谈”。我们使用如下的命令来上传 OpenAI detection 模型:

eland_import_hub_model --url https://elastic:o6G_pvRL=8P*7on+o6XH@localhost:9200 \
	--hub-model-id roberta-base-openai-detector \
	--task-type text_classification \
	--ca-cert /Users/liuxg/elastic/elasticsearch-8.11.0/config/certs/http_ca.crt \
	--start

在上面,我们需要根据自己的配置修改上面的证书路径,Elasticsearch 的访问地址。

我们可以在 Kibana 中查看最新上传的模型:

接下来,按照同样的方法,我们安装文本嵌入模型。

eland_import_hub_model --url https://elastic:o6G_pvRL=8P*7on+o6XH@localhost:9200 \
	--hub-model-id sentence-transformers/all-mpnet-base-v2 \
	--task-type text_embedding \
	--ca-cert /Users/liuxg/elastic/elasticsearch-8.11.0/config/certs/http_ca.crt \
	--start	

为了方便大家学习,我们可以在如下的地址下载代码:

git clone https://github.com/liu-xiao-guo/elasticsearch-labs

我们可以在如下的位置找到 jupyter notebook:

$ pwd
/Users/liuxg/python/elasticsearch-labs/supporting-blog-content/plagiarism-detection-with-elasticsearch
$ ls
plagiarism_detection_es_self_managed.ipynb

运行代码

接下来,我们开始运行 notebook。我们首先安装相应的 python 包:

pip3 install elasticsearch==8.11
pip3 -q install eland elasticsearch sentence_transformers transformers torch==2.1.0

在运行代码之前,我们先设置如下的变量:

export ES_USER="elastic"
export ES_PASSWORD="o6G_pvRL=8P*7on+o6XH"
export ES_ENDPOINT="localhost"

我们还需要把 Elasticsearch 的证书拷贝到当前的目录中:

$ pwd
/Users/liuxg/python/elasticsearch-labs/supporting-blog-content/plagiarism-detection-with-elasticsearch
$ cp ~/elastic/elasticsearch-8.11.0/config/certs/http_ca.crt .
$ ls
http_ca.crt                                plagiarism_detection_es_self_managed.ipynb
plagiarism_detection_es.ipynb

导入包:

from elasticsearch import Elasticsearch, helpers
from elasticsearch.client import MlClient
from eland.ml.pytorch import PyTorchModel
from eland.ml.pytorch.transformers import TransformerModel
from urllib.request import urlopen
import json
from pathlib import Path
import os

连接到 Elasticsearch

elastic_user=os.getenv('ES_USER')
elastic_password=os.getenv('ES_PASSWORD')
elastic_endpoint=os.getenv("ES_ENDPOINT")

url = f"https://{elastic_user}:{elastic_password}@{elastic_endpoint}:9200"
client = Elasticsearch(url, ca_certs = "./http_ca.crt", verify_certs = True)
 
print(client.info())

上传 detector 模型

hf_model_id ='roberta-base-openai-detector'
tm = TransformerModel(model_id=hf_model_id, task_type="text_classification")

#set the modelID as it is named in Elasticsearch
es_model_id = tm.elasticsearch_model_id()

# Download the model from Hugging Face
tmp_path = "models"
Path(tmp_path).mkdir(parents=True, exist_ok=True)
model_path, config, vocab_path = tm.save(tmp_path)

# Load the model into Elasticsearch
ptm = PyTorchModel(client, es_model_id)
ptm.import_model(model_path=model_path, config_path=None, vocab_path=vocab_path, config=config)

#Start the model
s = MlClient.start_trained_model_deployment(client, model_id=es_model_id)
s.body

我们可以在 Kibana 中进行查看:

上传 text embedding 模型

hf_model_id='sentence-transformers/all-mpnet-base-v2'
tm = TransformerModel(model_id=hf_model_id, task_type="text_embedding")

#set the modelID as it is named in Elasticsearch
es_model_id = tm.elasticsearch_model_id()

# Download the model from Hugging Face
tmp_path = "models"
Path(tmp_path).mkdir(parents=True, exist_ok=True)
model_path, config, vocab_path = tm.save(tmp_path)

# Load the model into Elasticsearch
ptm = PyTorchModel(client, es_model_id)
ptm.import_model(model_path=model_path, config_path=None, vocab_path=vocab_path, config=config)

# Start the model
s = MlClient.start_trained_model_deployment(client, model_id=es_model_id)
s.body

我们可以在 Kibana 中查看:

创建源索引

client.indices.create(
index="plagiarism-docs",
mappings= {
    "properties": {
        "title": {
            "type": "text",
            "fields": {
                "keyword": {
                "type": "keyword"
                }
            }
        },
        "abstract": {
            "type": "text",
            "fields": {
                "keyword": {
                "type": "keyword"
                }
            }
        },
        "url": {
            "type": "keyword"
        },
        "venue": {
            "type": "keyword"
        },
         "year": {
            "type": "keyword"
        }
    }
})

我们可以在 Kibana 中进行查看:

创建 checker ingest pipeline

client.ingest.put_pipeline(
    id="plagiarism-checker-pipeline",
    processors = [
    {
      "inference": { #for ml models - to infer against the data that is being ingested in the pipeline
        "model_id": "roberta-base-openai-detector", #text classification model id
        "target_field": "openai-detector", # Target field for the inference results
        "field_map": { #Maps the document field names to the known field names of the model.
        "abstract": "text_field" # Field matching our configured trained model input. Typically for NLP models, the field name is text_field.
        }
      }
    },
    {
      "inference": {
        "model_id": "sentence-transformers__all-mpnet-base-v2", #text embedding model model id
        "target_field": "abstract_vector", # Target field for the inference results
        "field_map": { #Maps the document field names to the known field names of the model.
        "abstract": "text_field" # Field matching our configured trained model input. Typically for NLP models, the field name is text_field.
        }
      }
    }

  ]
)

我们可以在 Kibana 中进行查看:

创建 plagiarism checker 索引

client.indices.create(
index="plagiarism-checker",
mappings={
"properties": {
    "title": {
        "type": "text",
        "fields": {
            "keyword": {
                "type": "keyword"
            }
        }
    },
    "abstract": {
        "type": "text",
        "fields": {
            "keyword": {
                "type": "keyword"
            }
        }
    },
    "url": {
        "type": "keyword"
    },
    "venue": {
        "type": "keyword"
    },
    "year": {
        "type": "keyword"
    },
    "abstract_vector.predicted_value": { # Inference results field, target_field.predicted_value
    "type": "dense_vector",
    "dims": 768, # embedding_size
    "index": "true",
    "similarity": "dot_product" #  When indexing vectors for approximate kNN search, you need to specify the similarity function for comparing the vectors.
         }
  }
}
)

我们可以在 Kibana 中进行查看:

写入源文档

我们首先把地址 https://public.ukp.informatik.tu-darmstadt.de/reimers/sentence-transformers/datasets/emnlp2016-2018.json 里的文档下载到当前目录下:

$ pwd
/Users/liuxg/python/elasticsearch-labs/supporting-blog-content/plagiarism-detection-with-elasticsearch
$ ls
emnlp2016-2018.json                        plagiarism_detection_es.ipynb
http_ca.crt                                plagiarism_detection_es_self_managed.ipynb
models

如上所示,emnlp2016-2018.json  就是我们下载的文档。

# Load data into a JSON object
with open('emnlp2016-2018.json') as f:
   data_json = json.load(f)
 
print(f"Successfully loaded {len(data_json)} documents")

def create_index_body(doc):
    """ Generate the body for an Elasticsearch document. """
    return {
        "_index": "plagiarism-docs",
        "_source": doc,
    }

# Prepare the documents to be indexed
documents = [create_index_body(doc) for doc in data_json]

# Use helpers.bulk to index
helpers.bulk(client, documents)

print("Done indexing documents into `plagiarism-docs` source index")

我们可以在 Kibana 中进行查看:

使用 ingest pipeline 进行 reindex

client.reindex(wait_for_completion=False,
               source={
                  "index": "plagiarism-docs"
    },
               dest= {
                  "index": "plagiarism-checker",
                  "pipeline": "plagiarism-checker-pipeline"
    }
)

在上面,我们设置 wait_for_completion=False。这是一个异步的操作。我们需要等一段时间让上面的 reindex 完成。我们可以通过检查如下的文档数:

上面表明我们的文档已经完成。我们再接着查看一下 plagiarism-checker 索引中的文档:

检查重复文字

direct plagarism

model_text = 'Understanding and reasoning about cooking recipes is a fruitful research direction towards enabling machines to interpret procedural text. In this work, we introduce RecipeQA, a dataset for multimodal comprehension of cooking recipes. It comprises of approximately 20K instructional recipes with multiple modalities such as titles, descriptions and aligned set of images. With over 36K automatically generated question-answer pairs, we design a set of comprehension and reasoning tasks that require joint understanding of images and text, capturing the temporal flow of events and making sense of procedural knowledge. Our preliminary results indicate that RecipeQA will serve as a challenging test bed and an ideal benchmark for evaluating machine comprehension systems. The data and leaderboard are available at http://hucvl.github.io/recipeqa.'

response = client.search(index='plagiarism-checker', size=1,
    knn={
        "field": "abstract_vector.predicted_value",
        "k": 9,
        "num_candidates": 974,
        "query_vector_builder": {
            "text_embedding": {
                "model_id": "sentence-transformers__all-mpnet-base-v2",
                "model_text": model_text
            }
        }
    }
)

for hit in response['hits']['hits']:
    score = hit['_score']
    title = hit['_source']['title']
    abstract = hit['_source']['abstract']
    openai = hit['_source']['openai-detector']['predicted_value']
    url = hit['_source']['url']

    if score > 0.9:
        print(f"\nHigh similarity detected! This might be plagiarism.")
        print(f"\nMost similar document: '{title}'\n\nAbstract: {abstract}\n\nurl: {url}\n\nScore:{score}\n")

        if openai == 'Fake':
            print("This document may have been created by AI.\n")

    elif score < 0.7:
        print(f"\nLow similarity detected. This might not be plagiarism.")

        if openai == 'Fake':
            print("This document may have been created by AI.\n")

    else:
        print(f"\nModerate similarity detected.")
        print(f"\nMost similar document: '{title}'\n\nAbstract: {abstract}\n\nurl: {url}\n\nScore:{score}\n")

        if openai == 'Fake':
            print("This document may have been created by AI.\n")

ml_client = MlClient(client)

model_id = 'roberta-base-openai-detector' #open ai text classification model

document = [
    {
        "text_field": model_text
    }
]

ml_response = ml_client.infer_trained_model(model_id=model_id, docs=document)

predicted_value = ml_response['inference_results'][0]['predicted_value']

if predicted_value == 'Fake':
    print("Note: The text query you entered may have been generated by AI.\n")

similar text - paraphrase plagiarism

model_text = 'Comprehending and deducing information from culinary instructions represents a promising avenue for research aimed at empowering artificial intelligence to decipher step-by-step text. In this study, we present CuisineInquiry, a database for the multifaceted understanding of cooking guidelines. It encompasses a substantial number of informative recipes featuring various elements such as headings, explanations, and a matched assortment of visuals. Utilizing an extensive set of automatically crafted question-answer pairings, we formulate a series of tasks focusing on understanding and logic that necessitate a combined interpretation of visuals and written content. This involves capturing the sequential progression of events and extracting meaning from procedural expertise. Our initial findings suggest that CuisineInquiry is poised to function as a demanding experimental platform.'

response = client.search(index='plagiarism-checker', size=1,
    knn={
        "field": "abstract_vector.predicted_value",
        "k": 9,
        "num_candidates": 974,
        "query_vector_builder": {
            "text_embedding": {
                "model_id": "sentence-transformers__all-mpnet-base-v2",
                "model_text": model_text
            }
        }
    }
)

for hit in response['hits']['hits']:
    score = hit['_score']
    title = hit['_source']['title']
    abstract = hit['_source']['abstract']
    openai = hit['_source']['openai-detector']['predicted_value']
    url = hit['_source']['url']

    if score > 0.9:
        print(f"\nHigh similarity detected! This might be plagiarism.")
        print(f"\nMost similar document: '{title}'\n\nAbstract: {abstract}\n\nurl: {url}\n\nScore:{score}\n")

        if openai == 'Fake':
            print("This document may have been created by AI.\n")

    elif score < 0.7:
        print(f"\nLow similarity detected. This might not be plagiarism.")

        if openai == 'Fake':
            print("This document may have been created by AI.\n")

    else:
        print(f"\nModerate similarity detected.")
        print(f"\nMost similar document: '{title}'\n\nAbstract: {abstract}\n\nurl: {url}\n\nScore:{score}\n")

        if openai == 'Fake':
            print("This document may have been created by AI.\n")

ml_client = MlClient(client)

model_id = 'roberta-base-openai-detector' #open ai text classification model

document = [
    {
        "text_field": model_text
    }
]

ml_response = ml_client.infer_trained_model(model_id=model_id, docs=document)

predicted_value = ml_response['inference_results'][0]['predicted_value']

if predicted_value == 'Fake':
    print("Note: The text query you entered may have been generated by AI.\n")

完整的代码可以在地址下载:https://github.com/liu-xiao-guo/elasticsearch-labs/blob/main/supporting-blog-content/plagiarism-detection-with-elasticsearch/plagiarism_detection_es_self_managed.ipynb

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.coloradmin.cn/o/1331715.html

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈,一经查实,立即删除!

相关文章

string的库函数reserve、resize

系列文章 http://t.csdnimg.cn/u80hL 目录 系列文章[TOC](目录) 一、reserve——请求容量的变化二、resize——操作对象使用的空间 一、reserve——请求容量的变化 改变对象的capacity——他会请求开辟和缩小对象所占的空间&#xff0c;reserve只能操作对象未使用的空间&…

指针:传址调用

#include <stdio.h> void Swap1(int x, int y) {int tmp x;x y;y tmp; } int main() {int a 0;int b 0;scanf("%d %d", &a, &b);printf("交换前&#xff1a;a%d b%d\n", a, b);Swap1(a, b);printf("交换后&#xff1a;a%d b%d\n&…

dotnet 的跨平台 UI 框架:WPF 的精神继承 | 开源日报 No.123

AvaloniaUI/Avalonia Stars: 20.7k License: MIT Avalonia 是 dotnet 的跨平台 UI 框架&#xff0c;提供灵活的样式系统&#xff0c;并支持 Windows、macOS、Linux、iOS、Android 和 WebAssembly 等多种平台。它被许多人认为是 WPF 的精神继承者&#xff0c;为 XAML 开发人员创…

OpenHarmony南向之Audio

音频架构 Audio驱动框架基于HDF驱动框架实现&#xff0c;包含内核态&#xff08;KHDF&#xff09;&#xff0c;和用户态&#xff08;UHDF&#xff09;&#xff0c; 对北向提供音频HDI接口 音频框架图 驱动架构主要由以下几部分组成。 HDI adapter&#xff1a;实现Audio HAL层…

java流浪动物保护系统Myeclipse开发mysql数据库web结构java编程计算机网页项目

一、源码特点 java Web 流浪动物保护系统是一套完善的java web信息管理系统&#xff0c;对理解JSP java编程开发语言有帮助&#xff0c;系统具有完整的源代码和数据库&#xff0c;系统主要采用B/S模式开发。开发环境为TOMCAT7.0,Myeclipse8.5开发&#xff0c;数据库为Mysql…

Spring Boot实践指南

一.SpringBoot入门案例 SpringBoot是由Pivotal团队提供的全新框架&#xff0c;其设计目的是用来简化Spring应用的初始搭建以及开发过程 原生开发SpringMVC程序过程 在没有SpringBoot前&#xff1a; 1.入门案例开发步骤 &#xff08;1&#xff09;创建新模块&#xff0c;选…

免费更新UltraNews v2.8.0 已注册 – Laravel报纸,博客多语言系统,支持AI作家,内容生成器脚本

UltraNews v2.8.0 已注册 – Laravel报纸&#xff0c;博客多语言系统&#xff0c;支持AI作家&#xff0c;内容生成器脚本 一、概述 在网络内容创作与管理领域&#xff0c;UltraNews v2.8.0以其高度现代化和多功能性而独树一帜。这是一个基于Laravel框架构建的报纸、博客多语言…

计算机网络概述(上)——“计算机网络”

各位CSDN的uu们好呀&#xff0c;好久没有更新小雅兰的计算机网络的专栏啦&#xff0c;而且期末考试也要考计算机网络&#xff0c;所以&#xff0c;小雅兰就来写计算机网络的内容啦&#xff01;&#xff01;&#xff01;下面&#xff0c;让我们进入计算机网络概述的世界吧&#…

智能优化算法应用:基于蛇优化算法3D无线传感器网络(WSN)覆盖优化 - 附代码

智能优化算法应用&#xff1a;基于蛇优化算法3D无线传感器网络(WSN)覆盖优化 - 附代码 文章目录 智能优化算法应用&#xff1a;基于蛇优化算法3D无线传感器网络(WSN)覆盖优化 - 附代码1.无线传感网络节点模型2.覆盖数学模型及分析3.蛇优化算法4.实验参数设定5.算法结果6.参考文…

基于python的excel检查和读写软件

软件版本&#xff1a;python3.6 窗口和界面gui代码&#xff1a; class mygui:def _init_(self):passdef run(self):root Tkinter.Tk()root.title(ExcelRun)max_w, max_h root.maxsize()root.geometry(f500x500{int((max_w - 500) / 2)}{int((max_h - 300) / 2)}) # 居中显示…

C语言易错知识点九(指针(part three))

❀❀❀ 文章由不准备秃的大伟原创 ❀❀❀ ♪♪♪ 若有转载&#xff0c;请联系博主哦~ ♪♪♪ ❤❤❤ 致力学好编程的宝藏博主&#xff0c;代码兴国&#xff01;❤❤❤ 许久不见&#xff0c;甚是想念&#xff0c;本大忙人已经很久没有更新博客了&#xff0c;我想大概我的粉丝们早…

使用ArcMap对工厂选址

文章目录 题目流程1&#xff0c;添加河流数据和高程数据2&#xff0c;对河流数据进行选取3&#xff0c;对高程数据进行选取并划定工厂选址范围3&#xff0c;根据工厂选址要求&#xff0c;获得整体的数据 结果 题目 实验名称&#xff1a;工厂选址 实验目的及要求&#xff1a; 根…

BATDK | 社招一年收割大厂算法offer

面试锦囊之面经分享系列&#xff0c;持续更新中 欢迎后台回复『面试』加入讨论组交流噢 没凑齐battmd是因为头条没面&#xff0c;美团面挂了。4/5的胜率&#xff1b;标题党了&#xff0c;T其实面的是搜狗&#xff0c;但是被腾讯收购&#xff0c;入职流程也走了腾讯的&#xf…

【低照度图像增强系列(1)】传统方法(直方图、图像变换)算法详解与代码实现

前言 ☀️ 在低照度场景下进行目标检测任务&#xff0c;常存在图像RGB特征信息少、提取特征困难、目标识别和定位精度低等问题&#xff0c;给检测带来一定的难度。 &#x1f33b;使用图像增强模块对原始图像进行画质提升&#xff0c;恢复各类图像信息&#xff0c;再使用目标检…

Kafka日志

位置 server.properties配置文件中通过log.dir指定日志存储目录 log.dir/{topic}-{partition} 核心文件 .log 存储消息的日志文件&#xff0c;固定大小为1G&#xff0c;写满后会新增一个文件&#xff0c;文件名表示当前日志文件记录的第一条消息的偏移量。 .index 以偏移…

Vue+ElementUI+nodejs学生宿舍报修管理系统68ozj

本站是一个B/S模式系统&#xff0c;采用vue框架&#xff0c;MYSQL数据库设计开发&#xff0c;充分保证系统的稳定性。系统具有界面清晰、操作简单&#xff0c;功能齐全的特点&#xff0c;使得学生宿舍信息管理系统管理工作系统化、规范化。本系统的使用使管理人员从繁重的工作中…

2023的AI工具集合,google和claude被禁用解决和edge的copilot

一、前言 AI工具集合 首先&#xff0c;OpenAI的ChatGPT以其深度学习模型和强大的语言处理能力引领了AI聊天机器人的潮流。自2022年11月30日上线以来&#xff0c;它创下了100万用户的注册记录&#xff0c;并被广泛应用于全球财富500强公司。为了实现盈利&#xff0c;OpenAI发布…

git入门指南:新手快速上手git(Linux环境如何使用git)

目录 前言 1. 什么是git&#xff1f; 2. git版本控制器 3. git在Linux中的使用 安装git 4. git三板斧 第一招&#xff1a;add 第二招&#xff1a;commit 第三招&#xff1a;push 5. 执行状态 6. 删除 总结 前言 Linux的基本开发工具介绍完毕&#xff0c;接下来介绍一…

Go 泛型发展史与基本介绍

Go 泛型发展史与基本介绍 Go 1.18版本增加了对泛型的支持&#xff0c;泛型也是自 Go 语言开源以来所做的最大改变。 文章目录 Go 泛型发展史与基本介绍一、为什么要加入泛型&#xff1f;二、什么是泛型三、泛型的来源四、为什么需要泛型五、Go 泛型设计的简史六、泛型语法6.1 …

采用SpringBoot框架+原生HTML、JS前后端分离模式开发和部署的电子病历编辑器源码(电子病历评级4级)

概述&#xff1a; 电子病历是指医务人员在医疗活动过程中,使用医疗机构信息系统生成的文字、符号、图表、图形、数据、影像等数字化信息,并能实现存储、管理、传输和重现的医疗记录,是病历的一种记录形式。 医院通过电子病历以电子化方式记录患者就诊的信息&#xff0c;包括&…