1.4亿中文知识图谱导入Nebula Graph快速体验

news2025/7/17 1:47:17

1. 史上最大规模的中文知识图谱

Yener 开源了史上最大规模的中文知识图谱—— OwnThink（链接：https://github.com/ownthink/KnowledgeGraphData，数据量为 1.4 亿条。数据以 (实体, 属性, 值) 和 (实体, 关系, 实体) 混合的三元组形式存储，数据格式为 csv

在这里插入图片描述

2. 重复数据清洗

你可以在这里https://github.com/jievince/rdf-converter下载这个简单的清洗工具源代码并编译使用。该工具会把转换后的顶点的数据写入到 vertex.csv 文件、边数据写入到 edge.csv 文件。在测试过程中，发现有大量的重复点数据，所以工具里面也做了去重。完全去重后的点的数据大概是 4600 万条，完全去重后的边的数据大概是 1 亿 4000 万条。

也可以直接下载去重后的数据https://www.kaggle.com/datasets/littlewey/nebula-ownthink-property-graph

在这里插入图片描述

3. 准备 schema 等元数据

create space 的概念接近 MySQL 里面 create database

# 创建 test space
CREATE SPACE test(partition_num=20,replica_factor=1,vid_type=INT64);
# 进入 test space
USE test;
# 创建点类型（entity）
CREATE TAG entity(name string);
# 创建边类型 (relation) 
CREATE EDGE relation(name string);
# 查看 entity 标签的属性
DESCRIBE TAG entity;

在这里插入图片描述

4. nebula-importer 导入数据

https://github.com/vesoft-inc/nebula-importer/releases 下载导入工具

直接使用如下config.yaml, 语法参考github相关文档

client:
  version: v3
  address: "127.0.0.1:9669"
  user: root
  password: nebula
  concurrencyPerAddress: 10
  reconnectInitialInterval: 1s
  retry: 3
  retryInitialInterval: 1s

manager:
  spaceName: test
  batch: 128
  readerConcurrency: 50
  importerConcurrency: 512
  statsInterval: 10s
log:
  level: INFO
  console: true
  files:
   - logs/nebula-importer.log

sources:
  - path: ./vertex.csv
    failDataPath: ./err/vertex.csv
    csv:
      delimiter: ","
      withHeader: false
      withLabel: false
    tags:
    - name: entity
      id:
        type: "INT"
        index: 0
      props:
        - name: "name"
          type: "STRING"
          index: 1
  - path: ./edge.csv
    failDataPath: ./err/edge.csv
    batch: 256
    csv:
      delimiter: ","
      withHeader: false
      withLabel: false
    edges:
    - name: relation
      src:
        id:
          type: "INT"
          index: 0
      dst:
        id:
          type: "INT"
          index: 1
      props:
        - name: "name"
          type: "string"
          index: 2

nebula-importer -c config.yaml

等待导入即可
在这里插入图片描述

5. 查询初体验

5.1 与姚明有直接关联的边的类型和点的属性

GO FROM hash("姚明[中国篮球协会主席、中职联公司董事长]") OVER relation YIELD relation.name AS Name, $$.entity.name AS Value;

在这里插入图片描述

本文来自互联网用户投稿，该文观点仅代表作者本人，不代表本站立场。本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如若转载，请注明出处：http://www.coloradmin.cn/o/1976046.html

如若内容造成侵权/违法违规/事实不符，请联系多彩编程网进行投诉反馈，一经查实，立即删除！