Lecture 18 Information Extraction

news2025/1/14 19:41:03

目录

      • Named Entity Recognition
      • Relation Extraction
      • Other IE Tasks
      • Conclusion

  • information extraction
    • Given this: “Brasilia, the Brazilian capital, was founded in 1960.”
    • Obtain this:
      • capital(Brazil, Brasilia)
      • founded(Brasilia, 1960)
    • Main goal: turn text into structured data
  • applications
    • Stock analysis
      • Gather information from news and social media
      • Summarise texts into a structured format
      • Decide whether to buy/sell at current stock price
    • Medical research
      • Obtain information from articles about diseases and treatments
      • Decide which treatment to apply for new patient
  • how
    • Two steps:
      • Named Entity Recognition (NER): find out entities such as “Brasilia” and “1960”
      • Relation Extraction: use context to find the relation between “Brasilia” and “1960” (“founded”)
  • machine learning in IE
    • Named Entity Recognition (NER): sequence models such as RNNs, HMMs or CRFs.
    • Relation Extraction: mostly classifiers, either binary or multi-class.
    • This lecture: how to frame these two tasks in order to apply sequence labellers and classifiers.

Named Entity Recognition

在这里插入图片描述
在这里插入图片描述

  • typical entity tags (types of tags to use depend on domains)
    • PER(people): people, characters
    • ORG(organisation): companies, sports teams
    • LOC(natural location): regions, mountains, seas
    • GPE(man-made locations): countries, states, provinces (in some tagset this is labelled as LOC)
    • FAC(facility): bridges, buildings, airports
    • VEH(vehcle): planes, trains, cars
    • Tag-set is application-dependent: some domains deal with specific entities e.g. proteins and genes
  • NER as sequnce labelling
    • NE tags can be ambiguous:

      • “Washington” can be a person, location or political entity
    • Similar problem when doing POS tagging

      • possible solution: Incorporate(包含) context
    • Can we use a sequence tagger for this (e.g. HMM)?

      • No, as entities can span multiple tokens(multiple words)
      • Solution: modify the tag set
    • IO(inside,outside) tagging

      • [ORG American Airlines], a unit of [ORG AMR Corp.], immediately matched the move, spokesman [PER Tim Wagner] said.
      • I-ORG’ represents a token that is inside an entity (ORG in this case).
      • All tokens which are not entities get the ‘O’ token (for outside).
      • Cannot differentiate between:
        • a single entity with multiple tokens
        • multiple entities with single tokens 在这里插入图片描述
    • IOB(beginning) tagging

      • [ORG American Airlines], a unit of [ORG AMR Corp.], immediately matched the move, spokesman [PER Tim Wagner] said.

      • B-ORG represents the beginning of an ORG entity.

      • If the entity has more than one token, subsequent tags are represented as I-ORG. 在这里插入图片描述

      • example: annotate the following sentence with NER tags(IOB)

        • Steves Jobs founded Apple Inc. in 1976, Tageset: PER, ORG, LOC, TIME
          • [B-PER Steves] [I-PER Jobs] [O founded] [B-ORG Apple] [I-ORG Inc.] [O in] [B-Time 1976]
    • NER as sequence labelling

      • Given such tagging scheme, we can train any sequence labelling model
      • In theory, HMMs can be used but discriminative models such as CRFs are preferred (HMMs cannot incorperate new features)
    • NER

      • features

        • Example: L’Occitane
        • Prefix/suffix:
          • L / L’ / L’O / L’Oc / …
          • e / ne / ane / tane / …
        • Word shape:
          • X’Xxxxxxxx / X’Xx
          • XXXX-XX-XX (date!)
        • POS tags / syntactic chunks: many entities are nouns or noun phrases.
        • Presence in a gazeteer: lists of entities, such as place names, people’s names and surnames, etc. 在这里插入图片描述
      • classifier 在这里插入图片描述

      • deep learning for NER

        • State of the art approach uses LSTMs with character and word embeddings (Lample et al. 2016) 在这里插入图片描述

Relation Extraction

  • relation extraction

    • [ORG American Airlines], a unit of [ORG AMR Corp.], immediately matched the move, spokesman [PER Tim Wagner] said.
    • Traditionally framed as triple(a relation and two entities) extraction:
      • unit(American Airlines, AMR Corp.)
      • spokesman(Tim Wagner, American Airlines)
    • Key question: do we know all the possible relations?
      • map relations to a closed set of relations
      • unit(American Airlines, AMR Corp.) → subsidiary
      • spokesman(Tim Wagner, American Airlines) → employment 在这里插入图片描述
  • methods

    • If we have access to a fixed relation database:

      • Rule-based
      • Supervised
      • Semi-supervised
      • Distant supervision
    • If no restrictions on relations:

      • Unsupervised
      • Sometimes referred as “OpenIE”
    • rule-based relation extraction

      • “Agar is a substance prepared from a mixture of red algae such as Gelidium, for laboratory or industrial use.”
      • identify linguitics patterns in sentence
      • [NP red algae] such as [NP Gelidium]
      • NP0 such as NP1 → hyponym(NP1, NP0)
      • hyponym(Gelidium, red algae)
      • Lexico-syntactic patterns: high precision, low recall(unlikely to recover all patterns, so many linguistic patterns!), manual effort required
      • more rules 在这里插入图片描述
    • supervised relation extraction

      • Assume a corpus with annotated relations
      • Two steps (if only one step, class imbalance problem: most entities have no relations!)
        • First, find if an entity pair is related or not (binary classification)
          • For each sentence, gather all possible entity pairs
          • Annotated pairs are considered positive examples
          • Non-annotated pairs are taken as negative examples
        • Second, for pairs predicted as positive, use a multiclass classifier (e.g. SVM) to obtain the relation
        • example
          • [ORG American Airlines], a unit of [ORG AMR Corp.], immediately matched the move, spokesman [PER Tim Wagner] said.
          • First:
            • (American Airlines, AMR Corp.) → \to positive
            • (American Airlines, Tim Wagner) → \to positive
            • (AMR Corp., Tim Wagner) → \to negative
          • Second:
            • (American Airlines, AMR Corp.) → \to subsidiary
            • (American Airlines, Tim Wagner) → \to employment
      • features
        • [ORG American Airlines], a unit of [ORG AMR Corp.], immediately matched the move, spokesman [PER Tim Wagner] said.
        • (American Airlines, Tim Wagner) → \to employment 在这里插入图片描述
    • semi-supervised relation extraction

      • Annotated corpora is very expensive to create

      • Use seed tuples to bootstrap a classifier (use seed to find more training data)

      • steps:

        1. Given seed tuple: hub(Ryanair, Charleroi)
        2. Find sentences containing terms in seed tuples
          • Budget airline Ryanair, which uses Charleroi as a hub, scrapped all weekend flights out of the airport
        3. Extract general patterns
          • [ORG], which uses [LOC] as a hub
        4. Find new tuples with these patterns
          • hub(Jetstar, Avalon)
        5. Add these new tuples to existing tuples and repeat step 2
      • issues

        • Extracted tuples deviate from original relation over time
          • semantic drift(deviate from original relation)
            • Pattern: [NP] has a {NP}* hub at [LOC]
            • Sydney has a ferry hub at Circular Quay
              • hub(Sydney, Circular Quay)
            • More erroneous(错误的) patterns extracted from this tuple…
            • Should only accept patterns with high confidences
        • Difficult to evaluate(no labels for new extracted tuples)
        • Extracted general patterns tend to be very noisy
    • distant supervision

      • Semi-supervised methods assume the existence of seed tuples to mine new tuples

      • Can we mine new tuples directly?

      • Distant supervision obtain new tuples from a range of sources:

        • DBpedia
        • Freebase 在这里插入图片描述
      • Generate massive training sets, enabling the use of richer features, and no risk of semantic drift

    • unsupervised relation extraction

      • No fixed or closed set of relations
      • Relations are sub-sentences; usually has a verb
      • “United has a hub in Chicago, which is the headquarters of United Continental Holdings.”
        • “has a hub in”(United, Chicago)
        • “is the headquarters of”(Chicago, United Continental Holdings)
      • Main problem: so many relation forms! mapping relations into canonical forms
    • evaluation

      • NER: F1-measure at the entity level.
      • Relation Extraction with known relation set: F1-measure
      • Relation Extraction with unknown relations: much harder to evaluate
        • Usually need some human evaluation
        • Massive datasets used in these settings are impractical to evaluate manually (use samples)
        • Can only obtain (approximate) precision, not recall(too many possible relations!)

Other IE Tasks

  • temporal expression extraction

    [TIME July 2, 2007]: A fare increase initiated [TIME last week] by UAL Corp’s United Airlines was matched by competitors over [TIME the weekend], marking the second successful fare increase in [TIME two weeks].”

    • Anchoring: when is “last week”?
      • “last week” → 2007−W26
    • Normalisation: mapping expressions to canonical forms.
      • July 2, 2007 → 2007-07-02
    • Mostly rule-based approaches
  • event extraction

    • “American Airlines, a unit of AMR Corp., immediately [EVENT matched] [EVENT the move], spokesman Tim Wagner [EVENT said].”
    • Very similar to NER but different tags, including annotation and learning methods.
    • Event ordering: detect how a set of events happened in a timeline.
      • Involves both event extraction and temporal expression extraction.

Conclusion

  • Information Extraction is a vast field with many different tasks and applications
    • Named Entity Recognition
    • Relation Extraction
    • Event Extraction
  • Machine learning methods involve classifiers and sequence labelling models.

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.coloradmin.cn/o/628850.html

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈,一经查实,立即删除!

相关文章

Linux基本指令详细介绍 【Linux】

文章目录 ls 指令( list directory contents)ls -als -alls -dlls -l (ll)ls -alF文件的类型 : pwd命令(Print Working Directory)cd 命令 (change directory)cd ..cd ~cd - touch指令mkdir指令mkdir -p ( parents) treermdir指令&#xff08…

【数据结构】哈希应用

目录 一、位图 1、位图概念 2、位图实现 2.1、位图结构 2.2、比特位置1 2.3、比特位置0 2.4、检测位图中比特位 3、位图例题 3.1、找到只出现一次的整数 3.2、找到两个文件交集 3.3、找到出现次数不超过2次的所有整数 二、布隆过滤器 1、布隆过滤器提出 2、布隆过…

javaScript蓝桥杯----商城管理系统

目录 一、介绍二、准备三、目标四、代码五、完成 一、介绍 在商城管理系统中,超级管理员和普通管理员因为权限不同,登录进入后看到的菜单也会是不同的。 本题需要你完成商城管理系统中权限数据的处理。 二、准备 开始答题前,需要先打开本…

2023年,千万不要裸辞....

作为IT行业的大热岗位——软件测试,只要你付出了,就会有回报。说它作为IT热门岗位之一是完全不虚的。可能很多人回说软件测试是吃青春饭的,但放眼望去,哪个工作不是这样的呢?会有哪家公司愿意养一些闲人呢?…

硬件设计电源系列文章-LDO基础知识

文章目录 概要整体架构流程技术名词解释技术细节小结 概要 提示:这里可以添加技术概要 例如: 本文主要开始讲述电源的发展 整体架构流程 提示:这里可以添加技术整体架构 AC/DC转换基础。为什么需要AC需要DC 技术名词解释 提示&#x…

车载测试很难吗?我靠着这套面试资料拿下了16k车载测试offer!

目录 如何写简历 项目经验 如何准备面试 车载项目的实施 常见面试题 总结: 车载测试通常包含以下三个方面: 系统测试:对整车系统进行测试,如车载电子系统、底盘系统、动力系统等。系统测试主要是评估整车各项性能指标是否达到…

STM32单片机(四)第一节:OLED调试工具

❤️ 专栏简介:本专栏记录了从零学习单片机的过程,其中包括51单片机和STM32单片机两部分;建议先学习51单片机,其是STM32等高级单片机的基础;这样再学习STM32时才能融会贯通。 ☀️ 专栏适用人群 :适用于想要…

深入聊一下机械硬盘的相关内容

本文是《数据存储通识课》合集的一部分,本合集希望通过一系列文章科普数据存储相关技术内容。同时,本系列文章不仅仅是科普,还会进行有深度解析,理论结合实现,从代码实现层面进行剖析​ 介绍存储技术当然要从存储技术最基本的组件磁盘开始介绍了。目前市面上我们见得最多的…

shell脚本基础4——function函数、expect

文章目录 一、function函数1.1 函数的定义使用1.2 函数参数1.2.1 脚本内传参1.2.2 脚本外传参 1.3 引用局部变量1.3.1 区分局部变量1.3.2 全局变量在函数外1.3.3 全局变量在函数体1.3.4 函数体直接输出具体值 二、expect命令2.1 常用命令2.2 安装使用2.3 例一2.4 例二2.5 例三 …

Cmake工具的简单使用

引言 本篇文章讲述如何简单的使用cmake工具构建一个项目,帮助入门的c新手学会如何使用cmake. 我们在Clion新创建一个项目时,会发现,除了main.cpp文件之外,还存在一个build-debug目录和一个CMakelists.txt文件,如图: …

Spring Boot 集成 Redisson 实现分布式锁

Redisson 是一种基于 Redis 的 Java 驻留集群的分布式对象和服务库,可以为我们提供丰富的分布式锁和线程安全集合的实现。在 Spring Boot 应用程序中使用 Redisson 可以方便地实现分布式应用程序的某些方面,例如分布式锁、分布式集合、分布式事件发布和订…

Java学习记录

引入包 把jar包复制到lib下面,然后右键add as library 常用方法 读取文件:FileInputStream获得当前路径:System.getProperty(“user.dir”) 快捷键 https://blog.csdn.net/W_317/article/details/114300373 常用的 生成循环&#xff1…

VMware Workstation 16 安装教程

哈喽,大家好。今天一起学习的是VMware Workstation 16的安装,vm虚拟机是小编非常喜欢的生产力软件,小编之前发布的测试教程钧在vm上进行的实验。 VMware Workstation是一款功能强大的桌面虚拟计算机软件,它能够让用户在宿主机操作…

Coggle 30 Days of ML 打卡任务一:两个赛题数据可视化

Coggle 30 Days of ML 打卡任务一:两个赛题数据可视化 任务一:两个赛题数据可视化 难度/分值:低/1 打卡内容: 参赛选手名称:AppleDoctor完成日期:2023.6.6任务完成情况: 使用的编程语言&…

【你真的会斗图嘛?】Python爬虫实战项目——你想要的图都可以爬到(附安装地址)

目录 一、安装知识(1)Python环境变量(2)Pycharm开发工具(3)requests模块1、安装下载好Pycharm之后,找到终端进行下载requests模块2、输入下载语法:3、此次项目还需用到正则&#xff…

【STM32】制作一个bootloader

工作环境:STM32CubeMXKeil 相关环境准备这里就不介绍了。 bootloader是什么 bootloader就是单片机启动时候运行的一段小程序,这段程序负责单片机固件的更新,也就是单片机选择性的自己给自己下载程序。可以更新,可以不更新&…

C++课程设计:电梯控制系统程序设计

目录 电梯控制系统程序设计背景 类结构图 程序设计 程序释义 运行展示 总结与体会 电梯控制系统程序设计背景 电梯控制系统程序设计旨在通过C编程实现电梯的模拟功能,使用户能够选择上行或下行,并输入要进入的楼层,程序将模拟电梯运行过…

死锁细究!

一、死锁的定义&危害 1、死锁是什么 发生在并发中互不想让:当两个(或更多)线程(或进程)相互持有对方所需要的资源,又不主动释放,导致所有人都无法继续前进,导致程序陷入无尽的…

学生成绩管理系统(C语言)

学生成绩管理系统 一、实现思路二、代码的实现(1)构造功能框架(2)实现各项功能 三、完整的代码四、总结 本篇博客介绍一个关于学生成绩管理系统的C语言代码,包含读取成绩、计算各门课程的总分和平均分、按分数排序、按…

快速实现一个分布式定时器

定时器(Timer)是一种在业务开发中常用的组件,主要用在执行延时通知任务上。本文以笔者在工作中的实践作为基础,介绍如何使用平时部门最常用的组件快速实现一个业务常用的分布式定时器服务。同时介绍了过程中遇到问题的一些解决方案…