Lecture 18 Information Extraction

information extraction
- Given this: “Brasilia, the Brazilian capital, was founded in 1960.”
- Obtain this:
  - capital(Brazil, Brasilia)
  - founded(Brasilia, 1960)
- Main goal: turn text into structured data
applications
- Stock analysis
  - Gather information from news and social media
  - Summarise texts into a structured format
  - Decide whether to buy/sell at current stock price
- Medical research
  - Obtain information from articles about diseases and treatments
  - Decide which treatment to apply for new patient
how
- Two steps:
  - Named Entity Recognition (NER): find out entities such as “Brasilia” and “1960”
  - Relation Extraction: use context to find the relation between “Brasilia” and “1960” (“founded”)
machine learning in IE
- Named Entity Recognition (NER): sequence models such as RNNs, HMMs or CRFs.
- Relation Extraction: mostly classifiers, either binary or multi-class.
- This lecture: how to frame these two tasks in order to apply sequence labellers and classifiers.

Named Entity Recognition

在这里插入图片描述

typical entity tags (types of tags to use depend on domains)
- PER(people): people, characters
- ORG(organisation): companies, sports teams
- LOC(natural location): regions, mountains, seas
- GPE(man-made locations): countries, states, provinces (in some tagset this is labelled as LOC)
- FAC(facility): bridges, buildings, airports
- VEH(vehcle): planes, trains, cars
- Tag-set is application-dependent: some domains deal with specific entities e.g. proteins and genes
NER as sequnce labelling
- NE tags can be ambiguous:
  - “Washington” can be a person, location or political entity
- Similar problem when doing POS tagging
  - possible solution: Incorporate(包含) context
- Can we use a sequence tagger for this (e.g. HMM)?
  - No, as entities can span multiple tokens(multiple words)
  - Solution: modify the tag set
- IO(inside,outside) tagging
  - [ORG American Airlines], a unit of [ORG AMR Corp.], immediately matched the move, spokesman [PER Tim Wagner] said.
  - ‘I-ORG’ represents a token that is inside an entity (ORG in this case).
  - All tokens which are not entities get the ‘O’ token (for outside).
  - Cannot differentiate between:
    - a single entity with multiple tokens
    - multiple entities with single tokens
- IOB(beginning) tagging
  - [ORG American Airlines], a unit of [ORG AMR Corp.], immediately matched the move, spokesman [PER Tim Wagner] said.
  - B-ORG represents the beginning of an ORG entity.
  - If the entity has more than one token, subsequent tags are represented as I-ORG.
  - example: annotate the following sentence with NER tags(IOB)
    - Steves Jobs founded Apple Inc. in 1976, Tageset: PER, ORG, LOC, TIME
      - [B-PER Steves] [I-PER Jobs] [O founded] [B-ORG Apple] [I-ORG Inc.] [O in] [B-Time 1976]
- NER as sequence labelling
  - Given such tagging scheme, we can train any sequence labelling model
  - In theory, HMMs can be used but discriminative models such as CRFs are preferred (HMMs cannot incorperate new features)
- NER
  - features
    - Example: L’Occitane
    - Prefix/suffix:
      - L / L’ / L’O / L’Oc / …
      - e / ne / ane / tane / …
    - Word shape:
      - X’Xxxxxxxx / X’Xx
      - XXXX-XX-XX (date!)
    - POS tags / syntactic chunks: many entities are nouns or noun phrases.
    - Presence in a gazeteer: lists of entities, such as place names, people’s names and surnames, etc.
  - classifier
  - deep learning for NER
    - State of the art approach uses LSTMs with character and word embeddings (Lample et al. 2016)

Relation Extraction

relation extraction
- [ORG American Airlines], a unit of [ORG AMR Corp.], immediately matched the move, spokesman [PER Tim Wagner] said.
- Traditionally framed as triple(a relation and two entities) extraction:
  - unit(American Airlines, AMR Corp.)
  - spokesman(Tim Wagner, American Airlines)
- Key question: do we know all the possible relations?
  - map relations to a closed set of relations
  - unit(American Airlines, AMR Corp.) → subsidiary
  - spokesman(Tim Wagner, American Airlines) → employment
methods
- If we have access to a fixed relation database:
  - Rule-based
  - Supervised
  - Semi-supervised
  - Distant supervision
- If no restrictions on relations:
  - Unsupervised
  - Sometimes referred as “OpenIE”
- rule-based relation extraction
  - “Agar is a substance prepared from a mixture of red algae such as Gelidium, for laboratory or industrial use.”
  - identify linguitics patterns in sentence
  - [NP red algae] such as [NP Gelidium]
  - NP0 such as NP1 → hyponym(NP1, NP0)
  - hyponym(Gelidium, red algae)
  - Lexico-syntactic patterns: high precision, low recall(unlikely to recover all patterns, so many linguistic patterns!), manual effort required
  - more rules
- supervised relation extraction
  - Assume a corpus with annotated relations
  - Two steps (if only one step, class imbalance problem: most entities have no relations!)
    - First, find if an entity pair is related or not (binary classification)
      - For each sentence, gather all possible entity pairs
      - Annotated pairs are considered positive examples
      - Non-annotated pairs are taken as negative examples
    - Second, for pairs predicted as positive, use a multiclass classifier (e.g. SVM) to obtain the relation
    - example
      - [ORG American Airlines], a unit of [ORG AMR Corp.], immediately matched the move, spokesman [PER Tim Wagner] said.
      - First:
        (American Airlines, AMR Corp.) $\to$ positive
        (American Airlines, Tim Wagner) $\to$ positive
        (AMR Corp., Tim Wagner) $\to$ negative
      - Second:
        (American Airlines, AMR Corp.) $\to$ subsidiary
        (American Airlines, Tim Wagner) $\to$ employment
  - features
    - [ORG American Airlines], a unit of [ORG AMR Corp.], immediately matched the move, spokesman [PER Tim Wagner] said.
    - (American Airlines, Tim Wagner) $\to$ employment
- semi-supervised relation extraction
  - Annotated corpora is very expensive to create
  - Use seed tuples to bootstrap a classifier (use seed to find more training data)
  - steps:
    1. Given seed tuple: hub(Ryanair, Charleroi)
    2. Find sentences containing terms in seed tuples
      - Budget airline Ryanair, which uses Charleroi as a hub, scrapped all weekend flights out of the airport
    3. Extract general patterns
      - [ORG], which uses [LOC] as a hub
    4. Find new tuples with these patterns
      - hub(Jetstar, Avalon)
    5. Add these new tuples to existing tuples and repeat step 2
  - issues
    - Extracted tuples deviate from original relation over time
      - semantic drift(deviate from original relation)
        Pattern: [NP] has a {NP}* hub at [LOC]
        Sydney has a ferry hub at Circular Quay
        hub(Sydney, Circular Quay)
        
        More erroneous(错误的) patterns extracted from this tuple…
        Should only accept patterns with high confidences
    - Difficult to evaluate(no labels for new extracted tuples)
    - Extracted general patterns tend to be very noisy
- distant supervision
  - Semi-supervised methods assume the existence of seed tuples to mine new tuples
  - Can we mine new tuples directly?
  - Distant supervision obtain new tuples from a range of sources:
    - DBpedia
    - Freebase
  - Generate massive training sets, enabling the use of richer features, and no risk of semantic drift
- unsupervised relation extraction
  - No fixed or closed set of relations
  - Relations are sub-sentences; usually has a verb
  - “United has a hub in Chicago, which is the headquarters of United Continental Holdings.”
    - “has a hub in”(United, Chicago)
    - “is the headquarters of”(Chicago, United Continental Holdings)
  - Main problem: so many relation forms! mapping relations into canonical forms
- evaluation
  - NER: F1-measure at the entity level.
  - Relation Extraction with known relation set: F1-measure
  - Relation Extraction with unknown relations: much harder to evaluate
    - Usually need some human evaluation
    - Massive datasets used in these settings are impractical to evaluate manually (use samples)
    - Can only obtain (approximate) precision, not recall(too many possible relations!)

Other IE Tasks

temporal expression extraction

“[TIME July 2, 2007]: A fare increase initiated [TIME last week] by UAL Corp’s United Airlines was matched by competitors over [TIME the weekend], marking the second successful fare increase in [TIME two weeks].”
- Anchoring: when is “last week”?
  - “last week” → 2007−W26
- Normalisation: mapping expressions to canonical forms.
  - July 2, 2007 → 2007-07-02
- Mostly rule-based approaches
event extraction
- “American Airlines, a unit of AMR Corp., immediately [EVENT matched] [EVENT the move], spokesman Tim Wagner [EVENT said].”
- Very similar to NER but different tags, including annotation and learning methods.
- Event ordering: detect how a set of events happened in a timeline.
  - Involves both event extraction and temporal expression extraction.

Conclusion

Information Extraction is a vast field with many different tasks and applications
- Named Entity Recognition
- Relation Extraction
- Event Extraction
Machine learning methods involve classifiers and sequence labelling models.