TextRank 是一种用于文本摘要的自然语言处理算法。它的工作原理类似于 Google 搜索引擎的 PageRank 算法,即根据文本中每个单词出现的频率和被引用的次数来评估它的重要性。
所谓自动摘要,就是从文章中自动抽取关键句。何谓关键句?人类的理解是能够概括文章中心的句子,机器的理解只能模拟人类的理解,即拟定一个权重的评分标准,给每个句子打分,之后给出排名靠前的几个句子。
1.TextRank简介
详情参考博客:利用TextRank算法提取摘要关键词以及Java实现_java textrank-CSDN博客
2.TextRank自动摘要算法实现
TextRank算法实现自动摘要分为 5 个步骤:
- 把原文分割为单个句子
- 为字词编号,以词袋表示将句子向量化(词向量)
- 计算各句子向量的相似性并存放在矩阵中,即构建相似度矩阵。
- 利用相似度矩阵求出矩阵M,迭代更新每个句子的TR值。
- 选取一定数量的排名靠前的句子构成最后的摘要。
3.TextRank自动摘要的Java实现
TextRank自动摘要的java代码实现主要参考开源工具HanLP中的
TextRank算法自动摘要的Java实现-码农场
主要是将其中的分词工具替换成ANSJ分词,使分词更加准确和可定制化
3.1 分词工具包:
<dependency> <groupId>org.ansj</groupId> <artifactId>ansj_seg</artifactId> <version>5.1.6</version> </dependency>
具体使用参考:开源中文分词Ansj的简单使用-CSDN博客
3.2 BM25——搜索相关性评分算法
import java.util.List;
import java.util.Map;
import java.util.TreeMap;
/**
* 搜索相关性评分算法
*
* @author hankcs
*/
public class BM25 {
/**
* 文档句子的个数
*/
int D;
/**
* 文档句子的平均长度
*/
double avgdl;
/**
* 拆分为[句子[单词]]形式的文档
*/
List<List<String>> docs;
/**
* 文档中每个句子中的每个词与词频
*/
Map<String, Integer>[] f;
/**
* 文档中全部词语与出现在几个句子中
*/
Map<String, Integer> df;
/**
* IDF
*/
Map<String, Double> idf;
/**
* 调节因子
*/
final static float k1 = 1.5f;
/**
* 调节因子
*/
final static float b = 0.75f;
public BM25(List<List<String>> docs) {
this.docs = docs;
D = docs.size();
for (List<String> sentence : docs) {
avgdl += sentence.size();
}
avgdl /= D;
f = new Map[D];
df = new TreeMap<String, Integer>();
idf = new TreeMap<String, Double>();
init();
}
/**
* 在构造时初始化自己的所有参数
*/
private void init() {
int index = 0;
for (List<String> sentence : docs) {
Map<String, Integer> tf = new TreeMap<String, Integer>();
for (String word : sentence) {
Integer freq = tf.get(word);
freq = (freq == null ? 0 : freq) + 1;
tf.put(word, freq);
}
f[index] = tf;
for (Map.Entry<String, Integer> entry : tf.entrySet()) {
String word = entry.getKey();
Integer freq = df.get(word);
freq = (freq == null ? 0 : freq) + 1;
df.put(word, freq);
}
++index;
}
for (Map.Entry<String, Integer> entry : df.entrySet()) {
String word = entry.getKey();
Integer freq = entry.getValue();
idf.put(word, Math.log(D - freq + 0.5) - Math.log(freq + 0.5));
}
}
public double sim(List<String> sentence, int index) {
double score = 0;
for (String word : sentence) {
if (!f[index].containsKey(word)) continue;
int d = docs.get(index).size();
Integer wf = f[index].get(word);
score += (idf.get(word) * wf * (k1 + 1)
/ (wf + k1 * (1 - b + b * d
/ avgdl)));
}
return score;
}
public double[] simAll(List<String> sentence) {
double[] scores = new double[D];
for (int i = 0; i < D; ++i) {
scores[i] = sim(sentence, i);
}
return scores;
}
}
3.3 TextRank 自动摘要
import com.timerchina.nlp.segment.SplitWords;
import com.timerchina.nlp.stopword.StopWord;
import com.timerchina.nlp.word.WOD;
import java.util.*;
/**
* TextRank 自动摘要
*
* @author hankcs
*/
public class TextRankSentence {
static {
//分词过滤代码初始化,请自行实现
StopWord.init();
}
/**
* 阻尼系数(DampingFactor),一般取值为0.85
*/
final static double d = 0.85;
/**
* 最大迭代次数
*/
final static int max_iter = 200;
final static double min_diff = 0.001;
/**
* 文档句子的个数
*/
int D;
/**
* 拆分为[句子[单词]]形式的文档
*/
List<List<String>> docs;
/**
* 排序后的最终结果 score <-> index
*/
TreeMap<Double, Integer> top;
/**
* 句子和其他句子的相关程度
*/
double[][] weight;
/**
* 该句子和其他句子相关程度之和
*/
double[] weight_sum;
/**
* 迭代之后收敛的权重
*/
double[] vertex;
/**
* BM25相似度
*/
BM25 bm25;
public TextRankSentence(List<List<String>> docs) {
this.docs = docs;
bm25 = new BM25(docs);
D = docs.size();
weight = new double[D][D];
weight_sum = new double[D];
vertex = new double[D];
top = new TreeMap<Double, Integer>(Collections.reverseOrder());
solve();
}
private void solve() {
int cnt = 0;
for (List<String> sentence : docs) {
double[] scores = bm25.simAll(sentence);
// System.out.println(Arrays.toString(scores));
weight[cnt] = scores;
weight_sum[cnt] = sum(scores) - scores[cnt]; // 减掉自己,自己跟自己肯定最相似
vertex[cnt] = 1.0;
++cnt;
}
for (int _ = 0; _ < max_iter; ++_) {
double[] m = new double[D];
double max_diff = 0;
for (int i = 0; i < D; ++i) {
m[i] = 1 - d;
for (int j = 0; j < D; ++j) {
if (j == i || weight_sum[j] == 0) continue;
m[i] += (d * weight[j][i] / weight_sum[j] * vertex[j]);
}
double diff = Math.abs(m[i] - vertex[i]);
if (diff > max_diff) {
max_diff = diff;
}
}
vertex = m;
if (max_diff <= min_diff) break;
}
// 我们来排个序吧
for (int i = 0; i < D; ++i) {
top.put(vertex[i], i);
}
}
/**
* 获取前几个关键句子
*
* @param size 要几个
* @return 关键句子的下标
*/
public int[] getTopSentence(int size) {
Collection<Integer> values = top.values();
size = Math.min(size, values.size());
int[] indexArray = new int[size];
Iterator<Integer> it = values.iterator();
for (int i = 0; i < size; ++i) {
indexArray[i] = it.next();
}
return indexArray;
}
/**
* 简单的求和
*
* @param array
* @return
*/
private static double sum(double[] array) {
double total = 0;
for (double v : array) {
total += v;
}
return total;
}
public static void main(String[] args) {
String document = "算法可大致分为基本算法、数据结构的算法、数论算法、计算几何的算法、图的算法、动态规划以及数值分析、加密算法、排序算法、检索算法、随机化算法、并行算法、厄米变形模型、随机森林算法。\n" +
"算法可以宽泛的分为三类,\n" +
"一,有限的确定性算法,这类算法在有限的一段时间内终止。他们可能要花很长时间来执行指定的任务,但仍将在一定的时间内终止。这类算法得出的结果常取决于输入值。\n" +
"二,有限的非确定算法,这类算法在有限的时间内终止。然而,对于一个(或一些)给定的数值,算法的结果并不是唯一的或确定的。\n" +
"三,无限的算法,是那些由于没有定义终止定义条件,或定义的条件无法由输入的数据满足而不终止运行的算法。通常,无限算法的产生是由于未能确定的定义终止条件。";
System.out.println(document.length());
System.out.println(TextRankSentence.getTopSentenceList(document, 5));
System.out.println(TextRankSentence.getSummary(document, 50));
}
/**
* 将文章分割为句子
*
* @param document
* @return
*/
static List<String> spiltSentence(String document) {
List<String> sentences = new ArrayList<String>();
for (String line : document.split("[\r\n]")) {
line = line.trim();
if (line.length() == 0) continue;
for (String sent : line.split("[,,。::??!!;;【】\\s]")) {
sent = sent.trim();
if (sent.length() == 0) continue;
sentences.add(sent);
}
}
return sentences;
}
/**
* 将句子列表转化为文档
*
* @param sentenceList
* @return
*/
private static List<List<String>> convertSentenceListToDocument(List<String> sentenceList) {
List<List<String>> docs = new ArrayList<List<String>>(sentenceList.size());
for (String sentence : sentenceList) {
List<Term> termList = ToAnalysis.parse(sentence).getTerms();
StopWord.cleanerPAS(termList);//过滤标点符号,自行实现
StopWord.filter(termList);//过滤停用词,自行实现
List<String> wordList = new LinkedList<String>();
for (Termterm : termList) {
wordList.add(term.getName());
}
docs.add(wordList);
}
return docs;
}
/**
* 一句话调用接口
*
* @param document 目标文档
* @param size 需要的关键句的个数
* @return 关键句列表
*/
public static List<String> getTopSentenceList(String document, int size) {
List<String> sentenceList = spiltSentence(document);
List<List<String>> docs = convertSentenceListToDocument(sentenceList);
TextRankSentence textRank = new TextRankSentence(docs);
int[] topSentence = textRank.getTopSentence(size);
List<String> resultList = new LinkedList<String>();
for (int i : topSentence) {
resultList.add(sentenceList.get(i));
}
return resultList;
}
/**
* 一句话调用接口
*
* @param document 目标文档
* @param max_length 需要摘要的长度
* @return 摘要文本
*/
public static String getSummary(String document, int max_length) {
List<String> sentenceList = spiltSentence(document);
int sentence_count = sentenceList.size();
int document_length = document.length();
int sentence_length_avg = document_length / sentence_count;
int size = max_length / sentence_length_avg + 1;
List<List<String>> docs = convertSentenceListToDocument(sentenceList);
TextRankSentence textRank = new TextRankSentence(docs);
int[] topSentence = textRank.getTopSentence(size);
List<String> resultList = new LinkedList<String>();
for (int i : topSentence) {
resultList.add(sentenceList.get(i));
}
resultList = permutation(resultList, sentenceList);
resultList = pick_sentences(resultList, max_length);
return join("。", resultList);
}
public static List<String> permutation(List<String> resultList, List<String> sentenceList) {
int index_buffer_x;
int index_buffer_y;
String sen_x;
String sen_y;
int length = resultList.size();
// bubble sort derivative
for (int i = 0; i < length; i++)
for (int offset = 0; offset < length - i; offset++) {
sen_x = resultList.get(i);
sen_y = resultList.get(i + offset);
index_buffer_x = sentenceList.indexOf(sen_x);
index_buffer_y = sentenceList.indexOf(sen_y);
// if the sentence order in sentenceList does not conform that is in resultList, reverse it
if (index_buffer_x > index_buffer_y) {
resultList.set(i, sen_y);
resultList.set(i + offset, sen_x);
}
}
return resultList;
}
public static List<String> pick_sentences(List<String> resultList, int max_length) {
int length_counter = 0;
int length_buffer;
int length_jump;
List<String> resultBuffer = new LinkedList<String>();
for (int i = 0; i < resultList.size(); i++) {
length_buffer = length_counter + resultList.get(i).length();
if (length_buffer <= max_length) {
resultBuffer.add(resultList.get(i));
length_counter += resultList.get(i).length();
} else if (i < (resultList.size() - 1)) {
length_jump = length_counter + resultList.get(i + 1).length();
if (length_jump <= max_length) {
resultBuffer.add(resultList.get(i + 1));
length_counter += resultList.get(i + 1).length();
i++;
}
}
}
return resultBuffer;
}
public static String join(String delimiter, Collection<String> stringCollection) {
StringBuilder sb = new StringBuilder(stringCollection.size() * (16 + delimiter.length()));
for (String str : stringCollection) {
sb.append(str).append(delimiter);
}
return sb.toString();
}
}
运行结果:
[这类算法在有限的时间内终止, 这类算法在有限的一段时间内终止, 是那些由于没有定义终止定义条件]
这类算法在有限的一段时间内终止。这类算法在有限的时间内终止。是那些由于没有定义终止定义条件。