【机器学习】AAAI 会议论文聚类分析

news2024/10/6 0:37:49

实验五:AAAI 会议论文聚类分析

​ 本次实验以AAAI 2014会议论文数据为基础,要求实现或调用无监督聚类算法,了解聚类方法。

1 任务介绍

​ 每年国际上召开的大大小小学术会议不计其数,发表了非常多的论文。在计算机领域的一些大型学术会议上,一次就可以发表涉及各个方向的几百篇论文。按论文的主题、内容进行聚类,有助于人们高效地查找和获得所需要的论文。本案例数据来源于AAAI 2014上发表的约400篇文章,由UCI公开提供,提供包括标题、作者、关键词、摘要在内的信息,希望大家能根据这些信息,合理地构造特征向量来表示这些论文,并设计实现或调用聚类算法对论文进行聚类。最后也可以对聚类结果进行观察,看每一类都是什么样的论文,是否有一些主题。

1.1 基本要求:

  1. 将文本转化为向量,实现或调用无监督聚类算法,对论文聚类,例如10类(可使用已有工具包例如sklearn);
  2. 观察每一类中的论文,调整算法使结果较为合理;
  3. 无监督聚类没有标签,效果较难评价,因此没有硬性指标,跑通即可,主要让大家了解和感受聚类算法,比较简单。

1.2 扩展要求:

  1. 对文本向量进行降维,并将聚类结果可视化成散点图。

注:group和topic也不能完全算是标签,因为

  1. 有些文章作者投稿时可能会选择某个group/topic但实际和另外group/topic也相关甚至更相关;
  2. 一篇文章可能有多个group和topic,作为标签会出现有的文章同属多个类别,这里暂不考虑这样的聚类;
  3. group和topic的取值很多,但聚类常常希望指定聚合成出例如5/10/20类;
  4. 感兴趣但同学可以思考利用group和topic信息来量化评价无监督聚类结果,不作要求。

1.3 提示:

  1. 高维向量的降维旨在去除一些高相关性的特征维度,保留最有用的信息,用更低维的向量表示高维数据,常用的方法有PCA和t-SNE等;
  2. 降维与聚类是两件不同的事情,聚类实际上在降维前的高维向量和降维后的低维向量上都可以进行,结果也可能截然不同;
  3. 高维向量做聚类,降维可视化后若有同一类的点不在一起,是正常的。在高维空间中它们可能是在一起的,降维后损失了一些信息。
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
import re
import nltk
import sklearn
import seaborn as sns # 作图
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from scipy import sparse # 稀疏矩阵

RANDOM_STATE = 2023

2 导入数据

data_df = pd.read_csv('./data/[UCI] AAAI-14 Accepted Papers - Papers.csv') # 读入 csv 文件为 pandas 的 DataFrame
data_df.head(5) 
titleauthorsgroupskeywordstopicsabstract
0Kernelized Bayesian Transfer LearningMehmet Gönen and Adam A. MargolinNovel Machine Learning Algorithms (NMLA)cross-domain learning\ndomain adaptation\nkern...APP: Biomedical / Bioinformatics\nNMLA: Bayesi...Transfer learning considers related but distin...
1"Source Free" Transfer Learning for Text Class...Zhongqi Lu, Yin Zhu, Sinno Pan, Evan Xiang, Yu...AI and the Web (AIW)\nNovel Machine Learning A...Transfer Learning\nAuxiliary Data Retrieval\nT...AIW: Knowledge acquisition from the web\nAIW: ...Transfer learning uses relevant auxiliary data...
2A Generalization of Probabilistic Serial to Ra...Haris Aziz and Paul StursbergGame Theory and Economic Paradigms (GTEP)social choice theory\nvoting\nfair division\ns...GTEP: Game Theory\nGTEP: Social Choice / VotingThe probabilistic serial (PS) rule is one of t...
3Lifetime Lexical Variation in Social MediaLiao Lizi, Jing Jiang, Ying Ding, Heyan Huang ...NLP and Text Mining (NLPTM)Generative model\nSocial Networks\nAge PredictionAIW: Web personalization and user modeling\nNL...As the rapid growth of online social media att...
4Hybrid Singular Value Thresholding for Tensor ...Xiaoqin Zhang, Zhengyuan Zhou, Di Wang and Yi MaKnowledge Representation and Reasoning (KRR)\n...tensor completion\nlow-rank recovery\nhybrid s...KRR: Knowledge Representation (General/Other)\...In this paper, we study the low-rank tensor co...

查看dataframe数据信息:

data_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 398 entries, 0 to 397
Data columns (total 6 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   title     398 non-null    object
 1   authors   398 non-null    object
 2   groups    396 non-null    object
 3   keywords  398 non-null    object
 4   topics    394 non-null    object
 5   abstract  398 non-null    object
dtypes: object(6)
memory usage: 18.8+ KB

从以上信息可以看出,data_df存在空数据,应对其作处理

# stack()将df转换为series对象; [lambda x:x]只保留True元素
data_df.isnull().stack()[lambda x: x]
211  groups    True
340  groups    True
344  topics    True
364  topics    True
365  topics    True
388  topics    True
dtype: bool

对空数据进行填充为空字符处理

data_df = data_df.fillna('') # 填充空值为空字符串

3 文本想量化

3.1 简单文本向量化

将同一篇文章的不同类型数据结合,选择使用TF-IDF模型,对文本进行向量化

paper_df = data_df['title']+' '+data_df['authors']+' '+data_df['groups']+' '\
+data_df['keywords']+' '+data_df['topics']+' '+data_df['abstract']

paper_df

结果:

0      Kernelized Bayesian Transfer Learning Mehmet G...
1      "Source Free" Transfer Learning for Text Class...
2      A Generalization of Probabilistic Serial to Ra...
3      Lifetime Lexical Variation in Social Media Lia...
4      Hybrid Singular Value Thresholding for Tensor ...
                             ...                        
393    Mapping Users Across Networks by Manifold Alig...
394    Compact Aspect Embedding For Diversified Query...
395    Contraction and Revision over DL-Lite TBoxes Z...
396    Zero Pronoun Resolution as Ranking Chen Chen a...
397    Supervised Transfer Sparse Coding Maruan Al-Sh...
Length: 398, dtype: object
vectorizer = TfidfVectorizer(max_df=0.9, min_df=10)
X_simple = vectorizer.fit_transform(paper_df)

3.2 复杂文本向量化

  1. 将作者名字分割合适
def author_tokenizer(text): 
    authors = re.split("\s+and\s+|\s*,\s*", text) # 根据逗号或者and进行分词
    return authors

authors = data_df['authors'][1]
author_split = author_tokenizer(authors)
print(authors,'\n',author_split)

结果:

Zhongqi Lu, Yin Zhu, Sinno Pan, Evan Xiang, Yujing Wang and Qiang Yang 
 ['Zhongqi Lu', 'Yin Zhu', 'Sinno Pan', 'Evan Xiang', 'Yujing Wang', 'Qiang Yang']
  1. 将其他文本进行分词、去除停用词、词干化处理
def text_tokenizer(text):
    # 分词
    words = nltk.tokenize.word_tokenize(text)
    # 去除停用词
    stop_words = set(nltk.corpus.stopwords.words('english'))
    words = [word for word in words if word.lower() not in stop_words]
    # 词干化
    stemmer = nltk.stem.PorterStemmer()
    words = [stemmer.stem(word) for word in words]
    return words

abstracts=data_df['abstract'][1]
abstracts_split = text_tokenizer(abstracts)
print(abstracts,'\n',abstracts_split)

结果:

Transfer learning uses relevant auxiliary data to help the learning task in a target domain where labeled data are usually insufficient to train an accurate model. Given appropriate auxiliary data, researchers have proposed many transfer learning models. How to find such auxiliary data, however, is of little research in the past. In this paper, we focus on this auxiliary data retrieval problem, and propose a transfer learning framework that effectively selects helpful auxiliary data from an open knowledge space (e.g. the World Wide Web). Because there is no need of manually selecting auxiliary data for different target domain tasks, we call our framework Source Free Transfer Learning (SFTL). For each target domain task, SFTL framework iteratively queries for the helpful auxiliary data based on the learned model and then updates the model using the retrieved auxiliary data. We highlight the automatic constructions of queries and the robustness of the SFTL framework. Our experiments on the 20 NewsGroup dataset and the Google search snippets dataset suggest that the new framework is capable to have the comparable performance to those state-of-the-art methods with dedicated selections of auxiliary data. 
 ['transfer', 'learn', 'use', 'relev', 'auxiliari', 'data', 'help', 'learn', 'task', 'target', 'domain', 'label', 'data', 'usual', 'insuffici', 'train', 'accur', 'model', '.', 'given', 'appropri', 'auxiliari', 'data', ',', 'research', 'propos', 'mani', 'transfer', 'learn', 'model', '.', 'find', 'auxiliari', 'data', ',', 'howev', ',', 'littl', 'research', 'past', '.', 'paper', ',', 'focu', 'auxiliari', 'data', 'retriev', 'problem', ',', 'propos', 'transfer', 'learn', 'framework', 'effect', 'select', 'help', 'auxiliari', 'data', 'open', 'knowledg', 'space', '(', 'e.g', '.', 'world', 'wide', 'web', ')', '.', 'need', 'manual', 'select', 'auxiliari', 'data', 'differ', 'target', 'domain', 'task', ',', 'call', 'framework', 'sourc', 'free', 'transfer', 'learn', '(', 'sftl', ')', '.', 'target', 'domain', 'task', ',', 'sftl', 'framework', 'iter', 'queri', 'help', 'auxiliari', 'data', 'base', 'learn', 'model', 'updat', 'model', 'use', 'retriev', 'auxiliari', 'data', '.', 'highlight', 'automat', 'construct', 'queri', 'robust', 'sftl', 'framework', '.', 'experi', '20', 'newsgroup', 'dataset', 'googl', 'search', 'snippet', 'dataset', 'suggest', 'new', 'framework', 'capabl', 'compar', 'perform', 'state-of-the-art', 'method', 'dedic', 'select', 'auxiliari', 'data', '.']

查看每列名称:

data_df.columns

结果:

Index(['title', 'authors', 'groups', 'keywords', 'topics', 'abstract'], dtype='object')

创建 TF-IDF 矩阵:

vectorizer_authour = TfidfVectorizer(tokenizer = author_tokenizer)
vectorizer_text = TfidfVectorizer(tokenizer = text_tokenizer)
X_authours = vectorizer_authour.fit_transform(data_df['authors'].tolist()) 
X_title = vectorizer_text.fit_transform(data_df['title'].tolist()) 
X_groups = vectorizer_text.fit_transform(data_df['groups'].tolist()) 
X_keywords = vectorizer_text.fit_transform(data_df['keywords'].tolist()) 
X_topics = vectorizer_text.fit_transform(data_df['topics'].tolist()) 

vectorizer_texts = TfidfVectorizer(max_df=0.9, min_df=5, tokenizer = text_tokenizer)
X_abstract = vectorizer_texts.fit_transform(data_df['abstract'].tolist()) 

print(f'X_title:{X_title.shape}')
print(f'X_authours:{X_authours.shape}')
print(f'X_groups:{X_groups.shape}')
print(f'X_keywords:{X_keywords.shape}')
print(f'X_topics:{X_topics.shape}')
print(f'X_abstract:{X_abstract.shape}')

结果:

X_title:(398, 1124)
X_authours:(398, 1105)
X_groups:(398, 64)
X_keywords:(398, 1051)
X_topics:(398, 305)
X_abstract:(398, 1042)

将稀疏矩阵拼接

X_passage = sparse.hstack([X_title, X_authours, X_groups, X_keywords, X_topics, X_abstract]) # 稀疏向量拼接
print(X_passage.shape)
(398, 4691)

4 聚类算法

4.1 简单聚类

直接采用KMeans简单聚类

k = 5 #假设有5个类别
model = KMeans(n_clusters=k, init='k-means++', max_iter=100, n_init=1)
model.fit(X_simple)
labels = model.labels_
data_df['label'] = labels
labels

结果:

array([1, 1, 4, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 0, 1, 4, 1, 1, 1, 1, 3,
       2, 1, 0, 1, 1, 2, 1, 1, 4, 0, 1, 1, 4, 3, 1, 4, 1, 4, 1, 3, 1, 0,
       4, 3, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 3, 4, 1, 1, 4, 1, 3, 1, 1, 4,
       3, 4, 1, 3, 4, 2, 1, 1, 1, 1, 3, 4, 1, 4, 1, 1, 1, 1, 1, 1, 1, 4,
       1, 1, 1, 1, 0, 1, 1, 2, 1, 4, 1, 1, 3, 1, 1, 1, 3, 2, 4, 0, 1, 3,
       4, 2, 1, 3, 1, 2, 1, 4, 1, 1, 1, 1, 1, 0, 4, 1, 1, 0, 1, 0, 1, 3,
       1, 1, 4, 4, 1, 1, 0, 1, 3, 1, 1, 1, 1, 1, 0, 1, 0, 4, 1, 1, 0, 2,
       1, 2, 1, 0, 1, 1, 1, 4, 3, 1, 2, 1, 4, 3, 0, 2, 3, 4, 0, 3, 3, 1,
       1, 2, 4, 3, 3, 4, 1, 1, 3, 2, 1, 0, 4, 4, 4, 4, 2, 1, 1, 3, 0, 4,
       2, 1, 2, 0, 1, 1, 3, 3, 0, 1, 1, 1, 1, 1, 3, 1, 1, 1, 0, 1, 0, 1,
       1, 1, 1, 1, 4, 1, 3, 1, 1, 1, 3, 1, 1, 4, 1, 2, 3, 0, 2, 3, 1, 1,
       1, 1, 1, 4, 1, 0, 1, 1, 2, 1, 4, 1, 1, 1, 0, 1, 1, 1, 1, 4, 1, 1,
       1, 4, 0, 1, 1, 1, 4, 1, 4, 2, 1, 1, 1, 2, 1, 3, 1, 0, 1, 2, 2, 1,
       1, 3, 1, 1, 1, 3, 2, 1, 3, 4, 1, 1, 1, 1, 1, 1, 1, 4, 4, 1, 1, 4,
       0, 1, 1, 3, 0, 4, 2, 0, 1, 4, 1, 2, 4, 3, 1, 1, 3, 3, 3, 1, 1, 1,
       4, 1, 1, 2, 2, 1, 4, 4, 2, 1, 3, 0, 4, 4, 1, 0, 0, 4, 3, 1, 1, 1,
       3, 1, 3, 1, 3, 0, 1, 4, 1, 1, 1, 1, 1, 1, 4, 1, 1, 1, 4, 2, 1, 1,
       3, 1, 3, 0, 1, 1, 0, 1, 1, 3, 1, 1, 2, 2, 1, 2, 4, 0, 1, 1, 1, 3,
       1, 1])

总结分类规律

data_df[data_df['label']==4][['title', 'groups', 'topics']]
titlegroupstopics
2A Generalization of Probabilistic Serial to Ra...Game Theory and Economic Paradigms (GTEP)GTEP: Game Theory\nGTEP: Social Choice / Voting
16Multi-Organ Exchange: The Whole is Greater tha...Applications (APP)\nGame Theory and Economic P...APP: Biomedical / Bioinformatics\nGTEP: Auctio...
30The Computational Rise and Fall of FairnessGame Theory and Economic Paradigms (GTEP)GTEP: Social Choice / Voting
34Lazy Defenders Are Almost Optimal Against Dili...Game Theory and Economic Paradigms (GTEP)GTEP: Game Theory\nGTEP: Imperfect Information
37Game-theoretic Resource Allocation for Protect...Applications (APP)\nGame Theory and Economic P...APP: Security and Privacy\nGTEP: Game Theory\n...
39A Strategy-Proof Online Auction with Time Disc...Game Theory and Economic Paradigms (GTEP)GTEP: Auctions and Market-Based Systems
44Simultaneous Cake CuttingGame Theory and Economic Paradigms (GTEP)GTEP: Social Choice / Voting
57Solving Imperfect Information Games Using Deco...Game Theory and Economic Paradigms (GTEP)GTEP: Game Theory\nGTEP: Equilibrium\nGTEP: Im...
60Online (Budgeted) Social ChoiceGame Theory and Economic Paradigms (GTEP)GTEP: Game Theory\nGTEP: Social Choice / Votin...
65Fixing a Balanced Knockout TournamentGame Theory and Economic Paradigms (GTEP)GTEP: Game Theory\nGTEP: Social Choice / Voting
67Incomplete Preferences in Single-Peaked Electo...Game Theory and Economic Paradigms (GTEP)GTEP: Social Choice / Voting\nGTEP: Imperfect ...
70A Control Dichotomy for Pure Scoring RulesGame Theory and Economic Paradigms (GTEP)GTEP: Social Choice / Voting
77Biased GamesGame Theory and Economic Paradigms (GTEP)GTEP: Game Theory\nGTEP: Equilibrium
79Preference Elicitation and Interview Minimizat...Game Theory and Economic Paradigms (GTEP)\nMul...APP: Computational Social Science\nGTEP: Socia...
87Minimising Undesired Task Costs in Multi-robot...Multiagent Systems (MAS)\nRobotics (ROB)GTEP: Auctions and Market-Based Systems\nMAS: ...
97Congestion Games for V2G-Enabled EV ChargingComputational Sustainability and AI (CSAI)\nGa...CSAI: Modeling the interactions of agents with...
106Evolutionary dynamics of learning algorithms o...Game Theory and Economic Paradigms (GTEP)\nMul...GTEP: Adversarial Learning\nGTEP: Equilibrium\...
110A Game-theoretic Analysis of Catalog OptimizationGame Theory and Economic Paradigms (GTEP)\nKno...GTEP: Auctions and Market-Based Systems\nGTEP:...
117Automatic Game Design via Mechanic GenerationGame Playing and Interactive Entertainment (GPIE)GPIE: AI in Game Design\nGPIE: Procedural Cont...
124False-Name Bidding and Economic Efficiency in ...Game Theory and Economic Paradigms (GTEP)\nMul...GTEP: Auctions and Market-Based Systems\nGTEP:...
134Mechanism Design for Scheduling with Uncertain...Game Theory and Economic Paradigms (GTEP)\nMul...GTEP: Auctions and Market-Based Systems\nGTEP:...
135Robust Winners and Winner Determination Polici...Game Theory and Economic Paradigms (GTEP)\nMul...APP: Computational Social Science\nGTEP: Socia...
149Regret Transfer and Parameter OptimizationGame Theory and Economic Paradigms (GTEP)GTEP: Game Theory\nGTEP: Equilibrium\nGTEP: Im...
161Trading Multiple Indivisible Goods with Indiff...Game Theory and Economic Paradigms (GTEP)GTEP: Game Theory\nGTEP: Social Choice / Votin...
166Item Bidding for Combinatorial Public ProjectsGame Theory and Economic Paradigms (GTEP)\nMul...GTEP: Game Theory\nGTEP: Coordination and Coll...
171Increasing VCG revenue by decreasing the quali...Game Theory and Economic Paradigms (GTEP)\nMul...GTEP: Auctions and Market-Based Systems\nMAS: ...
178Theory of Cooperation in Complex Social NetworksGame Theory and Economic Paradigms (GTEP)\nMul...GTEP: Game Theory\nGTEP: Coordination and Coll...
181Prices Matter for the Parameterized Complexity...Game Theory and Economic Paradigms (GTEP)\nMul...GTEP: Game Theory\nGTEP: Social Choice / Votin...
188Incentives for Truthful Information Elicitatio...Game Theory and Economic Paradigms (GTEP)\nHum...GTEP: Game Theory\nGTEP: Equilibrium\nGTEP: Im...
189Equilibria in Epidemic Containment GamesApplications (APP)\nComputational Sustainabili...APP: Security and Privacy\nCSAI: Modeling the ...
190Beat the Cheater: Computing Game-Theoretic Str...Game Theory and Economic Paradigms (GTEP)\nMul...GTEP: Game Theory\nGTEP: Equilibrium\nGTEP: Im...
191A Characterization of the Single-Peaked Single...Game Theory and Economic Paradigms (GTEP)\nMul...GTEP: Game Theory\nGTEP: Social Choice / Votin...
197Efficient buyer groups for prediction-of-use e...Computational Sustainability and AI (CSAI)\nGa...CSAI: Modeling the interactions of agents with...
224On Detecting Nearly Structured Preference Prof...Game Theory and Economic Paradigms (GTEP)\nMul...GTEP: Social Choice / Voting
233Betting Strategies, Market Selection, and the ...Game Theory and Economic Paradigms (GTEP)GTEP: Auctions and Market-Based Systems
245Leveraging Fee-Based, Imperfect Advisors in Hu...Humans and AI (HAI)HAI: Human-Computer Interaction
252On the Structure of Synergies in Cooperative G...Game Theory and Economic Paradigms (GTEP)GTEP: Game Theory
261On the Incompatibility of Efficiency and Strat...Game Theory and Economic Paradigms (GTEP)GTEP: Game Theory\nGTEP: Social Choice / Voting
265Regret-based Optimization and Preference Elici...Game Theory and Economic Paradigms (GTEP)GTEP: Game Theory
270Modal Ranking: A Uniquely Robust Voting RuleGame Theory and Economic Paradigms (GTEP)GTEP: Social Choice / Voting
272Extending Tournament SolutionsGame Theory and Economic Paradigms (GTEP)GTEP: Social Choice / Voting
295On Computing Optimal Strategies in Open List P...Game Theory and Economic Paradigms (GTEP)\nMul...GTEP: Game Theory\nGTEP: Social Choice / Votin...
303Envy-Free Division of Sellable GoodsGame Theory and Economic Paradigms (GTEP)GTEP: Auctions and Market-Based Systems\nGTEP:...
304Potential-Aware Imperfect-Recall Abstraction w...Game Theory and Economic Paradigms (GTEP)GTEP: Game Theory\nGTEP: Imperfect Information
307Voting with Rank Dependent Scoring RulesGame Theory and Economic Paradigms (GTEP)GTEP: Auctions and Market-Based Systems\nGTEP:...
313Incentivizing High-quality Content from Hetero...Game Theory and Economic Paradigms (GTEP)\nMul...GTEP: Game Theory\nGTEP: Equilibrium\nGTEP: Im...
317New Models for Competitive ContagionGame Theory and Economic Paradigms (GTEP)GTEP: Game Theory\nGTEP: Equilibrium
320Approximate Equilibrium and Incentivizing Soci...Game Theory and Economic Paradigms (GTEP)GTEP: Game Theory\nGTEP: Coordination and Coll...
330Internally Stable Kidney ExchangeMultiagent Systems (MAS)GTEP: Auctions and Market-Based Systems\nMAS: ...
336Strategyproof exchange with multiple private e...Game Theory and Economic Paradigms (GTEP)GTEP: Auctions and Market-Based Systems\nGTEP:...
337Mechanism design for mobile geo-location adver...Game Theory and Economic Paradigms (GTEP)\nMul...GTEP: Auctions and Market-Based Systems\nGTEP:...
342A Multiarmed Bandit Incentive Mechanism for Cr...Computational Sustainability and AI (CSAI)\nGa...CSAI: Modeling the interactions of agents with...
343Binary Aggregation by Selection of the Most Re...Game Theory and Economic Paradigms (GTEP)\nKno...GTEP: Social Choice / Voting\nKRR: Preferences...
347Bounding the Support Size in Extensive Form Ga...Game Theory and Economic Paradigms (GTEP)GTEP: Game Theory\nGTEP: Equilibrium\nGTEP: Im...
359The Fisher Market Game: Equilibrium and WelfareGame Theory and Economic Paradigms (GTEP)GTEP: Auctions and Market-Based Systems\nGTEP:...
366On the Axiomatic Characterization of Runoff Vo...Game Theory and Economic Paradigms (GTEP)GTEP: Social Choice / Voting
370Solving Zero-Sum Security Games in Discretized...Game Theory and Economic Paradigms (GTEP)\nMul...GTEP: Game Theory\nGTEP: Equilibrium\nMAS: Mul...
390Using Response Functions to Measure Strategy S...Game Theory and Economic Paradigms (GTEP)GTEP: Game Theory\nGTEP: Equilibrium\nGTEP: Im...

通过查看每组聚类结果得知:

  • 0:该类主要包含 VIS 计算机视觉 等文章
  • 1:该类主要包含 AIW 及网络类文章
  • 2:该类主要包含 NMLA 及算法类文章
  • 3:该类主要包含 GTEP 等游戏类文章
  • 4:该类主要包含 AIW 及NLP等文章

通过上述结果可知,简单聚类可以将文章分为几类,但是相互有所粘连

# 创建一个TSNE对象,指定要降维到的维数为2,随机数种子为RANDOM_STATE
tsne = sklearn.manifold.TSNE(n_components=2, random_state=RANDOM_STATE, init="random")

# 调用TSNE对象的fit_transform方法,传入X_simple数据集,返回一个降维后的数据数组,赋值给X_tsne
X_tsne = tsne.fit_transform(X_simple)

sns.scatterplot(x=X_tsne[:,0], y=X_tsne[:,1], hue=labels, palette="deep") # 散点图


请添加图片描述

通过上图显示,简单聚类可以成功聚类,但是结果有所粘连

4.2 复杂聚类

通过使用3.2中得到的X_pasage进行聚类,并聚集10类

model = KMeans(n_clusters=10,  init='k-means++', max_iter=100, n_init=1, random_state=RANDOM_STATE) # KMean聚类
model.fit(X_passage)
labels = model.labels_
data_df['label'] = labels
labels
array([2, 4, 3, 4, 0, 0, 8, 2, 4, 0, 5, 6, 2, 4, 6, 8, 3, 4, 4, 0, 1, 9,
       3, 2, 2, 8, 4, 5, 4, 8, 3, 5, 2, 2, 7, 9, 2, 7, 8, 7, 2, 1, 2, 6,
       3, 9, 2, 4, 4, 5, 4, 4, 4, 4, 2, 1, 9, 7, 1, 2, 3, 2, 4, 8, 4, 3,
       9, 3, 8, 9, 3, 9, 2, 2, 8, 8, 9, 7, 8, 3, 1, 4, 4, 0, 8, 1, 2, 3,
       8, 2, 2, 4, 6, 1, 2, 5, 0, 7, 8, 4, 9, 4, 2, 4, 6, 5, 7, 6, 8, 9,
       7, 5, 0, 9, 3, 5, 4, 7, 2, 0, 4, 2, 6, 6, 3, 8, 2, 6, 2, 9, 2, 1,
       6, 2, 3, 3, 8, 0, 6, 4, 9, 2, 6, 4, 2, 8, 5, 2, 6, 7, 2, 0, 6, 3,
       2, 5, 4, 6, 2, 4, 2, 3, 9, 2, 5, 8, 3, 1, 9, 3, 9, 3, 6, 9, 9, 2,
       8, 5, 3, 9, 9, 3, 8, 0, 1, 5, 4, 5, 3, 7, 7, 3, 5, 2, 2, 6, 6, 3,
       5, 2, 5, 6, 8, 4, 4, 9, 6, 8, 0, 1, 2, 2, 6, 4, 8, 4, 6, 2, 6, 0,
       4, 2, 8, 1, 3, 2, 4, 4, 4, 8, 6, 8, 2, 7, 4, 5, 3, 6, 5, 8, 2, 4,
       2, 2, 4, 4, 8, 5, 2, 2, 5, 0, 7, 2, 2, 4, 5, 0, 2, 2, 2, 3, 4, 0,
       2, 7, 5, 4, 1, 4, 3, 2, 3, 5, 2, 2, 8, 5, 2, 9, 4, 6, 2, 5, 5, 0,
       2, 9, 2, 4, 1, 9, 5, 2, 9, 3, 9, 4, 2, 2, 2, 8, 2, 3, 7, 8, 4, 3,
       9, 8, 5, 1, 6, 7, 5, 6, 2, 7, 2, 5, 7, 9, 8, 6, 9, 1, 9, 2, 2, 2,
       3, 2, 8, 5, 5, 8, 7, 3, 5, 8, 1, 6, 7, 3, 8, 6, 6, 7, 9, 0, 0, 4,
       9, 2, 1, 2, 9, 6, 2, 7, 8, 4, 6, 8, 2, 2, 3, 4, 1, 0, 7, 5, 2, 8,
       9, 4, 1, 6, 2, 8, 6, 0, 9, 9, 6, 4, 5, 5, 2, 5, 7, 6, 2, 1, 4, 9,
       4, 2])
data_df[data_df['label']==9][['title', 'groups', 'topics']]
titlegroupstopics
21The Complexity of Reasoning with FODD and GFODDKnowledge Representation and Reasoning (KRR)KRR: Automated Reasoning and Theorem Proving\n...
35PREGO: An Action Language for Belief-Based Cog...Knowledge Representation and Reasoning (KRR)KRR: Action, Change, and Causality\nKRR: Knowl...
45Recovering from Selection Bias in Causal and S...Knowledge Representation and Reasoning (KRR)\n...KRR: Action, Change, and Causality\nRU: Bayesi...
56A Parameterized Complexity Analysis of General...Game Playing and Interactive Entertainment (GP...GTEP: Social Choice / Voting\nKRR: Computation...
66Querying Inconsistent Description Logic Knowle...Knowledge Representation and Reasoning (KRR)KRR: Ontologies\nKRR: Computational Complexity...
69Knowledge Graph Embedding by Translating on Hy...Knowledge Representation and Reasoning (KRR)\n...KRR: Knowledge Representation (General/Other)\...
71Fast consistency checking of very large real-w...Knowledge Representation and Reasoning (KRR)\n...KRR: Geometric, Spatial, and Temporal Reasonin...
76The Computational Complexity of Structure-Base...Knowledge Representation and Reasoning (KRR)\n...KRR: Action, Change, and Causality\nKRR: Compu...
100A Tractable Approach to ABox Abduction over De...Knowledge Representation and Reasoning (KRR)KRR: Description Logics\nKRR: Diagnosis and Ab...
109Reasoning on LTL on Finite Traces: Insensitivi...Knowledge Representation and Reasoning (KRR)AIW: AI for web services: semantic description...
113Programming by Example using Least General Gen...Applications (APP)\nHeuristic Search and Optim...APP: Intelligent User Interfaces\nAPP: Other A...
129Using Model-Based Diagnosis to Improve Softwar...Applications (APP)\nKnowledge Representation a...APP: Other Applications\nKRR: Automated Reason...
140Confident Reasoning on Raven’s Progressive Mat...Knowledge Representation and Reasoning (KRR)KRR: Geometric, Spatial, and Temporal Reasonin...
162SenticNet 3: A Common and Common-Sense Knowled...Cognitive Systems (CS)\nKnowledge Representati...CS: Conceptual inference and reasoning\nKRR: C...
168Backdoors to PlanningKnowledge Representation and Reasoning (KRR)\n...KRR: Computational Complexity of Reasoning\nPS...
170Datalog Rewritability of Disjunctive Datalog P...Knowledge Representation and Reasoning (KRR)KRR: Ontologies\nKRR: Automated Reasoning and ...
173The Most Uncreative Examinee: A First Step tow...Knowledge Representation and Reasoning (KRR)KRR: Automated Reasoning and Theorem Proving
174Acquiring Commonsense Knowledge for Sentiment ...Human-Computation and Crowd Sourcing (HCC)\nKn...HCC: Domain-specific implementation challenges...
179Explanation-Based Approximate Weighted Model C...Knowledge Representation and Reasoning (KRR)\n...KRR: Logic Programming\nRU: Probabilistic Infe...
180A Knowledge Compilation Map for Ordered Real-V...Knowledge Representation and Reasoning (KRR)KRR: Computational Complexity of Reasoning\nKR...
205A reasoner for the RCC-5 and RCC-8 calculi ext...Knowledge Representation and Reasoning (KRR)\n...KRR: Computational Complexity of Reasoning\nKR...
279Computing General First-order Parallel and Pri...Knowledge Representation and Reasoning (KRR)KRR: Common-Sense Reasoning\nKRR: Nonmonotonic...
287Data Quality in Ontology-based Data Access: Th...Knowledge Representation and Reasoning (KRR)APP: Other Applications\nKRR: Ontologies\nKRR:...
291Diagnosing Analogue Linear Systems Using Dynam...Knowledge Representation and Reasoning (KRR)KRR: Diagnosis and Abductive Reasoning
294Elementary Loops RevisitedKnowledge Representation and Reasoning (KRR)KRR: Logic Programming
296Joint Morphological Generation and Syntactic L...NLP and Knowledge Representation (NLPKR)NLPKR: Natural Language Processing (General/Ot...
308Implementing GOLOG in Answer Set ProgrammingKnowledge Representation and Reasoning (KRR)\n...KRR: Action, Change, and Causality\nKRR: Logic...
321Qualitative Reasoning with Modelica ModelsApplications (APP)\nKnowledge Representation a...APP: Other Applications\nKRR: Knowledge Repres...
324Pathway Specification and Comparative Queries:...Knowledge Representation and Reasoning (KRR)APP: Biomedical / Bioinformatics\nKRR: Knowled...
326Testable Implications of Linear Structural Equ...Knowledge Representation and Reasoning (KRR)\n...KRR: Action, Change, and Causality\nRU: Graphi...
348Exploiting Support Sets for Answer Set Program...Knowledge Representation and Reasoning (KRR)KRR: Ontologies\nKRR: Description Logics\nKRR:...
352Local-To-Global Consistency Implies Tractabili...Knowledge Representation and Reasoning (KRR)KRR: Computational Complexity of Reasoning\nKR...
356Exploring the Boundaries of Decidable Verifica...Knowledge Representation and Reasoning (KRR)KRR: Action, Change, and Causality\nKRR: Geome...
374Managing Change in Graph-structured Data Using...Knowledge Representation and Reasoning (KRR)KRR: Computational Complexity of Reasoning\nKR...
382Coactive Learning for Locally Optimal Problem ...Humans and AI (HAI)\nKnowledge Representation ...HCC: Active learning from imperfect human labe...
383Large Scale Analogical ReasoningCognitive Systems (CS)\nKnowledge Representati...CS: Conceptual inference and reasoning\nCS: St...
395Contraction and Revision over DL-Lite TBoxesKnowledge Representation and Reasoning (KRR)KRR: Belief Change\nKRR: Description Logics\nK...

通过查看每组聚类结果可知,每类结果有较为清晰的特征:

  • 0:该类主要包含 VIS 等视觉相关文章
  • 1:该类主要包含 AIW 及 ROB 等文章
  • 2:该类主要包含 NMLA 机器学习等文章
  • 3:该类主要包含 GTEP 游戏类文章
  • 4:该类主要包含 AIW 及社交网络等文章
  • 5:该类主要包含 SCS 和 HSO等搜索类文章
  • 6:该类主要包含 PS 及 CS 策略计划类文章
  • 7:该类主要包含 GTEP 等文章
  • 8:该类主要保护 APP 及 MLA等文章
  • 9:该类主要包含 KRR 知识表示与推理等文章
# 创建一个TSNE对象,指定要降维到的维数为2,随机数种子为RANDOM_STATE
tsne = sklearn.manifold.TSNE(n_components=2, random_state=RANDOM_STATE, init="random")

# 调用TSNE对象的fit_transform方法,传入X_passage数据集,返回一个降维后的数据数组,赋值给X_tsne
X_tsne = tsne.fit_transform(X_passage)

sns.scatterplot(x=X_tsne[:,0], y=X_tsne[:,1], hue=labels, palette="deep") # 散点图


请添加图片描述

从上图可知,通过作者、词干等分词后,聚类效果更好

5 聚类效果分析

本章分析不同k值对聚类效果的影响,以及该数据集中k取什么效果最好

k_range = range(5,15)
label_dict = {}
for k in k_range:
    model = KMeans(n_clusters=k,  init='k-means++', max_iter=100, n_init=1, random_state=RANDOM_STATE)
    model.fit(X_passage)
    labels = model.labels_
    label_dict[k]=labels
label_dict[7]
array([0, 0, 5, 6, 3, 3, 3, 0, 6, 3, 4, 0, 0, 6, 2, 3, 5, 6, 6, 3, 3, 1,
       5, 0, 0, 3, 6, 4, 6, 6, 5, 4, 0, 0, 5, 1, 0, 5, 6, 5, 0, 1, 0, 3,
       5, 1, 0, 3, 3, 4, 6, 6, 6, 3, 0, 6, 1, 5, 3, 0, 5, 0, 6, 3, 6, 5,
       1, 5, 3, 1, 5, 1, 0, 0, 6, 3, 1, 5, 3, 5, 6, 3, 6, 3, 3, 6, 0, 5,
       3, 0, 0, 6, 2, 3, 0, 4, 3, 5, 3, 3, 1, 6, 0, 6, 1, 4, 5, 2, 3, 1,
       5, 4, 3, 1, 5, 4, 6, 3, 0, 3, 3, 0, 3, 2, 5, 3, 0, 2, 0, 1, 0, 1,
       1, 0, 5, 5, 3, 3, 2, 3, 1, 0, 3, 6, 0, 3, 4, 0, 2, 5, 0, 3, 3, 5,
       0, 4, 6, 2, 0, 6, 0, 5, 1, 0, 4, 3, 5, 6, 2, 5, 1, 5, 2, 1, 1, 0,
       3, 4, 5, 1, 1, 5, 3, 3, 1, 4, 3, 6, 5, 5, 5, 5, 4, 0, 0, 1, 2, 5,
       4, 0, 4, 2, 3, 3, 6, 1, 2, 3, 3, 6, 0, 3, 1, 3, 3, 6, 2, 0, 2, 3,
       6, 3, 3, 3, 5, 0, 3, 6, 3, 3, 1, 3, 0, 5, 6, 4, 5, 2, 4, 3, 0, 3,
       0, 0, 6, 6, 3, 4, 0, 0, 4, 3, 5, 0, 0, 6, 2, 3, 0, 0, 0, 5, 6, 3,
       0, 5, 4, 0, 6, 6, 5, 0, 5, 4, 0, 0, 3, 4, 0, 1, 3, 2, 0, 4, 4, 3,
       0, 1, 0, 6, 6, 1, 4, 0, 1, 5, 3, 6, 0, 0, 0, 3, 0, 5, 5, 0, 6, 5,
       1, 3, 4, 1, 2, 5, 4, 2, 0, 5, 0, 4, 5, 1, 3, 1, 1, 1, 1, 0, 0, 0,
       5, 0, 3, 4, 4, 3, 5, 5, 4, 3, 1, 2, 5, 5, 3, 2, 2, 5, 1, 3, 3, 6,
       1, 0, 1, 0, 1, 2, 0, 5, 3, 3, 1, 3, 0, 0, 5, 6, 6, 3, 5, 4, 0, 3,
       1, 3, 6, 2, 3, 3, 2, 3, 1, 1, 3, 3, 4, 4, 0, 4, 5, 2, 0, 6, 6, 1,
       3, 0])
# 创建2行5列的子图布局
fig, axes = plt.subplots(2, 5, figsize=(25, 10))

# 将10个子图填充到子图布局中
for k, label in label_dict.items():
    row, col = divmod(k-5, 5)  # 根据k计算在子图布局中的行和列位置
    ax = axes[row, col]
    
    sns.scatterplot(x=X_tsne[:, 0], y=X_tsne[:, 1], hue=label, palette="deep", ax=ax)
    ax.set_title("cluster = %d" % k)

# 调整子图布局
plt.tight_layout()
plt.show()


请添加图片描述

# 创建一个TSNE对象,指定要降维到的维数为3,随机数种子为RANDOM_STATE
tsne = sklearn.manifold.TSNE(n_components=3, random_state=RANDOM_STATE, init="random")

# 调用TSNE对象的fit_transform方法,传入X_passage数据集,返回一个降维后的数据数组,赋值给X_tsne
X_tsne = tsne.fit_transform(X_passage)

# 创建一个大画布,包含10个子图
fig, axes = plt.subplots(2, 5, figsize=(25, 10), subplot_kw={'projection': '3d'})

# 将10个子图填充到大画布中
for k, ax in zip(label_dict.keys(), axes.flatten()):
    # 绘制散点图,指定散点的大小
    ax.scatter(X_tsne[:, 0], X_tsne[:, 1], X_tsne[:, 2], c=label_dict[k], cmap='Dark2')
    # 设置标题,指定标题的字体大小
    ax.set_title("cluster = %d" % k, fontsize=16)

# 调整子图布局
plt.tight_layout()
plt.show()


请添加图片描述

以上可见,用2d和3d图展示聚类效果,在5到14的Kmeans中没有聚类效果特别好的,但是感觉取7时聚类效果更好一点

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.coloradmin.cn/o/1428414.html

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈,一经查实,立即删除!

相关文章

vue实现el-table-column中自定义label

默认的label只能显示普通文字&#xff0c;如果有一些特殊需求&#xff0c;比如换行显示&#xff0c;更改文字颜色&#xff0c;更改文字大小&#xff0c;就需要自定义label了 <el-table-column label"组合" align"center" key"combinData" pr…

element表格内多个输入框时如何添加表单校验

以下.vue文件Demo可直接复制运行&#xff1a; 重点&#xff1a; 1&#xff1a;表格数据定义在form里 2&#xff1a;prop需要加索引&#xff1b;索引前的变量不要加form&#xff0c;直接取里边的key&#xff0c;索引后的字段需要和表格里字段属性对应 。:prop"tableInfo.l…

华为数通方向HCIP-DataCom H12-821题库(单选题:381-400)

第381题 以下是某台设备通过display isis lsdb命令输出的信息,那么关于以上输出的信息的描述,正确的是哪一项? <R1>display isis lsdbDatabase information for ISIS(1)--------------------------------Level-1 Link State DatabaseLSPID Seq Num…

【Linux】环境基础开发工具的使用之gcc详解(二)

前言&#xff1a;上一篇文章中我们讲解了Linux下的vim和yum的工具的使用&#xff0c;今天我们将在上一次的基础上进一步的讲解开放工具的时候。 &#x1f496; 博主CSDN主页:卫卫卫的个人主页 &#x1f49e; &#x1f449; 专栏分类:Linux的深度刨析 &#x1f448; &#x1f4a…

[Python] 什么是逻辑回归模型?使用scikit-learn中的LogisticRegression来解决乳腺癌数据集上的二分类问题

什么是线性回归和逻辑回归&#xff1f; 线性回归是一种用于解决回归问题的统计模型。它通过建立自变量&#xff08;或特征&#xff09;与因变量之间的线性关系来预测连续数值的输出。线性回归的目标是找到一条直线&#xff08;或超平面&#xff09;&#xff0c;使得预测值与观…

WhisperFusion:与 AI 无缝语音对话(超低延迟),深入理解用户每句话背后的含义

演示视频里面&#xff0c;那老哥问它问题之后&#xff0c;后面更改问题&#xff0c;依然能很好的记录问题变化的过程并给出答案。 WhisperFusion 是基于 WhisperLive 和 WhisperSpeech 的强大工具&#xff0c;将声音转文字和文字理解融为一体&#xff0c;让你与AI机器人无缝语…

Linux-----文本三剑客补充~

一、模糊匹配 模糊匹配用 ~ 表示包含&#xff0c;!~表示不包含 1、匹配含有root的列 [rootlocalhost ~]#awk -F: /root/ /etc/passwd root:x:0:0:root:/root:/bin/bash operator:x:11:0:operator:/root:/sbin/nologin [rootlocalhost ~]#awk -F: $1~ /root/ /etc/passw…

网工内推 | 港企、合资公司,厂商认证优先,五险一金

01 九龙仓&#xff08;长沙&#xff09;置业有限公司 招聘岗位&#xff1a;IT网络工程师 职责描述&#xff1a; 1.负责公司网络架构规划设计、设备选型、远程组网方案的规划和设计&#xff1b; 2.负责公司网络IP地址规划管理&#xff0c;根据业务需求和公司状况&#xff0c;对…

将有逻辑关系的树形结构数组转换为扁平化的一维对象数组(包含PID、ID父子关系)(tree转换为List)

// 将有逻辑关系的树形结构数组转换为扁平化的一维对象数组 treeStructure2flatArray(arr) {let r [], r_ (ar, PID root) > ar.forEach(v > (v.children && (r_(v.children, v.ID), delete v.children), (v.PID PID, r.push(v))));r_(JSON.parse(JSON.strin…

由数据插入超长引起的问题——了解GaussDB和openGauss的字符集

前言 故事是这样开始的。我们的小DEMO项目的数据库版本从openGauss 2.1.0升级到了5.0.0版本。升级后进行功能验证的时候&#xff0c;测试同学发现个BUG&#xff0c;原来通过gs_restore导出来的数据再导入时报超长&#xff0c;插入失败了&#xff0c;如下图所示&#xff0c;nva…

VisionMaster图像拼接功能实现

由于硬件或安装环境限制&#xff0c;单个相机视野无法覆盖整个视野&#xff0c;但实际应用需要全视野图像时&#xff0c;可以拍摄物体的多个部分拼接成一张整图。VM提供图像拼接工具对图像进行拼接。 使用标定图进行标定建模是最重要的一步&#xff0c;成功标定后可以将图像无…

Cocos XR的WebBox实现流程

1. 正常3D场景下的webview 1.1 组件角色 Cocos Creator正常3D场景下只有在UI组件才支持webview&#xff0c;即作为下图中的UI Nodes(Canvas Node)的子节点&#xff0c;和3D组件是隔离开的&#xff0c;不能显示在3D空间中&#xff0c;UI Nodes(Canvas Node)是一个平面内的矩形…

新旧Mac恢复出厂设置的方法不同,这里提供新旧Mac不同的重置方法

在某些使用macOS 12 Monterey或更高版本系统的Mac电脑上,你可以使用系统首选项中的内置功能“擦除助手”轻松擦除和重置计算机。以下是操作方法。 要求(以及旧款Mac的提示) 从2021年发布的macOs Monterey(macOs 12)开始,系统首选项现在有一个类似于iPhone和iPad上的“擦…

TraceRoute 跟踪路由工具

随着企业网络需求的不断增长&#xff0c;组织发现监控和管理其网络基础设施变得越来越困难&#xff0c;网络管理员正在转向其他工具和资源&#xff0c;这些工具和资源可以使他们的工作更轻松一些&#xff0c;尤其是在故障排除方面。 目前&#xff0c;网络管理员主要使用简单、…

【日常聊聊】开源软件影响力

&#x1f34e;个人博客&#xff1a;个人主页 &#x1f3c6;个人专栏&#xff1a;JAVA ⛳️ 功不唐捐&#xff0c;玉汝于成 目录 前言 正文 方向一&#xff1a;开源软件如何推动技术创新 方向二&#xff1a;开源软件的商业模式 方向三&#xff1a;开源软件的安全风险 方…

单机搭建hadoop环境(包括hdfs、yarn、hive)

单机可以搭建伪分布式hadoop环境&#xff0c;用来测试和开发使用&#xff0c;hadoop包括&#xff1a; hdfs服务器 yarn服务器&#xff0c;yarn的前提是hdfs服务器&#xff0c; 在前面两个的基础上&#xff0c;课可以搭建hive服务器&#xff0c;不过hive不属于hadoop的必须部…

某通用引发供应链的思考

本文由掌控安全学院 - 江月 投稿 前言 前段时间看到很多人在打某通用系统&#xff0c;简单记录一下思路。 某通用单位系&#xff1a;xxx奕科技公司 产品如下&#xff1a; 资产还不少&#xff0c;记住这个容器服务平台 等下还要考&#xff0c;以及这个事务中心 可强行接管统一…

QButtonGroup使用介绍

一、简介 QButtonGroup是PyQt5库中的一个组件&#xff0c;主要用于组织和管理一组按钮。通过QButtonGroup&#xff0c;可以方便地实现单选框或多选框功能&#xff0c;统一处理按钮的信号&#xff0c;并且可以为按钮分组设定ID以进行识别。 1、原始工程 from PyQt5.Qt import …

-运算符-

目录 一.算术运算符 (一).四则运算符:加减乘除模(-*/%) (二).增量运算符 - * % (三).自增/自减运算符 -- 二.关系运算符 ! < > < > 三.逻辑运算符&& || ! (一).逻辑与 (表达式1&&表达式2) (二).逻辑或 || (三).逻辑非 ! (四).…

JMeter 下载、安装、启动

JMeter安装部署依赖Java环境&#xff0c;所以首先得安装JDK。 JDK下载JDK环境变量配置 ① 新建系统环境变量JAVA_HOME ② 编辑系统变量Path ③ 新建系统变量CLASSPATH变量 JMeter下载安装 Apache JMeter - Apache JMeter™ JMeter安装部署依赖Java环境&#xff0c;所以首…