《斯坦福数据挖掘教程·第三版》读书笔记(英文版) Chapter 6 Frequent Itemsets

news2025/1/11 18:08:38

来源:《斯坦福数据挖掘教程·第三版》对应的公开英文书和PPT

Chapter 6 Frequent Itemsets

The market-basket model of data is used to describe a common form of many-many relationship between two kinds of objects. On the one hand, we have items, and on the other we have baskets, sometimes called “transactions.” Each basket consists of a set of items (an itemset), and usually we assume that the number of items in a basket is small – much smaller than the total number of items. The number of baskets is usually assumed to be very large, bigger than what can fit in main memory. The data is assumed to be represented in a file consisting of a sequence of baskets. In terms of the distributed file system, the baskets are the objects of the file, and each basket is of type “set of items.”

We assume there is a number s, called the support threshold. If I is a set of items, the support for I is the number of baskets for which I is a subset. We say I is frequent if its support is s or more.

Suppose that we set our threshold at s = 3 s = 3 s=3. Then there are five frequent singleton itemsets: {dog}, {cat}, {and}, {a}, and {training}.

在这里插入图片描述

A doubleton cannot be frequent unless both items in the set are frequent by themselves.

Applications of frequent-itemset analysis is not limited to market baskets. The same model can be used to mine many other kinds of data. Some examples are:

  1. Related concepts: Let items be words, and let baskets be documents. A basket/document contains those items/words that are present in the document. If we look for sets of words that appear together in many documents, the sets will be dominated by the most common words (stop words). There, even though the intent was to find snippets that talked about cats and dogs, the stop words “and” and “a” were prominent among the frequent itemsets. However, if we ignore all the most common words, then we would hope to find among the frequent pairs some pairs of words that represent a joint concept.
  2. Plagiarism: Let the items be documents and the baskets be sentences. An item/document is “in” a basket/sentence if the sentence is in the document. This arrangement appears backwards, but it is exactly what we need, and we should remember that the relationship between items and baskets is an arbitrary many-many relationship. That is, “in” need not have its conventional meaning: “part of.” In this application, we look for pairs of items that appear together in several baskets. If we find such a pair, then we have two documents that share several sentences in common. In practice, even one or two sentences in common is a good indicator of plagiarism.
  3. Biomarkers: Let the items be of two types – biomarkers such as genes or blood proteins, and diseases. Each basket is the set of data about a patient: their genome and blood-chemistry analysis, as well as their medical history of disease. A frequent itemset that consists of one disease and one or more biomarkers suggests a test for the disease.

Thus, we define the interest of an association rule I → j I → j Ij to be the difference between its confidence and the fraction of baskets that contain j.

There is another approach to storing counts that may be more appropriate, depending on the fraction of the possible pairs of items that actually appear in some basket. We can store counts as triples [ i , j , c ] [i, j, c] [i,j,c], meaning that the count of pair { i , j } \{i, j\} {i,j}, with i < j i < j i<j, is c c c. A data structure, such as a hash table with i i i and j j j as the search key, is used so we can tell if there is a triple for a given i i i and j j j and, if so, to find it quickly. We call this approach the triples method of storing counts.

the A-Priori Algorithm, one pass is taken for each set-size k. If no frequent itemsets of a certain size are found, then monotonicity tells us there can be no larger frequent itemsets, so we can stop.

The pattern of moving from one size k to the next size k + 1 k + 1 k+1 can be summarized as follows. For each size k, there are two sets of itemsets:

  1. C k C_k Ck is the set of candidate itemsets of size k – the itemsets that we must count in order to determine whether they are in fact frequent.
  2. L k L_k Lk is the set of truly frequent itemsets of size k k k. The pattern of moving from one set to the next and one size to the next is suggested by Fig. 6.4.

在这里插入图片描述

Summary of Chapter 6

  • Market-Basket Data: This model of data assumes there are two kinds of entities: items and baskets. There is a many–many relationship between items and baskets. Typically, baskets are related to small sets of items, while items may be related to many baskets.
  • Frequent Itemsets: The support for a set of items is the number of baskets containing all those items. Itemsets with support that is at least some threshold are called frequent itemsets.
  • Association Rules: These are implications that if a basket contains a certain set of items I, then it is likely to contain another particular item j as well. The probability that j is also in a basket containing I is called the confidence of the rule. The interest of the rule is the amount by which the confidence deviates from the fraction of all baskets that contain j.
  • The Pair-Counting Bottleneck: To find frequent itemsets, we need to examine all baskets and count the number of occurrences of sets of a certain size. For typical data, with a goal of producing a small number of itemsets that are the most frequent of all, the part that often takes the most main memory is the counting of pairs of items. Thus, methods for finding frequent itemsets typically concentrate on how to minimize the main memory needed to count pairs.
  • Triangular Matrices: While one could use a two-dimensional array to count pairs, doing so wastes half the space, because there is no need to count pair { i , j } \{i, j\} {i,j} in both the i-j and j-i array elements. By arranging the pairs ( i , j ) (i, j) (i,j) for which i < j i < j i<j in lexicographic order, we can store only the needed counts in a one-dimensional array with no wasted space, and yet be able to access the count for any pair efficiently.
  • Storage of Pair Counts as Triples: If fewer than 1/3 of the possible pairs actually occur in baskets, then it is more space-efficient to store counts of pairs as triples ( i , j , c ) (i, j, c) (i,j,c), where c is the count of the pair { i , j } \{i, j\} {i,j}, and i < j i < j i<j. An index structure such as a hash table allows us to find the triple for ( i , j ) (i, j) (i,j) efficiently.
  • Monotonicity of Frequent Itemsets: An important property of itemsets is that if a set of items is frequent, then so are all its subsets. We exploit this property to eliminate the need to count certain itemsets by using its contrapositive: if an itemset is not frequent, then neither are its supersets.
  • The A-Priori Algorithm for Pairs: We can find all frequent pairs by making two passes over the baskets. On the first pass, we count the items themselves, and then determine which items are frequent. On the second pass, we count only the pairs of items both of which are found frequent on the first pass. Monotonicity justifies our ignoring other pairs.
  • Finding Larger Frequent Itemsets: A-Priori and many other algorithms allow us to find frequent itemsets larger than pairs, if we make one pass over the baskets for each size itemset, up to some limit. To find the frequent itemsets of size k, monotonicity lets us restrict our attention to only those itemsets such that all their subsets of size k − 1 k − 1 k1 have already been found frequent.
  • The PCY Algorithm: This algorithm improves on A-Priori by creating a hash table on the first pass, using all main-memory space that is not needed to count the items. Pairs of items are hashed, and the hash-table buckets are used as integer counts of the number of times a pair has hashed to that bucket. Then, on the second pass, we only have to count pairs of frequent items that hashed to a frequent bucket (one whose count is at least the support threshold) on the first pass.
  • The Multistage Algorithm: We can insert additional passes between the first and second pass of the PCY Algorithm to hash pairs to other, independent hash tables. At each intermediate pass, we only have to hash pairs of frequent items that have hashed to frequent buckets on all previous passes.
  • The Multihash Algorithm: We can modify the first pass of the PCY Algorithm to divide available main memory into several hash tables. On the second pass, we only have to count a pair of frequent items if they hashed to frequent buckets in all hash tables.
  • Randomized Algorithms: Instead of making passes through all the data, we may choose a random sample of the baskets, small enough that it is possible to store both the sample and the needed counts of itemsets in main memory. The support threshold must be scaled down in proportion. We can then find the frequent itemsets for the sample, and hope that it is a good representation of the data as whole. While this method uses at most one pass through the whole dataset, it is subject to false positives (itemsets that are frequent in the sample but not the whole) and false negatives (itemsets that are frequent in the whole but not the sample).
  • The SON Algorithm: An improvement on the simple randomized algorithm is to divide the entire file of baskets into segments small enough that all frequent itemsets for the segment can be found in main memory. Candidate itemsets are those found frequent for at least one segment. A second pass allows us to count all the candidates and find the exact collection of frequent itemsets. This algorithm is especially appropriate in a
    MapReduce setting.
  • Toivonen’s Algorithm: This algorithm starts by finding frequent itemsets in a sample, but with the threshold lowered so there is little chance of missing an itemset that is frequent in the whole. Next, we examine the entire file of baskets, counting not only the itemsets that are frequent in the sample, but also, the negative border – itemsets that have not been found frequent, but all their immediate subsets are. If no member of the
    negative border is found frequent in the whole, then the answer is exact. But if a member of the negative border is found frequent, then the whole process has to repeat with another sample.
  • Frequent Itemsets in Streams: If we use a decaying window with constant c c c, then we can start counting an item whenever we see it in a basket. We start counting an itemset if we see it contained within the current basket, and all its immediate proper subsets already are being counted. As the window is decaying, we multiply all counts by 1 − c 1 − c 1c and eliminate those that are less than 1 / 2 1/2 1/2.

END

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.coloradmin.cn/o/545616.html

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈,一经查实,立即删除!

相关文章

YOLOv8 独家原创改进:独家首发最新原创EfficiCLNMS改进点,改进有效可以直接当做自己的原创改进点来写,新的增强预测帧

💡该教程为属于《芒果书》📚系列,包含大量的原创首发改进方式, 所有文章都是全网首发原创改进内容🚀 💡本篇文章为YOLOv8改进:独家首发最新EfficiCL-NMS改进点,新的增强预测帧率。 💡对自己数据集改进有效的话,可以直接当做自己的原创改进点来写!!!改进点先到…

小白量化《穿云箭集群量化》(7) 巡航导弹策略

小白量化《穿云箭集群量化》&#xff08;7&#xff09; 巡航导弹策略 量化交易策略比较有名的是网格策略&#xff0c;网格策略的缺点是对网格定义不容易&#xff0c;另外通过网格穿越交易也不是最优价格。 穿云箭量化平台提供了巡航导弹策略&#xff0c;可以利用巡航导弹技术自…

无效数据大揭秘——你不知道的那些坑!

进行数据管理时&#xff0c;无效数据可能会对生产力和决策质量造成严重的影响。如何发现和处理无效数据变得愈发重要。一起来唠唠各位大佬是如何处理的&#xff1f; ⭐ 什么是无效数据&#xff1f;⭐ 如何处理无效数据&#xff1f;⭐ 如何减少无效数据&#xff1f;⭐ 无效数据管…

Python入门(十一)while循环(一)

while循环&#xff08;一&#xff09; 1.简介2.使用while循环3.让用户选择何时退出4.使用标志5.使用break退出循环6.在循环中使用continue7.避免无限循环 作者&#xff1a;xiou 1.简介 for循环用于针对集合中的每个元素都执行一个代码块&#xff0c;而while循环则不断运行&am…

css3:精灵图sprite的使用

文章目录 精灵图sprite简介原理优缺点实例通过精灵图实现一个导航栏 精灵图sprite 简介 CSS精灵技术&#xff08;也称CSS Sprites、CSS雪碧&#xff09;&#xff0c;简单来说就是从一张有各种小图标的大图上截取下来一个小图标来使用。 正因为只要加载一张大图片&#xff0c;…

Restful路径下编写controller层及其增删改查

前置&#xff1a;需要先创建好项目&#xff0c;并且使用mabtis根据数据表生成好代码 mybatis plus自动生成代码&#xff08;代码生成器&#xff09;_wa1ttinG的博客-CSDN博客 一、controller层定义 controller层就是和用户打交道&#xff0c;直接与前端进行交互。可调用service…

安全中级1-nginx_host与php处理不同绕过

一、nginx配置证书 1.生成一个ssl.key密钥 openssl genrsa -des3 -out ssl.key 2096 2.创建一个key的目录,并将ssl.key放入到key目录下 mkdir key mv ssl.key key/ cd key 3.将ssl.key修改为xxx.key mv ssl.key xxx.key 4.创建ssl.key密钥 openssl rsa -in xxx.key -out ssl.…

【计算机组成原理】实验二

文章目录 实验二 运算器实验一、实验目的二、实验原理三、运算器功能编码四、设置初始状态任务一 算术运算任务二 逻辑运算任务三 移位运算任务四 进位控制与零标志 实验二 运算器实验 一、实验目的 完成算术、逻辑、移位运算实验&#xff0c;熟悉ALU运算类型的控制位运用。…

华为OD机试真题 Java 实现【硬件产品销售方案】【2023Q1 200分】

一、题目描述 某公司目前推出了AI开发者套件、AI加速卡、AI加速模块、AI服务器、智能边缘多种硬件产品&#xff0c;每种产品包含若干个型号。 现某合作厂商要采购金额为amount元的硬件产品搭建自己的AI基座。 假设当前库存有N种产品&#xff0c;每种产品的库存量充足&#x…

Python爬虫urllib的基础使用详解

文章目录 1、urllib的使用response 服务器返回的数据&#xff1a;一个类型&#xff0c;六个方法urllib.request.urlretrieve(url,filename) 请求下载网页 请求下载图片 请求下载视频 2、请求对象的定制3.编解码post请求方式ajax的get请求ajax的post请求cookie模拟登录使用handl…

策略模式+单例模式(go)

《设计模式&#xff1a;可复用面向对象软件的基础》 策略允许算法独立于使用它的客户端而变化。 算法可以互相替换&#xff0c;不影响客户端 一个类定义多种行为&#xff0c;并且这些行为在这个类操作中以多个条件形式出现。将相关的条件移入它们各自的Strategy类中以替代这些…

Tomcat 的maxConnections、maxThreads、acceptCount 之间的关系

1. 在springboot 项目配置方式 server:port: 8081servlet:context-path: "/account"tomcat:threads:max: 200min-spare: 10max-connections: 8192accept-count: 100connection-timeout: 10000min-spare &#xff1a;最小线程数 最小工作空闲线程数&#xff0c;默认1…

ChatGPT 自定义提示词模板提升使用效率

相关文章推荐&#xff1a; 《提问的艺术&#xff1a;如何通过提示词让 ChatGPT 更准确地理解你的问题&#xff1f;》 《这些免费插件&#xff0c;让你的 ChatGPT 效率爆炸》 一、背景 现在 ChatGPT 异常火爆&#xff0c;很多人都在体验甚至购买 ChatGPT Plus。 现在使用 ChatG…

ProtoBuf之启航

"How do we do we do go refarming~" 如何理解序列化和反序列化? 序列化: 把 对象 转换为 字节序列 的过程 称为对象的序列化。 反序列化: 把 字节序列 恢复为 对象 的过程 称为对象的反序列化。 这两对反义词从概念上来说并不难理解&#xff0c;但是为什么有事没…

Xcode 14.3 和 iOS 16.4 为 SwiftUI 带来了哪些新的功能?

0. 概览 今年年初&#xff0c;Apple 推出了最新的 Xcode 14.3 以及对应的 iOS 16.4 。 与此同时&#xff0c;它们对目前最新的 SwiftUI 4.0 也添加了一些新功能&#xff1a; sheet 弹窗后部视图&#xff08;Interact with a view Behind a sheet&#xff09;可交互&#xff…

高级树结构

二叉排序树 左子树中所有结点的值&#xff0c;均小于其根结点的值。 右子树中所有结点的值&#xff0c;均大于其根结点的值。 二叉搜索树的子树也是二叉搜索树。 注意&#xff1a; 1.二叉查找树不能插入重复元素 2.中序遍历是一个递增的数列 3.高度越小查询效率越高 二叉排序…

RK3568平台开发系列讲解(驱动基础篇)RK 看门狗的使用

🚀返回专栏总目录 文章目录 一、简介二、DTS配置三、使用沉淀、分享、成长,让自己和他人都能有所收获!😄 📢本篇将对RK 看门狗的使用进行学习。 文章目录 一、简介二、DTS配置三、使用一、简介 看门狗(watchdog)是一个定时器,启动之后会开始计时。系统或者软件需要…

一个非常sb的报错……idea创建项目初始化失败……

今天在用idea创建项目时报错项目初始化失败&#xff1b; 一开始以为是配置原因&#xff0c;但后面尝试创建空项目都失败…… 觉得可能跟版本什么的无关&#xff0c;尝试重启、更新系统后&#xff0c;试着以管理员身份运行idea&#xff0c;问题解决了……………… 如果有报错信…

C语言——结构体初阶

哈喽&#xff0c;大家好&#xff0c;今天我们来学习C语言中的结构体&#xff0c;今天主要学习初阶结构体&#xff0c;后期我们将继续学习结构体进阶。 目录 1.结构体类型的声明 1.1 结构体的基础知识 1.2 结构的声明 1.3 结构成员的类型 1.4 结构体变量的定义和初始化 2. …