《斯坦福数据挖掘教程·第三版》读书笔记(英文版)Chapter 4 Mining Data Streams

news2025/1/4 11:53:57

来源:《斯坦福数据挖掘教程·第三版》对应的公开英文书和PPT

Chapter 4 Mining Data Streams

在这里插入图片描述

💡 Skip this chapter due to its difficulty and for me, it is hard to understand.

Summary of Chapter 4

  • The Stream Data Model: This model assumes data arrives at a processing engine at a rate that makes it infeasible to store everything in active storage. One strategy to dealing with streams is to maintain summaries of the streams, sufficient to answer the expected queries about the data. A second approach is to maintain a sliding window of the most recently
    arrived data.
  • Sampling of Streams: To create a sample of a stream that is usable for a class of queries, we identify a set of key attributes for the stream. By hashing the key of any arriving stream element, we can use the hash value to decide consistently whether all or none of the elements with that key will become part of the sample.
  • Bloom Filters: This technique allows us to filter streams so elements that belong to a particular set are allowed through, while most nonmembers are deleted. We use a large bit array, and several hash functions. Members of the selected set are hashed to buckets, which are bits in the array, and those bits are set to 1. To test a stream element for membership, we hash the element to a set of bits using each of the hash functions, and only
    accept the element if all these bits are 1.
  • Counting Distinct Elements: To estimate the number of different elements appearing in a stream, we can hash elements to integers, interpreted as binary numbers. 2 raised to the power that is the longest sequence of 0’s seen in the hash value of any stream element is an estimate of the number of different elements. By using many hash functions and combining these estimates, first by taking averages within groups, and then taking the median of the averages, we get a reliable estimate.
  • Moments of Streams: The kth moment of a stream is the sum of the k-th powers of the counts of each element that appears at least once in the stream. The 0th moment is the number of distinct elements, and the 1st moment is the length of the stream.
  • Estimating Second Moments: A good estimate for the second moment, or surprise number, is obtained by choosing a random position in the stream, taking twice the number of times this element appears in the stream from that position onward, subtracting 1, and multiplying by the length of the stream. Many random variables of this type can be combined like the estimates for counting the number of distinct elements, to produce a reliable estimate of the second moment.
  • Estimating Higher Moments: The technique for second moments works for kth moments as well, as long as we replace the formula 2 x − 1 2x − 1 2x1 (where x is the number of times the element appears at or after the selected position) by x k − ( x − 1 ) k x^k − (x − 1)^k xk(x1)k.
  • Estimating the Number of 1’s in a Window: We can estimate the number of 1’s in a window of 0’s and 1’s by grouping the 1’s into buckets. Each bucket has a number of 1’s that is a power of 2; there are one or two buckets of each size, and sizes never decrease as we go back in time. If we record only the position and size of the buckets, we can represent the
    contents of a window of size N with O ( l o g 2 2 N ) O(log_2^2 N) O(log22N) space.
  • Answering Queries About Numbers of 1’s: If we want to know the approximate numbers of 1’s in the most recent k elements of a binary stream, we find the earliest bucket B that is at least partially within the last k positions of the window and estimate the number of 1’s to be the sum of the sizes of each of the more recent buckets plus half the size of B. This estimate can never be off by more that 50% of the true count of 1’s.
  • Closer Approximations to the Number of 1’s: By changing the rule for how many buckets of a given size can exist in the representation of a binary window, so that either r or r − 1 of a given size may exist, we can assure that the approximation to the true number of 1’s is never off by more than 1/r.
  • Exponentially Decaying Windows: Rather than fixing a window size, we can imagine that the window consists of all the elements that ever arrived in the stream, but with the element that arrived t time units ago weighted by e − c t e^{−ct} ect for some time-constant c. Doing so allows us to maintain certain summaries of an exponentially decaying window easily. For instance, the weighted sum of elements can be recomputed, when a new element arrives, by multiplying the old sum by 1 − c and then adding the new element.
  • Maintaining Frequent Elements in an Exponentially Decaying Window: We can imagine that each item is represented by a binary stream, where 0 means the item was not the element arriving at a given time, and 1 means that it was. We can find the elements whose sum of their binary stream is at least 1/2. When a new element arrives, multiply all recorded sums by 1 minus the time constant, add 1 to the count of the item that just arrived, and delete from the record any item whose sum has fallen below 1/2.

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.coloradmin.cn/o/484627.html

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈,一经查实,立即删除!

相关文章

【微机原理】半导体存储器

目录 一.半导体存储器的分类 二、半导体存储器性能指标 三、半导体存储器的结构 一.半导体存储器的分类 半导体存储器的分类方法有很多种。 1.按器件原理来分:有双极型存储器和MOS型存储器。 双极型:速度快、集成度低、功耗大MOS型:速度慢、集…

“ 探索迷局:解密广度寻路算法 “

专栏文章,自下而上 数据结构与算法——二叉搜索树 数据结构与算法——深度寻路算法 数据结构与算法——二叉树实现表达式树 数据结构与算法——树(三指针描述一棵树) 数据结构与算法——栈和队列<也不过如此> 数据结构与算法——八…

C++的智能指针

文章目录 1. 内存泄漏1.1 什么是内存泄漏1.2 内存泄漏分类 2. 为什么需要智能指针3. 智能指针的使用及原理3.1 RAII3.2 使用RAII思想设计的SmartPtr类3.3 让SmartPtr像指针一样3.3 SmartPtr的拷贝3.4 auto_ptr3.5 unique_ptr3.6 shared_ptr3.6.1 shared_ptr的循环引用3.6.2 wea…

MYSQL-数据库管理(上)

一、数据库概述 一、数据库基本概念 1.1 数据 1) 描述事物的符号记录称为数据(Data)。数字、文字、图形、图像、声音、档案记录等 都是数据。 2)数据是以“记录”的形式按照统一的格式进行存储的,而不是杂乱无章的。…

机器学习之分类决策树与回归决策树—基于python实现

大家好,我是带我去滑雪! 本期为大家介绍决策树算法,它一种基学习器,广泛应用于集成学习,用于大幅度提高模型的预测准确率。决策树在分区域时,会考虑特征向量对响应变量的影响,且每次仅使用一个分…

vs编译生成动态库

说明 windows版本,vs2019 创建一个动态库 新建一c项目,创建一个dll类型项目。 在头文件中添加一个mylib.h文件: #pragma once#ifndef MYLIB_H #define MYLIB_Hextern "C" __declspec(dllexport) void Hello(); extern "C…

UG NX二次开发(C++)-建模-修改NXObject或者Feature的颜色(一)

文章目录 1、前言2、在UG NX中修改Feature的颜色操作3、采用NXOpen(C)实现3.1 创建修改特征的方法3.2 调用ModifyFeatureColor方法3.3 测试结果 1、前言 在UG NX中,改变NXObject和Feature的操作是不相同的,所以其二次开发的代码也不一样,我们…

优化问题的拉格朗日Lagrange对偶法原理

首先我们定义一般形式的求解x的优化问题: 表示优化的目标函数,上述为最小优化,实际上最大优化可以改写为的形式表示第i个不等式约束表示等式约束 1. Lagrange对偶问题 上述优化问题的拉格朗日Lagrange对偶法求解,是将上述带约束…

Baumer工业相机堡盟工业相机如何联合BGAPISDK和Halcon实现图像的直方图算法增强(C#)

Baumer工业相机堡盟工业相机如何联合BGAPISDK和Halcon实现图像的直方图算法增强(C#) Baumer工业相机Baumer工业相机使用图像算法增加图像的技术背景Baumer工业相机通过BGAPI SDK联合Halcon使用直方图图像增强算法1.引用合适的类文件2.BGAPI SDK在图像回调…

车牌识别系统完整商用级别设计流程

简介 车牌识别(License Plate Recognition)是一种通过计算机视觉技术识别和提取车辆车牌上字符信息的技术。它在交通管理、智慧停车、安防监控等领域有着广泛的应用。 本项目将带完整的了解车牌识别系统设计思路,以及实现流程。 算法部分应…

图书管理系统【控制台+MySQL】(Java课设)

系统类型 控制台类型Mysql数据库存储数据 使用范围 适合作为Java课设!!! 部署环境 jdk1.8Mysql8.0Idea或eclipsejdbc 运行效果 本系统源码地址:https://download.csdn.net/download/qq_50954361/87737294 更多系统资源库地…

机器学习 --- 绪论

第1关:什么是机器学习 任务描述 ​ 本关任务:根据本节课所学知识完成本关所设置的选择题。 相关知识 ​ 为了完成本关任务,你需要掌握: 什么是机器学习。 什么是机器学习 ​ 相信大家一定都非常喜欢吃西…

【Python】【进阶篇】21、Django Admin数据表可视化

目录 21、Django Admin数据表可视化1. 创建超级用户2. 将Model注册到管理后台1)在admin.py文件中声明 3. django_admin_log数据表 21、Django Admin数据表可视化 在《Django Admin后台管理系统》介绍过 Django 的后台管理系统是为了方便站点管理人员对数据表进行操作。Django …

【Java面试八股文】数据库篇

导航: 【黑马Java笔记踩坑汇总】JavaSEJavaWebSSMSpringBoot瑞吉外卖SpringCloud黑马旅游谷粒商城学成在线MySQL高级篇设计模式牛客面试题 目录 请你说说MySQL索引,以及它们的好处和坏处 请你说说MySQL的索引是什么结构,为什么不用哈希表 请你说说数据库索引的底…

R语言 | 编写自己的函数

目录 一、正式编写程序 二、设计第一个函数 三、函数也是一个对象 四、程序代码的简化 五、return()函数的功能 六、省略函数的大括号 七、传递多个参数函数的应用 7.1 设计可传递2个参数的函数 7.2 函数参数的默认值 7.3 3点参数“…”的使用 八、函数也可以作为参数 …

《针灸》笔记(倪海厦先生人纪系列针灸篇)

程序员上了年纪,各种职业病就来了,人工智能成为好工具的同时,自己的时间也多了一些。 了解他才能判断他的真伪,没学过就认为中医是糟粕的,请划走。 学到什么记什么,线上线下齐下手,自用笔记&…

【Python基础入门学习】Python高级变量你了解多少?

认识高级变量 1. 列表 list1.1 列表的定义1.2 列表常用操作关键字、函数和方法 1.3 循环遍历1.4 列表嵌套1.5 应用场景 2. 元组 tuple2.1 元组的定义2.2 元组常用操作2.3 应用场景 3. 字典 dictionary3.1 字典的含义3.2 字典常用操作3.3 应用场景 4. 字符串 string4.1 字符串的…

基于PyQt5的图形化界面开发——模拟医院管理系统

基于PyQt5的图形化界面开发——模拟医院管理系统 0. 前言1. 需求分析2. 挂号界面的思路、UI界面代码及相应触发函数2.1 思路分析2.2 ui_guahao.py2.3 相应的触发函数代码 3. 就诊界面的思路、UI界面代码及相应触发函数3.1 思路分析3.2 ui_jiuzhen.py3.3 相关触发函数&#xff1…

如何制定专属于自己的个人目标

文章目录 前言一、自问自答二、制定目标的原则1.明确性2.衡量性3.可实现性4.相关性5.时限性 三、对目标进行计划1.5W2H1.做什么(what to do it)2.为什么做(why to do it)3.何时做(when to do it)4.何地做(w…

8b/10b编码方式(详细)总结附实例快速理解

目录 前言一、简介二、转换过程2.1 背景2.2 具体转换过程 三、其他相关链接1、PCI总线及发展历程总结2、PCIe物理层总结-PCIE专题知识(一)3、PCIe数据链路层图文总结-PCIe专题知识(二) 前言 本文主要通过图文方式介绍8b/10b编码&…