《斯坦福数据挖掘教程·第三版》读书笔记(英文版) Chapter 1 Data Mining

news2024/12/24 20:25:11

来源:《斯坦福数据挖掘教程·第三版》对应的公开英文书和PPT

Chapter 1 Data Mining

Now, statisticians view data mining as the construction of a statistical model, that is, an underlying distribution from which the visible data is drawn.

However, machine learning can be uncompetitive in situations where we can describe the goals of the mining more directly. Another problem with some machine-learning methods is that they often yield a model that, while it may be quite accurate, is not explainable.

Most other approaches to modeling can be described as either :

  1. Summarizing the data succinctly and approximately, or
  2. Extracting the most prominent features of the data and ignoring the rest.

Some of the important kinds of feature extraction from large-scale data that we shall study are:

  1. Frequent Itemsets. This model makes sense for data that consists of “baskets” of small sets of items.

  2. Similar Items. Often, your data looks like a collection of sets, and the objective is to find pairs of sets that have a relatively large fraction of their elements in common.

Suppose you have a certain amount of data, and you look for events of a certain type within that data. You can expect events of this type to occur, even if the data is completely random, and the number of occurrences of these events will grow as the size of the data grows. These occurrences are “bogus,” in the sense that they have no cause other than that random data will always have some number of unusual features that look significant but aren’t.

Bonferroni’s principle, that helps us avoid treating random occurrences as if they were real. Calculate the expected number of occurrences of the events you are looking for, on the assumption that data is random. If this number is significantly larger than the number of real instances you hope to find, then you must expect almost anything you find to be bogus, i.e., a statistical artifact rather than evidence of what you are looking for.

A hash function h takes a hash-key value as an argument and produces
a bucket number as a result. The bucket number is an integer, normally in the range 0 to B − 1, where B is the number of buckets. Hash-keys can be of any type. There is an intuitive property of hash functions that they “randomize” hash-keys. To be precise, if hash-keys are drawn randomly from a reasonable population of possible hash-keys, then h will send approximately equal numbers of hash-keys to each of the B buckets.
在这里插入图片描述

For more complex data types, we can extend the idea used for converting strings to integers, recursively.

  • For a type that is a record, each of whose components has its own type, recursively convert the value of each component to an integer, using the algorithm appropriate for the type of that component. Sum the integers for the components, and convert the integer sum to buckets by dividing by B.
  • For a type that is an array, set, or bag of elements of some one type,
    convert the values of the elements’ type to integers, sum the integers, and divide by B.

A hash table used as an index; phone numbers are hashed to buckets, and the entire record is placed in the bucket whose number is the hash value of the phone.

Some other useful approximations follow from the Taylor expansion of e:
e x = ∑ i = 0 ∞ x i i ! e^x=\sum_{i=0}^{\infty}\frac{x^i}{i!} ex=i=0i!xi

There are many phenomena that relate two variables by a power law, that is, a linear relationship between the logarithms of the variables.

Thus, a power law can be written as y = c x a y = cx^a y=cxa for some constants a a a and c c c.

Often, the existence of power laws with values of the exponent higher than 1 are explained by the Matthew effect. In the biblical Book of Matthew, there is a verse about “the rich get richer.” Many phenomena exhibit this behavior, where getting a high value of some property causes that very property to increase. For example, if a Web page has many links in, then people are more likely to find the page and may choose to link to it from one of their pages as well. As another example, if a book is selling well on Amazon, then it is likely to be advertised when customers go to the Amazon site. Some of these people will choose to buy the book as well, thus increasing the sales of this book.

Summary of Chapter 1

  • Data Mining: This term refers to applying the powerful tools of computer science to solve problems in science, industry, and many other application areas. Frequently, the key to a successful application is building a model of the data, that is, a summary or relatively succinct representation of the most relevant features of the data.
  • Bonferroni’s Principle: If we are willing to view as an interesting feature of data something of which many instances can be expected to exist in random data, then we cannot rely on such features being significant. This observation limits our ability to mine data for features that are not sufficiently rare in practice.
  • TF.IDF: The measure called TF.IDF lets us identify words in a collection of documents that are useful for determining the topic of each document. A word has high TF.IDF score in a document if it appears in relatively few documents, but appears in this one, and when it appears in a document it tends to appear many times.
  • Hash Functions: A hash function maps hash-keys of some data type to integer bucket numbers. A good hash function distributes the possible hash-key values approximately evenly among buckets. Any data type can be the domain of a hash function.
  • Indexes: An index is a data structure that allows us to store and retrieve data records efficiently, given the value in one or more of the fields of the record. Hashing is one way to build an index.
  • Storage on Disk: When data must be stored on disk (secondary memory), it takes very much more time to access a desired data item than if the same data were stored in main memory. When data is large, it is important that algorithms strive to keep needed data in main memory.
  • Power Laws: Many phenomena obey a law that can be expressed as y = c x a y = cx^a y=cxa for some power a, often around −2. Such phenomena include the sales of the x-th most popular book, or the number of in-links to the x-th most popular page.

END

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.coloradmin.cn/o/472826.html

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈,一经查实,立即删除!

相关文章

docker上部署程序后无法连接数据库的问题

咱就是说,这个问题差点给我劝退docker。下面说下环境情况。 装了个javaweb程序容器,装了个数据库容器,javaweb容器就是链接不上数据库。 咱也是跟着菜鸟教程的容器互联步骤简历网络链接: 并且启动时增加--networkxxx 都加入到了…

socket通信之select

多线程版本的的socket的server端 #include "socket.hpp" #include <iostream> #include <string> #include <memory>DWORD WINAPI threadProc(LPVOID lp) {SOCKET sClient *(SOCKET*)(lp);while (true) {char buff[1024] { 0 };int result rec…

数据结构初阶 —— 树(堆)

目录 一&#xff0c;堆 堆的概念 向下调整法&#xff08;数组&#xff09; 向上调整法&#xff08;数组&#xff09; 堆的创建&#xff08;建堆&#xff09; 堆的实现 一&#xff0c;堆 堆的概念 如有个关键码的集合K{&#xff0c;&#xff0c;&#xff0c;...&#xf…

Anaconda你不得不知道的若干知识点

Anaconda你不得不知道的若干知识点 1. 查看所有的环境变量2. 加载tensorflow在jupyter中会挂掉怎么办&#xff1f;3. Excel xlsx file&#xff1b; not supported两种解决办法4. (unicode error)5. 统计pandas二维表中的某列的重复值法一&#xff1a;df.loc[:,col_name].value_…

Fabric测试与基础

Fabric官网:Introduction — hyperledger-fabricdocs main documentation 1.测试网络 ./network.sh up #启动./network.sh down #关闭 2.Fabric核心模块 peer:主节点模块&#xff0c;负责存储区块链数据&#xff0c;运行维护链码 orderer:交易打包、排序模块 cryptogen:组织…

【机器学习分支】重要性采样(Importance sampling)学习笔记

重要性采样&#xff08;importance sampling&#xff09;是一种用于估计概率密度函数期望值的常用蒙特卡罗积分方法。其基本思想是利用一个已知的概率密度函数来生成样本&#xff0c;从而近似计算另一个概率密度函数的期望值。 想从复杂概率分布中采样的一个主要原因是能够使用…

uniapp 抖音授权登录、发布、分享 Ba-Aweme

简介&#xff08;下载地址&#xff09; Ba-Aweme 是一个集成抖音的uniapp插件&#xff0c;支持抖音授权登录&#xff0c;发布图片、视频&#xff0c;分享到联系人群组&#xff0c;直接拍摄等。自带选择图片和选择视频方法。 注意&#xff1a; 使用前&#xff0c;先到抖音开放…

LabVIEW CompactRIO 开发指南 3 选择CompactRIO编程模式

第二章 选择CompactRIO编程模式 第一章中介绍的CompactRIO架构为我们提供了通过LabVIEW FPGA定制FPGA硬件或使用NI CompactRIO扫描模式来实现I/O的选项。如果计算机上有LabVIEW Real-Time和LabVIEW FPGA&#xff0c;那么当向LabVIEW项目添加CompactRIO目标时&#xff0c;将提…

BM48-数据流中的中位数

题目 如何得到一个数据流中的中位数&#xff1f;如果从数据流中读出奇数个数值&#xff0c;那么中位数就是所有数值排序之后位于中间的数值。如果从数据流中读出偶数个数值&#xff0c;那么中位数就是所有数值排序之后中间两个数的平均值。我们使用Insert()方法读取数据流&…

【软考网络管理员】2023年软考网管初级常见知识考点(3)- 网络体系结构

【写在前面】也是趁着五一假期前再写几篇分享类的文章给大家&#xff0c;希望看到我文章能给软考网络管理员备考的您带来一些帮助&#xff0c;5月27号也是全国计算机软件考试统一时间&#xff0c;也就不用去各个地方找资料和代码了。紧接着我就把我整理的一些资料分享给大家哈&…

【Python】selenium工具

目录 1. 安装 2. 测试 3. 无头浏览器 4. 元素定位 5. 页面滑动 6. 按键、填写登录表单 7. 页面切换 Selenium是Web的自动化测试工具&#xff0c;为网站自动化测试而开发&#xff0c;Selenium可以直接运行在浏览器上&#xff0c;它支持所有主流的浏览器&#xff0c;可以接…

【Linux - Shell常用命令】- 判断文件是否存在、去掉文件后缀

目录 一、判断文件是否存在1.1 判断目录是否存在1.2 判断文件是否存在1.3 其他文件类型判断 二、字符串截取&#xff08;去掉文件后缀&#xff09;2.1 获取文件后缀2.2 获取文件前缀 一、判断文件是否存在 1.1 判断目录是否存在 将下面代码保存为dirExist.sh &#xff0c;运行…

隐私权限是什么

导读&#xff1a; 隐私权在现代社会对于人们而言是重要的人格权&#xff0c;而随着互联网技术的发展&#xff0c;实践中侵犯隐私权的行为很常见。那么隐私权限是什么&#xff1f;侵犯隐私权的行为有哪些&#xff1f;侵犯他人隐私权要负什么法律责任&#xff1f;接下来将由找法…

Linux-修改虚拟机为静态IP 和 主机名

一、设置虚拟机的IP为静态的 一般情况下&#xff0c;NAT网络连接模式下&#xff0c;DHCP动态分配IP地址的&#xff0c;但这样在每次访问连接Linux虚拟机时&#xff0c;都要先去查询ip地址&#xff0c;很麻烦&#xff0c;干脆就将虚拟机IP地址写死&#xff0c;也就是设置为静态…

数据结构初阶 —— 树(二叉树)

目录 一&#xff0c;二叉树 特殊二叉树 二叉树的性质 二叉树的存储结构 二&#xff0c;二叉树链式结构 二叉树的遍历&#xff08;四种&#xff09; 二叉树接口 试题 一&#xff0c;二叉树 由一个根节点&#xff0c;加上两颗左二叉树和右二叉树组成&#xff0c;可以为空…

python cms建站教程:Wagtail建站(一、安装与基本使用)

最近有个建站的项目&#xff0c;因为python比较熟&#xff0c;为了快速建站想着用cms&#xff0c;但发现网上关于python cms的教程很少&#xff0c;于是自己试着写一个。建站工具采用Wagtail&#xff0c;是一款基于Django框架的cms&#xff0c;自己照着文档摸索了一番&#xff…

线性调频Z变换 CZT

文章目录 【1. 原理】【2. z k z_k zk​ 所在的路径】【3. CZT的实现步骤】【4. CZT的特点 】【5. CZT的应用】5.1 通过 CZT 变换求 DFT5.2 对信号的频谱进行细化分析5.3 求解Z变换X(z)的零、极点5.4 使用CZT进行Keystone变换 【6.相关文献】 线性调频Z变换&#xff08;chirp …

FL Studio中文版V21的主要功能与下载教程

FL Studio21最新版是流行的数字音频工作站(DAW)其最新版本FL Studio 21,主要功能和下载教程如下: FL Studio21中文版功能介绍: 1. 全新界面:采用简洁现代的设计风格,工具栏和菜单进行重组,更加直观。提供智能提示与工作流指导,易于学习和操作。 2. 多显示器支持:可以在不同屏…

野火STM32电机系列(六)Cubemx配置ADC规则和注入通道

前文已经配置了GPIO、编码器 本节讲解CubeMXADC规则和注入通道 本文adc注入通道采用定时器触发&#xff0c;因此在上文定时器配置的基础上进行 常规信号&#xff08;温度等&#xff09;使用带DMA的常规通道连续采样 注入采样由定时器触发&#xff0c;采集电机三相电流&…

科大版中国版ChatGPT来啦!抢先体验

随着文心一言、通义千问等国内顶尖级ChatGPT大模型相继问世&#xff0c;具有语言理解和生成能力的人工智能正在引领行业创新发展。作为人工智能公司中的佼佼者&#xff0c;科大讯飞也开始加入到这场竞争中来。 4月20日&#xff0c;科大讯飞宣布即将于5月6日正式发布其最新的“…