《斯坦福数据挖掘教程·第三版》读书笔记(英文版) Chapter 2 MapReduce and the New Software Stack

news2024/11/24 14:55:18

来源:《斯坦福数据挖掘教程·第三版》对应的公开英文书和PPT

Chapter 2 MapReduce and the New Software Stack

Computing cluster means large collections of commodity hardware, including conventional processors (“compute nodes”) connected by Ethernet cables or inexpensive switches.

The software stack begins with a new form of file system, called a “distributed file system,” which features much larger units than the disk blocks in a conventional operating system. Distributed file systems also provide replication of data or redundancy to protect against the frequent media failures that occur when data is distributed over thousands of low-cost compute nodes.

Compute nodes are stored on racks, perhaps 8–64 on a rack. The nodes on a single rack are connected by a network, typically gigabit Ethernet. There can be many racks of compute nodes, and racks are connected by another level of network or a switch. The bandwidth of inter-rack communication is somewhat greater than the inter-rack Ethernet, but given the number of pairs of nodes that might need to communicate between racks, this bandwidth may be essential.

在这里插入图片描述

Some important calculations take minutes or even hours on thousands of compute nodes. If we had to abort and restart the computation every time one component failed, then the computation might never complete successfully.
The solution to this problem takes two forms:

  1. Files must be stored redundantly. If we did not duplicate the file at several compute nodes, then if one node failed, all its files would be unavailable until the node is replaced. If we did not back up the files at all, and the disk crashes, the files would be lost forever.
  2. Computations must be divided into tasks, such that if any one task fails to execute to completion, it can be restarted without affecting other tasks. This strategy is followed by the MapReduce programming system.

在这里插入图片描述
在这里插入图片描述

在这里插入图片描述

在这里插入图片描述

在这里插入图片描述

在这里插入图片描述

在这里插入图片描述

在这里插入图片描述

Summary of Chapter 2

  • Cluster Computing: A common architecture for very large-scale applications is a cluster of compute nodes (processor chip, main memory, and disk). Compute nodes are mounted in racks, and the nodes on a rack are connected, typically by gigabit Ethernet. Racks are also connected by a high-speed network or switch.
  • Distributed File Systems: An architecture for very large-scale file systems has developed recently. Files are composed of chunks of about 64 megabytes, and each chunk is replicated several times, on different compute nodes or racks.
  • MapReduce: This programming system allows one to exploit parallelism inherent in cluster computing, and manages the hardware failures that can occur during a long computation on many nodes. Many Map tasks and many Reduce tasks are managed by a Master process. Tasks on a failed compute node are rerun by the Master.
  • The Map Function: This function is written by the user. It takes a collection of input objects and turns each into zero or more key-value pairs. Keys are not necessarily unique.
  • The Reduce Function: A MapReduce programming system sorts all the key-value pairs produced by all the Map tasks, forms all the values associated with a given key into a list and distributes key-list pairs to Reduce tasks. Each Reduce task combines the elements on each list, by applying the function written by the user. The results produced by all the Reduce tasks form the output of the MapReduce process.
  • Reducers: It is often convenient to refer to the application of the Reduce function to a single key and its associated value list as a “reducer.”
  • Hadoop: This programming system is an open-source implementation of a distributed file system (HDFS, the Hadoop Distributed File System) and MapReduce (Hadoop itself). It is available through the Apache Foundation.
  • Managing Compute-Node Failures: MapReduce systems support restart of tasks that fail because their compute node, or the rack containing that node, fail. Because Map and Reduce tasks deliver their output only after they finish (the blocking property), it is possible to restart a failed task without concern for possible repetition of the effects of that task. It is necessary to restart the entire job only if the node at which the Master
    executes fails.
  • Applications of MapReduce: While not all parallel algorithms are suitable for implementation in the MapReduce framework, there are simple implementations of matrix-vector and matrix-matrix multiplication. Also, the principal operators of relational algebra are easily implemented in MapReduce.
  • Workflow Systems: MapReduce has been generalized to systems that support any acyclic collection of functions, each of which can be instantiated by any number of tasks, each responsible for executing that function on a portion of the data.
  • Spark: This popular workflow system introduces Resilient, Distributed Datasets (RDD’s) and a language in which many common operations on RDD’s can be written. Spark has a number of efficiencies, including lazy evaluation of RDD’s to avoid secondary storage of intermediate results and the recording of lineage for RDD’s so they can be reconstructed as needed.
  • TensorFlow: This workflow system is specifically designed to support machine-learning. Data is represented as multidimensional arrays, or tensors, and built-in operations perform many powerful operations, such as linear algebra and model training.
  • Recursive Workflows: When implementing a recursive collection of functions, it is not always possible to preserve the ability to restart any failed task, because recursive tasks may have produced output that was consumed by another task before the failure. A number of schemes for checkpointing parts of the computation to allow restart of single tasks, or restart all tasks from a recent point, have been proposed.
  • Communication-Cost: Many applications of MapReduce or similar systems do very simple things for each task. Then, the dominant cost is usually the cost of transporting data from where it is created to where it is used. In these cases, efficiency of a MapReduce algorithm can be estimated by calculating the sum of the sizes of the inputs to all the tasks.
  • Multiway Joins: It is sometimes more efficient to replicate tuples of the relations involved in a join and have the join of three or more relations computed as a single MapReduce job. The technique of Lagrangean multipliers can be used to optimize the degree of replication for each of the participating relations.
  • Star Joins: Analytic queries often involve a very large fact table joined with smaller dimension tables. These joins can always be done efficiently by the multiway-join technique. An alternative is to distribute the fact table and replicate the dimension tables permanently, using the same strategy as would be used if we were taking the multiway join of the fact table and every dimension table.
  • Replication Rate and Reducer Size: It is often convenient to measure communication by the replication rate, which is the communication per input. Also, the reducer size is the maximum number of inputs associated with any reducer. For many problems, it is possible to derive a lower bound on replication rate as a function of the reducer size.
  • Representing Problems as Graphs: It is possible to represent many problems that are amenable to MapReduce computation by a graph in which nodes represent inputs and outputs. An output is connected to all the inputs that are needed to compute that output.
  • Mapping Schemas: Given the graph of a problem, and given a reducer size, a mapping schema is an assignment of the inputs to one or more reducers so that no reducer is assigned more inputs than the reducer size permits, and yet for every output there is some reducer that gets all the inputs needed to compute that output. The requirement that there be a mapping schema for any MapReduce algorithm is a good expression of what makes MapReduce algorithms different from general parallel computations.
  • Matrix Multiplication by MapReduce: There is a family of one-pass MapReduce algorithms that performs multiplication of n × n matrices with the minimum possible replication rate r = 2 n 2 q r =\frac {2n^2}q r=q2n2, where q is the reducer size. On the other hand, a two-pass MapReduce algorithm for the same problem with the same reducer size can use up to a factor of n less communication.

END

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.coloradmin.cn/o/477050.html

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈,一经查实,立即删除!

相关文章

学习HCIP的day.03

目录 OSPF:开放式最短路径优先协议 OSPF的数据包 -- 5种 OSPF的状态机 OSPF的工作过程 OSPF的基础配置 关于OSPF协议从邻居建立成为邻接的条件 OSPF的接口网络类型 OSPF:开放式最短路径优先协议 无类别链路状态型IGP协议;由于其基于拓…

golang Gin实现websocket

golang使用 Gin实现 websocket,这里笔者重新搭建一个项目 1、创建项目安装依赖 项目名为 go-gin-websocket 在指定文件夹下,新建项目文件夹 go-gin-websocket 进入项目文件夹,打开cmd窗口,在项目(go-gin-websocket&a…

【Git 入门教程】第九节、Git的最佳实践

Git是一个强大的版本控制系统,可以帮助开发者管理和协调代码库。然而,正确使用Git并不总是容易。本文将介绍一些Git的最佳实践,以帮助开发者更好地利用Git来管理和协调代码库。 一、编写有意义的提交信息 在使用Git时,编写有意义…

Python编程IDE的选择

Python环境安装之后,接下来就是选择编写Python程序的编辑器了,这里就给大家推荐几种Python编辑器,我们简称这些编辑器为IDE。好的编程IDE可以提高编写代码效率,那咱话不多说,直接开始推荐! IDLE 首先&…

有人USR-M100边缘主动上报电流数据到TCP服务器

前两天跟强哥配置了有人的USR-M100模块,实现了采集的电流信号主动上报服务器的功能,昨天去第一污水厂配置了1台、第二污水厂配置了5台、第三污水厂配置了1台,能够将数据上报到甲方的云平台,这里记录一下配置过程,方便以…

推荐4款免费好用的chatGPT平台

1 ShellGPT 这是一款出色的客户端,无需APIkey和科学上网即可访问chatGPT3.5以及绘画AI。项目的github地址如下:https://github.com/akl7777777/free-chatgpt-client-pub/,可在主页下载windows、linux和macOS的安装包,安装后即可使…

力扣刷题2023-04-30-1——题目:剑指 Offer II 007. 数组中和为 0 的三个数

题目: 给你一个整数数组 nums ,判断是否存在三元组 [nums[i], nums[j], nums[k]] 满足 i ! j、i ! k 且 j ! k ,同时还满足 nums[i] nums[j] nums[k] 0 。请 你返回所有和为 0 且不重复的三元组。 注意:答案中不可以包含重复…

React之动态路由创建以及解决刷新白屏问题

动态路由的创建和动态菜单的创建几乎类似,只不过的是,动态路由需要导入组件。这样才能完成跳转。 动态路由与动态菜单一样都需要封装一个转化函数,将后端传来的数据进行转换,转换成我们需要的格式。 需要导入的依赖 导入路由use…

kali: kali工具-Ettercap

kali工具-Ettercap ettercap工具: 用来进行arp欺骗,可以进行ARP poisoning(arp投毒),除此之外还可以其他功能: ettercap工具的arp投毒可以截取web服务器、FTP服务器账号密码等信息,简略后打印出…

C++篇--类大小计算、this指针

文章目录 一、类大小计算二、this指针 一、类大小计算 类的大小是只计算它的成员变量或者自定义成员&#xff0c;不会计算它的成员函数大小。 #include<iostream> using namespace std;class A1 { public:void fun(){_a 1;_b 2;}int _a;int _b; };int main() {cout <…

干货 | 赵亚雄:大模型、AI经济和AI基础设施

大家好&#xff0c;我是算想未来的创始人CEO赵亚雄。今天非常高兴到母校来做简短的分享。我们最近会几乎是被ChatGPT、OpenAI等等话题各类的信息轮番轰炸。我希望借助这个机会&#xff0c;把自己这一段时间来思考的有关AI基础设施还有AGI再到大模型等内容&#xff0c;从相对抽象…

ElasticSearch创建文档以及索引文档的详细流程

当我们发起一个查询请求之后&#xff0c;ES是怎么处理这个请求然后返回数据的呢&#xff1f;今天就来详细说一下。 首先看一下整体结构&#xff1a; 在集群模式下一个索引有多个分片&#xff0c;在上图中有三个节点(一个主节点两个从节点)&#xff0c;一个索引被分为两个分片(…

基于协同过滤的旅游推荐系统设计与实现(论文+源码)_kaic

1 绪论 1.1 研究背景及意义 1.2 国内外研究现状 1.3 研究目标与意义 1.4 主要研究工作 2 相关理论介绍 2.1HTML与JavaScript 2.2 MySQL数据库 2.3 协同过滤算法简介 3 系统分析与设计 3.1 系统需求分析 3.1.1 功能性需求 3.1.2 安全性需求 3.2 系统总体架构 3.3 功能模块设计 3…

vue3的props和defineProps

文章目录 1. Props 声明1.1 props用字符串数组来声明Blog.vueBlogPost.vue 1.2 props使用对象来声明Blog.vueBlogPost.vue 2. 传递 prop 的细节2.1 Prop 名字格式2.1 静态Prop & 动态 Prop静态prop动态prop示例Blog.vueBlogPost.vue 2.3 传递不同的值类型NumberBooleanArra…

【Java笔试强训 11】

&#x1f389;&#x1f389;&#x1f389;点进来你就是我的人了博主主页&#xff1a;&#x1f648;&#x1f648;&#x1f648;戳一戳,欢迎大佬指点! 欢迎志同道合的朋友一起加油喔&#x1f93a;&#x1f93a;&#x1f93a; 目录 一、选择题 二、编程题 &#x1f525;最近公共…

算法记录 | Day46 动态规划

139.单词拆分 思路&#xff1a; 1.确定dp数组以及下标的含义 dp[i] : 字符串长度为i的话&#xff0c;dp[i]为true&#xff0c;表示可以拆分为一个或多个在字典中出现的单词。 2.确定递推公式 如果 s[0: j] 可以拆分为单词&#xff08;即 dp[j] True&#xff09;&#xff…

Linux RHCE解题方法

目录 安装和配置Ansible 创建和运行Ansible临时命令 安装软件包 使用RHEL系统角色 使用 Ansible Galaxy 安装角色 创建和使用角色 从 Ansible Galaxy 使用角色 创建和使用逻辑卷 生成主机文件 修改文件内容 创建 Web 内容目录 生成硬件报告 创建密码库 创建用户帐…

C++中stringstream类详解

本文主要介绍 C 编程语言中 stringstream 类的相关知识&#xff0c;同时通过示例代码介绍 stringstream 类的使用方法。 1 概述 <sstream> 定义了三个类&#xff1a;istringstream、ostringstream 和 stringstream&#xff0c;分别用来进行流的输入、输出和输入输出操作…

快速傅里叶变换FFT学习笔记

点值表示法 我们正常表示一个多项式的方式&#xff0c;形如 A ( x ) a 0 a 1 x a 2 x 2 . . . a n x n A(x)a_0a_1xa_2x^2...a_nx^n A(x)a0​a1​xa2​x2...an​xn&#xff0c;这是正常人容易看懂的&#xff0c;但是&#xff0c;我们还有一种表示法。 我们知道&#xf…

java结束当前循环

在 Java中&#xff0c;当我们要结束一个循环时&#xff0c;通常会使用循环变量的实现类来结束&#xff0c;但在实际开发中&#xff0c;我们经常会遇到某个循环结束后需要进行其他的操作的情况。此时&#xff0c;就需要使用循环变量来结束当前循环。 1、创建一个新的类&#xff…