《斯坦福数据挖掘教程·第三版》读书笔记(英文版)Chapter 8 Advertising on the Web

news2025/1/10 16:41:24

来源:《斯坦福数据挖掘教程·第三版》对应的公开英文书和PPT

Chapter 8 Advertising on the Web

There are several factors that must be considered in evaluating ads:

  1. The position of the ad in a list has great influence on whether or not it is clicked. The first on the list has by far the highest probability, and the probability drops off exponentially as the position increases.
  2. The ad may have attractiveness that depends on the query terms. For example, an ad for a used convertible would be more attractive if the search query includes the term “convertible,” even though it might be a valid response to queries that look for that make of car, without specifying whether or not a convertible is wanted.
  3. All ads deserve the opportunity to be shown until their click probability can be approximated closely. If we start all ads out with a click probability of 0, we shall never show them and thus never learn whether or not they are attractive ads.

There will be some constant c less than 1, such that on any input, the result of a particular on-line algorithm is at least c times the result of the optimum off-line algorithm. The constant c, if it exists, is called the competitive ratio for the on-line algorithm.

A matching is a subset of the edges such that no node is an end of two or more edges. A matching is said to be perfect if every node appears in the matching. Note that a matching can only be perfect if the left and right sets are of the same size. A matching that is as large as any other matching for the graph in question is said to be maximal.

Suppose M o M_o Mo is a maximal matching, and M g M_g Mg is the matching that the greedy algorithm produces. Let L be the set of left nodes that are matched in M o M_o Mo but not in M g M_g Mg. Let R be the set of right nodes that are connected by edges to any node in L. We claim that every node in R is matched in M g M_g Mg. Suppose not; in particular, suppose node r in R is not matched in M g M_g Mg. Then the greedy algorithm will eventually consider some edge ( ℓ , r ) (ℓ, r) (,r), where ℓ ℓ is in L. At that time, neither end of this edge is matched, because we have supposed that neither ℓ ℓ nor r is ever matched by the greedy algorithm. That observation contradicts the definition of how the greedy algorithm works; that is, the greedy algorithm would indeed match ( ℓ , r ) (ℓ, r) (,r). We conclude that every node in R is matched in M g M_g Mg.
Now, we know several things about the sizes of sets and matchings.

  1. ∣ M o ∣ ≤ ∣ M g ∣ + ∣ L ∣ |M_o| ≤ |M_g|+|L| MoMg+L, since among the nodes on the left, only nodes in L can be matched in M o M_o Mo but not M g M_g Mg.
  2. ∣ L ∣ ≤ ∣ R ∣ |L| ≤ |R| LR, because in M o M_o Mo, all the nodes in L were matched.
  3. ∣ R ∣ ≤ ∣ M g ∣ |R| ≤ |M_g| RMg, because every node in R is matched in M g M_g Mg.

Definition of the Adwords Problem

Given:

  1. A set of bids by advertisers for search queries.
  2. A click-through rate for each advertiser-query pair.
  3. A budget for each advertiser. We shall assume budgets are for a month, although any unit of time could be used.
  4. A limit on the number of ads to be displayed with each search query.

Respond to each search query with a set of advertisers such that:

  1. The size of the set is no larger than the limit on the number of ads per query.
  2. Each advertiser has bid on the search query.
  3. Each advertiser has enough budget left to pay for the ad if it is clicked upon.

The revenue of a selection of ads is the total value of the ads selected, where the value of an ad is the product of the bid and the click-through rate for the ad and query. The merit of an on-line algorithm is the total revenue obtained over a month (the time unit over which budgets are assumed to apply). We shall try to measure the competitive ratio for algorithms, that is, the minimum total revenue for that algorithm, on any sequence of search queries, divided by the revenue of the optimum off-line algorithm for the same sequence of search queries.

There is a simple improvement to the greedy algorithm that gives a competitive ratio of 3/4 for the simple case of Section 8.4.3. This algorithm, called the Balance Algorithm, assigns a query to the advertiser who bids on the query and has the largest remaining budget. Ties may be broken arbitrarily.

When there are many advertisers, the competitive ratio for the Balance Algorithm can be under 3/4, but not too far below that fraction. The worst case for Balance is as follows.

  1. There are N advertisers, A 1 , A 2 , . . . , A N A_1, A_2, . . . , A_N A1,A2,...,AN .
  2. Each advertiser has a budget B = N ! B = N! B=N!.
  3. There are N queries q 1 , q 2 , . . . , q N q_1, q_2, . . . , q_N q1,q2,...,qN .
  4. Advertiser A i A_i Ai bids on queries q 1 , q 2 , . . . , q i q_1, q_2, . . . , q_i q1,q2,...,qi and no other queries.
  5. The query sequence consists of N rounds. The ith round consists of B occurrences of query q i q_i qi and nothing else.

The competitive ratio is 1 − 1 e 1 − \frac{1}{e} 1e1 , or approximately 0.63.

There are two new elements in the representation.

First, we shall include a status with each list of words. The status is an integer indicating how many of the first words on the list have been matched by the current document. When a bid is stored in the index, its status is always 0.

Second, while the order of words could be lexicographic, we can lower the amount of work by ordering words rarest-first. However, since the number of different words that can appear in emails is essentially unlimited, it is not feasible to order all words in this manner. As a compromise, we might identify the n most common words on the Web or in a sample of the stream of documents we are processing. Here, n might be a hundred thousand or a million. These n words are sorted by frequency, and they occupy the end of the list, with the most frequent words at the very end. All words not among the n most frequent can be assumed equally infrequent and ordered lexicographically. Then, the words of any document can be ordered. If a word does not appear on the list of n frequent words, place it at the front of the order, lexicographically. Those words in the document that do appear on the list of most frequent words appear after the infrequent words, in the reverse order of frequency (i.e., with the most frequent words of the documents ordered last).

The bids are stored in a hash-table, whose hash key is the first word of the bid, in the order explained above. The record for the bid will also include information about what to do when the bid is matched. The status is 0 and need not be stored explicitly. There is another hash table, whose job is to contain copies of those bids that have been partially matched. These bids have a status that is at least 1, but less than the number of words in the set. If the status is i, then the hash-key for this hash table is the **(i + 1)**st word. The arrangement of hash tables is suggested by Fig. 8.5. To process a document, do the following.

  1. Sort the words of the document in the order discussed above. Eliminate duplicate words.

  2. For each word w, in the sorted order:
    (i) Using w as the hash-key for the table of partially matched bids, find those bids having w as key.
    (ii) For each such bid b, if w is the last word of b, move b to the table of matched bids.
    (iii) If w is not the last word of b, add 1 to b’s status, and rehash b using the word whose position is one more than the new status, as the hash-key.
    (iv) Using w as the hash key for the table of all bids, find those bids for which w is their first word in the sorted order.
    (v) For each such bid b, if there is only one word on its list, copy it to the table of matched bids.

    (vi) If b consists of more than one word, add it, with status 1, to the table of partially matched bids, using the second word of b as the hash-key.

  3. Produce the list of matched bids as the output.

在这里插入图片描述

The benefit of the rarest-first order should now be visible. A bid is only copied to the second hash table if its rarest word appears in the document. In comparison, if lexicographic order was used, more bids would be copied to the second hash table. By minimizing the size of that table, we not only reduce the amount of work in steps 2(i)–2(iii), but we make it more likely that this entire table can be kept in main memory.

Summary of Chapter 8

  • Targeted Advertising: The big advantage that Web-based advertising has over advertising in conventional media such as newspapers is that Web advertising can be selected according to the interests of each individual user. This advantage has enabled many Web services to be supported entirely by advertising revenue.
  • On- and Off-Line Algorithms: Conventional algorithms that are allowed to see all their data before producing an answer are called off-line. An on-line algorithm is required to make a response to each element in a stream immediately, with knowledge of only the past, not the future elements in the stream.
  • Greedy Algorithms: Many on-line algorithms are greedy, in the sense that they select their action at every step by minimizing some objective function.
  • Competitive Ratio: We can measure the quality of an on-line algorithm by minimizing, over all possible inputs, the value of the result of the online algorithm compared with the value of the result of the best possible off-line algorithm.
  • Bipartite Matching: This problem involves two sets of nodes and a set of edges between members of the two sets. The goal is to find a maximal matching – as large a set of edges as possible that includes no node more than once.
  • On-Line Solution to the Matching Problem: One greedy algorithm for finding a match in a bipartite graph (or any graph, for that matter) is to order the edges in some way, and for each edge in turn, add it to the match if neither of its ends are yet part of an edge previously selected for the match. This algorithm can be proved to have a competitive ratio of 1/2; that is, it never fails to match at least half as many nodes as the best off-line algorithm matches.
  • Search Ad Management: A search engine receives bids from advertisers on certain search queries. Some ads are displayed with each search query, and the search engine is paid the amount of the bid only if the queryer clicks on the ad. Each advertiser can give a budget, the total amount they are willing to pay for clicks in a month.
  • The Adwords Problem: The data for the adwords problem is a set of bids by advertisers on certain search queries, together with a total budget for each advertiser and information about the historical click-through rate for each ad for each query. Another part of the data is the stream of search queries received by the search engine. The objective is to select on-line a fixed-size set of ads in response to each query that will maximize the revenue to the search engine.
  • Simplified Adwords Problem: To see some of the nuances of ad selection, we considered a simplified version in which all bids are either 0 or 1, only one ad is shown with each query, and all advertisers have the same budget. Under this model the obvious greedy algorithm of giving the ad placement to anyone who has bid on the query and has budget remaining can be shown to have a competitive ratio of 1/2.
  • The Balance Algorithm: This algorithm improves on the simple greedy algorithm. A query’s ad is given to the advertiser who has bid on the query and has the largest remaining budget. Ties can be broken arbitrarily.
  • Competitive Ratio of the Balance Algorithm: For the simplified adwords model, the competitive ratio of the Balance Algorithm is 3/4 for the case of two advertisers and 1 − 1 e 1−\frac{1}{e} 1e1, or about 63% for any number of advertisers.
  • The Balance Algorithm for the Generalized Adwords Problem: When bidders can make differing bids, have different budgets, and have different click-through rates for different queries, the Balance Algorithm awards an ad to the advertiser with the highest value of the function Ψ = x ( 1 − e − f ) . Ψ = x(1−e^{−f}). Ψ=x(1ef). Here, x is the product of the bid and the click-through rate for that advertiser and query, and f is the fraction of the advertiser’s budget that remains unspent.
  • Implementing an Adwords Algorithm: The simplest version of the implementation serves in situations where the bids are on exactly the set of words in the search query. We can represent a query by the list of its words, in sorted order. Bids are stored in a hash table or similar structure, with a hash key equal to the sorted list of words. A search query can then be matched against bids by a straightforward lookup in the table.
  • Matching Word Sets Against Documents: A harder version of the adwords-implementation problem allows bids, which are still small sets of words as in a search query, to be matched against larger documents, such as emails or tweets. A bid set matches the document if all the words appear in the document, in any order and not necessarily adjacent.
  • Hash Storage of Word Sets: A useful data structure stores the words of each bid set in the order rarest-first. Documents have their words sorted in the same order. Word sets are stored in a hash table with the first word, in the rarest-first order, as the key.
  • Processing Documents for Bid Matches: We process the words of the document rarest-first. Word sets whose first word is the current word are copied to a temporary hash table, with the second word as the key. Sets already in the temporary hash table are examined to see if the word that is their key matches the current word, and, if so, they are rehashed using their next word. Sets whose last word is matched are copied to the output.

END

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.coloradmin.cn/o/558799.html

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈,一经查实,立即删除!

相关文章

Linkage Mapper 之 Barrier Mapper 功能解析(含实际案例分析)

✅创作者:陈书予 🎉个人主页:陈书予的个人主页 🍁陈书予的个人社区,欢迎你的加入: 陈书予的社区 🌟专栏地址: Linkage Mapper解密数字世界链接 大草原松鸡"(Greater Sage Grouse)的Lek Kernals(即HCAs),以及连接、连接障碍和恢复机会效果图: 图片

python+Django音乐播放器网站系统0tr3w

音乐网站系统的后台开发目标是以信息管理系统的管理和开发方法,用目前现有的新技术进行系统开发,提供后台管理员高度友好的界面操作以及迅捷的信息处理。而前台的开发目标是以用户的需求作为主导,提供对用户而言非常友好的界面操作环境以及完…

实时频谱-1.1基本概念

RF信号 RF(射频)是Radio Frequency的缩写,表示可以辐射到空间的电磁频率,频率范围从300KHz~30GHz之间。 中频 IF(intermediate frequency),用来在中频衡量AM或FM调谐器抑制外来干扰的能力,数…

基于springboot在线外卖系统

末尾获取源码 开发语言:Java Java开发工具:JDK1.8 后端框架:SpringBoot 前端:Vue 数据库:MySQL5.7和Navicat管理工具结合 服务器:Tomcat8.5 开发软件:IDEA / Eclipse 是否Maven项目:…

You Only Look Once: 革命性目标检测算法论文解析

You Only Look Once 全论文完整翻译 You Only Look Once: Unified, Real-Time Object Detection 摘要 我们介绍了一种名为YOLO的新型目标检测方法。在目标检测的先前工作中,人们将分类器重新应用于执行检测任务。相反,我们将目标检测视为一个回归问题&a…

ChatGPT 的议论文究竟写的怎么样?111 位高中教师告诉你答案

夕小瑶科技说 原创 作者 | 小戏、Python 在 OpenAI GPT-4 发布时发布的《GPT-4 Technical Report》中,其中很吸引人眼球的一部分是 GPT-4 应用于教育领域的出色表现,通过让 GPT-4 去完成美国的 AP 课程及考试,来评估 GPT-4 在多个学科中的性…

WIN提权补丁提权,at,sc,psexes提权

win提权分为web和本地提权 web提权就是getshell后,权限是网站权限,要进行提权 本地提权是本地用户进行提权 本地用户的权限大于网站权限,所以本地提权成功概率比web提权概率大 因为我们做渗透测试,一般都是从网站入侵。所以大…

OpenAI最新iOS版ChatGPT下载使用手册:三步快速下载,支持语音输入和历史聊天记录重新对话(免费、比网页端响应快、亲测可用)

目录 前言ChatGPT移动端与网页端相比的优势步骤一:注册美区Apple id账号步骤二:苹果手机切换appstore id步骤三:下载ChatGPT IOS移动版APP畅玩ChatGPT APP体验总结其它资料下载 ! 前言 北京时间5月19日凌晨,OpenAI重…

散点图(Scatter Plot)

目录 1、散点图 2、随机数据分布 1、散点图 散点图是数据集中的每个值都由点表示的图 Matplotlib 模块有一种绘制散点图的方法,它需要两个长度相同的数组,一个数组用于 x 轴的值,另一个数组用于 y 轴的值 x [5,7,8,7,2,17,2,9,4,11,12,9…

2023最新网络安全面试题大全

2023年快过去一半了,不知道小伙伴们有没有找到自己心仪的工作呀【doge】,本文总结了常见的安全岗位面试题,方便各位复习。祝各位事业顺利,财运亨通。在网络安全的道路上越走越远! 所有的资料都整理成了PDF&#xff0c…

Netty实战(七)

EventLoop和线程模型 一、什么是线程模型二、EventLoop 接口2.14 Netty 4 中的 I/O 和事件处理 三、任务调度3.1 JDK 的任务调度 API3.2 使用 EventLoop 调度任务 四、实现细节4.1 线程管理4.2 EventLoop/线程的分配4.2.1 异步传输4.2.2 .阻塞传输 一、什么是线程模…

Java基础学习---3、堆、GC

1、堆 1.1 概述 1.1.1 堆空间结构 1.1.2 堆空间工作机制 新创建的对象会放在Eden区当Eden区中已使用的空间达到一定比例,会触发Minor GC每一次在Minor GC中没有被清理掉的对象就成了幸存者。幸存者对象会被转移到幸存者区幸存者区分成from区和to区from区快满的时…

如何提高软件复用度,降低项目开发成本?

1、代码基线管控策略 理想的代码复用是我们建立一条主干代码,持续维护下去。面对客户的新需求,需要我们拉一条临时分支来满足客户需求,然后将稳定后的临时分支代码成果回归到主干。这样我们所有的研发成果都可以在一个代码分支上进行追溯&…

FreeRTOS学习之路,以STM32F103C8T6为实验MCU(序章——浅谈单片机以及FreeRTOS)

学习之路主要为FreeRTOS操作系统在STM32F103(STM32F103C8T6)上的运用,采用的是标准库编程的方式,使用的IDE为KEIL5。 注意!!!本学习之路可以通过购买STM32最小系统板以及部分配件的方式进行学习…

论文解读 | 透过窥镜: 透明容器内物体的神经三维重建

原创 | 文 BFT机器人 随着虚拟现实和虚拟世界技术的发展,博物馆藏品的数字化是一个越来越受关注的新兴话题。世界上许多著名的博物馆都在为网上展览建立自己的数字馆藏。 在这些藏品中,有一种特殊而重要的藏品昆虫、人体组织、水生生物和其他易碎的标本需…

ZooKeeper(一):基础介绍

文章目录 什么是 ZooKeeper?ZooKeeper 发展历史ZooKeeper 应用场景ZooKeeper 服务的使用ZooKeeper 数据模型data tree 接口znode 分类 总结 什么是 ZooKeeper? ZooKeeper 是一个分布式的,开放源码的分布式应用程序协同服务。ZooKeeper 的设计…

docker-compose安装nacos 2.2.1及配置

目录 官网 创建存储目录 创建数据库 application.properties配置(重要) docker-compose.yml 启动 登录 下面是安装nacos 2.2.1版本的方法,有一些变化 官网 GitHub - alibaba/nacos: an easy-to-use dynamic service discovery, configu…

SCTracker 跟踪论文阅读笔记

SCTracker 跟踪论文阅读笔记 SCTracker: Multi-object tracking with shape and confidence constraints 论文链接 (未开源状态) 论文主要更新点围绕shape constraint and confidence两点来展开: 首先论证在跟踪匹配的过程中D-box(检测框)与T-box(预测框)需要有一定…

今日的CSS小案例

个人名片: 😊作者简介:一名大一在校生,web前端开发专业 🤡 个人主页:几何小超 🐼座右铭:懒惰受到的惩罚不仅仅是自己的失败,还有别人的成功。 🎅**学习目…

第一篇、基于Arduino uno,获取dht11温湿度传感器的温度信息和湿度信息——结果导向

0、结果 说明:先来看看串口调试助手显示的结果,如果是你想要的,可以接着往下看。 1、外观 说明:虽然dht11温湿度模块形态各异,但是代码都是适用的,因为它们的模块都是一样的。 2、连线 说明:…