设计亚马逊按销售排名功能

news2025/7/6 8:25:56

1：定义 Use Cases 和约束

Use cases

作用域内的Use Case

Service 通过目录计算过去一周内最受欢迎的产品
User 通过目录去View过去周内最受欢迎的产品
Service 有高可用

作用域外

整个电商网站
- 设计组件（只是计算销售排名）

约束和假设

Traffic 不是平均分布的
类目会被存进多个不同目录
类目不能改变目录
没有子目录，比如: foo/bar/baz
结果会被定时更新
- 更受换一个的产品或许需要更频繁的更新
1000 万产品
1000 目录
10 亿交易 / 月
1000 亿请求 / 月
100：1读写比率

计算使用量

每个 trasnaction 的 Size
- created_at - 5 bytes
- product_id - 8 bytes
- category_id - 4 bytes
- seller_id - 8 bytes
- buyer_id - 8 bytes
- quantity - 4 bytes
- total_price - 5 bytes
- Total: ~40 bytes
每个月有 40 GB 的新 trasnaction 内容

40 字节 / transaction * 10 亿 transaction / 月
1.44 TB 新 transaction 内容 / 3 年
假设大部分是新的交易，而不是更新已经存在的

平均 400 transaction / s
平均 40000 读请求 / s

便利转换指南：

每个月有250万秒
1 request / s = 250 万请求 / 月
40 request / s = 1 亿请求每月
400 request / s = 10 亿请求每月

2：创建一个High Level设计

Design

3：设计核心组件

Use case: Service 通过category计算过去周内最受欢迎的产品

我们可以我们可以存储二进制 Sales API server的 log 文件进受管理的独享存储，比如 Amazon S3, 而不是管理我们自己的分布式文件系统

我们将假设这是一个简单的 log entry, tab delimited:

timestamp   product_id  category_id    qty     total_price   seller_id    buyer_id
t1          product1    category1      2       20.00         1            1
t2          product1    category2      2       20.00         2            2
t2          product1    category2      1       10.00         2            3
t3          product2    category1      3        7.00         3            4
t4          product3    category2      7        2.00         4            5
t5          product4    category1      1        5.00         5            6

Sales Rank Service 会使用 MapReduce, 使用这个 Sales API Sever的 log 文件作为输入，而且结果会被写进聚合表 sales_rank. 我们应该讨论数据库是选用SQL 还是NoSQL.

我们将使用多步骤的 MapReduce:

Step1: 转移数据到 `(category, produce_id), sum(quantity)
Step2: 执行分布式排序

class SalesRanker(MRJob):

    def within_past_week(self, timestamp):
        """Return True if timestamp is within past week, False otherwise."""
        ...

    def mapper(self, _ line):
        """Parse each log line, extract and transform relevant lines.

        Emit key value pairs of the form:

        (category1, product1), 2
        (category2, product1), 2
        (category2, product1), 1
        (category1, product2), 3
        (category2, product3), 7
        (category1, product4), 1
        """
        timestamp, product_id, category_id, quantity, total_price, seller_id, \
            buyer_id = line.split('\t')
        if self.within_past_week(timestamp):
            yield (category_id, product_id), quantity

    def reducer(self, key, value):
        """Sum values for each key.

        (category1, product1), 2
        (category2, product1), 3
        (category1, product2), 3
        (category2, product3), 7
        (category1, product4), 1
        """
        yield key, sum(values)

    def mapper_sort(self, key, value):
        """Construct key to ensure proper sorting.

        Transform key and value to the form:

        (category1, 2), product1
        (category2, 3), product1
        (category1, 3), product2
        (category2, 7), product3
        (category1, 1), product4

        The shuffle/sort step of MapReduce will then do a
        distributed sort on the keys, resulting in:

        (category1, 1), product4
        (category1, 2), product1
        (category1, 3), product2
        (category2, 3), product1
        (category2, 7), product3
        """
        category_id, product_id = key
        quantity = value
        yield (category_id, quantity), product_id

    def reducer_identity(self, key, value):
        yield key, value

    def steps(self):
        """Run the map and reduce steps."""
        return [
            self.mr(mapper=self.mapper,
                    reducer=self.reducer),
            self.mr(mapper=self.mapper_sort,
                    reducer=self.reducer_identity),
        ]

这个结果将变成下面的 sorted list, 我们可以插入 sales_rank 表中：

(category1, 1), product4
(category1, 2), product1
(category1, 3), product2
(category2, 3), product1
(category2, 7), product3

sales_rank table 会有如下的结构:

id int NOT NULL AUTO_INCREMENT
category_id int NOT NULL
total_sold int NOT NULL
product_id int NOT NULL
PRIMARY KEY(id)
FOREIGN KEY(category_id) REFERENCES Categories(id)
FOREIGN KEY(product_id) REFERENCES Products(id)

我们将创建一个index在id, category_id, and product_id 上去加速查询（log 时间而不是扫描整张表）
，而且放进数据去内存。从内存中序列化的读取数据需要250微妙，当从SSD读取需要4倍，从磁盘读取需要80倍。

Use Case: User 通过目录查看过去一周内最受欢迎的产品

Client 发送请求到Web Server
Web Server 转发请求到 Read API server
Read API server 从数据库表 sales_rank 重读取数据

我们可以使用如下Rest API:

$ curl https://amazon.com/api/v1/popular?category_id=1234

Response:

{
    "id": "100",
    "category_id": "1234",
    "total_sold": "100000",
    "product_id": "50",
},
{
    "id": "53",
    "category_id": "1234",
    "total_sold": "90000",
    "product_id": "200",
},
{
    "id": "75",
    "category_id": "1234",
    "total_sold": "80000",
    "product_id": "3",
}