文章目录
- 一、单一聚合
- 二、聚合管道
- 2.1 $match / $project / $count
- 2.2 $limit / $skip / $sort
- 2.3 $group 分组查询
- 2.4 $unwind 展开数组
- 2.5 $lookup 左外连接
- 2.6 $bucket 存储桶
- 三、MapReduce
提示:以下是本篇文章正文内容,MongoDB 系列学习将会持续更新
一、单一聚合
单一作用聚合:提供了对常见聚合过程的简单访问,操作都从单个集合聚合文档。
函数 | 描述 |
---|---|
db.collection.estimatedDocumentCount() | 忽略查询条件,返回集合或视图中所有文档的计数 |
db.collection.count() | 返回与find()集合或视图的查询匹配的文档计数,等同于 db.collection.find(query).count() |
db.collection.distinct() | 在单个集合或视图中查找指定字段的不同值,并在数组中返回结果 |
// 检索book集合中所有文档的计数, 效率高
db.book.estimatedDocumentCount()
// 结果同上,忽略查询条件
db.book.estimatedDocumentCount({type:"travel"})
// 统计所有文档数量
db.book.count()
// 统计符合条件的文档数量
db.book.count({favCount:{$gt:50}})
// 返回不同type的数组
db.book.distinct("type")
// 返回收藏数大于90的文档不同type的数组
db.book.distinct("type",{favCount:{$gt:90}})
注意:在分片群集上,如果存在孤立文档或正在进行块迁移,则 db.collection.count() 没有查询谓词可能导致计数不准确。要避免这些情况,请在分片群集上使用 db.collection.aggregate() 方法。
回到目录…
二、聚合管道
官方文档:Aggregation Pipeline Stages — MongoDB Manual
pipeline = [$stage1, $stage2, ...$stageN];
db.collection.aggregate(pipeline, {options})
pipelines 一组数据聚合阶段。除
$out
、$Merge
和$geonear
阶段之外,每个阶段都可以在管道中出现多次。
options 可选,聚合操作的其他参数。包含:查询计划、是否使用临时文件、 游标、最大操作时间、读写策略、强制索引等等
数据准备:
var tags = ["nosql","mongodb","document","developer","popular"];
var types = ["technology","sociality","travel","novel","literature"];
var books=[];
for(var i=0;i<50;i++){
var typeIdx = Math.floor(Math.random()*types.length);
var tagIdx = Math.floor(Math.random()*tags.length);
var tagIdx2 = Math.floor(Math.random()*tags.length);
var favCount = Math.floor(Math.random()*100);
var username = "xx00"+Math.floor(Math.random()*10);
var age = 20 + Math.floor(Math.random()*15);
var book = {
_id: i+1,
title: "book-"+i,
type: types[typeIdx],
tag: [tags[tagIdx],tags[tagIdx2]],
favCount: favCount,
author: {name:username,age:age}
};
books.push(book)
}
db.book2.insertMany(books);
2.1 $match / $project / $count
$match 条件筛选
$match
用于对文档进行筛选,之后可以在得到的文档子集上做聚合。(一般放在最前面,也可以根据需求放到后面做筛选)。
db.book2.aggregate([{$match:{type:"novel"}}])
$project 字段投影
//投影操作, 将原始字段投影成指定名称, 如将集合中的 title 投影成 name
db.book2.aggregate([{$project:{name:"$title"}}])
//$project 可以灵活控制输出文档的格式,也可以剔除不需要的字段
db.book2.aggregate([{$project:{_id:0,name:"$title",type:1}}])
$count 计数
> db.book2.aggregate([{$count:"all_count"}])
{ "all_count" : 50 }
// $match筛选出type匹配的文档;$count返回聚合管道中过滤后的文档计数,并将值分配给novel_count
> db.book2.aggregate([{$match:{type:"novel"}},{$count:"novel_count"}])
{ "novel_count" : 14 }
回到目录…
2.2 $limit / $skip / $sort
$limit / $skip
db.book2.find().limit(5)
db.book2.aggregate([{$limit:5}])
db.book2.find().skip(5)
db.book2.aggregate([{$skip:5}])
db.book2.find().skip(5).limit(5)
db.book2.aggregate([{$skip:5},{$limit:5}])
$sort
db.book2.find().sort({favCount:1})
db.book2.aggregate([{$sort:{favCount:1}}])
回到目录…
2.3 $group 分组查询
按指定的表达式对文档进行分组,并将每个不同分组的文档输出到下一个阶段。
{$group: {_id: <expression>, <field1>: {<accumulator1>: <expression1>}, ...} }
_id
表示分组条件 (必填),指定为 null 时以整个文档为一组;剩余的计算字段是可选的,并使用运算符进行计算。
$group
阶段的内存限制为 100M。默认情况下,如果stage超过此限制会产生错误。但是,要允许处理大型数据集,请将allowDiskUse
选项设置为 true 以启用 $group 操作以写入临时文件。
accumulator 操作符:
名称 | 描述 | 类比 SQL |
---|---|---|
$avg | 计算均值。 | avg |
$first | 返回每组第一个文档,如果有排序,按照排序,如果没有按照默认的存储的顺序的第一个文档。 | limit 0,1 |
$last | 返回每组最后一个文档,如果有排序,按照排序,如果没有按照默认的存储的顺序的最后个文档。 | - |
$max | 根据分组,获取集合中所有文档对应值得最大值。 | max |
$min | 根据分组,获取集合中所有文档对应值得最小值。 | min |
$push | 将指定的表达式的值添加到一个数组中。 | - |
$addToSet | 将表达式的值添加到一个集合中 (无重复值,无序)。 | - |
$sum | 计算总和 | sum |
$stdDevPop | 返回输入值的总体标准偏差(population standard deviation) | - |
$stdDevSamp | 返回输入值的样本标准偏差(the sample standard deviation) | - |
示例一:统计文档中的书籍总量、收藏总数、平均收藏量
db.book2.aggregate([
{$group:{_id:null,bookCount:{$sum:1},favCount:{$sum:"$favCount"},favAvg:{$avg:"$favCount"}}}
])
{ "_id" : null, "bookCount" : 50, "favCount" : 2392, "favAvg" : 47.84 }
示例二:统计每个作者的book收藏总数
db.book2.aggregate([
{$group:{_id:"$author.name",favCount:{$sum:"$favCount"}}}
])
{ "_id" : "xx007", "favCount" : 259 }
{ "_id" : "xx009", "favCount" : 160 }
{ "_id" : "xx005", "favCount" : 229 }
{ "_id" : "xx008", "favCount" : 302 }
{ "_id" : "xx001", "favCount" : 300 }
{ "_id" : "xx002", "favCount" : 311 }
{ "_id" : "xx006", "favCount" : 174 }
{ "_id" : "xx003", "favCount" : 160 }
{ "_id" : "xx004", "favCount" : 373 }
{ "_id" : "xx000", "favCount" : 124 }
示例三:统计每个作者的每本book的收藏数
db.book2.aggregate([
{$group:{_id:{author:"$author.name",title:"$title"},favCount:{$sum:"$favCount"}}}
])
{ "_id" : { "author" : "xx006", "title" : "book-42" }, "favCount" : 77 }
{ "_id" : { "author" : "xx002", "title" : "book-33" }, "favCount" : 70 }
{ "_id" : { "author" : "xx002", "title" : "book-3" }, "favCount" : 44 }
{ "_id" : { "author" : "xx007", "title" : "book-34" }, "favCount" : 50 }
{ "_id" : { "author" : "xx004", "title" : "book-46" }, "favCount" : 84 }
{ "_id" : { "author" : "xx008", "title" : "book-4" }, "favCount" : 64 }
{ "_id" : { "author" : "xx009", "title" : "book-8" }, "favCount" : 45 }
Type "it" for more
示例四:每个作者的book的type合集
db.book2.aggregate([
{$group:{_id:"$author.name",types:{$addToSet:"$type"}}}
])
{ "_id" : "xx007", "types" : [ "sociality", "travel", "literature", "novel", "technology" ] }
{ "_id" : "xx009", "types" : [ "literature", "novel" ] }
{ "_id" : "xx005", "types" : [ "travel", "novel", "literature", "sociality" ] }
{ "_id" : "xx008", "types" : [ "literature", "sociality", "technology" ] }
{ "_id" : "xx001", "types" : [ "novel", "sociality", "technology" ] }
{ "_id" : "xx002", "types" : [ "literature", "novel", "sociality", "technology" ] }
{ "_id" : "xx006", "types" : [ "novel" ] }
{ "_id" : "xx003", "types" : [ "literature", "novel", "technology" ] }
{ "_id" : "xx004", "types" : [ "sociality", "travel", "literature", "novel" ] }
{ "_id" : "xx000", "types" : [ "literature", "sociality" ] }
示例五:统计每个分类的book文档数量
db.book2.aggregate([
{$group:{_id:"$type",bookCount:{$sum:1}}},
{$sort:{bookCount:1}}
])
{ "_id" : "travel", "bookCount" : 4 }
{ "_id" : "technology", "bookCount" : 9 }
{ "_id" : "literature", "bookCount" : 11 }
{ "_id" : "sociality", "bookCount" : 12 }
{ "_id" : "novel", "bookCount" : 14 }
回到目录…
2.4 $unwind 展开数组
可以将数组拆分为单独的文档。
> db.book2.aggregate([{$unwind:"$tag"}])
{ "_id" : 1, "title" : "book-0", "type" : "literature", "tag" : "developer", "favCount" : 76, "author" : { "name" : "xx003", "age" : 30 } }
{ "_id" : 1, "title" : "book-0", "type" : "literature", "tag" : "document", "favCount" : 76, "author" : { "name" : "xx003", "age" : 30 } }
{ "_id" : 2, "title" : "book-1", "type" : "sociality", "tag" : "document", "favCount" : 41, "author" : { "name" : "xx004", "age" : 27 } }
{ "_id" : 2, "title" : "book-1", "type" : "sociality", "tag" : "document", "favCount" : 41, "author" : { "name" : "xx004", "age" : 27 } }
{ "_id" : 3, "title" : "book-2", "type" : "novel", "tag" : "nosql", "favCount" : 17, "author" : { "name" : "xx009", "age" : 30 } }
{ "_id" : 3, "title" : "book-2", "type" : "novel", "tag" : "mongodb", "favCount" : 17, "author" : { "name" : "xx009", "age" : 30 } }
{ "_id" : 4, "title" : "book-3", "type" : "literature", "tag" : "popular", "favCount" : 44, "author" : { "name" : "xx002", "age" : 29 } }
{ "_id" : 4, "title" : "book-3", "type" : "literature", "tag" : "popular", "favCount" : 44, "author" : { "name" : "xx002", "age" : 29 } }
Type "it" for more
示例一:姓名为xx006的作者的book的tag数组拆分为多个文档
db.book2.aggregate([
{$match:{"author.name":"xx006"}},
{$unwind:"$tag"}
])
{ "_id" : 16, "title" : "book-15", "type" : "novel", "tag" : "popular", "favCount" : 97, "author" : { "name" : "xx006", "age" : 30 } }
{ "_id" : 16, "title" : "book-15", "type" : "novel", "tag" : "popular", "favCount" : 97, "author" : { "name" : "xx006", "age" : 30 } }
{ "_id" : 43, "title" : "book-42", "type" : "novel", "tag" : "developer", "favCount" : 77, "author" : { "name" : "xx006", "age" : 28 } }
{ "_id" : 43, "title" : "book-42", "type" : "novel", "tag" : "popular", "favCount" : 77, "author" : { "name" : "xx006", "age" : 28 } }
示例二:每个作者的book的tag合集
db.book2.aggregate([
{$unwind:"$tag"},
{$group:{_id:"$author.name",tags:{$addToSet:"$tag"}}}
])
{ "_id" : "xx003", "tags" : [ "document", "mongodb", "developer", "nosql" ] }
{ "_id" : "xx004", "tags" : [ "nosql", "developer", "popular", "mongodb", "document" ] }
{ "_id" : "xx007", "tags" : [ "document", "mongodb", "popular", "nosql", "developer" ] }
{ "_id" : "xx008", "tags" : [ "document", "mongodb", "popular", "developer", "nosql" ] }
{ "_id" : "xx006", "tags" : [ "popular", "developer" ] }
{ "_id" : "xx009", "tags" : [ "nosql", "developer", "popular", "mongodb", "document" ] }
{ "_id" : "xx005", "tags" : [ "nosql", "popular", "mongodb", "document" ] }
{ "_id" : "xx001", "tags" : [ "document", "mongodb", "popular", "nosql", "developer" ] }
{ "_id" : "xx002", "tags" : [ "document", "popular", "developer", "nosql" ] }
{ "_id" : "xx000", "tags" : [ "document", "developer", "nosql" ] }
示例三:统计tag标签的热度排行 (根据favCount总量排行即为热度)
db.book2.aggregate([
{$unwind:"$tag"},
{$group:{_id:"$tag",favCount:{$sum:"$favCount"}}},
{$sort:{favCount:-1}}
])
{ "_id" : "popular", "favCount" : 1421 }
{ "_id" : "document", "favCount" : 977 }
{ "_id" : "developer", "favCount" : 893 }
{ "_id" : "nosql", "favCount" : 850 }
{ "_id" : "mongodb", "favCount" : 643 }
回到目录…
2.5 $lookup 左外连接
Mongodb3.2 新增,主要用来实现多表关联查询。每个输入待处理的文档,经过 $lookup 阶段的处理,输出的新文档中会包含一个新生成的数组。数组列存放的数据是来自被 Join 集合的适配文档 (如果没有即为[ ])。
db.collection.aggregate([{
$lookup: {
from: "<等待被Join的集合>",
localField: "<源集合中的match值>",
foreignField: "<待Join的集合的match值>",
as: "<输出文档的新增值命名>"
}
})
数据准备:
db.customer.insert({_id:1,name:"customer1",phone:"13112345678",address:"test1"})
db.customer.insert({_id:2,name:"customer2",phone:"13112345679",address:"test2"})
db.order.insert({_id:1,orderCode:"order001",customerId:1,price:200})
db.order.insert({_id:2,orderCode:"order002",customerId:2,price:400})
db.orderItem.insert({_id:1,productName:"apples",qutity:2,orderId:1})
db.orderItem.insert({_id:2,productName:"oranges",qutity:2,orderId:1})
db.orderItem.insert({_id:3,productName:"mangoes",qutity:2,orderId:1})
db.orderItem.insert({_id:4,productName:"apples",qutity:2,orderId:2})
db.orderItem.insert({_id:5,productName:"oranges",qutity:2,orderId:2})
db.orderItem.insert({_id:6,productName:"mangoes",qutity:2,orderId:2})
示例一:联表查询用户以及对应的订单信息
db.customer.aggregate([
{$lookup: {
from: "order",
localField: "_id",
foreignField: "customerId",
as: "customerOrder"
}
}
]).pretty()
// 结果如下:
{
"_id" : 1,
"name" : "customer1",
"phone" : "13112345678",
"address" : "test1",
"customerOrder" : [
{
"_id" : 1,
"orderCode" : "order001",
"customerId" : 1,
"price" : 200
}
]
}
{
"_id" : 2,
"name" : "customer2",
"phone" : "13112345679",
"address" : "test2",
"customerOrder" : [
{
"_id" : 2,
"orderCode" : "order002",
"customerId" : 2,
"price" : 400
}
]
}
示例二:联表查询订单信息以及对应的商品项
db.order.aggregate([
{$lookup: {
from: "orderItem",
localField: "_id",
foreignField: "orderId",
as: "orderItem"
}
}
]).pretty()
// 结果如下:
{
"_id" : 1,
"orderCode" : "order001",
"customerId" : 1,
"price" : 200,
"orderItem" : [
{
"_id" : 1,
"productName" : "apples",
"qutity" : 2,
"orderId" : 1
},
{
"_id" : 2,
"productName" : "oranges",
"qutity" : 2,
"orderId" : 1
},
{
"_id" : 3,
"productName" : "mangoes",
"qutity" : 2,
"orderId" : 1
}
]
}
{
"_id" : 2,
"orderCode" : "order002",
"customerId" : 2,
"price" : 400,
"orderItem" : [
{
"_id" : 4,
"productName" : "apples",
"qutity" : 2,
"orderId" : 2
},
{
"_id" : 5,
"productName" : "oranges",
"qutity" : 2,
"orderId" : 2
},
{
"_id" : 6,
"productName" : "mangoes",
"qutity" : 2,
"orderId" : 2
}
]
}
回到目录…
2.6 $bucket 存储桶
将传入文档分类为组 (称为存储桶),指定表达式和存储桶边界并输出文档到每个存储桶,结果至少包含一个输入文档。
$bucket 阶段的 RAM 限制为 100 MB。如果阶段超过此限制,会返回一个错误。
{
$bucket: {
groupBy: <expression>, //指定字段,用来对文档进行分组
boundaries: [ <lowerbound1>, <lowerbound2>, ... ], //_id值,指定每个存储桶的边界
default: <literal>, //可选的,指定边界外的_id值
output: { //可选的, 除_id字段外,指定输出文档中要包含的字段
<output1>: { <$accumulator expression> },
...
<outputN>: { <$accumulator expression> }
}
}
}
示例:统计book收藏数在[0,10),[10,60),[60,80),[80,100),[100,+∞)区间内的文档数
db.book2.aggregate([{
$bucket:{
groupBy:"$favCount",
boundaries:[0,10,60,80,100],
default:"other",
output:{"bookCount":{$sum:1}}
}
}])
{ "_id" : 0, "bookCount" : 5 }
{ "_id" : 10, "bookCount" : 28 }
{ "_id" : 60, "bookCount" : 10 }
{ "_id" : 80, "bookCount" : 7 }
回到目录…
三、MapReduce
MapReduce 操作将大量的数据处理工作拆分成多个线程并行处理,然后将结果合并在一起。MongoDB 提供的 Map-Reduce 非常灵活,对于大规模数据分析也相当实用。
MapReduce 具有两个阶段:
- 将具有相同 Key 的文档数据整合在一起的 map 阶段
- 组合 map 操作的结果进行统计输出的 reduce 阶段
db.collection.mapReduce(
function() {emit(key,value);}, //map函数,将数据拆分成键值对,交给reduce函数
function(key,values) {return reduceFunction}, //reduce函数,根据键将值做统计运算
{
out: <collection>, //可选,将结果汇入指定表
query: <document>, //可选,筛选数据的条件,筛选的数据送入map
sort: <document>, //排序完后,送入map
limit: <number>, //限制送入map的文档数
finalize: <function>, //可选,修改reduce的结果后进行输出
scope: <document>, //可选,指定map、reduce、finalize的全局变量
jsMode: <boolean>, //可选,默认false。在mapreduce过程中是否将数据转换成bson格式
verbose: <boolean>, //可选,是否在结果中显示时间,默认false
bypassDocumentValidation: <boolean> //可选,是否略过数据校验
}
)
示例:统计type为travel的不同作者的book文档收藏数
db.book2.mapReduce(
function(){emit(this.author.name,this.favCount)},
function(key,values){return Array.sum(values)},
{
query:{type:"travel"},
out: "books_favCount"
}
)
db.books_favCount.find()
{ "_id" : "xx007", "value" : 57 }
{ "_id" : "xx005", "value" : 95 }
{ "_id" : "xx004", "value" : 58 }
我们使用聚合管道也能实现, MongoDB5.0开始,map-reduce已被弃用。
db.book2.aggregate([
{$match:{type:"travel"}},
{$group:{_id:"$author.name",favCount:{$sum:"$favCount"}}}
])
{ "_id" : "xx007", "favCount" : 57 }
{ "_id" : "xx005", "favCount" : 95 }
{ "_id" : "xx004", "favCount" : 58 }
回到目录…
总结:
提示:这里对文章进行总结:
本文是对MongoDB的学习,主要学习了MongoDB的聚合操作,重点掌握了 $group分组查询, $unwind展开数组, $lookup左外连接等聚合管道,并且实现了一些应用场景。之后的学习内容将持续更新!!!