Hadoop综合项目——二手房统计分析（Hive篇）

news2025/7/9 12:42:08

Hadoop综合项目——二手房统计分析（Hive篇）

文章目录

Hadoop综合项目——二手房统计分析（Hive篇）
- 0、写在前面
- 1、Hive统计分析
- - 1.1 本地数据/HDFS数据导入到Hive
  - 1.2 楼龄超过20年的二手房比例
  - 1.3 四大一线城市各楼层地段的平均价格
  - 1.4 采光较好的二手房比例
  - 1.5 二手房面积大小各个区间的比例
  - 1.6 各大宣传标签的二手房占比
  - 1.7 统计四大一线城市二手房关注人数
  - 1.8 四大一线城市二手房规格比例
- 2、数据及源代码
- 3、总结

在这里插入图片描述

0、写在前面

Windows版本：Windows10
Linux版本：Ubuntu Kylin 16.04
JDK版本：Java8
Hadoop版本：Hadoop-2.7.1
Hive版本：Hive1.2.2
IDE：IDEA 2020.2.3
IDE：Pycharm 2021.1.3
IDE：Eclipse3.8

1、Hive统计分析

使用Hive做了7个不同的统计分析，更好地展示二手房的情况

1.1 本地数据/HDFS数据导入到Hive

先创建数据库db_ke_house以及表tb_ke_house，再从HDFS导入数据

hive> create database if not exists db_ke_house;
OK
Time taken: 0.163 seconds
hive> use db_ke_house;
OK
hive> create table if not exists tb_ke_house (
    > id int,
    > city string,
    > title string,
    > houseInfo array<string>,
    > followInfo array<string>,
    > positionInfo string,
    > total int,
    > unitPrice double,
    > tag string,
    > crawler_time timestamp
    > )
    > row format delimited fields terminated by '\t'
    > collection items terminated by '|';
OK
Time taken: 0.061 seconds
hive> load data inpath '/Ke_House/tb_house.txt' into table tb_ke_house;
Loading data to table db_ke_house.tb_ke_house
Table db_ke_house.tb_ke_house stats: [numFiles=1, totalSize=3385429]
OK
Time taken: 0.223 seconds

查看表数据

hive> select * from tb_ke_house limit 5;

1.2 楼龄超过20年的二手房比例

分析目的：

楼龄的长短是顾客是否购买房子的决定因素之一，楼龄越短，房子越新，顾客购买的欲望也越大。

代码：

with t1 as (
select city, count(city) totalcnt from tb_ke_house where array_contains(houseinfo, "未透露年份建") = false group by city
), t2 as (
select city, 2022 - cast(substring(houseinfo[2],0,length(houseinfo[2])-2) as int) as ycnt 
from tb_ke_house where houseinfo[2] != "未透露年份建"
), t3 as (
select city, count(ycnt) as cnt from t2 where ycnt >= 20 group by city
)
select t1.city, cnt / totalcnt rate from t1 join t3 on t1.city = t3.city group by t1.city, cnt, totalcnt;

运行情况：

tp1

在这里插入图片描述

结果：

在这里插入图片描述

1.3 四大一线城市各楼层地段的平均价格

分析目的：

价格是顾客心中最重要的一个因素，也是顾客是否购买房子的决定因素之一，性价比好的房子，顾客购买欲望可能更大。

代码：

select city,layer, avg(total/size) from (
select city,houseinfo[0] layer,cast(substring(houseinfo[4],0,length(houseinfo[4])-2) as int) as size, total from tb_ke_house 
) t
group by city,layer;

运行情况：
结果：

1.4 采光较好的二手房比例

分析目的：

阳光是否充同样是顾客心中最重要的一个因素，采光好的房源，顾客更倾向于购买。

代码：

with t1 as 
select city, sum(if (title regexp('.*采光.*') = true, 1, 0)) cnt1 from tb_ke_house group by city
), t2 as (
select city, count(city) totalcnt from tb_ke_house group by city
)
select t1.city, cnt1/totalcnt rate1 from t1 join t2 on t1.city = t2.city roup by t1.city, cnt1, totalcnt;

运行情况：

tp2

结果：

1.5 二手房面积大小各个区间的比例

分析目的：

顾客对于房子的大小各有不同，为了更好满足顾客要求，面积大小的统计展示是必不可少的。

代码：

with t2 as (
select city, size,
case when size < 100 then '<100'
when size < 200 then '100~200'
when size < 300 then '200~300'
when size < 500 then '300~500'
else 'other' 
end sizerange
from (
select id, city, substring(houseInfo[4], 0, length(houseinfo[4]) - 2) as size from tb_ke_house
) t1
), t3 as (
select city, count(city) totalcnt from tb_ke_house group by city
)
select t2.city, sizerange, count(sizerange)/totalcnt as rate from t2 join t3 on t2.city = t3.city
group by t2.city, sizerange,totalcnt

运行情况：
结果：

1.6 各大宣传标签的二手房占比

分析目的：

吸引顾客购买的一大来源便是房子的优势，好的宣传标签可以促进房子的销量。

代码：

with t1 as (
select city, count(city) totalcnt from tb_ke_house group by city
), t2 as (
select city, count(tag) cnt, if(tag != '', tag, "未知标签") as newTag
from tb_ke_house group by city, tag
)
select t2.city, newTag, cnt / totalcnt as rate from t2 join t1 on t2.city = t1.city
group by t2.city, newTag,cnt, totalcnt

运行情况：

在这里插入图片描述

结果：

在这里插入图片描述

1.7 统计四大一线城市二手房关注人数

分析目的：

房子的关注数可以体现出顾客此段房源的热爱程度，有利于开发商合理规划房源。

代码：

select city, sum(size) from (
select city, cast(substring(followinfo[0], 0, length(followinfo[0]) - 3) as int) as size
from tb_ke_house
)t
group by city

运行情况：

在这里插入图片描述

结果：

1.8 四大一线城市二手房规格比例

分析目的：

房子规格大小直接影响到顾客对于房源的好感度，合理的规格可以促进顾客的购买欲望。

代码：

with t1 as (
select city, count(city) totalcnt from tb_ke_house group by city
), t2 as (
select city, houseinfo[3] as scale, count(houseinfo[3]) as size
from tb_ke_house group by city, houseinfo[3]
)
select t2.city, scale, size / totalcnt as rate from t1 join t2 on t1.city = t2.city
group by t2.city, scale, size, totalcnt