实验代码:https://download.csdn.net/download/Amzmks/87396462
首先读表
将有空值的列和完全相同的列删除
将数值型数据单独挑出来
将数值型数据从string转为float
用方差阈值法筛选特征较为明显的部分数值型数据
将文本型数据单独挑出来
去除所有的可能的头部和尾部的空格
将id、数值、文本拿出来连接在一起
然后再导出数据
计算相关性 pearson相关系数
查看describe统计信息
loan_amnt列的直方图
y轴对应的是loan_amnt的值落在某个区间的数量,比如说(10000, 10500)这个区间有2000个,则y轴为2000(只是个比方)
理解:直方图可以查看某一(数值型)列在不同区间的分布情况,落在哪个区间的数量有多少
这个图分别是loan_amnt和funded_amnt两列与grade的关系,由于这两列其实数值差不多,所以图上像是都一样的,你换成别的你需要在报告里写的列以后就不一样了。
那两个散点图是这两列的相关性
理解:散点图可以查看某两(数值型)列的数据的关系,比如说A和B两列,A取多少的时候B取多少
线形图是这两列分别和grade的关系,比如说grade为A的列,对应的是橙色的部分,y轴是某一列对应的grade是A的数量。
热力图体现的是每两列之间的相关性,比如说第1行第2个方块的颜色是id和amnt_inv的相关性。对角线是跟自己的相关性,为1.
(-1,1)区间,-1指这两列完全负相关,即A列越大B列就越小,1指完全正相关,0则为无关
Wikipedia:
In statistics, the Pearson correlation coefficient (PCC, pronounced /ˈpɪərsən/) ― also known as Pearson's r, the Pearson product-moment correlation coefficient (PPMCC), the bivariate correlation,[1] or colloquially simply as the correlation coefficient[2] ― is a measure of linear correlation between two sets of data. It is the ratio between the covariance of two variables and the product of their standard deviations; thus it is essentially a normalised measurement of the covariance, such that the result always has a value between −1 and 1. As with covariance itself, the measure can only reflect a linear correlation of variables, and ignores many other types of relationship or correlation. As a simple example, one would expect the age and height of a sample of teenagers from a high school to have a Pearson correlation coefficient significantly greater than 0, but less than 1 (as 1 would represent an unrealistically perfect correlation).
3D的那个图在jupyter或者pycharm环境里能看到,是散点图,意思和上面2D散点图是一样的,每个点的x,y,z代表三个列对应的数值,可以查看三列对应的分布情况。