Python机器学习基础（一）---数据集加载的方法

news2025/7/4 18:42:55

几个数据集加载的方式

鸢尾花练习资源(这个资源有瑕疵，index列和Species 都是带”“的字符串导致一些加载现实问题，从而验证还是pandas最好用)

"index","Sepal.Length","Sepal.Width","Petal.Length","Petal.Width","Species"
"1",5.1,3.5,1.4,0.2,"setosa"
"2",4.9,3,1.4,0.2,"setosa"
"3",4.7,3.2,1.3,0.2,"setosa"
"4",4.6,3.1,1.5,0.2,"setosa"
"5",5,3.6,1.4,0.2,"setosa"
"6",5.4,3.9,1.7,0.4,"setosa"
"7",4.6,3.4,1.4,0.3,"setosa"
"8",5,3.4,1.5,0.2,"setosa"
"9",4.4,2.9,1.4,0.2,"setosa"
"10",4.9,3.1,1.5,0.1,"setosa"

下载链接（鸢尾花）

https://download.csdn.net/download/qq_27437073/88475286?spm=1001.2014.3001.5503https://download.csdn.net/download/qq_27437073/88475286?spm=1001.2014.3001.5503

1.CVS加载数据集

代码：

import csv
import numpy as np

path = r"D:\DevelopWorkSpace\vsCodeWorkSpaces\加载数据集示例\iris.csv"
with open(path,'r') as f:
    reader = csv.reader(f,delimiter = ',')
    headers = next(reader)
    data = list(reader)
    data = np.array(data)
    print(headers)
    print(data[0:30])

输出结果：

['index', 'Sepal.Length', 'Sepal.Width', 'Petal.Length', 'Petal.Width', 'Species']
[['1' '5.1' '3.5' '1.4' '0.2' 'setosa']
 ['2' '4.9' '3' '1.4' '0.2' 'setosa']  
 ['3' '4.7' '3.2' '1.3' '0.2' 'setosa']
 ['4' '4.6' '3.1' '1.5' '0.2' 'setosa']
 ['5' '5' '3.6' '1.4' '0.2' 'setosa']]

代码解释：

path = r"D:\DevelopWorkSpace\vsCodeWorkSpaces\加载数据集示例\iris.csv"

（1）r 保持url原样输出

（2）reader = csv.reader(f,delimiter = ',')
delimiter = ',' 设置数据集间隔符号

（3）next(reader) 获得下一行内容

2.numpy加载数据集

代码：

from numpy import loadtxt

def add_two(x):
   
    transText=str(x,encoding='utf-8')
    if transText == '"setosa"':
        return 1.0
    elif transText=='"versicolor"':
        return 2.0
    elif transText=='"virginica"':
        return 3.0
path = r"D:\DevelopWorkSpace\vsCodeWorkSpaces\加载数据集示例\iris.csv"
datapath= open(path, 'r')
data = loadtxt(datapath, delimiter=",",skiprows=1,usecols = (1,2,3,4,5),converters={5:add_two})
print(data.shape)
print(data[:3])

输出结果：

(150, 5)
[[5.1 3.5 1.4 0.2 1. ]
 [4.9 3.  1.4 0.2 1. ]
 [4.7 3.2 1.3 0.2 1. ]]

代码解释：

loadtxt(datapath, delimiter=",",skiprows=1,usecols = (1,2,3,4,5),converters={5:add_two})

（1）skiprows 省略第一行

（2）usecols = (1,2,3,4,5) 显示1，2，3，4，5列

（3）converters={5:add_two}) 第五列内容使用 add_two方法进行转换

3.pandas加载数据集

代码：

from pandas import read_csv
from pandas import set_option
path=r"D:\DevelopWorkSpace\vsCodeWorkSpaces\加载数据集示例\iris.csv"
data = read_csv(path)

#(150, 6)数据大小 150行6列
print(data.shape)

#显示前0-10行
print(data[:10])

#显示前50行
#print(data.head(50))

#显示每列数据类型
print(data.dtypes)
#set_option参数一览
# pd.set_option('display.max_rows',xxx) # 最大行数
# pd.set_option('display.min_rows',xxx) # 最小显示行数
# pd.set_option('display.max_columns',xxx) # 最大显示列数
# pd.set_option ('display.max_colwidth',xxx) #最大列字符数
# pd.set_option( 'display.precision',2) # 浮点型精度
# pd.set_option('display.float_format','{:,}'.format) #逗号分隔数字
# pd.set_option('display.float_format',  '{:,.2f}'.format) #设置浮点精度
# pd.set_option('display.float_format', '{:.2f}%'.format) #百分号格式化
# pd.set_option('plotting.backend', 'altair') # 更改后端绘图方式
# pd.set_option('display.max_info_columns', 200) # info输出最大列数
# pd.set_option('display.max_info_rows', 5) # info计数null时的阈值
# pd.describe_option() #展示所有设置和描述
# pd.reset_option('all') #重置所有设置选项
set_option('display.max_colwidth', 100)
set_option('precision', 2)
#统计数据
# 总数
# 平均值
# 标准偏差
# 最低价值
# 最大值
# 25％
# 中位数，即50％
# 75％
print(data.describe())

#查看类分布情况
# Petal.Length  属性名称
# 1.0     1     值   出现次数
# 1.1     1
count_class = data.groupby('Petal.Length').size()
#count_class2 = data.groupby('Petal.Length')
print(count_class)

#属性之间的关联性
# 系数值= 1 -它表示变量之间的完全正相关。
# 系数值= -1 -它表示变量之间完全负的相关性。
# 系数值= 0 -它表示变量之间完全没有相关性。
correlations = data.corr(method='pearson')
print(correlations)

输出结果：

(150, 6)
   index  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width Species
0      1           5.1          3.5           1.4          0.2  setosa
1      2           4.9          3.0           1.4          0.2  setosa
2      3           4.7          3.2           1.3          0.2  setosa
3      4           4.6          3.1           1.5          0.2  setosa
4      5           5.0          3.6           1.4          0.2  setosa
5      6           5.4          3.9           1.7          0.4  setosa
6      7           4.6          3.4           1.4          0.3  setosa
7      8           5.0          3.4           1.5          0.2  setosa
8      9           4.4          2.9           1.4          0.2  setosa
9     10           4.9          3.1           1.5          0.1  setosa
index             int64
Sepal.Length    float64
Sepal.Width     float64
Petal.Length    float64
Petal.Width     float64
Species          object
dtype: object
        index  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width
count  150.00        150.00       150.00        150.00       150.00
mean    75.50          5.84         3.06          3.76         1.20
std     43.45          0.83         0.44          1.77         0.76
min      1.00          4.30         2.00          1.00         0.10
25%     38.25          5.10         2.80          1.60         0.30
50%     75.50          5.80         3.00          4.35         1.30
75%    112.75          6.40         3.30          5.10         1.80
max    150.00          7.90         4.40          6.90         2.50
Petal.Length
1.0     1
1.1     1
1.2     2
1.3     7
1.4    13
1.5    13
1.6     7
1.7     4
1.9     2
3.0     1
3.3     2
3.5     2
3.6     1
3.7     1
3.8     1
3.9     3
4.0     5
4.1     3
4.2     4
4.3     2
4.4     4
4.5     8
4.6     3
4.7     5
4.8     4
4.9     5
5.0     4
5.1     8
5.2     2
5.3     2
5.4     2
5.5     3
5.6     6
5.7     3
5.8     3
5.9     2
6.0     2
6.1     3
6.3     1
6.4     1
6.6     1
6.7     2
6.9     1
dtype: int64
              index  Sepal.Length  Sepal.Width  Petal.Length  Petal.Width
index          1.00          0.72        -0.40          0.88         0.90
Sepal.Length   0.72          1.00        -0.12          0.87         0.82
Sepal.Width   -0.40         -0.12         1.00         -0.43        -0.37
Petal.Length   0.88          0.87        -0.43          1.00         0.96
Petal.Width    0.90          0.82        -0.37          0.96         1.00