机器学习样本数据划分的典型Python方法

news2025/1/12 23:11:59

机器学习样本数据划分的典型Python方法

DateAuthorVersionNote
2023.08.16Dog TaoV1.0完成文档撰写。

文章目录

  • 机器学习样本数据划分的典型Python方法
    • 样本数据的分类
      • Training Data
      • Validation Data
      • Test Data
    • numpy.ndarray类型数据
      • 直接划分
      • 交叉验证
        • 基于`KFold`
        • 基于`RepeatedKFold`
        • 基于`cross_val_score`
    • torch.tensor类型数据
      • 直接划分
        • 基于TensorDataset
        • 基于切片方法
      • 交叉验证

样本数据的分类

In machine learning and deep learning, the data used to develop a model can be divided into three distinct sets: training data, validation data, and test data. Understanding the differences among them and their distinct roles is crucial for effective model development and evaluation.

Training Data

  • Purpose: The training data is used to train the model. It’s the dataset the algorithm will learn from.
  • Usage: The model parameters are adjusted or “learned” using this data. For example, in a neural network, weights are adjusted using backpropagation on this data.
  • Fraction: Typically, a significant majority of the dataset is allocated to training (e.g., 60%-80%).
  • Issues: Overfitting can be a concern if the model becomes too specialized to the training data, leading it to perform poorly on unseen data.

Validation Data

  • Purpose: The validation data is used to tune the model’s hyperparameters and make decisions about the model’s structure (e.g., choosing the number of hidden units in a neural network or the depth of a decision tree).
  • Usage: After training on the training set, the model is evaluated on the validation set, and adjustments to the model (like changing hyperparameters) are made based on this evaluation. The process might be iterative.
  • Fraction: Often smaller than the training set, typically 10%-20% of the dataset.
  • Issues: Overfitting to the validation set can happen if you make too many adjustments based on the validation performance. This phenomenon is sometimes called “validation set overfitting” or “leakage.”

Test Data

  • Purpose: The test data is used to evaluate the model’s final performance after training and validation. It provides an unbiased estimate of model performance in real-world scenarios.
  • Usage: Only for evaluation. The model does not “see” this data during training or hyperparameter tuning. Once the model is finalized, it is tested on this dataset to gauge its predictive performance.
  • Fraction: Typically, 10%-20% of the dataset.
  • Issues: To preserve the unbiased nature of the test set, it should never be used to make decisions about the model. If it’s used in this way, it loses its purpose, and one might need a new test set.

Note: The exact percentages mentioned can vary based on the domain, dataset size, and specific methodologies. In practice, strategies like k-fold cross-validation might be used, where the dataset is split into k subsets, and the model is trained and validated multiple times, each time using a different subset as the validation set and the remaining data as the training set.

In summary, the distinction among training, validation, and test data sets is crucial for robust model development, avoiding overfitting, and ensuring that the model will generalize well to new, unseen data.

在这里插入图片描述

numpy.ndarray类型数据

直接划分

To split numpy.ndarray data into a training set and validation set, you can use the train_test_split function provided by the sklearn.model_selection module.

Here’s a brief explanation followed by an example:

  • Function Name: train_test_split()

  • Parameters:

    1. arrays: Sequence of indexables with the same length. Can be any data type.
    2. test_size: If float, should be between 0.0 and 1.0, representing the proportion of the dataset to include in the test split. If int, represents the absolute number of test samples.
    3. train_size: Complement to test_size. If not provided, the value is set to the complement of the test size.
    4. random_state: Seed for reproducibility.
    5. shuffle: Whether to shuffle before splitting. Default is True.
    6. stratify: If not None, the data is split in a stratified fashion using this as the class labels.
  • Returns: Split arrays.

Example:

Let’s split an example dataset into a training set (80%) and a validation set (20%):

import numpy as np
from sklearn.model_selection import train_test_split

# Sample data
X = np.random.rand(100, 5)  # 100 samples, 5 features
y = np.random.randint(0, 2, 100)  # 100 labels, binary classification

# Split the data
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

print("Training set size:", len(X_train))
print("Validation set size:", len(X_val))
  • If you want the split to be reproducible (i.e., get the same split each time you run the code), set the random_state to any integer value.
  • If you’re working with imbalanced datasets and want to ensure that the class distribution is the same in both the training and validation sets, you can use the stratify parameter. Setting stratify=y will ensure that the splits have the same class distribution as the original dataset.

交叉验证

基于KFold

For performing ( n )-fold cross-validation on numpy.ndarray data, you can use the KFold class from the sklearn.model_selection module.

Here’s how you can use ( n )-fold cross-validation:

  • Class Name: KFold

  • Parameters of KFold:

    1. n_splits: Number of folds.
    2. shuffle: Whether to shuffle the data before splitting into batches.
    3. random_state: Seed used by the random number generator for reproducibility.

Example:

Let’s say you want 5-fold cross-validation:

import numpy as np
from sklearn.model_selection import KFold

# Sample data
X = np.random.rand(100, 5)  # 100 samples, 5 features
y = np.random.randint(0, 2, 100)  # 100 labels, binary classification

kf = KFold(n_splits=5, shuffle=True, random_state=42)

for train_index, val_index in kf.split(X):
    X_train, X_val = X[train_index], X[val_index]
    y_train, y_val = y[train_index], y[val_index]
    print("Training set size:", len(X_train))
    print("Validation set size:", len(X_val))
    print("---")
  • Each iteration in the loop gives you a different split of training and validation data.
  • The training and validation indices are generated based on the size of X.
  • If you want the split to be reproducible (i.e., get the same split each time you run the code), set the random_state parameter.
  • In case you want stratified k-fold cross-validation (where the folds are made by preserving the percentage of samples for each class), use StratifiedKFold instead of KFold. This can be particularly useful for imbalanced datasets.

基于RepeatedKFold

RepeatedKFold repeats K-Fold cross-validator. For each repetition, it splits the dataset into k-folds and then the k-fold cross-validation is performed. This results in having multiple scores for multiple runs, which might give a more comprehensive evaluation of the model’s performance.

Parameters:

  • n_splits: Number of folds.
  • n_repeats: Number of times cross-validator needs to be repeated.
  • random_state: Random seed for reproducibility.

Example:

import numpy as np
from sklearn.model_selection import RepeatedKFold

X = np.array([[1, 2], [3, 4], [5, 6], [7, 8]])
y = np.array([1, 2, 3, 4])

rkf = RepeatedKFold(n_splits=2, n_repeats=2, random_state=42)

for train_index, test_index in rkf.split(X):
    print("TRAIN:", train_index, "TEST:", test_index)
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

基于cross_val_score

cross_val_score evaluates a score by cross-validation. It’s a quick utility that wraps both the steps of splitting the dataset and evaluating the estimator’s performance.

Parameters:

  • estimator: The object to use to fit the data.
  • X: The data to fit.
  • y: The target variable for supervised learning problems.
  • cv: Cross-validation strategy.
  • scoring: A string (see model evaluation documentation) or a scorer callable object/function.

Example:

Here’s an example using RepeatedKFold with cross_val_score for a simple regression model:

from sklearn.datasets import make_regression
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score, RepeatedKFold

# Generate a sample dataset
X, y = make_regression(n_samples=1000, n_features=20, noise=0.1)

# Define the model
model = LinearRegression()

# Define the evaluation procedure
cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)

# Evaluate the model
scores = cross_val_score(model, X, y, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1)

# Summary of performance
print('Mean MAE: %.3f (%.3f)' % (np.mean(scores), np.std(scores)))

In the above example:

  • cross_val_score is used to evaluate the performance of a LinearRegression model using the mean absolute error (MAE) metric.
  • We employ a 10-fold cross-validation strategy that is repeated 3 times, as specified by RepeatedKFold.
  • The scores from all these repetitions and folds are aggregated into the scores array.

Note:

  • In the scoring parameter, the ‘neg_mean_absolute_error’ is used because in sklearn, the convention is to maximize the score, so loss functions are represented with negative values (the closer to 0, the better).

torch.tensor类型数据

直接划分

基于TensorDataset

To split a tensor into training and validation sets, you can use the random_split method from torch.utils.data. This is particularly handy when you’re dealing with Dataset objects, but it can also be applied directly to tensors with a bit of wrapping.

Here’s how you can do it:

  1. Wrap your tensor in a TensorDataset:
    Before using random_split, you might need to wrap your tensors in a TensorDataset so they can be treated as a dataset.

  2. Use random_split to divide the dataset:
    The random_split function requires two arguments: the dataset you’re splitting and a list of lengths for each resulting subset.

Here’s an example using random_split:

import torch
from torch.utils.data import TensorDataset, random_split

# Sample tensor data
X = torch.randn(1000, 10)  # 1000 samples, 10 features each
Y = torch.randint(0, 2, (1000,))  # 1000 labels

# Wrap tensors in a dataset
dataset = TensorDataset(X, Y)

# Split into 80% training (800 samples) and 20% validation (200 samples)
train_size = int(0.8 * len(dataset))
val_size = len(dataset) - train_size
train_dataset, val_dataset = random_split(dataset, [train_size, val_size])

print(len(train_dataset))  # 800
print(len(val_dataset))    # 200

Once you’ve split your data into training and validation sets, you can easily load them in batches using DataLoader if needed.

  • The random_split method does not actually make a deep copy of the dataset. Instead, it returns Subset objects that internally have indices to access the original dataset. This makes the splitting operation efficient in terms of memory.

  • Each time you call random_split, the split will be different because the method shuffles the indices. If you want reproducibility, you should set the random seed using torch.manual_seed() before calling random_split.

The resulting subsets from random_split can be directly passed to DataLoader to create training and validation loaders:

from torch.utils.data import DataLoader

train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=32, shuffle=False)

This allows you to efficiently iterate over the batches of data during training and validation.

If you have a TensorDataset and you want to retrieve all the data pairs from it, you can simply iterate over the dataset. Each iteration will give you a tuple where each element of the tuple corresponds to a tensor in the TensorDataset.

Here’s an example:

import torch
from torch.utils.data import TensorDataset

# Sample tensor data
X = torch.randn(100, 10)  # 100 samples, 10 features each
Y = torch.randint(0, 2, (100,))  # 100 labels

# Wrap tensors in a dataset
dataset = TensorDataset(X, Y)

# Get all data pairs
data_pairs = [data for data in dataset]

# If you want to get them separately
X_data, Y_data = zip(*data_pairs)

# Convert back to tensors if needed
X_data = torch.stack(X_data)
Y_data = torch.stack(Y_data)

print(X_data.shape)  # torch.Size([100, 10])
print(Y_data.shape)  # torch.Size([100])

In the code above:

  • We first create a TensorDataset from sample data.
  • Then, we use list comprehension to retrieve all data pairs from the dataset.
  • Finally, we separate the features and labels using the zip function, and then convert them back to tensors.

The zip(*data_pairs) expression is a neat Python trick that involves unpacking and transposing pairs (or tuples) of data.

To break it down:

  1. zip function: This is a built-in Python function that allows you to iterate over multiple lists (or other iterable objects) in parallel. For example, if you have two lists a = [1,2,3] and b = [4,5,6], calling zip(a,b) will yield pairs (1,4), (2,5), and (3,6).

  2. The * unpacking operator: When used in a function call, it unpacks a list (or tuple) into individual elements. For instance, if you have func(*[1,2,3]), it’s the same as calling func(1,2,3).

When you use them together as in zip(*data_pairs), you’re doing the following:

  • Unpacking the data_pairs: This treats the list of tuples in data_pairs as separate arguments to zip.
  • Transposing with zip: Since each element of data_pairs is a tuple of (X, Y), using zip effectively transposes the data, separating all the X’s from the Y’s.

Here’s a simple example to illustrate:

data_pairs = [(1, 'a'), (2, 'b'), (3, 'c')]
x_data, y_data = zip(*data_pairs)
print(x_data)  # Outputs: (1, 2, 3)
print(y_data)  # Outputs: ('a', 'b', 'c')

In the context of our previous discussion, this operation allowed us to efficiently separate the feature tensors from the label tensors in the TensorDataset.

基于切片方法

To split a PyTorch tensor into training and validation sets, you can use simple slicing. Here’s a straightforward way to do this:

  1. Decide on a split ratio (e.g., 80% training and 20% validation).
  2. Shuffle the tensor (optional, but often a good idea).
  3. Split the tensor based on the desired ratio.

Here’s an example using a 80-20 split:

import torch

# Sample data
X = torch.randn(1000, 10)  # 1000 samples, 10 features each
Y = torch.randint(0, 2, (1000,))

# Shuffle data
indices = torch.randperm(X.size(0))
X = X[indices]
Y = Y[indices]

# Split ratios
train_size = int(0.8 * X.size(0))
val_size = X.size(0) - train_size

# Split data
X_train = X[:train_size]
Y_train = Y[:train_size]
X_val = X[train_size:]
Y_val = Y[train_size:]

print(X_train.size())
print(Y_train.size())
print(X_val.size())
print(Y_val.size())

In this example:

  • We first shuffled the data by generating a permutation of indices with torch.randperm().
  • We then split the data based on the desired ratio (in this case, 80-20).
  • The resulting tensors (X_train, Y_train, X_val, Y_val) represent the training and validation sets respectively.

This method works well when you have independent and identically distributed data. If you need to perform stratified sampling (e.g., you want to ensure the training and validation sets have similar class distributions), consider using utilities from libraries like scikit-learn to generate the splits, and then index into the PyTorch tensor using those splits.

The torch.randperm(n) function generates a random permutation of integers from 0 to n-1. This is particularly useful for shuffling data. Let’s break down the function torch.randperm(X.size(0)):

  1. X.size(0):

    • This retrieves the size of the first dimension of tensor X.
    • If X is a 2D tensor with shape [samples, features], then X.size(0) will return the number of samples.
  2. torch.randperm(...):

    • This generates a tensor of random permutations of integers from 0 to n-1, where n is the input argument.
    • The result is effectively a shuffled sequence of integers in the range [0, n-1].

In the context of splitting data into training and validation sets, the random permutation ensures that the data is shuffled randomly before the split, so that the training and validation sets are likely to be representative of the overall dataset.

交叉验证

To perform n-fold cross-validation on PyTorch tensor data, you can use the KFold class from sklearn.model_selection. Here’s a step-by-step guide:

  1. Convert the PyTorch tensor to numpy arrays using the .numpy() method.
  2. Use KFold from sklearn.model_selection to generate training and validation indices.
  3. Use these indices to split your PyTorch tensor data into training and validation sets.
  4. Train and validate your model using these splits.

Let’s see a practical example:

import torch
from sklearn.model_selection import KFold

# Sample tensor data
X = torch.randn(100, 10)  # 100 samples, 10 features each
Y = torch.randint(0, 2, (100,))  # 100 labels

# Convert tensor to numpy
X_np = X.numpy()
Y_np = Y.numpy()

# Number of splits
n_splits = 5
kf = KFold(n_splits=n_splits)

for train_index, val_index in kf.split(X_np):
    # Convert indices to tensor
    train_index = torch.tensor(train_index)
    val_index = torch.tensor(val_index)

    X_train, X_val = X[train_index], X[val_index]
    Y_train, Y_val = Y[train_index], Y[val_index]
    
    # Now, you can train and validate your model using X_train, X_val, Y_train, Y_val

Note:

  • The KFold class provides indices which we then use to slice our tensor and obtain the respective training and validation sets.
  • In the example above, we’re performing a 5-fold cross-validation on the data. Each iteration provides a new training-validation split.

If you want to shuffle the data before splitting, you can set the shuffle parameter of KFold to True.

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.coloradmin.cn/o/885796.html

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈,一经查实,立即删除!

相关文章

【C++】面向对象编程引入 ( 面向过程编程 | 查看 iostream 依赖 | 面向对象编程 )

文章目录 一、面向过程编程二、查看 iostream 依赖三、面向对象编程 一、面向过程编程 给定 圆 的 半径 , 求该圆 的 周长 和 面积 ; 半径为 r r r , 周长就是 2 π r 2 \pi r 2πr , 面积是 π r 2 \pi r^2 πr2 ; 使用 面向过程 的方法解决上述问题 , 只能是令程序顺序执…

DDCX——运维开发准备

DD——运维开发准备 一4 linux用的什么版本,常见命令(awk sed grep telnet netstate tcpdump top ps perf)5 数据库有哪些类型,关系型数据库有哪些,非关系型数据库有哪些6 mysql事务7 mysql集群了解多少8 redis数据类型…

Spring 框架入门介绍及IoC的三种注入方式

目录 一、Spring 简介 1. 简介 2. spring 的核心模块 ⭐ 二、IoC 的概念 2.1 IoC 详解 2.2 IoC的好处 2.3 谈谈你对IoC的理解 三、IoC的三种注入方式 3.1 构造方法注入 3.2 setter方法注入 3.3 接口注入(自动分配) 3.4 spring上下文与tomcat整…

ChatGPT聊天微信小程序源码/适配H5和WEB端

ChatGPT-MP(基于ChatGPT实现的微信小程序,适配H5和WEB端) 可二开包含前后台,支持打字效果输出流式输出,支持AI聊天次数限制,支持分享增加次数等功能。开源版禁止商用,仅供学习交流,禁止倒卖。 技术栈&…

本地linux 搭建云服务器

本人穷逼,三年268 腾讯云可以接受,续费千百块 承担不起 研究了一会,发现搭建云服务器有两种较好的方式 一种是有公网IP的,另外是没有公网IP的,这里实验成功的是没有公网ip的方法 这种方法有缺点,因为fre…

前端组件高级封装技巧--纯干货

对于前端的小伙伴来说,最常见的工作就是写后台管理系统的页面,而后台管理系统最多的操作就是CRUD了,类似下面的,一个搜索框,一个表格,一个分页,然后点击新增编辑有个弹框 当你写过一段时间的CRU…

OpenDDS安装教程 Java开发

一、环境搭建 1、版本介绍 笔者使用以下版本(不同版本的openDDS对应ACETAO版本不同) openDDS:3.14 ACETAO:6.5.12 perl:5.32.0.1-64bit Visual Studio:Community 2019 jdk:jdk-8u111-windows-…

spark的standalone 分布式搭建

一、环境准备 集群环境hadoop11,hadoop12 ,hadoop13 安装 zookeeper 和 HDFS 1、启动zookeeper -- 启动zookeeper(11,12,13都需要启动) xcall.sh zkServer.sh start -- 或者 zk.sh start -- xcall.sh 和zk.sh都是自己写的脚本-- 查看进程 jps -- 有…

【CTF-web】备份是个好习惯(查找备份文件、双写绕过、md5加密绕过)

题目链接:https://ctf.bugku.com/challenges/detail/id/83.html 经过扫描可以找到index.php.bak备份文件,下载下来后打开发现是index.php的原代码,如下图所示。 由代码可知我们要绕过md5加密,两数如果满足科学计数法的形式的话&a…

centos7离线安装gdal3.6.3

本文档以纯离线环境为基础,所有的安装包都是提前下载好的。以gdal3.6.3为例(其他版本安装步骤或方式可能不同),在centos7系统离线安装,并运行java项目,实现在java服务中调用gdal库解析地理数据。以下任意组…

python的 __all__ 用法

一、介绍 在Python中,__all__通常用于定义模块的公开接口。在使用from module import *语句时,此时被导入模块若定义了__all__属性,则只有__all__内指定的属性、方法、类可被导入;若没定义,则导入模块内的所有公有属性…

华为公开“倒装芯片封装”创新技术,改善散热性能,火龙秒变冰龙

根据华为技术有限公司公开的专利申请,他们提出了一项名为“具有改进的热性能的倒装芯片封装”的创新技术。这项技术旨在改善各种专利应用设备的散热性能,涉及的芯片类型包括CPU、GPU、FPGA和ASIC等。 这些设备可以是智能手机、平板电脑、可穿戴移动设备、…

CUDA计算超时(TDR)和阻塞界面问题的处理参考方法

本文提供一种解决单个英伟达独立显卡(终端用户常见的情形)上计算密集导致程序崩溃和电脑界面卡死的问题参考方法,采取降低效率和花费更多时间的思路来解决崩溃和卡顿的问题,即让CPU占有率不是一直100%,也不会因为被TDR机制打断。 如上图,在GPU-Z软件中看到“GPU Load”没…

实施统一待办:如何将“人找事变成事找人”

当企业信息化发展到一定程度,往往会呈现一种局面:专业的咖,管专业的事。如HRM管企业的人事信息、SRM管企业与供应商之间的合作关系、CRM管企业与客户之间的互动和沟通等。 然而当这些系统叠加在一起时,奇妙的化学反应产生了&#…

C#工程建立后修改工程文件名与命名空间

使用之前的项目做二次开发,项目快结束的时候,需要把主项目的名称修改成我们想要的。 之前从来没有这么干过,记录一下。 步骤如下: 1:打开vs2010项目解决方案,重命名,如下图所示: …

stack 、 queue的语法使用及底层实现以及deque的介绍【C++】

文章目录 stack的使用queue的使用适配器queue的模拟实现stack的模拟实现deque stack的使用 stack是一种容器适配器&#xff0c;具有后进先出&#xff0c;只能从容器的一端进行元素的插入与提取操作 #include <iostream> #include <vector> #include <stack&g…

猿人学刷题系列(第一届比赛)——第三题

题目&#xff1a;抓取下列5页商标的数据&#xff0c;并将出现频率最高的申请号填入答案中 地址&#xff1a;https://match.yuanrenxue.cn/match/3 本题主要考察请求逻辑&#xff0c;可以借助fiddler或Charles等抓包工具进行分析。首先通过浏览器来简单进行请求逻辑分析。 从抓…

Linux系统下消息中间件RocketMQ下载、安装、搭建、配置、控制台rocketmq-dashboard的安装保姆级教程 rocketmq ui

这里给出我使用的 RocketMQ 版本&#xff08;5.1.3&#xff09;、RocketMQ-Dashboard 版本的百度网盘链接&#xff1a; 链接&#xff1a;https://pan.baidu.com/s/1HaKBBDGWZ0WKLGgVwIG9pw 提取码&#xff1a;1234 文章目录 一. 官网下载安装二、启动NameServer三、启动Broker四…

Linux学习之初识Linux

目录 一.Linux的发展历史及概念 1.什么是Linux UNIX发展的历史&#xff1a; Linux发展历史&#xff1a; 2. 开源 商业化发行版本 二. 如何搭建Linux环境 Linux 环境的搭建方式主要有三种&#xff1a; 1. 直接安装在物理机上 2. 使用虚拟机软件 3. 使用云服务器 三. …

4.SpringCloud

1.SpringCloud概述 Spring Cloud为开发人员提供了快速构建分布式系统中一些常见模式的工具&#xff08;例如配置管理&#xff0c;服务发现&#xff0c;断路器&#xff0c;智能路由&#xff0c;微代理&#xff0c;控制总线&#xff0c;一次性令牌&#xff0c;全局锁&#xff0c;…