feature engnineering 特征工程

news2025/4/11 3:14:12

特征工程

数值型变量
- standardization
- log_transformation(使其符合正态分布)
- polynomial features
分类型变量
- orinigalencoder
- onehot encoder
分类创造下的数值

以下代码根据Abhishek Thakur在kaggle上的机器学习30天（b站） (kaggle)可惜的是，我没有找到源代码，如果需要代码可以从其他人上传的代码里“盗取”。
我们首先导入需要的库函数

import numpy as np 
import pandas as pd 
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from xgboost import XGBRegressor
from sklearn.metrics import mean_squared_error

数据如下，可以发现有数值型变量和分类型变量，下面我们对其进行操作
在这里插入图片描述

数值型变量

standardization

减去平均值，除以标准差(standardScaler)

    scaler = preprocessing.StandardScaler()
    xtrain[numerical_cols] = scaler.fit_transform(xtrain[numerical_cols])
    xvalid[numerical_cols] = scaler.transform(xvalid[numerical_cols])
    xtest[numerical_cols] = scaler.transform(xtest[numerical_cols])

log_transformation(使其符合正态分布)

It’s one of several methods that you can use to transform datasets to achieve linearity .
This means it can help you obtain further insights into your data that may not be obvious at first.For example, notice how the original data below shows a nonlinear relationship. Well, after applying an exponential transformation, which takes the natural log of the response variable, our data becomes a linear function as seen in the side-by-side comparison of both scatterplots and residual plots.

“残差图”以回归方程的自变量为横坐标，以残差为纵坐标，将每一个自变量的残差描在该平面坐标上所形成的图形。当描绘的点围绕残差等于O的直线上下随机散布，说明回归直线对原观测值的拟合情况良好。否则，说明回归直线对原观测值的拟合不理想.。从“残差图”可以直观地看出残差的绝对数值都比较小，所描绘的点都在以O为横轴的直线上下随机散布，回归直线对各个观测值的拟合情况是良好的。说明变量X与y之间有显著的线性相关关系。

log1p就是在这里插入图片描述

for col in numerical_cols:
    df[col] = np.log1p(df[col])
    df_test[col] = np.log1p(df_test[col])

polynomial features

如果有（a，b）两个特征，使用degree=2的二次多项式，则为（1，a, a^2, ab, b ,b^2)。以此类推。interaction_only就是只留下交互项，去掉1，a，b。

poly = preprocessing.PolynomialFeatures(degree=3, interaction_only=True, include_bias=False)
train_poly = poly.fit_transform(df[numerical_cols])
test_poly = poly.fit_transform(df_test[numerical_cols])

分类型变量

orinigalencoder

直接原始分类，a是1，b是2

 ordinal_encoder = preprocessing.OrdinalEncoder()
 xtrain[object_cols] = ordinal_encoder.fit_transform(xtrain[object_cols])
 xvalid[object_cols] = ordinal_encoder.fit_transform(xvalid[object_cols])
 #应该为transform
 xtest[object_cols] = ordinal_encoder.fit_transform(xtest[object_cols])
 #应该为transform

这里我写的时候突然发现一个问题，如果都对其使用fit_transform可能最后诞生的分类是不同的，如果单一fit后是能保证一一对应的。
你想对单个serie进行操作，就用label encoder.

onehot encoder

独热编码

    ohe = preprocessing.OneHotEncoder(sparse=False, handle_unknown="ignore")
    xtrain_ohe = ohe.fit_transform(xtrain[object_cols])
    xvalid_ohe = ohe.transform(xvalid[object_cols])
    xtest_ohe = ohe.transform(xtest[object_cols])

分类创造下的数值

df.groupbu(col)[col].transform()#不改变形状
df.groupbu(col)[col].agg()

本文来自互联网用户投稿，该文观点仅代表作者本人，不代表本站立场。本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如若转载，请注明出处：http://www.coloradmin.cn/o/148353.html

如若内容造成侵权/违法违规/事实不符，请联系多彩编程网进行投诉反馈，一经查实，立即删除！

feature engnineering 特征工程

特征工程

数值型变量

standardization

log_transformation(使其符合正态分布)

polynomial features

分类型变量

orinigalencoder

onehot encoder

分类创造下的数值

相关文章

Oracle 19c VLDB and Partitioning Guide 第5章：管理和维护基于时间的信息读书笔记

计算机网络复习之网络层

Eclipse安装教程

Hive表的创建，删除，修改

数据的合并和分组聚合

Java之class类

7-10 列车调度

联合证券|道指狂拉700点！八大科技巨头市值暴涨1.3万亿！

上半年要写的博客，提前占坑1

第二章创建数据集

可变形卷积DCN/DConv

Nginx配置负载均衡到网关

【C++】优先级队列priority_queue仿函数

Springboot配置静态资源

【卷积码系列4】卷积码的状态转移函数、距离谱和译码性能界分析及matlab仿真

Diffusion Models从入门到放弃：必读的10篇经典论文

从0到1完成一个Vue后台管理项目（五、登录页（表单校验的使用和封装））

IDEA的使用

10、可观测性--系统告警

【云原生进阶之容器】第二章Controller Manager原理2.6节--Informer controller