金融贷款批准预测项目

news2024/11/20 13:23:34

注意:本文引用自专业人工智能社区Venus AI

更多AI知识请参考原站 ([www.aideeplearning.cn])

在金融服务行业,贷款审批是一项关键任务,它不仅关系到资金的安全,还直接影响到金融机构的运营效率和风险管理。传统的审批流程往往依赖于人工审核,这不仅效率低下,而且容易受到主观判断的影响。为了解决这些问题,我们引入了一种基于机器学习的贷款预测模型,旨在提高贷款审批的准确性和效率。

项目背景

在当前的金融市场中,违约率的不断波动对贷款审批流程提出了新的挑战。传统方法往往无法有效预测和管理这些风险,因此需要一种更智能、更可靠的方法来评估贷款申请。通过使用机器学习,我们可以从大量历史数据中学习并识别违约的潜在风险,这不仅能提高贷款批准的准确性,还能大大降低金融机构的损失。

经过训练的模型将用于预测新的贷款申请是否有高风险。这将帮助金融机构在贷款批准过程中做出更加明智的决策,减少不良贷款的比例,提高整体的财务健康状况。

数据集

我们项目使用的数据集包括了广泛的客户特征,这些特征反映了贷款申请者的财务状况和背景。具体包括:

  1. 性别(Gender):申请人的性别。
  2. 婚姻状况(Married):申请人的婚姻状态。
  3. 受抚养人数(Dependents):申请人负责抚养的人数。
  4. 教育背景(Education):申请人的教育水平。
  5. 是否自雇(Self_Employed):申请人是否拥有自己的业务。
  6. 申请人收入(ApplicantIncome):申请人的月收入。
  7. 共同申请人收入(CoapplicantIncome):与申请人一同申请贷款的人的月收入。
  8. 贷款金额(LoanAmount):申请的贷款总额。
  9. 贷款期限(Loan_Amount_Term):预期的还款期限。
  10. 信用历史(Credit_History):申请人的信用记录。
  11. 财产区域(Property_Area):申请人财产所在的地理位置。

模型和依赖库

Models:

  1. RandomForestRegressor
  2. Decision Tree Regression
  3. logistic regression

Libraries:

  1. matplotlib==3.7.1
  2. numpy==1.24.3
  3. pandas==2.0.2
  4. scikit_learn==1.2.2
  5. seaborn==0.13.0

代码实现

金融贷款批准预测

项目背景

在金融领域,贷款审批是向任何人提供贷款之前需要执行的一项至关重要的任务。 这确保了批准的贷款将来可以收回。 然而,要确定一个人是否适合贷款或违约者,就很难确定有助于做出决定的性格和特征。

在这些情况下,使用机器学习的贷款预测模型成为非常有用的工具,可以根据过去的数据来预测该人是否违约。

我们获得了两个数据集(训练和测试),其中包含过去的交易,其中包括客户的一些特征以及显示客户是否违约的标签。 我们建立了一个模型,可以在训练数据集上执行,并可以预测贷款是否应获得批准。

About Data:

导入库并加载数据

#Impoting libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
df_train = pd.read_csv("train_u6lujuX_CVtuZ9i.csv")
df_test = pd.read_csv("test_Y3wMUE5_7gLdaTN.csv")
df_train.head()
Loan_IDGenderMarriedDependentsEducationSelf_EmployedApplicantIncomeCoapplicantIncomeLoanAmountLoan_Amount_TermCredit_HistoryProperty_AreaLoan_Status
0LP001002MaleNo0GraduateNo58490.0NaN360.01.0UrbanY
1LP001003MaleYes1GraduateNo45831508.0128.0360.01.0RuralN
2LP001005MaleYes0GraduateYes30000.066.0360.01.0UrbanY
3LP001006MaleYes0Not GraduateNo25832358.0120.0360.01.0UrbanY
4LP001008MaleNo0GraduateNo60000.0141.0360.01.0UrbanY
df_test.head()
Loan_IDGenderMarriedDependentsEducationSelf_EmployedApplicantIncomeCoapplicantIncomeLoanAmountLoan_Amount_TermCredit_HistoryProperty_Area
0LP001015MaleYes0GraduateNo57200110.0360.01.0Urban
1LP001022MaleYes1GraduateNo30761500126.0360.01.0Urban
2LP001031MaleYes2GraduateNo50001800208.0360.01.0Urban
3LP001035MaleYes2GraduateNo23402546100.0360.0NaNUrban
4LP001051MaleNo0Not GraduateNo3276078.0360.01.0Urban
#shape of data
df_train.shape
(614, 13)
#data summary
df_train.describe()
ApplicantIncomeCoapplicantIncomeLoanAmountLoan_Amount_TermCredit_History
count614.000000614.000000592.000000600.00000564.000000
mean5403.4592831621.245798146.412162342.000000.842199
std6109.0416732926.24836985.58732565.120410.364878
min150.0000000.0000009.00000012.000000.000000
25%2877.5000000.000000100.000000360.000001.000000
50%3812.5000001188.500000128.000000360.000001.000000
75%5795.0000002297.250000168.000000360.000001.000000
max81000.00000041667.000000700.000000480.000001.000000
df_train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 614 entries, 0 to 613
Data columns (total 13 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Loan_ID            614 non-null    object 
 1   Gender             601 non-null    object 
 2   Married            611 non-null    object 
 3   Dependents         599 non-null    object 
 4   Education          614 non-null    object 
 5   Self_Employed      582 non-null    object 
 6   ApplicantIncome    614 non-null    int64  
 7   CoapplicantIncome  614 non-null    float64
 8   LoanAmount         592 non-null    float64
 9   Loan_Amount_Term   600 non-null    float64
 10  Credit_History     564 non-null    float64
 11  Property_Area      614 non-null    object 
 12  Loan_Status        614 non-null    object 
dtypes: float64(4), int64(1), object(8)
memory usage: 62.5+ KB

数据清洗

# 检测空值
df_train.isna().sum()
Loan_ID               0
Gender               13
Married               3
Dependents           15
Education             0
Self_Employed        32
ApplicantIncome       0
CoapplicantIncome     0
LoanAmount           22
Loan_Amount_Term     14
Credit_History       50
Property_Area         0
Loan_Status           0
dtype: int64

有很多空值,Credit_History 的最大值为 50。

去除所有空值

# Dropping all the null values
drop_list = ['Gender','Married','Dependents','Self_Employed','LoanAmount','Loan_Amount_Term','Credit_History']
for col in drop_list:
 df_train = df_train[~df_train[col].isna()]
df_train.isna().sum()
Loan_ID              0
Gender               0
Married              0
Dependents           0
Education            0
Self_Employed        0
ApplicantIncome      0
CoapplicantIncome    0
LoanAmount           0
Loan_Amount_Term     0
Credit_History       0
Property_Area        0
Loan_Status          0
dtype: int64

Loan_ID 列没用,这里删除它

# dropping Loan_ID
df_train.drop(columns='Loan_ID',axis=1, inplace=True)
df_train.shape
(480, 12)
#data summary
df_train.describe()
ApplicantIncomeCoapplicantIncomeLoanAmountLoan_Amount_TermCredit_History
count480.000000480.000000480.000000480.000000480.000000
mean5364.2312501581.093583144.735417342.0500000.854167
std5668.2512512617.69226780.50816465.2124010.353307
min150.0000000.0000009.00000036.0000000.000000
25%2898.7500000.000000100.000000360.0000001.000000
50%3859.0000001084.500000128.000000360.0000001.000000
75%5852.5000002253.250000170.000000360.0000001.000000
max81000.00000033837.000000600.000000480.0000001.000000

数据分析(EDA)

df_train.head()
GenderMarriedDependentsEducationSelf_EmployedApplicantIncomeCoapplicantIncomeLoanAmountLoan_Amount_TermCredit_HistoryProperty_AreaLoan_Status
1MaleYes1GraduateNo45831508.0128.0360.01.0RuralN
2MaleYes0GraduateYes30000.066.0360.01.0UrbanY
3MaleYes0Not GraduateNo25832358.0120.0360.01.0UrbanY
4MaleNo0GraduateNo60000.0141.0360.01.0UrbanY
5MaleYes2GraduateYes54174196.0267.0360.01.0UrbanY
#distribution of Churn data
sns.displot(data=df_train,x='Loan_Status')
<seaborn.axisgrid.FacetGrid at 0x1f54d853bb0>

数据集是不平衡的,但是不是非常严重

自变量相对于因变量的分布.

# 设置分类特征
categorical_features=list(df_train.columns)
numeical_features = list(df_train.describe().columns)
for elem in numeical_features:
 categorical_features.remove(elem)
categorical_features = categorical_features[:-1]
categorical_features
['Gender',
 'Married',
 'Dependents',
 'Education',
 'Self_Employed',
 'Property_Area']
# Set categorical and numerical features
categorical_features = list(df_train.columns)
numerical_features = list(df_train.describe().columns)
for elem in numerical_features:
    categorical_features.remove(elem)
categorical_features.remove('Loan_Status')  # Assuming 'Loan_Status' is not a feature to plot

# Determine the layout of subplots
n_cols = 2  # Can be adjusted based on preference
n_rows = (len(categorical_features) + 1) // n_cols

# Create a grid of subplots
fig, axes = plt.subplots(nrows=n_rows, ncols=n_cols, figsize=(12, n_rows * 4))

# Flatten the axes array for easy iteration
axes = axes.flatten()

# Plot each bar chart
for i, col in enumerate(categorical_features):
    df_train.groupby([col, 'Loan_Status']).size().unstack().plot(kind='bar', stacked=True, ax=axes[i])
    axes[i].set_title(f'Total count of Loan_Status grouped by {col}')
    axes[i].set_ylabel('Count')

# Adjust layout and display the plot
plt.tight_layout()
plt.show()

从上面的图中观察到的结果:

  • 与女性相比,男性获得贷款批准的比例更高。
  • 与非毕业生相比,贷款审批对毕业生更有利。

  • 与受雇者相比,个体经营者获得贷款批准的机会较少。

  • 城乡结合部的贷款批准率最高。

让我们看看按因变量分组的连续自变量

numerical_features = df_train.describe().columns

# Determine the layout of subplots
n_cols = 2  # Adjust based on preference
n_rows = (len(numerical_features) + 1) // n_cols

# Create a grid of subplots
fig, axes = plt.subplots(nrows=n_rows, ncols=n_cols, figsize=(12, n_rows * 4))

# Flatten the axes array for easy iteration
axes = axes.flatten()

# Plot each boxplot
for i, col in enumerate(numerical_features):
    sns.boxplot(x='Loan_Status', y=col, data=df_train, ax=axes[i])
    axes[i].set_title(f'Distribution of {col} grouped by Loan_Status')

# Adjust layout and display the plot
plt.tight_layout()
plt.show()

我们可以在数据中观察到很多异常值。

从上面的箱线图中无法得出任何正确的结论。

相关性分析

 ## Correlation between variables
plt.figure(figsize=(15,8))
correlation = df_train.corr()
sns.heatmap((correlation), annot=True, cmap='coolwarm')
<Axes: >

没有观察到任何显着的相关性。

数据预处理

df_train.head()
GenderMarriedDependentsEducationSelf_EmployedApplicantIncomeCoapplicantIncomeLoanAmountLoan_Amount_TermCredit_HistoryProperty_AreaLoan_Status
1MaleYes1GraduateNo45831508.0128.0360.01.0RuralN
2MaleYes0GraduateYes30000.066.0360.01.0UrbanY
3MaleYes0Not GraduateNo25832358.0120.0360.01.0UrbanY
4MaleNo0GraduateNo60000.0141.0360.01.0UrbanY
5MaleYes2GraduateYes54174196.0267.0360.01.0UrbanY
df_train['Property_Area'].value_counts()
Semiurban    191
Urban        150
Rural        139
Name: Property_Area, dtype: int64
df_train['Credit_History'].value_counts()
1.0    410
0.0     70
Name: Credit_History, dtype: int64
df_train['Dependents'].value_counts()
0     274
2      85
1      80
3+     41
Name: Dependents, dtype: int64

使用标签编码将分类列转换为数字

#Label encoding for some categorical features
df_train_new = df_train.copy()
label_col_list = ['Married','Self_Employed']
for col in label_col_list:
  df_train_new=df_train_new.replace({col:{'Yes':1,'No':0}})
df_train_new=df_train_new.replace({'Gender':{'Male':1,'Female':0}})
df_train_new=df_train_new.replace({'Education':{'Graduate':1,'Not Graduate':0}})
df_train_new=df_train_new.replace({'Loan_Status':{'Y':1,'N':0}})

对于其余的分类特征,我们将进行一种热编码:

#one hot encoding
df_train_new = pd.get_dummies(df_train_new, columns=["Dependents","Property_Area"])
df_train_new.head()
GenderMarriedEducationSelf_EmployedApplicantIncomeCoapplicantIncomeLoanAmountLoan_Amount_TermCredit_HistoryLoan_StatusDependents_0Dependents_1Dependents_2Dependents_3+Property_Area_RuralProperty_Area_SemiurbanProperty_Area_Urban
1111045831508.0128.0360.01.000100100
2111130000.066.0360.01.011000001
3110025832358.0120.0360.01.011000001
4101060000.0141.0360.01.011000001
5111154174196.0267.0360.01.010010001

标准化连续变量。

#standardize continuous features
from scipy.stats import zscore
df_train_new[['ApplicantIncome','CoapplicantIncome','LoanAmount','Loan_Amount_Term']]=df_train_new[['ApplicantIncome','CoapplicantIncome','LoanAmount','Loan_Amount_Term']].apply(zscore) 
df_train_new.head()
GenderMarriedEducationSelf_EmployedApplicantIncomeCoapplicantIncomeLoanAmountLoan_Amount_TermCredit_HistoryLoan_StatusDependents_0Dependents_1Dependents_2Dependents_3+Property_Area_RuralProperty_Area_SemiurbanProperty_Area_Urban
11110-0.137970-0.027952-0.2080890.2755421.000100100
21111-0.417536-0.604633-0.9790010.2755421.011000001
31100-0.4911800.297100-0.3075620.2755421.011000001
410100.112280-0.604633-0.0464460.2755421.011000001
511110.0093190.9999781.5202450.2755421.010010001
# Repositioning the dependent variable to last index
last_column = df_train_new.pop('Loan_Status')
df_train_new.insert(16, 'Loan_Status', last_column)
df_train_new.head()
GenderMarriedEducationSelf_EmployedApplicantIncomeCoapplicantIncomeLoanAmountLoan_Amount_TermCredit_HistoryDependents_0Dependents_1Dependents_2Dependents_3+Property_Area_RuralProperty_Area_SemiurbanProperty_Area_UrbanLoan_Status
11110-0.137970-0.027952-0.2080890.2755421.001001000
21111-0.417536-0.604633-0.9790010.2755421.010000011
31100-0.4911800.297100-0.3075620.2755421.010000011
410100.112280-0.604633-0.0464460.2755421.010000011
511110.0093190.9999781.5202450.2755421.000100011

数据处理完毕,准备训练模型

数据集划分

由于我们的数据仅用于训练,其他数据可用于测试。 我们仍然会进行训练测试分割,因为测试数据没有标记,并且有必要根据未见过的数据评估模型。

X= df_train_new.iloc[:,:-1]
y= df_train_new.iloc[:,-1]
from sklearn.model_selection import train_test_split 
X_train, X_test, y_train, y_test = train_test_split( X,y , test_size = 0.2, random_state = 0) 
print(X_train.shape)
print(X_test.shape)
(384, 16)
(96, 16)
y_train.value_counts()
1    271
0    113
Name: Loan_Status, dtype: int64
y_test.value_counts()
1    61
0    35
Name: Loan_Status, dtype: int64

对训练数据进行逻辑回归拟合

#Importing and fitting Logistic regression
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression(fit_intercept=True, max_iter=10000,random_state=0)
lr.fit(X_train, y_train)

LogisticRegression

LogisticRegression(max_iter=10000, random_state=0)
# Get the model coefficients
lr.coef_
array([[ 0.23272114,  0.57128602,  0.26384918, -0.24617035,  0.15924191,
        -0.14703758, -0.19280038, -0.16392914,  2.97399665, -0.18202629,
        -0.27741114,  0.17256535,  0.28601466, -0.30275813,  0.64592912,
        -0.3440284 ]])
#model intercept
lr.intercept_
array([-2.1943974])

评价训练模型的性能

# Get the predicted probabilities
train_preds = lr.predict_proba(X_train)
test_preds = lr.predict_proba(X_test)
test_preds
array([[0.23916396, 0.76083604],
       [0.24506751, 0.75493249],
       [0.04933527, 0.95066473],
       [0.20146124, 0.79853876],
       [0.2347122 , 0.7652878 ],
       [0.05817427, 0.94182573],
       [0.17668886, 0.82331114],
       [0.21352909, 0.78647091],
       [0.39015173, 0.60984827],
       [0.1902079 , 0.8097921 ],
       [0.20590091, 0.79409909],
       [0.184445  , 0.815555  ],
       [0.80677694, 0.19322306],
       [0.23024539, 0.76975461],
       [0.23674387, 0.76325613],
       [0.32409412, 0.67590588],
       [0.08612609, 0.91387391],
       [0.20502754, 0.79497246],
       [0.71006169, 0.28993831],
       [0.05818474, 0.94181526],
       [0.16546532, 0.83453468],
       [0.1191243 , 0.8808757 ],
       [0.16412334, 0.83587666],
       [0.14471253, 0.85528747],
       [0.49082632, 0.50917368],
       [0.37484189, 0.62515811],
       [0.20042593, 0.79957407],
       [0.07289182, 0.92710818],
       [0.10696878, 0.89303122],
       [0.27313905, 0.72686095],
       [0.07661587, 0.92338413],
       [0.07911086, 0.92088914],
       [0.32357856, 0.67642144],
       [0.24855278, 0.75144722],
       [0.25736849, 0.74263151],
       [0.10330185, 0.89669815],
       [0.27934665, 0.72065335],
       [0.23504431, 0.76495569],
       [0.37235234, 0.62764766],
       [0.82612173, 0.17387827],
       [0.25597195, 0.74402805],
       [0.07027974, 0.92972026],
       [0.21138903, 0.78861097],
       [0.30656929, 0.69343071],
       [0.12859877, 0.87140123],
       [0.22422238, 0.77577762],
       [0.19222405, 0.80777595],
       [0.33904961, 0.66095039],
       [0.21169609, 0.78830391],
       [0.12783677, 0.87216323],
       [0.21562742, 0.78437258],
       [0.1003408 , 0.8996592 ],
       [0.39205576, 0.60794424],
       [0.10298106, 0.89701894],
       [0.34917087, 0.65082913],
       [0.31848606, 0.68151394],
       [0.46697536, 0.53302464],
       [0.83005638, 0.16994362],
       [0.84749511, 0.15250489],
       [0.82240763, 0.17759237],
       [0.08938059, 0.91061941],
       [0.38214865, 0.61785135],
       [0.62202628, 0.37797372],
       [0.1124887 , 0.8875113 ],
       [0.29371977, 0.70628023],
       [0.12829643, 0.87170357],
       [0.30152976, 0.69847024],
       [0.12669798, 0.87330202],
       [0.07601492, 0.92398508],
       [0.06068026, 0.93931974],
       [0.05461916, 0.94538084],
       [0.10209121, 0.89790879],
       [0.20592351, 0.79407649],
       [0.56190874, 0.43809126],
       [0.19828342, 0.80171658],
       [0.20171019, 0.79828981],
       [0.11960918, 0.88039082],
       [0.25602438, 0.74397562],
       [0.18013843, 0.81986157],
       [0.37225288, 0.62774712],
       [0.21781716, 0.78218284],
       [0.10365239, 0.89634761],
       [0.29076172, 0.70923828],
       [0.59602673, 0.40397327],
       [0.39435357, 0.60564643],
       [0.40070233, 0.59929767],
       [0.88224869, 0.11775131],
       [0.22235351, 0.77764649],
       [0.1765423 , 0.8234577 ],
       [0.75247369, 0.24752631],
       [0.20366031, 0.79633969],
       [0.85207477, 0.14792523],
       [0.3873617 , 0.6126383 ],
       [0.12318258, 0.87681742],
       [0.06667711, 0.93332289],
       [0.17440779, 0.82559221]])
# Get the predicted classes
train_class_preds = lr.predict(X_train)
test_class_preds = lr.predict(X_test)
train_class_preds
array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1,
       0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1,
       1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       0, 0, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0,
       1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1,
       1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1,
       0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0,
       1, 1, 1, 0, 1, 0, 1, 1, 1, 0], dtype=int64)

准确率

from sklearn.metrics import accuracy_score, confusion_matrix ,classification_report
# Get the accuracy scores
train_accuracy = accuracy_score(train_class_preds,y_train)
test_accuracy = accuracy_score(test_class_preds,y_test)

print("The accuracy on train data is ", train_accuracy)
print("The accuracy on test data is ", test_accuracy)
The accuracy on train data is  0.8229166666666666
The accuracy on test data is  0.7604166666666666

由于我们的数据有些不平衡,准确性可能不是一个好的指标。 让我们使用 roc_auc 分数。

# Get the roc_auc scores
train_roc_auc = accuracy_score(y_train,train_class_preds)
test_roc_auc = accuracy_score(y_test,test_class_preds)

print("The accuracy on train data is ", train_roc_auc)
print("The accuracy on test data is ", test_roc_auc)
The accuracy on train data is  0.8229166666666666
The accuracy on test data is  0.7604166666666666
# Other evaluation metrics for train data
print(classification_report(train_class_preds,y_train))
              precision    recall  f1-score   support

           0       0.45      0.89      0.60        57
           1       0.98      0.81      0.89       327

    accuracy                           0.82       384
   macro avg       0.71      0.85      0.74       384
weighted avg       0.90      0.82      0.84       384

# Other evaluation metrics for train data
print(classification_report(y_test,test_class_preds))
              precision    recall  f1-score   support

           0       1.00      0.34      0.51        35
           1       0.73      1.00      0.84        61

    accuracy                           0.76        96
   macro avg       0.86      0.67      0.68        96
weighted avg       0.83      0.76      0.72        96

训练集和测试集上的混淆矩阵

# Get the confusion matrix for trained data

labels = ['Notapproved', 'approved']
cm = confusion_matrix(y_train, train_class_preds)
print(cm)

ax= plt.subplot()
sns.heatmap(cm, annot=True, ax = ax) #annot=True to annotate cells

# labels, title and ticks
ax.set_xlabel('Predicted labels')
ax.set_ylabel('True labels')
ax.set_title('Confusion Matrix on trained data')
ax.xaxis.set_ticklabels(labels)
ax.yaxis.set_ticklabels(labels)
plt.show()


# Get the confusion matrix for test data

labels = ['Notapproved', 'approved']
cm = confusion_matrix(y_test, test_class_preds)
print(cm)

ax= plt.subplot()
sns.heatmap(cm, annot=True, ax = ax); #annot=True to annotate cells

# labels, title and ticks
ax.set_xlabel('Predicted labels')
ax.set_ylabel('True labels')
ax.set_title('Confusion Matrix on test data')
ax.xaxis.set_ticklabels(labels)
ax.yaxis.set_ticklabels(labels)
[[ 51  62]
 [  6 265]]

[[12 23]
 [ 0 61]]
[Text(0, 0.5, 'Notapproved'), Text(0, 1.5, 'approved')]

决策树

#Importing libraries
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV
# applying GreadsearchCV to identify best parameters
decision_tree = DecisionTreeClassifier()
tree_para = {'criterion':['gini','entropy'],'max_depth':[4,5,6,7,8,9,10,11,12,15,20,30,40,50,70,90,120,150]}
clf = GridSearchCV(decision_tree, tree_para, cv=5)
clf.fit(X_train, y_train)

clf.best_params_
{'criterion': 'gini', 'max_depth': 4}
#applying decision tree classifier
dt = DecisionTreeClassifier(criterion='gini',max_depth=4,random_state=0)
dt.fit(X_train, y_train)

train_class_preds = dt.predict(X_train)
test_class_preds = dt.predict(X_test)

Accuracy Score

# Get the accuracy scores
train_accuracy = accuracy_score(train_class_preds,y_train)
test_accuracy = accuracy_score(test_class_preds,y_test)

print("The accuracy on train data is ", train_accuracy)
print("The accuracy on test data is ", test_accuracy)
The accuracy on train data is  0.8463541666666666
The accuracy on test data is  0.71875

roc_auc score

# Get the roc_auc scores
train_roc_auc = accuracy_score(y_train,train_class_preds)
test_roc_auc = accuracy_score(y_test,test_class_preds)

print("The accuracy on train data is ", train_roc_auc)
print("The accuracy on test data is ", test_roc_auc)
The accuracy on train data is  0.8463541666666666
The accuracy on test data is  0.71875
# Other evaluation metrics for train data
print(classification_report(train_class_preds,y_train))
              precision    recall  f1-score   support

           0       0.54      0.90      0.67        68
           1       0.97      0.84      0.90       316

    accuracy                           0.85       384
   macro avg       0.76      0.87      0.79       384
weighted avg       0.90      0.85      0.86       384
# Other evaluation metrics for train data
print(classification_report(y_test,test_class_preds))
              precision    recall  f1-score   support

           0       0.70      0.40      0.51        35
           1       0.72      0.90      0.80        61

    accuracy                           0.72        96
   macro avg       0.71      0.65      0.66        96
weighted avg       0.72      0.72      0.70        96

Confusion matrix on trained and test data

# Get the confusion matrix for trained data

labels = ['Notapproved', 'approved']
cm = confusion_matrix(y_train, train_class_preds)
print(cm)

ax= plt.subplot()
sns.heatmap(cm, annot=True, ax = ax) #annot=True to annotate cells

# labels, title and ticks
ax.set_xlabel('Predicted labels')
ax.set_ylabel('True labels')
ax.set_title('Confusion Matrix on trained data')
ax.xaxis.set_ticklabels(labels)
ax.yaxis.set_ticklabels(labels)
plt.show()

# Get the confusion matrix for test data

labels = ['Notapproved', 'approved']
cm = confusion_matrix(y_test, test_class_preds)
print(cm)

ax= plt.subplot()
sns.heatmap(cm, annot=True, ax = ax); #annot=True to annotate cells

# labels, title and ticks
ax.set_xlabel('Predicted labels')
ax.set_ylabel('True labels')
ax.set_title('Confusion Matrix on test data')
ax.xaxis.set_ticklabels(labels)
ax.yaxis.set_ticklabels(labels)
[[ 61  52]
 [  7 264]]

[[14 21]
 [ 6 55]]
[Text(0, 0.5, 'Notapproved'), Text(0, 1.5, 'approved')]

随机森林

# applying Random forrest classifier with Hyperparameter tuning
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier()
grid_values = {'n_estimators':[50, 80,  100], 'max_depth':[4,5,6,7,8,9,10]}
rf_gd = GridSearchCV(rf, param_grid = grid_values, scoring = 'roc_auc', cv=5)

# Fit the object to train dataset
rf_gd.fit(X_train, y_train)

train_class_preds = rf_gd.predict(X_train)
test_class_preds = rf_gd.predict(X_test)

Accuracy Score

# Get the accuracy scores
train_accuracy = accuracy_score(train_class_preds,y_train)
test_accuracy = accuracy_score(test_class_preds,y_test)

print("The accuracy on train data is ", train_accuracy)
print("The accuracy on test data is ", test_accuracy)
The accuracy on train data is  0.890625
The accuracy on test data is  0.75

roc_auc Score

# Get the roc_auc scores
train_roc_auc = accuracy_score(y_train,train_class_preds)
test_roc_auc = accuracy_score(y_test,test_class_preds)

print("The accuracy on train data is ", train_roc_auc)
print("The accuracy on test data is ", test_roc_auc)
The accuracy on train data is  0.890625
The accuracy on test data is  0.75

Confusion Matrix

# Get the confusion matrix for trained data

labels = ['Notapproved', 'approved']
cm = confusion_matrix(y_train, train_class_preds)
print(cm)

ax= plt.subplot()
sns.heatmap(cm, annot=True, ax = ax) #annot=True to annotate cells

# labels, title and ticks
ax.set_xlabel('Predicted labels')
ax.set_ylabel('True labels')
ax.set_title('Confusion Matrix on trained data')
ax.xaxis.set_ticklabels(labels)
ax.yaxis.set_ticklabels(labels)
plt.show()

# Get the confusion matrix for test data

labels = ['Notapproved', 'approved']
cm = confusion_matrix(y_test, test_class_preds)
print(cm)

ax= plt.subplot()
sns.heatmap(cm, annot=True, ax = ax); #annot=True to annotate cells

# labels, title and ticks
ax.set_xlabel('Predicted labels')
ax.set_ylabel('True labels')
ax.set_title('Confusion Matrix on test data')
ax.xaxis.set_ticklabels(labels)
ax.yaxis.set_ticklabels(labels)
plt.show()
[[ 72  41]
 [  1 270]]

[[13 22]
 [ 2 59]]

  • 最佳 roc_auc 分数源于随机森林分类器,因此随机森林是该模型的最佳预测模型。

代码与数据集下载

详情请见金融贷款批准预测项目-VenusAI (aideeplearning.cn)

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.coloradmin.cn/o/1570276.html

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈,一经查实,立即删除!

相关文章

LIN总线基础

文章目录 1 什么是LIN 总线&#xff1f;1.1 LIN总线的历史 2.LIN总线的特点3. 应用4 LIN总线基本概念4.1 LIN报文帧结构4.1.1 主节点与从节点4.1.2 调度表4.1.3网络管理4.1.4 帧头结构4.1.4.1 电平 1 什么是LIN 总线&#xff1f; LIN(Local Interconnect Network)是一种低成本…

linux 搭建Samba服务

Samba简介 SAMBA是⼀个实现不同操作系统之间⽂件共享和打印机共享的⼀种SMB协议的免费软件&#xff0c; SMB(Server Message block)协议是window下所使⽤的⽂件共享协议&#xff0c;我们在linux系统或 者其类unix系统当中可以通过samba服务来实现SMB功能。 &#xff08;1&…

Linux离线安装python3(源码编译)

1、下载python包 下载python3.9.6的源码包 python下载 下载后&#xff0c;解压&#xff0c;目录如下&#xff1a; -rw-------. 1 root root 1454 Aug 26 2023 anaconda-ks.cfg -rw-r--r--. 1 root root 25640094 Apr 4 21:52 Python-3.9.6.tgz drwxrwxr…

LC 96.不同的二叉搜索树

96.不同的二叉搜索树 给你一个整数 n &#xff0c;求恰由 n 个节点组成且节点值从 1 到 n 互不相同的 二叉搜索树 有多少种&#xff1f;返回满足题意的二叉搜索树的种数。 示例 1&#xff1a; 输入&#xff1a; n 3 输出&#xff1a; 5 示例 2&#xff1a; 输入&#xff1a;…

配置启动nacos,保姆级教程

下载nacos 下载链接 https://github.com/alibaba/nacos/releases进去下拉&#xff0c;找到下载版本信息。 下载后如图所示。 配置数据库 在我们的conf文件夹中有一个nacos-mysql的数据库文件 我们需要导入数据库&#xff0c;可通过工具Navicat等进行导入。 会有一下几张表…

【面试经典150 | 动态规划】交错字符串

文章目录 写在前面Tag题目来源解题思路方法一&#xff1a;动态规划 写在最后 写在前面 本专栏专注于分析与讲解【面试经典150】算法&#xff0c;两到三天更新一篇文章&#xff0c;欢迎催更…… 专栏内容以分析题目为主&#xff0c;并附带一些对于本题涉及到的数据结构等内容进行…

JAVA毕业设计133—基于Java+Springboot+Vue的网上宠物店商城管理系统(源代码+数据库+12000字论文)

毕设所有选题&#xff1a; https://blog.csdn.net/2303_76227485/article/details/131104075 基于JavaSpringbootVue的网上宠物店商城管理系统(源代码数据库12000字论文)133 一、系统介绍 本项目前后端分离&#xff0c;分为管理员、用户两种角色 1、用户&#xff1a; 注册…

聚合DNS管理系统v1.0全新发布 域名解析管理系统

内容目录 一、详细介绍二、效果展示1.部分代码2.效果图展示 三、学习资料下载 一、详细介绍 聚合DNS管理系统可以实现在一个网站内管理多个平台的域名解析&#xff0c;目前已支持的域名平台有&#xff1a;阿里云、腾讯云、华为云、西部数码、CloudFlare。本系统支持多用户&…

数据挖掘中的PCA和KMeans:Airbnb房源案例研究

目录 一、PCA简介 二、数据集概览 三、数据预处理步骤 四、PCA申请 五、KMeans 聚类 六、PCA成分分析 七、逆变换 八、质心分析 九、结论 十、深入探究 10.1 第 1 步&#xff1a;确定 PCA 组件的最佳数量 10.2 第 2 步&#xff1a;使用 9 个组件重做 PCA 10.3 解释 PCA 加载和特…

高度不同的流体瀑布css实现方法

商城商品列表 实现瀑布流展示&#xff0c;通过flex或grid实现会导致每行中的列高度一致&#xff0c;无法达到错落有致的感觉&#xff1b; 为此需要用到&#xff1a; CSS columns 属性 columns 属性是一个简写属性&#xff0c;用于设置列宽和列数。 CSS 语法 columns: column-wi…

网络攻防中json序列化漏洞案例,fastjson远程命令执行漏洞原理

网络攻防中json序列化漏洞案例,fastjson远程命令执行漏洞原理。 网络攻防中的JSON序列化漏洞是指当应用程序使用JSON(JavaScript Object Notation)格式来序列化和反序列化对象时,由于不当处理或不安全的编程实践,导致攻击者能够执行恶意操作的安全漏洞。这些操作可能包括远…

成为不可替代的人,优秀到不能被忽视

一、资料描述 本套个人成长资料&#xff0c;大小30.89M&#xff0c;共有21个文件。 二、资料目录 00发刊词 成为不可替代的人.pdf 01累死你的不是工作&#xff0c;是错的职场价值观.pdf 02教你选好行业&#xff0c;远离裁员降薪.pdf 03如何选对公司&#xff0c;让高薪升值…

AJAX —— 学习(一)

目录 一、原生 AJAX &#xff08;一&#xff09;AJAX 介绍 1.理解 2.作用 3.最大的优势 4.应用例子 &#xff08;二&#xff09;XML 介绍 1.理解 2.作用 &#xff08;三&#xff09;AJAX 的特点 1.优点 2.缺点 二、HTTP 协议 &#xff08;一&#xff09;HTTP 介…

【攻防世界】ics-05

php://filter 伪协议查看源码 preg_replace 函数漏洞 1.获取网页源代码。多点点界面&#xff0c;发现点云平台设备维护中心时&#xff0c;页面发生变化。 /?pageindex 输入什么显示什么&#xff0c;有回显。 用php://filter读取网页源代码 ?pagephp://filter/readconvert.…

docker安装、调试qsign签名服务器

go-cqhttp 在 Docker 里早就部署好了&#xff0c;由于没有搭建 qsign 签名服务器&#xff0c;所以迟迟不敢上线。今天终于搞定了在 Docker 下安装 qsign 签名服务器了。这次用的docker市场里找到的镜像&#xff0c;下次找时间制作一个自己的镜像。 1 拉取和运行镜像&#xff1a…

轻薄本没有独立显卡如何运行stable diffusion

众所周知&#xff0c;Stable Diffusion WebUI 使用 GPU 模式运行。 一&#xff1a;检查自己显卡 打开任务管理器或者winR 输入dxdiag 查看自己显卡状态 很明显一般轻薄本只会带有集显&#xff0c;不能满足stable diffusion要求所以我们可以使用cup来运行stable diffusion 在…

转让1000万不良资产处置公司包变更需要多久

随着经济的发展和市场的变化&#xff0c;越来越多的公司需要进行资产的处置和转让&#xff0c;以提言公司的效益和盈利。山东不良资产处置公司作为一家专门服务于资产处置和转让的企业&#xff0c;其流程与变更步骤对于资产转让的操作至关重要。可以致电咨询我或者来公司面谈。…

电荷泵如何实现升压原来

电荷泵如何实现升压原来 某芯片自举栅极驱动内部原理图迪克森电荷泵 某芯片自举栅极驱动内部原理图 迪克森电荷泵 迪克森电荷泵&#xff08;Dickson Charge Pump&#xff09;是一种电压倍增器电路&#xff0c;可以将低电压升高到较高电压&#xff0c;相对于其他电压升压电路&a…

网站建设 之 发布ios

首先将forceDev改为false 然后执行npm run build:ios 然后用xocode安装到手机上进行测试 ##Version&#xff08;应用程序发布版本号&#xff09; 对应的就是CFBundleShortVersionString。该版本的版本号是三个时期分隔的整数组成的字符串&#xff1a; 第一个整数代表重大修…

Pytoch安装记录

使用pycharm 1、CUDA的安装 官网&#xff1a;CUDA Toolkit Archive | NVIDIA Developer 选择对应的版本 选择对应的版本进行下载&#xff1a; 有3个多G cuda的安装需要注意&#xff0c;如果没有安装vs&#xff0c;则需要选择自定义安装&#xff0c;在自定义的安装中取消 安…