企业级实战项目:基于 pycaret 自动化预测公司是否破产

news2024/11/17 15:36:27

本文系数据挖掘实战系列文章,我跟大家分享一个数据挖掘实战,与以往的数据实战不同的是,用自动机器学习方法完成模型构建与调优部分工作,深入理解由此带来的便利与效果。

1. Introduction

在这里插入图片描述

本文是一篇数据挖掘实战案例,详细探索了从台湾经济杂志收集的1999年到2009年的数据,看看在数据探索过程中,可以洞察出哪些有用的信息,判断哪一个模型能够最准确地预测公司是否破产。

公司破产的定义是根据台湾证券交易所的商业规则而定的。

该建模将尝试使用自动机器学习库pycaret来构建机器学习模型,pycaret是一个用python编写的开源低代码机器学习库,它将机器学习工作流程自动化。如果你想探索这个库并更好地理解它的功能。推荐查看

设置环境并读取数据

import pandas as pd  
import numpy as np  
import math  
import matplotlib.pyplot as plt  
import seaborn as sns  
  
bankruptcy_df = pd.read_csv("Bankruptcy.csv")    

bankruptcy_df.head()  

技术交流&源码获取

技术要学会交流、分享,不建议闭门造车。一个人可以走的很快、一堆人可以走的更远。

好的文章离不开粉丝的分享、推荐,资料干货、资料分享、数据、技术交流提升,均可加交流群获取,群友已超过2000人,添加时最好的备注方式为:来源+兴趣方向,方便找到志同道合的朋友。

本文数据&源码,技术交流、按照如下方式获取:

方式①、添加微信号:dkl88194,备注:资料
方式②、微信搜索公众号:Python学习与数据挖掘,后台回复:资料

资料1
在这里插入图片描述
资料2

我们打造了《100个超强算法模型》,特点:从0到1轻松学习,原理、代码、案例应有尽有,所有的算法模型都是按照这样的节奏进行表述,所以是一套完完整整的案例库。

很多初学者是有这么一个痛点,就是案例,案例的完整性直接影响同学的兴致。因此,我整理了 100个最常见的算法模型,在你的学习路上助推一把!

在这里插入图片描述

2. 理解数据

bankruptcy_df.info()
<class 'pandas.core.frame.DataFrame'>  
RangeIndex: 6819 entries, 0 to 6818  
Data columns (total 96 columns):  
 #   Column                                                    Non-Null Count  Dtype    
  
---  ------                                                    --------------  -----  
  
 0   Bankrupt?                                                 6819 non-null   int64    
 1    ROA(C) before interest and depreciation before interest  6819 non-null   float64  
 2    ROA(A) before interest and % after tax                   6819 non-null   float64  
 3    ROA(B) before interest and depreciation after tax        6819 non-null   float64  
 4    Operating Gross Margin                                   6819 non-null   float64  
 5    Realized Sales Gross Margin                              6819 non-null   float64  
 6    Operating Profit Rate                                    6819 non-null   float64  
 7    Pre-tax net Interest Rate                                6819 non-null   float64  
 8    After-tax net Interest Rate                              6819 non-null   float64  
 9    Non-industry income and expenditure/revenue              6819 non-null   float64  
 10   Continuous interest rate (after tax)                     6819 non-null   float64  
 11   Operating Expense Rate                                   6819 non-null   float64  
 12   Research and development expense rate                    6819 non-null   float64  
 13   Cash flow rate                                           6819 non-null   float64  
 14   Interest-bearing debt interest rate                      6819 non-null   float64  
 15   Tax rate (A)                                             6819 non-null   float64  
 16   Net Value Per Share (B)                                  6819 non-null   float64  
 17   Net Value Per Share (A)                                  6819 non-null   float64  
 18   Net Value Per Share (C)                                  6819 non-null   float64  
 19   Persistent EPS in the Last Four Seasons                  6819 non-null   float64  
 20   Cash Flow Per Share                                      6819 non-null   float64  
 21   Revenue Per Share (Yuan ¥)                               6819 non-null   float64  
 22   Operating Profit Per Share (Yuan ¥)                      6819 non-null   float64  
 23   Per Share Net profit before tax (Yuan ¥)                 6819 non-null   float64  
 24   Realized Sales Gross Profit Growth Rate                  6819 non-null   float64  
 25   Operating Profit Growth Rate                             6819 non-null   float64  
 26   After-tax Net Profit Growth Rate                         6819 non-null   float64  
 27   Regular Net Profit Growth Rate                           6819 non-null   float64  
 28   Continuous Net Profit Growth Rate                        6819 non-null   float64  
 29   Total Asset Growth Rate                                  6819 non-null   float64  
 30   Net Value Growth Rate                                    6819 non-null   float64  
 31   Total Asset Return Growth Rate Ratio                     6819 non-null   float64  
 32   Cash Reinvestment %                                      6819 non-null   float64  
 33   Current Ratio                                            6819 non-null   float64  
 34   Quick Ratio                                              6819 non-null   float64  
 35   Interest Expense Ratio                                   6819 non-null   float64  
 36   Total debt/Total net worth                               6819 non-null   float64  
 37   Debt ratio %                                             6819 non-null   float64  
 38   Net worth/Assets                                         6819 non-null   float64  
 39   Long-term fund suitability ratio (A)                     6819 non-null   float64  
 40   Borrowing dependency                                     6819 non-null   float64  
 41   Contingent liabilities/Net worth                         6819 non-null   float64  
 42   Operating profit/Paid-in capital                         6819 non-null   float64  
 43   Net profit before tax/Paid-in capital                    6819 non-null   float64  
 44   Inventory and accounts receivable/Net value              6819 non-null   float64  
 45   Total Asset Turnover                                     6819 non-null   float64  
 46   Accounts Receivable Turnover                             6819 non-null   float64  
 47   Average Collection Days                                  6819 non-null   float64  
 48   Inventory Turnover Rate (times)                          6819 non-null   float64  
 49   Fixed Assets Turnover Frequency                          6819 non-null   float64  
 50   Net Worth Turnover Rate (times)                          6819 non-null   float64  
 51   Revenue per person                                       6819 non-null   float64  
 52   Operating profit per person                              6819 non-null   float64  
 53   Allocation rate per person                               6819 non-null   float64  
 54   Working Capital to Total Assets                          6819 non-null   float64  
 55   Quick Assets/Total Assets                                6819 non-null   float64  
 56   Current Assets/Total Assets                              6819 non-null   float64  
 57   Cash/Total Assets                                        6819 non-null   float64  
 58   Quick Assets/Current Liability                           6819 non-null   float64  
 59   Cash/Current Liability                                   6819 non-null   float64  
 60   Current Liability to Assets                              6819 non-null   float64  
 61   Operating Funds to Liability                             6819 non-null   float64  
 62   Inventory/Working Capital                                6819 non-null   float64  
 63   Inventory/Current Liability                              6819 non-null   float64  
 64   Current Liabilities/Liability                            6819 non-null   float64  
 65   Working Capital/Equity                                   6819 non-null   float64  
 66   Current Liabilities/Equity                               6819 non-null   float64  
 67   Long-term Liability to Current Assets                    6819 non-null   float64  
 68   Retained Earnings to Total Assets                        6819 non-null   float64  
 69   Total income/Total expense                               6819 non-null   float64  
 70   Total expense/Assets                                     6819 non-null   float64  
 71   Current Asset Turnover Rate                              6819 non-null   float64  
 72   Quick Asset Turnover Rate                                6819 non-null   float64  
 73   Working capitcal Turnover Rate                           6819 non-null   float64  
 74   Cash Turnover Rate                                       6819 non-null   float64  
 75   Cash Flow to Sales                                       6819 non-null   float64  
 76   Fixed Assets to Assets                                   6819 non-null   float64  
 77   Current Liability to Liability                           6819 non-null   float64  
 78   Current Liability to Equity                              6819 non-null   float64  
 79   Equity to Long-term Liability                            6819 non-null   float64  
 80   Cash Flow to Total Assets                                6819 non-null   float64  
 81   Cash Flow to Liability                                   6819 non-null   float64  
 82   CFO to Assets                                            6819 non-null   float64  
 83   Cash Flow to Equity                                      6819 non-null   float64  
 84   Current Liability to Current Assets                      6819 non-null   float64  
 85   Liability-Assets Flag                                    6819 non-null   int64    
 86   Net Income to Total Assets                               6819 non-null   float64  
 87   Total assets to GNP price                                6819 non-null   float64  
 88   No-credit Interval                                       6819 non-null   float64  
 89   Gross Profit to Sales                                    6819 non-null   float64  
 90   Net Income to Stockholder's Equity                       6819 non-null   float64  
 91   Liability to Equity                                      6819 non-null   float64  
 92   Degree of Financial Leverage (DFL)                       6819 non-null   float64  
 93   Interest Coverage Ratio (Interest expense to EBIT)       6819 non-null   float64  
 94   Net Income Flag                                          6819 non-null   int64    
 95   Equity to Liability                                      6819 non-null   float64  
dtypes: float64(93), int64(3)  
memory usage: 5.0 MB
bankruptcy_df.shape  
(6819, 96)  
bankruptcy_df.describe()  

3. 数据探索与清洗

3.1 缺失值处理

bankruptcy_df.columns[bankruptcy_df.isna().any()]  
Index([], dtype='object')  

从结果看,改数据集非常完整,没有缺失值!

.any() 指的是有没有(缺失值),而与之对应的.all()指的是是否都是(缺失值)

调整数据列名
def clean_col_names(col_name):  
    col_name = (  
        col_name.strip()  
        .replace("?", "_")  
        .replace("(", "_")  
        .replace(")", "_")  
        .replace(" ", "_")  
        .replace("/", "_")  
        .replace("-", "_")  
        .replace("__", "_")  
        .replace("'", "")  
        .lower()  
    )  
    return col_name  
  
bank_columns = list(bankruptcy_df.columns)  
bank_columns = [clean_col_names(col_name) for col_name in bank_columns]  
bankruptcy_df.columns = bank_columns  
display(bankruptcy_df.columns)
Index(['bankrupt_', 'roa_c_before_interest_and_depreciation_before_interest',  
       'roa_a_before_interest_and_%_after_tax',  
       'roa_b_before_interest_and_depreciation_after_tax',  
       'operating_gross_margin', 'realized_sales_gross_margin',  
       'operating_profit_rate', 'pre_tax_net_interest_rate',  
       'after_tax_net_interest_rate',  
       'non_industry_income_and_expenditure_revenue',  
       'continuous_interest_rate_after_tax_', 'operating_expense_rate',  
       'research_and_development_expense_rate', 'cash_flow_rate',  
       'interest_bearing_debt_interest_rate', 'tax_rate_a_',  
       'net_value_per_share_b_', 'net_value_per_share_a_',  
       'net_value_per_share_c_', 'persistent_eps_in_the_last_four_seasons',  
       'cash_flow_per_share', 'revenue_per_share_yuan_¥_',  
       'operating_profit_per_share_yuan_¥_',  
       'per_share_net_profit_before_tax_yuan_¥_',  
       'realized_sales_gross_profit_growth_rate',  
       'operating_profit_growth_rate', 'after_tax_net_profit_growth_rate',  
       'regular_net_profit_growth_rate', 'continuous_net_profit_growth_rate',  
       'total_asset_growth_rate', 'net_value_growth_rate',  
       'total_asset_return_growth_rate_ratio', 'cash_reinvestment_%',  
       'current_ratio', 'quick_ratio', 'interest_expense_ratio',  
       'total_debt_total_net_worth', 'debt_ratio_%', 'net_worth_assets',  
       'long_term_fund_suitability_ratio_a_', 'borrowing_dependency',  
       'contingent_liabilities_net_worth', 'operating_profit_paid_in_capital',  
       'net_profit_before_tax_paid_in_capital',  
       'inventory_and_accounts_receivable_net_value', 'total_asset_turnover',  
       'accounts_receivable_turnover', 'average_collection_days',  
       'inventory_turnover_rate_times_', 'fixed_assets_turnover_frequency',  
       'net_worth_turnover_rate_times_', 'revenue_per_person',  
       'operating_profit_per_person', 'allocation_rate_per_person',  
       'working_capital_to_total_assets', 'quick_assets_total_assets',  
       'current_assets_total_assets', 'cash_total_assets',  
       'quick_assets_current_liability', 'cash_current_liability',  
       'current_liability_to_assets', 'operating_funds_to_liability',  
       'inventory_working_capital', 'inventory_current_liability',  
       'current_liabilities_liability', 'working_capital_equity',  
       'current_liabilities_equity', 'long_term_liability_to_current_assets',  
       'retained_earnings_to_total_assets', 'total_income_total_expense',  
       'total_expense_assets', 'current_asset_turnover_rate',  
       'quick_asset_turnover_rate', 'working_capitcal_turnover_rate',  
       'cash_turnover_rate', 'cash_flow_to_sales', 'fixed_assets_to_assets',  
       'current_liability_to_liability', 'current_liability_to_equity',  
       'equity_to_long_term_liability', 'cash_flow_to_total_assets',  
       'cash_flow_to_liability', 'cfo_to_assets', 'cash_flow_to_equity',  
       'current_liability_to_current_assets', 'liability_assets_flag',  
       'net_income_to_total_assets', 'total_assets_to_gnp_price',  
       'no_credit_interval', 'gross_profit_to_sales',  
       'net_income_to_stockholders_equity', 'liability_to_equity',  
       'degree_of_financial_leverage_dfl_',  
       'interest_coverage_ratio_interest_expense_to_ebit_', 'net_income_flag',  
       'equity_to_liability'],  
      dtype='object')
统计并绘制目标变量

该步骤的目的是查看目标变量是否平衡,如果不平衡,则需要针对性处理。

class_bar=sns.countplot(data=bankruptcy_df,x="bankrupt_")  
ax = plt.gca()  
for p in ax.patches:  
        ax.annotate('{:.1f}'.format(p.get_height()), (p.get_x()+0.3, p.get_height()+500))  
class_bar  

3.2 特征分布

检查偏态
# Return true/false if skewed  
import scipy.stats  
skew_df = pd.DataFrame(bankruptcy_df.select_dtypes(np.number).columns, columns = ['Feature'])  
  
skew_df['Skew'] = skew_df['Feature'].apply(lambda feature: scipy.stats.skew(bankruptcy_df[feature]))   
  
skew_df['Absolute Skew'] = skew_df['Skew'].apply(abs)    
# 得到与方向无关的倾斜幅度  
skew_df['Skewed']= skew_df['Absolute Skew'].apply(lambda x: True if x>= 0.5 else False)  
with pd.option_context("display.max_rows", 1000):  
    display(skew_df)

可视化分布
cols = list(bankruptcy_df.columns)  
ncols = 8  
nrows = math.ceil(len(cols) / ncols)  
  
fig, ax = plt.subplots(nrows, ncols, figsize = (4.5 * ncols, 4 * nrows))  
for i in range(len(cols)):  
    sns.kdeplot(bankruptcy_df[cols[i]], ax = ax[i // ncols, i % ncols])  
    if i % ncols != 0:  
        ax[i // ncols, i % ncols].set_ylabel(" ")  
plt.tight_layout()  
plt.show()  

在这里插入图片描述

查看有偏态的特征
query_skew=skew_df.query("Skewed == True")["Feature"]  
with pd.option_context("display.max_rows", 1000):  
    display(query_skew)
0                                             bankrupt_  
2                 roa_a_before_interest_and_%_after_tax  
3      roa_b_before_interest_and_depreciation_after_tax  
4                                operating_gross_margin  
5                           realized_sales_gross_margin  
6                                 operating_profit_rate  
7                             pre_tax_net_interest_rate  
8                           after_tax_net_interest_rate  
9           non_industry_income_and_expenditure_revenue  
10                  continuous_interest_rate_after_tax_  
11                               operating_expense_rate  
12                research_and_development_expense_rate  
13                                       cash_flow_rate  
14                  interest_bearing_debt_interest_rate  
15                                          tax_rate_a_  
16                               net_value_per_share_b_  
17                               net_value_per_share_a_  
18                               net_value_per_share_c_  
19              persistent_eps_in_the_last_four_seasons  
20                                  cash_flow_per_share  
21                            revenue_per_share_yuan_¥_  
22                   operating_profit_per_share_yuan_¥_  
23              per_share_net_profit_before_tax_yuan_¥_  
24              realized_sales_gross_profit_growth_rate  
25                         operating_profit_growth_rate  
26                     after_tax_net_profit_growth_rate  
27                       regular_net_profit_growth_rate  
28                    continuous_net_profit_growth_rate  
29                              total_asset_growth_rate  
30                                net_value_growth_rate  
31                 total_asset_return_growth_rate_ratio  
32                                  cash_reinvestment_%  
33                                        current_ratio  
34                                          quick_ratio  
35                               interest_expense_ratio  
36                           total_debt_total_net_worth  
37                                         debt_ratio_%  
38                                     net_worth_assets  
39                  long_term_fund_suitability_ratio_a_  
40                                 borrowing_dependency  
41                     contingent_liabilities_net_worth  
42                     operating_profit_paid_in_capital  
43                net_profit_before_tax_paid_in_capital  
44          inventory_and_accounts_receivable_net_value  
45                                 total_asset_turnover  
46                         accounts_receivable_turnover  
47                              average_collection_days  
48                       inventory_turnover_rate_times_  
49                      fixed_assets_turnover_frequency  
50                       net_worth_turnover_rate_times_  
51                                   revenue_per_person  
52                          operating_profit_per_person  
53                           allocation_rate_per_person  
57                                    cash_total_assets  
58                       quick_assets_current_liability  
59                               cash_current_liability  
60                          current_liability_to_assets  
61                         operating_funds_to_liability  
62                            inventory_working_capital  
63                          inventory_current_liability  
64                        current_liabilities_liability  
65                               working_capital_equity  
66                           current_liabilities_equity  
67                long_term_liability_to_current_assets  
68                    retained_earnings_to_total_assets  
69                           total_income_total_expense  
70                                 total_expense_assets  
71                          current_asset_turnover_rate  
72                            quick_asset_turnover_rate  
73                       working_capitcal_turnover_rate  
74                                   cash_turnover_rate  
75                                   cash_flow_to_sales  
76                               fixed_assets_to_assets  
77                       current_liability_to_liability  
78                          current_liability_to_equity  
79                        equity_to_long_term_liability  
81                               cash_flow_to_liability  
83                                  cash_flow_to_equity  
84                  current_liability_to_current_assets  
85                                liability_assets_flag  
86                           net_income_to_total_assets  
87                            total_assets_to_gnp_price  
88                                   no_credit_interval  
89                                gross_profit_to_sales  
90                    net_income_to_stockholders_equity  
91                                  liability_to_equity  
92                    degree_of_financial_leverage_dfl_  
93    interest_coverage_ratio_interest_expense_to_ebit_  
95                                  equity_to_liability  
Name: Feature, dtype: object

进行下采样,直至样本集中的破产与非破产比例为50/50。完成之后再次对数据进行偏态检查,决定是否需要做log转换,另外进行相关矩阵分析。

3.3 下采样

首先对数据集进行下采样,目标比例为bankrupt vs non bankrupt = 50 vs 50

bankruptcy_df2 = bankruptcy_df.sample(frac=1) #Shuffle Bankruptcy df  
  
bankruptcy_df_b = bankruptcy_df2.loc[bankruptcy_df2["bankrupt_"] == 1]  
bankruptcy_df_nb = bankruptcy_df2.loc[bankruptcy_df2["bankrupt_"] == 0][:220]  
  
bankruptcy_subdf_comb = pd.concat([bankruptcy_df_b,bankruptcy_df_nb])  
bankruptcy_subdf = bankruptcy_subdf_comb.sample(frac=1,random_state=42)  
  
bankruptcy_subdf  

再次绘图查看正负样本数。

sns.countplot(bankruptcy_subdf["bankrupt_"])  

随机选择220家非破产公司和220家破产公司。

4. 特征工程

bankruptcy_subdf2 = bankruptcy_subdf.drop(["net_income_flag"],axis=1)  
bankruptcy_subdf2.shape  

(440, 95)  

4.1 相关矩阵

fig = plt.figure(figsize=(30,20))  
ax1 = fig.add_subplot(1,1,1)  
sns.heatmap(bankruptcy_subdf2.corr(),ax=ax1,cmap="coolwarm")  
4.1.1 找出与破产相关的最高特征

根据对破产企业的基本认识,破产企业资产少、负债高、盈利能力低、现金流少。可以朝这个方向分析我们的数据集。

corr=bankruptcy_subdf2[bankruptcy_subdf2.columns[:-1]].corr()['bankrupt_'][:]  
  
corr_df = pd.DataFrame(corr)  
  
print("Correlations to Bankruptcy:")  
for index, row in corr_df["bankrupt_"].iteritems():  
    if row!=1.0 and row>=0.5:  
        print(f'Positive Correlation: {index}')  
    elif row!=1.0 and row<=-0.5:  
        print(f'Negative Correlation: {index}')  
Correlations to Bankruptcy:  
Negative Correlation: roa_c_before_interest_and_depreciation_before_interest  
Negative Correlation: roa_b_before_interest_and_depreciation_after_tax  
Negative Correlation: net_value_per_share_b_  
Negative Correlation: net_value_per_share_a_  
Negative Correlation: net_value_per_share_c_  
Negative Correlation: persistent_eps_in_the_last_four_seasons  
Negative Correlation: per_share_net_profit_before_tax_yuan_¥_  
Positive Correlation: debt_ratio_%  
Negative Correlation: net_worth_assets  
Negative Correlation: net_profit_before_tax_paid_in_capital  
Negative Correlation: total_income_total_expense  

这些特征代表什么

  • roa_c_before_interest_and_depreciation_before_interest息前资产收益率和息前折旧:总资产收益率–如果总资产收益率低,破产风险高

  • roa_a_before_interest_and_after_tax息前和税后利润:总资产回报率–如果总资产回报率较低,破产风险较高

  • roa_b_before_interest_and_depreciation_after_tax利润不计利息及税后折旧:总资产回报率–如果总资产回报率较低,破产风险较高

  • debt_ratio负债率:负债占总资产的比例–价值越高,负债占资产的比例越高,导致破产风险越高

  • net_worth_assets净资产:净资产越少,破产风险越高

  • retained_earnings_to_total_assets留存收益与总资产之比:留存收益越少,破产风险越高

  • total_income_total_expense总费用:收入与费用之比较低,破产风险较高

  • net_income_to_total_assets净收入与总资产之比:净收入越低,破产风险越高

从结果看,导致公司违约风险越高的特征,似乎与背景知识一致。

4.2 下采样后特征分布可视化

# Visualisation of distributions after sub-sampling  
cols = list(bankruptcy_subdf2.columns)  
ncols = 8  
nrows = math.ceil(len(cols) / ncols)  
  
fig, ax = plt.subplots(nrows, ncols, figsize = (4.5 * ncols, 4 * nrows))  
for i in range(len(cols)):  
    sns.kdeplot(bankruptcy_subdf2[cols[i]], ax = ax[i // ncols, i % ncols])  
    if i % ncols != 0:  
        ax[i // ncols, i % ncols].set_ylabel(" ")  
plt.tight_layout()  
plt.show()  

在这里插入图片描述

4.3 所有特征的箱线图

plt.figure(figsize=(30,20))  
boxplot=sns.boxplot(data=bankruptcy_subdf2,orient="h")  
boxplot.set(xscale="log")  
plt.show()  

在这里插入图片描述

4.4 异常值处理

quartile1 = bankruptcy_subdf2.quantile(q=0.25,axis=0)  
# display(quartile1)  
quartile3 = bankruptcy_subdf2.quantile(q=0.75,axis=0)  
# display(quartile3)  
IQR = quartile3 -quartile1  
lower_limit = quartile1-1.5*IQR  
upper_limit = quartile3+1.5*IQR  
  
lower_limit = lower_limit.drop(["bankrupt_"])  
upper_limit = upper_limit.drop(["bankrupt_"])  
# print(lower_limit)  
# print(" ")  
# print(upper_limit)  
  
bankruptcy_subdf2_out = bankruptcy_subdf2[((bankruptcy_subdf2<lower_limit) | (bankruptcy_subdf2>upper_limit)).any(axis=1)]  
display(bankruptcy_subdf2_out.shape)  
display(bankruptcy_subdf2.shape)  
(423, 95)  
  
(440, 95)  

额外复制一份表,供后续分析处理。

bankruptcy_subdf3 = bankruptcy_subdf2_out.copy()  
bankruptcy_subdf3  

下采样后且去除离群值后的分布可视化。

# Visualisation of distributions after sub-sampling after outlier removal  
cols = list(bankruptcy_subdf3.columns)  
ncols = 8  
nrows = math.ceil(len(cols) / ncols)  
  
fig, ax = plt.subplots(nrows, ncols, figsize = (4.5 * ncols, 4 * nrows))  
for i in range(len(cols)):  
    sns.kdeplot(bankruptcy_subdf3[cols[i]], ax = ax[i // ncols, i % ncols],fill=True,color="red")  
    sns.kdeplot(bankruptcy_subdf2[cols[i]], ax = ax[i // ncols, i % ncols],color="green")  
    if i % ncols != 0:  
        ax[i // ncols, i % ncols].set_ylabel(" ")  
plt.tight_layout()  
plt.show()  

在这里插入图片描述

5 数据预处理

5.1 特征编码

所有类别在基础数据中都已编码完成,因此这里不需要再次编码列。在实际工作中,这一步大概率是必不可少的,编码技术也是尤其重要,需要好好掌握。如果你还不了解或不是很了解,推荐查看:

5.2 Log转换

这一步是为了去除数据中的偏态分布。

# Log transform to remove skews  
target = bankruptcy_subdf3['bankrupt_']  
bankruptcy_subdf4 = bankruptcy_subdf3.drop(["bankrupt_"],axis=1)  
  
def log_trans(data):  
    for col in data:  
        skew = data[col].skew()  
        if skew>=0.5 or skew<=0.5:  
            data[col] = np.log1p(data[col])  
        else:  
            continue  
    return data  
  
bankruptcy_subdf4_log = log_trans(bankruptcy_subdf4)  
bankruptcy_subdf4_log.head()  

5.2.1 Log转换数据的箱线图
plt.figure(figsize=(30,20))  
boxplot=sns.boxplot(data=bankruptcy_subdf4_log,orient="h")  
boxplot.set(xscale="log")  
plt.show()  

在这里插入图片描述

5.2.2 Log转换后的数据分布可视化
# 在下采样后、去除离群值及log变换后的数据分布的可视化  
compare_subdf2 = bankruptcy_subdf2.drop(["bankrupt_"],axis=1)  
  
cols = list(bankruptcy_subdf4.columns)  
ncols = 8  
nrows = math.ceil(len(cols) / ncols)  
  
fig, ax = plt.subplots(nrows, ncols, figsize = (4.5 * ncols, 4 * nrows))  
for i in range(len(cols)):  
    sns.kdeplot(bankruptcy_subdf4_log[cols[i]], ax = ax[i // ncols, i % ncols],fill=True,color="red")  
    sns.kdeplot(bankruptcy_subdf2[cols[i]], ax = ax[i // ncols, i % ncols],color="green")  
    if i % ncols != 0:  
        ax[i // ncols, i % ncols].set_ylabel(" ")  
plt.tight_layout()  
plt.show()  
print("Red represents distributions after log transforms, green represents before log transform")  

在这里插入图片描述

红色表示Log变换后的分布,绿色表示Log变换前的分布。(完整数据集:关注@公众号:数据STUDIO,联系云朵君获取)

6 使用Pycaret构建模型

本次模型构建使用的是自动机器学习框架pycaret,如果你还没有安装,可使用下述命令安装即可。

pip install -U --ignore-installed --pre pycaret  

在pycaret中自动完成训练及测试数据的切分工作。

from pycaret.classification import *  
exp_name = setup(data = bankruptcy_subdf4,  target = bankruptcy_subdf3["bankrupt_"])  

compare_models()  

Pycaret显示,3种模型的准确性最高的是

  • LightGBM分类器

  • 梯度提升GBC分类器

  • XGBoost分类器

接下来将使用这5个模型进行超参数调优。

6.1 选定模型交叉验证

LightGBM
print("LGBM Model")  
lgb_clf = create_model("lightgbm")  
lgb_clf_scoregrid = pull()  
LGBM Model  

GBC
print("GBC Model")  
gbc_clf = create_model("gbc")  
gbc_clf_scoregrid = pull()  
GBC Model  

XGBoost
print("XGB Model")  
xgb_clf = create_model("xgboost")  
xgb_clf_scoregrid = pull()  
XGB Model  

7 使用Pycaret进行超参数调优

7.1 模型调优

LightGBM
print("Before Tuning")  
print(lgb_clf_scoregrid.loc[["Mean","Std"]])  
print("")  
lgb_clf = tune_model(lgb_clf,choose_better=True)  
print(lgb_clf)  
Before Tuning  
      Accuracy     AUC  Recall   Prec.      F1   Kappa     MCC  
Fold                                                            
Mean    0.8433  0.9233  0.8562  0.8497  0.8495  0.6866  0.6929  
Std     0.0524  0.0429  0.0802  0.0681  0.0506  0.1046  0.1048  

GBC
print("Before Tuning")  
print(gbc_clf_scoregrid.loc[["Mean","Std"]])  
print("")  
gbc_clf = tune_model(gbc_clf,choose_better=True)  
print(gbc_clf)  
Before Tuning  
      Accuracy     AUC  Recall   Prec.      F1   Kappa     MCC  
Fold                                                         
Mean    0.8329  0.9242  0.8558  0.8324  0.8419  0.6649  0.6691  
Std     0.0599  0.0403  0.0634  0.0750  0.0557  0.1204  0.1198  

XGBoost
print("Before Tuning")  
print(xgb_clf_scoregrid.loc[["Mean","Std"]])  
print("")  
xgb_clf = tune_model(xgb_clf,choose_better = True)  
print(xgb_clf)  
Before Tuning  
      Accuracy     AUC  Recall   Prec.      F1   Kappa     MCC  
Fold                                                            
Mean    0.8400  0.9270  0.8562  0.8410  0.8460  0.6797  0.6852  
Std     0.0582  0.0382  0.0906  0.0586  0.0583  0.1161  0.1187  

7.2 模型集成

  1. Bagged & Boosting 方法

  2. Blending

  3. Stacking

LightGBM
# Original  
print(lgb_clf_scoregrid.loc[['Mean', 'Std']])  
  
# Compare the original against bagged and boosted  
  
# Bagged  
lgb_clf = ensemble_model(lgb_clf,fold =5,choose_better = True)  
# Boosted  
lgb_clf = ensemble_model(lgb_clf,method="Boosting",choose_better = True)  
      Accuracy     AUC  Recall   Prec.      F1   Kappa     MCC  
Fold                                                            
Mean    0.8433  0.9233  0.8562  0.8497  0.8495  0.6866  0.6929  
Std     0.0524  0.0429  0.0802  0.0681  0.0506  0.1046  0.1048

GBC
# Original  
print(gbc_clf_scoregrid.loc[['Mean', 'Std']])  
  
# Compare the original against bagged and boosted  
  
# Bagged  
gbc_clf = ensemble_model(gbc_clf,fold =5,choose_better = True)  
# Boosted  
gbc_clf = ensemble_model(gbc_clf,method="Boosting",choose_better = True)  
      Accuracy     AUC  Recall   Prec.      F1   Kappa     MCC  
Fold                                                            
Mean    0.8329  0.9242  0.8558  0.8324  0.8419  0.6649  0.6691  
Std     0.0599  0.0403  0.0634  0.0750  0.0557  0.1204  0.1198

XGBoost
# Original  
print(xgb_clf_scoregrid.loc[['Mean', 'Std']])  
  
# Compare the original and boosted against bagged and boosted  
  
# Bagged  
xgb_clf = ensemble_model(xgb_clf,fold =5,choose_better = True)  
# Boosted  
xgb_clf = ensemble_model(xgb_clf,method="Boosting",choose_better = True)  
      Accuracy     AUC  Recall   Prec.      F1   Kappa     MCC  
Fold                                                            
Mean    0.8400  0.9270  0.8562  0.8410  0.8460  0.6797  0.6852  
Std     0.0582  0.0382  0.0906  0.0586  0.0583  0.1161  0.1187

7.3.1 Blend Models
blend_models([lgb_clf, gbc_clf, xgb_clf],choose_better=True)  

7.3.2 Stacking
stacker = stack_models(lgb_clf,gbc_clf)  #remove xgb as some issues  

print(stacker)  

8 模型评估

# evaluate_model(lgb_clf)  
# evaluate_model(gbc_clf)  
# evaluate_model(xgb_clf)  

8.1 ROC-AUC

plot_model(stacker, plot = 'auc')     
# Stacked classifier from ensembling  
plot_model(lgb_clf, plot = 'auc')     
# lgb最适合Bagging集成并被选中  
plot_model(gbc_clf, plot = 'auc')     
# gbc最适合Boosting集成并被选中  
plot_model(xgb_clf, plot = 'auc')     
# 基本的xgb分类器在经过调优和集成后仍然表现最好,因此选择了它  

8.2 混淆矩阵

plot_model(stacker,   
           plot = 'confusion_matrix',   
           plot_kwargs = {'percent' : True})  
plot_model(lgb_clf,   
           plot = 'confusion_matrix',   
           plot_kwargs = {'percent' : True})  
plot_model(gbc_clf,   
           plot = 'confusion_matrix',   
           plot_kwargs = {'percent' : True})  
plot_model(xgb_clf,  
           plot = 'confusion_matrix',   
           plot_kwargs = {'percent' : True})  

8.3 学习曲线

plot_model(stacker, plot = 'learning')  

plot_model(lgb_clf, plot = 'learning')  

就到这里了!

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.coloradmin.cn/o/1338772.html

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈,一经查实,立即删除!

相关文章

美国某金融公司遭遇网络攻击,130 万民众受影响

The Record 网站披露&#xff0c;美国最大的产权保险公司富达国民金融&#xff08;Fidelity National Financial&#xff08;"FNF"&#xff09;&#xff09;子公司向所在州监管机构报告了一起数据泄露事件&#xff0c;并指出有 1316938 人的数据信息被入侵其母公司的…

普中STM32-PZ6806L开发板(烧录方式)

前言 有两种方式, 串口烧录和STLink方式烧录;串口烧录 步骤 开发板USB转串口CH340驱动板接线到USB连接PC使用自带工具普中自动下载软件.exe烧录程序到开发板 ST Link方式 这种方式需要另外进行供电&#xff0c; 我买的如下&#xff0c;当年用于调试STM8的&#xff0c;也可…

uniapp中两个app互相跳转

例如,appA需要跳转到appB里面 第一步配置,android不用配置的&#xff0c;主要是配置ios的白名单 appB中 appA中 第二步代码 handleJump() {if (plus.runtime.isApplicationExist({pname: com., //安卓包名action: https://a //ios地址})) {if (plus.os.name Android) {pl…

什么是数据实时同步?对企业有什么重要性?

在当前的信息时代&#xff0c;数据的重要性已经超越了企业的资产范畴&#xff0c;成为推动创新和发展的核心动力。数据质量、完整性和可用性直接关系到企业的决策、运营和竞争力&#xff0c;因此&#xff0c;如何高效地管理和利用数据已经成为企业面临的重要挑战和机遇。 数据同…

哔哩哔哩HarmonyOS服务卡片开发

#HarmonyOS征文#完整服务卡片项目开发&#xff0c;为Bilibili添加服务卡片。 项目预览视频播放地址 介绍 这是一款纯鸿蒙版的哔哩哔哩服务卡片应用。 6月2日鸿蒙发布&#xff0c;今年的六月已经被鸿蒙刷屏了。从安卓到鸿蒙&#xff0c;最直观的变化应该就是服务卡片了。我也…

JVM高频面试题(2023最新版)

JVM面试题 1、JVM内存区域 Jvm包含两个子系统和两个组件。 1.1子系统 Class loader&#xff08;类加载器&#xff09;&#xff1a;根据给定的全限定名类名&#xff08;java.lang.object&#xff09;来装载class文件到Runtime data area&#xff08;运行时数据区&#xff09;…

解密负载均衡:如何平衡系统负载(下)

&#x1f90d; 前端开发工程师&#xff08;主业&#xff09;、技术博主&#xff08;副业&#xff09;、已过CET6 &#x1f368; 阿珊和她的猫_CSDN个人主页 &#x1f560; 牛客高级专题作者、在牛客打造高质量专栏《前端面试必备》 &#x1f35a; 蓝桥云课签约作者、已在蓝桥云…

Linux高级管理——yum仓库服务

一、部署yum软件仓库&#xff1a; 借助于YLM软件仓库&#xff0c;可以完成安装&#xff0e;卸载、自动升级rpm软件包等任务。YUM的前身是YUP (Yellow dog Updater, Yellow dog Linux的软件更新器),最初由TSS公司&#xff08;TerraSoft Solutions&#xff0c;INC.&#xff09;使…

【vue滚动条插件vuescroll】【vue自定义滚动条】

文章目录 前言一、使用步骤1.下载2.引入库三、在组件中如何使用&#xff1f;四、跳转到顶部的方法scrollTo() 五、效果总结 前言 由于浏览器自带的滚动条比较不符合设计图&#xff0c;所以在大部分项目中&#xff0c;我们都会自定义滚动条的样式&#xff0c;来还原设计图&…

Linux free命令使用教程(free指令)(查看内存、系统内存、内存占用、内存使用情况)

文章目录 Linux free命令使用教程1. free命令简介2. free命令选项2.1 显示单位&#xff08;free -b、free -k、free -m、free -g&#xff09;2.2 使用人类可读的格式&#xff08;free -h&#xff09;2.3 显示总计数&#xff08;free -t&#xff09;2.4 连续监视内存使用 3. 解析…

从零实现一套低代码(保姆级教程) --- 【8】补充数据录入组件,并实现分组

摘要 目前&#xff0c;我们实现的这一套低代码已经具备基本的功能了。从组件&#xff0c;到画布区&#xff0c;再到可视化的属性配置。就是我们这套低代码现在组件有点少&#xff0c;只有三个组件。 其实如果这三个组件&#xff0c;你已经实现了。那么如果你想根据antD新增其他…

建议收藏!上位机编程学习经验分享

家好&#xff0c;我是卢工&#xff0c;今年6月毕业。 在网上看到朱老师的课程&#xff0c;经过多方面了解&#xff0c;决定报名线上全套课程&#xff0c;后来由于线上自制力不够&#xff0c;效果甚微&#xff0c;我决定线下学习&#xff0c;经过三个月系统学习&#xff0c;我成…

人工智能:网络犯罪分子的驱动力

随着 2024 年的临近&#xff0c;是时候展望今年的网络安全状况了。由于网络犯罪日益复杂&#xff0c;预计到 2025 年&#xff0c;全球网络安全成本将增至 10.5 万亿美元。 人工智能的使用不断发展&#xff0c;网络犯罪分子变得越来越有创造力 我们注意到&#xff0c;联邦调查…

改进YOLO系列 | YOLOv5/v7 引入高效的混合特征编码器 AIFI

论文地址:https://arxiv.org/abs/2304.08069 代码地址:https://github.com/PaddlePaddle/PaddleDetection 中文翻译:https://blog.csdn.net/weixin_43694096/article/details/131353118 注意!这个模块需要 torch>=1.9 才能使用 源代码 import torch import torch.nn …

Tomcat启动流程分析

通过前面的学习&#xff0c;相信已经对Tomcat这个容器已经有了自己的一些认知&#xff0c;接下来的话我们将会一起来进行探讨Tomcat的架构设计和目录解析。 一、Tomcat目录解析 /bin 启动、关闭和其他脚本/conf 配置文件和相关DTD。比较重要的是server.xml&#xff0c;这里面…

Python测试框架pytest核心库pluggy详解

代码案例 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 import pluggy # HookspecMarker 和 HookimplMarker 实质上是一个装饰器带参数的装饰器类&#xff0c;作用是给函数增加额外的…

SCA在 得物 DevSecOps 平台上应用

SCA在 得物 DevSecOps 平台上应用。 ############################# 免责声明:工具本身并无好坏,希望大家以遵守《网络安全法》相关法律为前提来使用该工具,支持研究学习,切勿用于非法犯罪活动,对于恶意使用该工具造成的损失,和本人及开发者无关。 ###################…

CAD objectArx 在操作mfc时出现“不支持尝试执行的操作“

问题原因&#xff1a; ARX中对话框通常继承自CAcUiDialog&#xff0c;CAcUiDialog 构造函数有个参数 HINSTANCE hInstance&#xff0c;默认为 NULL&#xff0c;指定了对话框资源所在DLL进程。如果没有指定该参数&#xff0c;在创建对话框&#xff08;DoModal或Create&#xff…

超维空间S2无人机使用说明书——42、使用3维雷达进行室内定位——实物篇

引言&#xff1a;为了提高无人机室内定位的稳定性和多种定位方案&#xff0c;S系列无人机推出了基于三维雷达激光点云数据的定位方案。该方案的提出是为了改善视觉定位容易受到光照等影响导致的位置漂移。 简介 定位采用fast-lio定位算法&#xff0c;通过修改源码直接发布定位…

MySQL数据库——多表查询

1. 概述 MySQL中的多表查询是指在一个查询中同时使用两个或多个表&#xff0c;通过某种关系将它们连接起来&#xff0c;以检索所需的数据。多表查询在数据库操作中非常常见&#xff0c;尤其是在处理复杂的数据关系时。 简单讲就是&#xff1a;多表查询是指从多个表中检索数据…