【Python机器学习】算法链与管道——利用网格搜索选择使用哪个模型

news2026/3/14 22:08:38

我们可以进一步将GridSearchCV和Pipeline结合起来：还可以搜索管道中正在执行的实际步骤（比如用StandardScaler还是用MinMaxScaler）。这样会导致更大的搜索空间，应该予以仔细考虑。

尝试所有可能的解决方案，通常并不是一种可行的机器学习策略。但下面是一个例子：在iris数据集上比较RandomForestClassifier和SVC。我们知道，SVC可能需要对数据进行缩放，所以我们还需要搜索是使用StandardScaler还是不使用预处理。我们知道，RandomForestClassifier不需要预处理。我们先定义管道。这里我们显式地对步骤命名。需要两个步骤，一个用于预处理，然后是一个分类器。我们可以用SVC和StandardScaler来将其实例化：

pipe=Pipeline(
    [('preprocessing',StandardScaler()),('classifier',SVC())]
)

现在我们可以定义需要搜索的param_grid。我们希望classifier是RandomForestClassifier或SVC。由于这两种分类器需要调节不同的参数，并且需要不同的预处理，所以我们可以使用搜索网格列表。

为了将一个估计器分配给一个步骤，我们使用步骤名称作为参数名称。如果我们想跳过管道中的某个步骤，则可以将该步骤设置为None：

param_grid=[
    {
        'classifier':[SVC()],
        'preprocessing':[StandardScaler(),None],
        'classifier__gamma':[0.001,0.01,0.1,1,10,100],
        'classifier__C':[0.001,0.01,0.1,1,10,100]
    },
    {
        'classifier':[RandomForestClassifier(n_estimators=100)],
        'preprocessing':[None],
        'classifier__max_features':[1,2,3]
    }
]

现在，我们将网格搜索实例化，并在数据集上运行：

cancer=load_breast_cancer()
X_train,X_test,y_train,y_test=train_test_split(cancer.data,cancer.target,random_state=0)
grid=GridSearchCV(pipe,param_grid=param_grid,cv=5)
grid.fit(X_train,y_train)
print('最佳模型参数：{}'.format(grid.best_params_))
print('最佳交叉验证分数：{}'.format(grid.best_score_))
print('最佳模型在测试集精度：{}'.format(grid.score(X_test,y_test)))