模型评估与优化
2706 字约 9 分钟
2026-05-20
训练模型只是开始。如何科学地衡量模型好坏,如何系统地提升性能,才是工程实践中最核心的技能。
1. 分类评估指标
1.1 混淆矩阵
所有分类指标的基础:
| 预测为正(Positive) | 预测为负(Negative) | |
|---|---|---|
| 实际为正 | TP(真正例) | FN(假负例,漏报) |
| 实际为负 | FP(假正例,误报) | TN(真负例) |
1.2 核心指标
准确率(Accuracy):
Accuracy=TP+TN+FP+FNTP+TN
直观但危险——类别严重不平衡时,全部预测为负类也能达到99%准确率。
精确率(Precision):预测为正的里面,真正为正的比例
Precision=TP+FPTP
关注"误报"——我预测的正例靠谱吗?
召回率(Recall):实际为正的里面,被我找出来的比例
Recall=TP+FNTP
关注"漏报"——真实的正例被我找了多少?
F1 Score:Precision 和 Recall 的调和平均(调和平均对极端值更敏感,两者都高 F1 才高)
F1=Precision+Recall2×Precision×Recall
Fβ Score:当 Recall 比 Precision 更重要时使用(β>1 更重视 Recall)
Fβ=(1+β2)β2×Precision+RecallPrecision×Recall
1.3 Precision vs Recall 的取舍
核心问题:漏报代价大,还是误报代价大?
| 场景 | 更重视 | 理由 |
|---|---|---|
| 癌症筛查 | Recall | 宁可误报,不能漏掉真实病例 |
| 垃圾邮件过滤 | Precision | 宁可漏报,不能把正常邮件误判 |
| 欺诈检测 | 看成本 | 取决于误报和漏报的损失比 |
| 推荐系统 | Precision | 推荐的要准,不能"乱推" |
| 地震预警 | Recall | 必须尽可能不漏报 |
1.4 ROC 曲线与 AUC
ROC 曲线:以不同阈值(0~1)下的 FPR 为横轴,TPR(=Recall)为纵轴绘制的曲线。
FPR=FP+TNFP,TPR=TP+FNTP
AUC(Area Under Curve):ROC 曲线下面积。
- AUC = 1:完美模型
- AUC = 0.5:随机猜
- AUC < 0.5:比随机差(预测反了?)
AUC 的意义:随机取一个正样本和一个负样本,正样本得分高于负样本的概率。
from sklearn.metrics import (confusion_matrix, classification_report,
roc_auc_score, roc_curve,
precision_recall_curve, average_precision_score)
import matplotlib.pyplot as plt
# 完整评估报告
print(classification_report(y_test, y_pred,
target_names=['负类', '正类']))
# ROC AUC
y_prob = model.predict_proba(X_test)[:, 1]
auc = roc_auc_score(y_test, y_prob)
fpr, tpr, thresholds = roc_curve(y_test, y_prob)
plt.plot(fpr, tpr, label=f'ROC (AUC = {auc:.3f})')
plt.plot([0, 1], [0, 1], 'k--', label='随机猜测')
plt.xlabel('FPR (误报率)')
plt.ylabel('TPR (召回率)')
plt.title('ROC 曲线')
plt.legend()何时用 PR 曲线代替 ROC?
类别严重不平衡时,ROC 曲线会过于乐观(因为 FPR 的分母 TN 很大,FPR 很低,ROC 曲线偏向左上角)。PR 曲线更能反映模型在少数类上的表现。
precision, recall, _ = precision_recall_curve(y_test, y_prob)
ap = average_precision_score(y_test, y_prob)
plt.plot(recall, precision, label=f'PR (AP = {ap:.3f})')
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.title('Precision-Recall 曲线')2. 回归评估指标
| 指标 | 公式 | 说明 |
|---|---|---|
| MAE(平均绝对误差) | mean(|y-ŷ|) | 直观,对异常值不敏感 |
| MSE(均方误差) | mean((y-ŷ)²) | 对大误差惩罚重,常用于损失函数 |
| RMSE(均方根误差) | √MSE | 和原始数据量纲一致,更直观 |
| R²(决定系数) | 1 - SS_res/SS_tot | 0~1,越接近1越好 |
| MAPE(平均绝对百分比误差) | mean(|y-ŷ|/|y|)×100% | 相对误差,跨量级比较 |
R² 的理解:模型解释了因变量多少比例的方差。R²=0.8 意味着模型解释了80%的数据变异。
R2=1−∑(yi−yˉ)2∑(yi−y^i)2
R² 可以为负(当模型比用均值预测还差时),这通常意味着模型有严重问题。
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import numpy as np
mae = mean_absolute_error(y_test, y_pred)
mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)
print(f"MAE: {mae:.4f}")
print(f"RMSE: {rmse:.4f}")
print(f"R²: {r2:.4f}")3. 过拟合与欠拟合
3.1 偏差-方差分解
期望预测误差可以分解为:
E[(y−f^(x))2]=Bias2[f^(x)]+Var[f^(x)]+σ2
- 偏差(Bias):模型预测的期望值与真实值的差距,反映模型的系统性错误
- 方差(Variance):模型在不同训练集上预测结果的波动程度,反映对训练数据的敏感程度
- 不可约误差 σ²:数据本身的噪声,无法消除
| 症状 | 原因 | 解决方案 |
|---|---|---|
| 高偏差(欠拟合) | 模型太简单 | 增加模型复杂度、增加特征、减少正则化 |
| 高方差(过拟合) | 模型太复杂 | 增加数据、正则化、Dropout、减少特征 |
| 两者都高 | 数据噪声大或特征无效 | 特征工程、换模型 |
3.2 学习曲线诊断
画出训练误差和验证误差随训练数据量的变化:
from sklearn.model_selection import learning_curve
import numpy as np
import matplotlib.pyplot as plt
def plot_learning_curve(model, X, y, cv=5):
train_sizes, train_scores, val_scores = learning_curve(
model, X, y,
cv=cv,
train_sizes=np.linspace(0.1, 1.0, 10),
scoring='neg_mean_squared_error',
n_jobs=-1
)
train_mean = -train_scores.mean(axis=1)
train_std = train_scores.std(axis=1)
val_mean = -val_scores.mean(axis=1)
val_std = val_scores.std(axis=1)
plt.figure(figsize=(10, 6))
plt.fill_between(train_sizes, train_mean-train_std, train_mean+train_std, alpha=0.1)
plt.fill_between(train_sizes, val_mean-val_std, val_mean+val_std, alpha=0.1)
plt.plot(train_sizes, train_mean, 'o-', label='训练集误差')
plt.plot(train_sizes, val_mean, 'o-', label='验证集误差')
plt.xlabel('训练集大小')
plt.ylabel('误差(MSE)')
plt.legend()
plt.title('学习曲线')
plt.show()解读学习曲线:
- 训练误差和验证误差都高,且随数据增加不收敛 → 欠拟合(增加模型复杂度)
- 训练误差低,验证误差高,差距大 → 过拟合(增加数据或正则化)
- 两条曲线都低且趋近一致 → 理想状态
4. 交叉验证
4.1 K 折交叉验证
将数据分成 K 份,轮流用其中一份作验证集,其余作训练集:
from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
# 基本 K 折交叉验证
cv_scores = cross_val_score(
model, X, y,
cv=5, # 5折
scoring='accuracy',
n_jobs=-1
)
print(f"CV Accuracy: {cv_scores.mean():.3f} ± {cv_scores.std():.3f}")
# 分层 K 折(分类问题推荐):保证每折类别比例一致
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
# 在 Pipeline 中使用(防止数据泄露!)
pipeline = Pipeline([
('scaler', StandardScaler()),
('model', model)
])
cv_scores = cross_val_score(pipeline, X, y, cv=skf)数据泄露陷阱
在做交叉验证时,特征缩放(StandardScaler)的 fit 必须只在训练折上进行,不能在整个数据集上 fit 后再切分。用 Pipeline 可以避免这个问题。
4.2 其他交叉验证策略
留一法(LOO-CV):每次留一个样本作验证集。无偏估计,但计算量极大,只适合小数据集。
时序交叉验证:时间序列数据不能随机切分,训练集必须在验证集时间之前。
from sklearn.model_selection import TimeSeriesSplit
tscv = TimeSeriesSplit(n_splits=5)
for train_idx, val_idx in tscv.split(X):
X_train, X_val = X[train_idx], X[val_idx]
# 时间戳更早的作训练集5. 超参数调优
5.1 网格搜索(Grid Search)
穷举所有参数组合,全面但计算量大:
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
param_grid = {
'n_estimators': [100, 200, 300],
'max_depth': [5, 10, 15, None],
'min_samples_leaf': [1, 5, 10],
'max_features': ['sqrt', 'log2']
}
grid_search = GridSearchCV(
RandomForestClassifier(random_state=42),
param_grid,
cv=StratifiedKFold(5),
scoring='roc_auc',
n_jobs=-1,
verbose=2
)
grid_search.fit(X_train, y_train)
print(f"最优参数: {grid_search.best_params_}")
print(f"最优 AUC: {grid_search.best_score_:.4f}")
best_model = grid_search.best_estimator_5.2 随机搜索(Random Search)
从参数分布中随机采样,在参数空间大时比网格搜索效率更高(相同计算量下,通常比网格搜索效果更好):
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint, uniform
param_distributions = {
'n_estimators': randint(50, 500),
'max_depth': randint(3, 20),
'min_samples_leaf': randint(1, 20),
'max_features': uniform(0.3, 0.7)
}
random_search = RandomizedSearchCV(
RandomForestClassifier(random_state=42),
param_distributions,
n_iter=100, # 随机试100种组合
cv=5,
scoring='roc_auc',
n_jobs=-1,
random_state=42
)
random_search.fit(X_train, y_train)5.3 贝叶斯优化(Optuna)
用概率模型建模"参数→性能"的关系,智能地选择下一个要尝试的参数,比随机搜索更高效:
import optuna
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import cross_val_score
def objective(trial):
params = {
'n_estimators': trial.suggest_int('n_estimators', 50, 500),
'max_depth': trial.suggest_int('max_depth', 3, 10),
'learning_rate': trial.suggest_float('learning_rate', 0.01, 0.3, log=True),
'subsample': trial.suggest_float('subsample', 0.6, 1.0),
'min_samples_leaf': trial.suggest_int('min_samples_leaf', 1, 20),
}
model = GradientBoostingClassifier(**params, random_state=42)
scores = cross_val_score(model, X_train, y_train, cv=5, scoring='roc_auc')
return scores.mean()
study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=100, n_jobs=-1)
print(f"最优参数: {study.best_params}")
print(f"最优 AUC: {study.best_value:.4f}")6. 特征工程
6.1 特征选择
过滤法(Filter):根据统计指标独立于模型评估特征:
from sklearn.feature_selection import (
SelectKBest, f_classif, mutual_info_classif,
chi2, RFE
)
# 方差过滤(删掉方差极低的特征)
from sklearn.feature_selection import VarianceThreshold
selector = VarianceThreshold(threshold=0.01)
X_filtered = selector.fit_transform(X)
# 相关性过滤
import pandas as pd
corr_matrix = pd.DataFrame(X).corr().abs()
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool))
to_drop = [col for col in upper.columns if any(upper[col] > 0.95)]
# F 检验(回归/分类)
selector = SelectKBest(f_classif, k=20)
X_selected = selector.fit_transform(X_train, y_train)
# 互信息(能捕捉非线性关系)
mi_scores = mutual_info_classif(X_train, y_train)包裹法(Wrapper):用模型性能指导特征选择:
# 递归特征消除
from sklearn.feature_selection import RFE
from sklearn.svm import SVC
rfe = RFE(estimator=SVC(kernel='linear'), n_features_to_select=20, step=1)
rfe.fit(X_train, y_train)
selected_features = np.where(rfe.support_)[0]嵌入法(Embedded):特征选择内嵌到模型训练过程:
# 树模型特征重要性
rf = RandomForestClassifier(n_estimators=200)
rf.fit(X_train, y_train)
importances = pd.Series(rf.feature_importances_, index=feature_names)
top_features = importances.nlargest(20).index
# L1 正则化(Lasso)产生稀疏特征
from sklearn.linear_model import LassoCV
lasso = LassoCV(cv=5)
lasso.fit(X_train, y_train)
selected = np.where(lasso.coef_ != 0)[0]6.2 特征构造
import pandas as pd
import numpy as np
# 多项式特征(交互项)
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=2, interaction_only=True, include_bias=False)
X_poly = poly.fit_transform(X)
# 时间特征提取
df['hour'] = df['timestamp'].dt.hour
df['day_of_week'] = df['timestamp'].dt.dayofweek
df['is_weekend'] = df['day_of_week'].isin([5, 6]).astype(int)
df['month'] = df['timestamp'].dt.month
df['quarter'] = df['timestamp'].dt.quarter
# 聚合特征(对 ID 类别)
user_stats = df.groupby('user_id').agg({
'amount': ['mean', 'std', 'max', 'sum', 'count'],
'is_fraud': ['sum', 'mean']
}).reset_index()
user_stats.columns = ['user_id'] + [f'user_{c[0]}_{c[1]}' for c in user_stats.columns[1:]]
# 目标编码(Target Encoding)
def target_encode(df, col, target, n_splits=5):
"""避免数据泄露的目标编码"""
df[f'{col}_target_enc'] = np.nan
skf = StratifiedKFold(n_splits=n_splits, shuffle=True, random_state=42)
for train_idx, val_idx in skf.split(df, df[target]):
mean_enc = df.iloc[train_idx].groupby(col)[target].mean()
df.iloc[val_idx, df.columns.get_loc(f'{col}_target_enc')] = \
df.iloc[val_idx][col].map(mean_enc)
return df6.3 处理缺失值
# 查看缺失情况
missing_pct = df.isnull().sum() / len(df) * 100
missing_pct = missing_pct[missing_pct > 0].sort_values(ascending=False)
# 删除(缺失太多的列/行)
df.drop(columns=missing_pct[missing_pct > 50].index, inplace=True)
# 简单填充
df['age'].fillna(df['age'].median(), inplace=True) # 数值型用中位数
df['city'].fillna('Unknown', inplace=True) # 类别型用众数/Unknown
# 迭代填补(基于其他特征预测缺失值)
from sklearn.impute import IterativeImputer
imputer = IterativeImputer(max_iter=10, random_state=42)
X_imputed = imputer.fit_transform(X)6.4 处理类别不平衡
# 过采样(SMOTE:合成少数类样本)
from imblearn.over_sampling import SMOTE
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X_train, y_train)
# 欠采样
from imblearn.under_sampling import RandomUnderSampler
rus = RandomUnderSampler(random_state=42)
X_resampled, y_resampled = rus.fit_resample(X_train, y_train)
# 调整类别权重(不改变数据,更推荐)
model = RandomForestClassifier(class_weight='balanced') # 或指定 {0: 1, 1: 10}