房价预测是kaggle官网的一个竞赛项目,算是机器学习的一个入门项目。kaggle官网链接: link.
关于kaggle竞赛项目的操作流程可以参看这篇博客: link.
一、kaggle介绍
kaggle主要为开发商和数据科学家提供举办机器学习竞赛、托管数据库、编写和分享代码的平台,kaggle已经吸引了80万名数据科学家的关注。是学习数据挖掘和数据分析一个不可多得的实战学习平台,上面还有许多的项目有巨额的奖金,有许多的获奖选手都会分享他们的代码并分析和挖掘数据的经验。
二、房价预测
房价竞赛的链接: link.
三、数据分析
导入相应的库:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')
(1)从官网下载数据集
(2)利用pandas从下载目录里导入数据,输出所有列名、输出详细的目标变量的信息,查看数据是否满足正态分布
df_train=pd.read_csv('./train.csv')
print(df_train.columns) #读取列表中所有列的列名
print(df_train['SalePrice'].describe()) #分析目标变量
sns.distplot((df_train['SalePrice'])) #查看是否满足正态分布
plt.show()
所有列名
Index([‘Id’, ‘MSSubClass’, ‘MSZoning’, ‘LotFrontage’, ‘LotArea’, ‘Street’,
‘Alley’, ‘LotShape’, ‘LandContour’, ‘Utilities’, ‘LotConfig’,
‘LandSlope’, ‘Neighborhood’, ‘Condition1’, ‘Condition2’, ‘BldgType’,
‘HouseStyle’, ‘OverallQual’, ‘OverallCond’, ‘YearBuilt’, ‘YearRemodAdd’,
‘RoofStyle’, ‘RoofMatl’, ‘Exterior1st’, ‘Exterior2nd’, ‘MasVnrType’,
‘MasVnrArea’, ‘ExterQual’, ‘ExterCond’, ‘Foundation’, ‘BsmtQual’,
‘BsmtCond’, ‘BsmtExposure’, ‘BsmtFinType1’, ‘BsmtFinSF1’,
‘BsmtFinType2’, ‘BsmtFinSF2’, ‘BsmtUnfSF’, ‘TotalBsmtSF’, ‘Heating’,
‘HeatingQC’, ‘CentralAir’, ‘Electrical’, ‘1stFlrSF’, ‘2ndFlrSF’,
‘LowQualFinSF’, ‘GrLivArea’, ‘BsmtFullBath’, ‘BsmtHalfBath’, ‘FullBath’,
‘HalfBath’, ‘BedroomAbvGr’, ‘KitchenAbvGr’, ‘KitchenQual’,
‘TotRmsAbvGrd’, ‘Functional’, ‘Fireplaces’, ‘FireplaceQu’, ‘GarageType’,
‘GarageYrBlt’, ‘GarageFinish’, ‘GarageCars’, ‘GarageArea’, ‘GarageQual’,
‘GarageCond’, ‘PavedDrive’, ‘WoodDeckSF’, ‘OpenPorchSF’,
‘EnclosedPorch’, ‘3SsnPorch’, ‘ScreenPorch’, ‘PoolArea’, ‘PoolQC’,
‘Fence’, ‘MiscFeature’, ‘MiscVal’, ‘MoSold’, ‘YrSold’, ‘SaleType’,
‘SaleCondition’, ‘SalePrice’],dtype=‘object’)
房价变量的详细信息
count 1460.000000
mean 180921.195890
std 79442.502883
min 34900.000000
25% 129975.000000
50% 163000.000000
75% 214000.000000
max 755000.000000
Name: SalePrice, dtype: float64
房价分布曲线
从结果来看并不满足标准的正态分布,所以查看下它的峰度(kurtosis)和偏度(skewness)。
峰度(kurtosis):描述变量取值分布形态的陡缓程度的统计量。
kurtosis=0与正态分布的陡缓程度相同。
kurtosis>0比正态分布的高峰更加陡峭。
kurtosis<0比正态分布的高峰平。
偏度(skewness)是描述变量取值分布对称性的统计量。
skewness=0分布形态与正态分布偏度相同。
skewness>0表示正(右)偏差数值较大,右边的 尾巴比较长。
skewness<0表示负(左)偏差数值较大,左边的尾巴比较长。
print('Skewness:%f'%df_train['SalePrice'].skew())
print('Kurtosis:%f'%df_train['SalePrice'].kurt())
Skewness:1.882876
Kurtosis:6.536282
(3)查看各个特征的分布走向
居住面积:
var='GrLivArea'
data=pd.concat([df_train['SalePrice'],df_train[var]],axis=1)
data.plot.scatter(x=var,y='SalePrice',ylim=(0,800000))
plt.show()
这里可以看出居住面积存在离群值
地下室面积:
var='TotalBsmtSF'
data=pd.concat([df_train['SalePrice'],df_train[var]],axis=1)
data.plot.scatter(x=var,y='SalePrice',ylim=(0,800000))
plt.show()
整体材料与饰面质量:(用箱型图,可以查看离群值,均值,最值)
var='OverallQual'
data=pd.concat([df_train['SalePrice'],df_train[var]],axis=1)
f,ax=plt.subplots(figsize=(8,6))
fig=sns.boxplot(x=var,y='SalePrice',data=data)
fig.axis(ymin=0,ymax=800000)
plt.show()
原施工日期:
var='YearBuilt'
data=pd.concat([df_train['SalePrice'],df_train[var]],axis=1)
f,ax=plt.subplots(figsize=(16,8))
fig=sns.boxplot(x=var,y='SalePrice',data=data)
fig.axis(ymin=0,ymax=800000)
plt.xticks(rotation=90)
# plt.savefig('原施工日期.jpg') #保存至当前目录
plt.show()
特征相关度热度图:
corrmat=df_train.corr()
f,ax=plt.subplots(figsize=(12,9))
sns.heatmap(corrmat,square=True,cmap='YlGnBu')
plt.savefig('热力图.jpg')
plt.show()
选取前10个和出售价格相关性比较大的特征进行分析:
将上面挑出来的十个特征进行两两画图(选六个):
sns.set()
cols=['SalePrice','OverallQual','GrLivArea','GarageCars','TotalBsmtSF','FullBath','YearBuilt']
sns.pairplot(df_train[cols],size=2.5)
plt.savefig('相关性图.jpg')
plt.show()
缺失值查看:
total=df_train.isnull().sum().sort_values(ascending=False)
percent=(df_train.isnull().sum()/df_train.isnull().count()).sort_values(ascending=False)
missing_data=pd.concat([total,percent],axis=1,keys=['Total','Percent'])
print(missing_data.head(20))
Total Percen
PoolQC 1453 0.995205
MiscFeature 1406 0.963014
Alley 1369 0.937671
Fence 1179 0.807534
FireplaceQu 690 0.472603
LotFrontage 259 0.177397
GarageCond 81 0.055479
GarageType 81 0.055479
GarageYrBlt 81 0.055479
GarageFinish 81 0.055479
GarageQual 81 0.055479
BsmtExposure 38 0.026027
BsmtFinType2 38 0.026027
BsmtFinType1 37 0.025342
BsmtCond 37 0.025342
BsmtQual 37 0.025342
MasVnrArea 8 0.005479
MasVnrType 8 0.005479
Electrical 1 0.000685
Utilities 0 0.000000
四、数据处理
这个环节将对我们数据分析时候找出来的一些需要处理的数据进行适当的数据处理,数据处理部分因人而异,数据处理的好坏将直接影响模型的结果。
先将所用到的库导入(包括后面建模的)
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from scipy.stats import norm,skew
from sklearn.preprocessing import LabelEncoder
from scipy import stats
import warnings
warnings.filterwarnings('ignore')
导入数据集:
train = pd.read_csv('./train.csv')
test = pd.read_csv('./test.csv')
查看数据集的大小:
print('The train data size before dropping Id feature is :{}'.format(train.shape))
print('The test data size before dropping Id feature is :{}'.format(test.shape))
The train data size after dropping Id feature is :(1460, 81)
The test data size after dropping Id feature is :(1459, 80)
将id列赋值给变量,然后删除该列,再看下删除后的数据集大小:
train_ID = train['Id']
test_ID = test['Id']train.drop('Id', axis=1, inplace=True)
test.drop('Id', axis=1, inplace=True)print('\nThe train data size after dropping Id feature is :{}'.format(train.shape))
print('The test data size after dropping Id feature is :{}'.format(test.shape))
The train data size after dropping Id feature is :(1460, 80)
The test data size after dropping Id feature is :(1459, 79)
由数据分析阶段知道居住面积存在离群值,再次显示出来(为下一步删掉离群值提供删除的范围),删除离群值,显示删除后的数据集:
fig, ax = plt.subplots()
ax.scatter(x=train['GrLivArea'], y=train['SalePrice'])
plt.ylabel('SalePrice', fontsize=13)
plt.xlabel('GrLivArea', fontsize=13)
plt.show()# 干掉离群点
train = train.drop(train[(train['GrLivArea'] > 4000) & (train['SalePrice'] < 300000)].index)
# 查看去掉离群值后的数据
fig, ax = plt.subplots()
ax.scatter(x=train['GrLivArea'], y=train['SalePrice'])
plt.ylabel('SalePrice', fontsize=13)
plt.xlabel('GrLivArea', fontsize=13)
plt.show()
将目标变量转换为正态分布:
sns.distplot(train['SalePrice'], fit=norm)
plt.show()
(mu, sigma) = norm.fit(train['SalePrice'])
print('\n mu = {:.2f} and sigma={:.2f}\n'.format(mu, sigma))
mu = 180932.92 and sigma=79467.79
qq图(可以查看数据与正态分布差距):
fig = plt.figure()
res = stats.probplot(train['SalePrice'], plot=plt)
plt.show()
上面可以看出来与正态分布偏离还是很大,这里做个对数变换:
train['SalePrice'] = np.log1p(train['SalePrice'])sns.distplot(train['SalePrice'], fit=norm)
plt.show()(mu, sigma) = norm.fit(train['SalePrice'])
print('\n mu={:.2f} and sigma={:.2f}\n'.format(mu, sigma))
#qq图
fig = plt.figure()
res = stats.probplot(train['SalePrice'], plot=plt)
plt.show()
mu=12.02 and sigma=0.40
缺失值处理:
缺失值处理前我们先建立整体的数据集(将训练集和测试集放一起处理)
ntrain = train.shape[0]
ntest = test.shape[0]
y_train = train.SalePrice.values
all_data = pd.concat((train, test)).reset_index(drop=True)
all_data.drop(['SalePrice'], axis=1, inplace=True)
print('all_data size is :{}'.format(all_data.shape))
all_data size is :(2917, 79)
打印缺失数据:
all_data_na = (all_data.isnull().sum() / len(all_data)) * 100
all_data_na = all_data_na.drop(all_data_na[all_data_na == 0].index).sort_values(ascending=False)[:30]
missing_data = pd.DataFrame({'Missing Ratio': all_data_na})
print(missing_data)
Missing Ratio
PoolQC 99.691464
MiscFeature 96.400411
Alley 93.212204
Fence 80.425094
FireplaceQu 48.680151
LotFrontage 16.660953
GarageFinish 5.450806
GarageYrBlt 5.450806
GarageQual 5.450806
GarageCond 5.450806
GarageType 5.382242
BsmtExposure 2.811107
BsmtCond 2.811107
BsmtQual 2.776826
BsmtFinType2 2.742544
BsmtFinType1 2.708262
MasVnrType 0.822763
MasVnrArea 0.788481
MSZoning 0.137127
BsmtFullBath 0.068564
BsmtHalfBath 0.068564
Utilities 0.068564
Functional 0.068564
Exterior2nd 0.034282
Exterior1st 0.034282
SaleType 0.034282
BsmtFinSF1 0.034282
BsmtFinSF2 0.034282
BsmtUnfSF 0.034282
Electrical 0.034282
f, ax=plt.subplots(figsize=(15,12))
plt.xticks(rotation='90')
sns.barplot(x=all_data_na. index,y=all_data_na)
plt.xlabel('Features', fontsize=15)
plt.ylabel('Percent of missing values', fontsize=15)
plt.title('Percent missing data by feature', fontsize=15)
plt.show()
对上述缺失值做填充:
#查看游泳池数据,并填充
# print(all_data['PoolQC'][:5])
all_data['PoolQC']=all_data['PoolQC'].fillna('None')
# print(all_data['PoolQC'][:5])
#MiscFeature
# print(all_data['MiscFeature'][:10])
all_data['MiscFeature']=all_data['MiscFeature'].fillna('None')
#通道的入口
all_data['Alley']=all_data['Alley'].fillna('None')
#栅栏
all_data['Fence']=all_data['Fence'].fillna('None')
#壁炉
all_data['FireplaceQu']=all_data['FireplaceQu'].fillna('None')
#离街道的距离(用临近值代替)
all_data['LotFrontage']=all_data.groupby('Neighborhood')['LotFrontage'].transform(lambda x:x.fillna(x.median()))
#车库的一系列特征
for col in ('GarageFinish','GarageQual','GarageCond','GarageType'):all_data[col]=all_data[col].fillna('None')
for col in ('GarageYrBlt', 'GarageArea', 'GarageCars'):all_data[col] = all_data[col].fillna(0)
#地下室的一系列特征
for col in ('BsmtFullBatch','BsmtUnfSF','TotalBsmtSf','BsmtFinSF1','BsmtFinSF2','BsmtHalfBath'):all_data[col]=all_data[col].fillna(0)
for col in ('BsmtExposure','BsmtCond','BsmtQual','BsmtFinType2','BsmtFinType1'):all_data[col]=all_data[col].fillna('None')#砌体
all_data['MasVnrType']=all_data['MasVnrType'].fillna('None')
all_data['MasVnrArea']=all_data['MasVnrArea'].fillna(0)
all_data['MSZoning'].mode()
#一般分区分类
all_data['MSZoning']=all_data['MSZoning'].fillna(all_data['MSZoning'].mode()[0])
#家庭功能评定 对于Functional,数据描述里说明,其NA值代表Typ
all_data['Functional']=all_data['Functional'].fillna('Typ')
#电力系统
all_data['Electrical']=all_data['Electrical'].fillna(all_data['Electrical'].mode()[0])
#厨房品质
all_data['KitchenQual']=all_data['KitchenQual'].fillna(all_data['KitchenQual'].mode()[0])
#外部
all_data['Exterior1st']=all_data['Exterior1st'].fillna(all_data['Exterior1st'].mode()[0])
all_data['Exterior2nd']=all_data['Exterior2nd'].fillna(all_data['Exterior2nd'].mode()[0])
#销售类型
all_data['SaleType']=all_data['SaleType'].fillna(all_data['SaleType'].mode()[0])
#建筑类型
all_data['MSSubClass']=all_data['MSSubClass'].fillna('None')all_data=all_data.drop(['Utilities'],axis=1)
填充完以后再次查看缺失值:
all_data_na=(all_data.isnull().sum()/len(all_data))*100
all_data_na=all_data_na.drop(all_data_na[all_data_na==0].index).sort_values(ascending=False)
missing_data=pd.DataFrame({'Missing Ratio':all_data_na})
print(missing_data)
Empty DataFrame
Columns: [Missing Ratio]
Index: []
完成填充!
将有些不是连续值的数据给他们做成类别值:
all_data['MSSubClass']=all_data['MSSubClass'].apply(str)
all_data['OverallCond']=all_data['OverallCond'].astype(str)
all_data['YrSold']=all_data['YrSold'].astype(str)
all_data['MoSold']=all_data['MoSold'].astype(str)
使用sklearn进行标签映射(使用sklearn的LabelEncoder方法将类别特征(离散型)编码为0~n-1之间连续的特征数值):
cols=('FireplaceQu', 'BsmtQual', 'BsmtCond', 'GarageQual', 'GarageCond',\'ExterQual', 'ExterCond','HeatingQC', 'PoolQC', 'KitchenQual', 'BsmtFinType1',\'BsmtFinType2', 'Functional', 'Fence', 'BsmtExposure', 'GarageFinish', 'LandSlope',\'LotShape', 'PavedDrive', 'Street', 'Alley', 'CentralAir', 'MSSubClass', 'OverallCond',\'YrSold', 'MoSold')for c in cols:lbl = LabelEncoder()lbl.fit(list(all_data[c].values))all_data[c] = lbl.transform(list(all_data[c].values))
# shape
print('Shape all_data: {}'.format(all_data.shape))
一般房价与房子整体的面积有关,所以这里多做一个特征,将几个面积整合在一起:
all_data['TotalSF'] = all_data['TotalBsmtSF'] + all_data['1stFlrSF'] + all_data['2ndFlrSF']
我们检查数值型特征数据的偏度(skewness),但是要注意,object类型的数据无法计算skewness,因此计算的时候要过滤掉object数据。
umeric_feats = all_data.dtypes[all_data.dtypes != 'object'].index
# check the skew of all numerical features
skewed_feats = all_data[umeric_feats].apply(lambda x: skew(x.dropna())).sort_values(ascending=False)
skewness = pd.DataFrame({'Skew': skewed_feats})
print(skewness.head(10))
Skew
MiscVal 21.939672
PoolArea 17.688664
LotArea 13.109495
LowQualFinSF 12.084539
3SsnPorch 11.372080
LandSlope 4.973254
KitchenAbvGr 4.300550
BsmtFinSF2 4.144503
EnclosedPorch 4.002344
ScreenPorch 3.945101
对于偏度过大的特征数据利用sklearn的box-cox转换函数,以降低数据的偏度
# box cox transformation of highly skewed features
# box cox转换的知识可以google
skewness = skewness[abs(skewness) > 0.75]
print('there are {} skewed numerical features to Box Cox transform'.format(skewness.shape[0]))
from scipy.special import boxcox1p
skewed_feats_index = skewness.index
lam = 0.15
for feat in skewed_feats_index:all_data[feat] = boxcox1p(all_data[feat], lam)
使用pandas的dummy方法来进行数据独热编码,并形成最终的训练和测试数据集:
# getting dummy categorical features onehot???
all_data = pd.get_dummies(all_data)
print(all_data.shape)
# getting the new train and test sets
train = all_data[:ntrain]
test = all_data[ntrain:]
模型的建立
导入所需要的库:
这里使用sklearn的交叉验证函数cross_val_score,由于该函数并没有shuffle的功能,我们还需要额外的kfold函数来对数据集进行随机分割。
from sklearn.linear_model import ElasticNet, Lasso, BayesianRidge, LassoLarsIC
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.kernel_ridge import KernelRidge
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import RobustScaler
from sklearn.base import BaseEstimator, TransformerMixin, RegressorMixin, clone
from sklearn.model_selection import KFold, cross_val_score, train_test_split
from sklearn.metrics import mean_squared_error
import xgboost as xgb
import lightgbm as lgbn_folds=5
def rmsle_cv(model):kf = KFold(n_folds, shuffle=True, random_state=42).get_n_splits(train.values)rmse = np.sqrt(-cross_val_score(model, train.values, y_train, scoring="neg_mean_squared_error", cv=kf))return rmse
Lasso模型
lasso=make_pipeline(RobustScaler(),Lasso(alpha=0.0005,random_state=1))
ElasticNet模型
ENet=make_pipeline(RobustScaler(),ElasticNet(alpha=0.0005,l1_ratio=.9,random_state=3))
KernelRidge带有核函数的岭回归
KRR=KernelRidge(alpha=0.6, kernel='polynomial', degree=2, coef0=2.5)
GradientBoostingRegressor模型
GBoost = GradientBoostingRegressor(n_estimators=3000, learning_rate=0.05, max_depth=4, max_features='sqrt',min_samples_leaf=15, min_samples_split=10, loss='huber', random_state=5)
XGboost模型
xgb_model = xgb.XGBRegressor(colsample_bytree=0.4603, gamma=0.0468, learning_rate=0.05, max_depth=3,min_child_weight=1.7817, n_estimators=2200, reg_alpha=0.4640, reg_lambda=0.8571,subsample=0.5213, silent=1, random_state=7, nthread=-1)
lightgbm模型
lgb_model =lgb.LGBMRegressor(objective='regression',num_leaves=1000,learning_rate=0.05,n_estimators=350,reg_alpha=0.9)
输出每个模型的得分:
score = rmsle_cv(lasso)
print("\nLasso score: {:.4f} ({:.4f})\n".format(score.mean(), score.std()))score = rmsle_cv(ENet)
print("ElasticNet score: {:.4f} ({:.4f})\n".format(score.mean(), score.std()))score = rmsle_cv(KRR)
print("Kernel Ridge score: {:.4f} ({:.4f})\n".format(score.mean(), score.std()))score = rmsle_cv(GBoost)
print("Gradient Boosting score: {:.4f} ({:.4f})\n".format(score.mean(), score.std()))score = rmsle_cv(xgb_model)
print("Xgboost score: {:.4f} ({:.4f})\n".format(score.mean(), score.std()))score = rmsle_cv(lgb_model)
print("lightgbm score: {:.4f} ({:.4f})\n".format(score.mean(), score.std()))
Lasso score: 0.1115 (0.0074)
ElasticNet score: 0.1116 (0.0074)
Kernel Ridge score: 0.1153 (0.0075)
Gradient Boosting score: 0.1167 (0.0083)
Xgboost score: 0.1164 (0.0070)
lightgbm score: 0.1288 (0.0058)
均值化模型
class AveragingModels(BaseEstimator, RegressorMixin, TransformerMixin):def __init__(self, models):self.models = models# we define clones of the original models to fit the data indef fit(self, X, y):self.models_ = [clone(x) for x in self.models]# train cloned base modelsfor model in self.models_:model.fit(X, y)return self# we do the predictions for cloned models and average themdef predict(self, X):predictions = np.column_stack([model.predict(X) for model in self.models_])return np.mean(predictions, axis=1)
这里我们将enet gboost krr lasso四个模型进行均值:
averaged_models = AveragingModels(models=(ENet, GBoost, KRR, lasso))
score_all = rmsle_cv(averaged_models)
print('Averaged base models score: {:.4f} ({:.4f})\n'.format(score_all.mean(), score_all.std()))
Averaged base models score: 0.1087 (0.0077)
Stacking模型
在Stacking模型基础上加入元模型。
这里在均化模型基础上加入元模型,然后在这些基础模型上使用折外预测(out-of-folds)来训练我们的元模型,其训练步骤如下:
1 将训练集分出2个部分:train_a和train_b
2 用train_a来训练其他基础模型
3 然后用其训练模型在测试集train_b上进行预测
4 使用步骤3中中的预测结果作为输入,然后在其元模型上进行训练
参考链接: link.我们使用五折stacking方法,一般情况下,我们会将训练集分为5个部分,每次的训练中都会使用其中4个部分的数据集,然后使用最后一个部分数据集来预测,五次迭代后我们会得到五次预测结果,最终使用着五次结果作为元模型的输入进行元模型的训练(其预测目标变量不变)。在元模型的预测部分,我们会平均所有基础模型的预测结果作为元模型的输入进行预测。
class StackingAveragedModels(BaseEstimator, RegressorMixin, TransformerMixin):def __init__(self, base_models, meta_model, n_folds=5):self.base_models = base_modelsself.meta_model = meta_modelself.n_folds = n_folds# We again fit the data on clones of the original modelsdef fit(self, X, y):self.base_models_ = [list() for x in self.base_models]self.meta_model_ = clone(self.meta_model)kfold = KFold(n_splits=self.n_folds, shuffle=True, random_state=156)# Train cloned base models then create out-of-fold predictions# that are needed to train the cloned meta-modelout_of_fold_predictions = np.zeros((X.shape[0], len(self.base_models)))for i, model in enumerate(self.base_models):for train_index, holdout_index in kfold.split(X, y):instance = clone(model)self.base_models_[i].append(instance)instance.fit(X[train_index], y[train_index])y_pred = instance.predict(X[holdout_index])out_of_fold_predictions[holdout_index, i] = y_pred# Now train the cloned meta-model using the out-of-fold predictions as new featureself.meta_model_.fit(out_of_fold_predictions, y)return self# Do the predictions of all base models on the test data and use the averaged predictions as# meta-features for the final prediction which is done by the meta-modeldef predict(self, X):meta_features = np.column_stack([np.column_stack([model.predict(X) for model in base_models]).mean(axis=1)for base_models in self.base_models_])return self.meta_model_.predict(meta_features)
用前面定义好的enet、gboost、krr基础模型,使用lasso作为元模型进行训练预测,并计算得分:
stacked_averaged_models = StackingAveragedModels(base_models = (ENet, GBoost, KRR), meta_model = lasso)
score_all_stacked = rmsle_cv(stacked_averaged_models)
print('Stacking Averaged base models score: {:.4f} ({:.4f})\n'.format(score_all_stacked.mean(), score_all_stacked.std()))
Stacking Averaged base models score: 0.1081 (0.0073)
集成模型
将前面的模型进行集成化,组合出一个更加高效的模型(StackedRegressor,XGBoost和LightGBM模型集成)
定义一下评价函数
def rmsle(y, y_pred):return np.sqrt(mean_squared_error(y, y_pred))
分别训练XGBoost和LightGBM和StackedRegressor模型:
stacked_averaged_models.fit(train.values, y_train)
stacked_train_pred = stacked_averaged_models.predict(train.values)
stacked_test_pred = np.expm1(stacked_averaged_models.predict(test.values))
print(rmsle(y_train, stacked_train_pred))
xgb_model.fit(train, y_train)
xgb_train_pred = xgb_model.predict(train)
xgb_test_pred = np.expm1(xgb_model.predict(test))
print(rmsle(y_train, xgb_train_pred))
lgb_model.fit(train, y_train)
lgb_train_pred = lgb_model.predict(train)
lgb_pred = np.expm1(lgb_model.predict(test.values))
print(rmsle(y_train, lgb_train_pred))
0.07839506096666995
0.07876052198274874
0.05893922686966146
用加权来平均上述的xgboost和LightGBM和StackedRegressor模型:
print('RMSLE score on train data all models:')
print(rmsle(y_train, stacked_train_pred * 0.6 + xgb_train_pred * 0.20 +lgb_train_pred * 0.20 ))
# Ensemble prediction 集成预测
ensemble_result = stacked_test_pred * 0.60 + xgb_test_pred * 0.20 + lgb_test_pred *0.20
生成结果的提交
submission = pd.DataFrame()
submission['Id'] = test_ID
submission['SalePrice'] = ensemble_result
submission.to_csv(r'E:\fangjiayucejieguo\submission.csv', index=False)