使用 scikit-learn 的决策树和随机森林预测房屋价格

数据集


数据集获取

加载数据集:

house_trainset_path = '/path/to/train.csv'
house_trainset = pd.read_csv(house_trainset_path)
print(house_trainset.columns)
print(house_trainset.describe())

print

Index(['Id', 'MSSubClass', 'MSZoning', 'LotFrontage', 'LotArea', 'Street',
'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig',
'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType',
'HouseStyle', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd',
'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType',
'MasVnrArea', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual',
'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinSF1',
'BsmtFinType2', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', 'Heating',
'HeatingQC', 'CentralAir', 'Electrical', '1stFlrSF', '2ndFlrSF',
'LowQualFinSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath', 'FullBath',
'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'KitchenQual',
'TotRmsAbvGrd', 'Functional', 'Fireplaces', 'FireplaceQu', 'GarageType',
'GarageYrBlt', 'GarageFinish', 'GarageCars', 'GarageArea', 'GarageQual',
'GarageCond', 'PavedDrive', 'WoodDeckSF', 'OpenPorchSF',
'EnclosedPorch', '3SsnPorch', 'ScreenPorch', 'PoolArea', 'PoolQC',
'Fence', 'MiscFeature', 'MiscVal', 'MoSold', 'YrSold', 'SaleType',
'SaleCondition', 'SalePrice'],
dtype='object')
Id MSSubClass LotFrontage LotArea OverallQual \
count 1460.000000 1460.000000 1201.000000 1460.000000 1460.000000
mean 730.500000 56.897260 70.049958 10516.828082 6.099315
std 421.610009 42.300571 24.284752 9981.264932 1.382997
min 1.000000 20.000000 21.000000 1300.000000 1.000000
25% 365.750000 20.000000 59.000000 7553.500000 5.000000
50% 730.500000 50.000000 69.000000 9478.500000 6.000000
75% 1095.250000 70.000000 80.000000 11601.500000 7.000000
max 1460.000000 190.000000 313.000000 215245.000000 10.000000

OverallCond YearBuilt YearRemodAdd MasVnrArea BsmtFinSF1 \
count 1460.000000 1460.000000 1460.000000 1452.000000 1460.000000
mean 5.575342 1971.267808 1984.865753 103.685262 443.639726
std 1.112799 30.202904 20.645407 181.066207 456.098091
min 1.000000 1872.000000 1950.000000 0.000000 0.000000
25% 5.000000 1954.000000 1967.000000 0.000000 0.000000
50% 5.000000 1973.000000 1994.000000 0.000000 383.500000
75% 6.000000 2000.000000 2004.000000 166.000000 712.250000
max 9.000000 2010.000000 2010.000000 1600.000000 5644.000000

... WoodDeckSF OpenPorchSF EnclosedPorch 3SsnPorch \
count ... 1460.000000 1460.000000 1460.000000 1460.000000
mean ... 94.244521 46.660274 21.954110 3.409589
std ... 125.338794 66.256028 61.119149 29.317331
min ... 0.000000 0.000000 0.000000 0.000000
25% ... 0.000000 0.000000 0.000000 0.000000
50% ... 0.000000 25.000000 0.000000 0.000000
75% ... 168.000000 68.000000 0.000000 0.000000
max ... 857.000000 547.000000 552.000000 508.000000

ScreenPorch PoolArea MiscVal MoSold YrSold \
count 1460.000000 1460.000000 1460.000000 1460.000000 1460.000000
mean 15.060959 2.758904 43.489041 6.321918 2007.815753
std 55.757415 40.177307 496.123024 2.703626 1.328095
min 0.000000 0.000000 0.000000 1.000000 2006.000000
25% 0.000000 0.000000 0.000000 5.000000 2007.000000
50% 0.000000 0.000000 0.000000 6.000000 2008.000000
75% 0.000000 0.000000 0.000000 8.000000 2009.000000
max 480.000000 738.000000 15500.000000 12.000000 2010.000000

SalePrice
count 1460.000000
mean 180921.195890
std 79442.502883
min 34900.000000
25% 129975.000000
50% 163000.000000
75% 214000.000000
max 755000.000000

数据处理


columns_x = ['LotArea', 'LotFrontage', 'MSSubClass', 'LotFrontage', 'LotShape', 'OverallQual', 'OverallCond', 'YearBuilt', 'YearRemodAdd', 'MasVnrArea',
'BsmtFinSF1', 'BsmtFinSF2', 'BsmtUnfSF', 'TotalBsmtSF', '1stFlrSF', '2ndFlrSF', 'GrLivArea', 'BsmtFullBath', 'BsmtHalfBath',
'FullBath', 'HalfBath', 'BedroomAbvGr', 'KitchenAbvGr', 'TotRmsAbvGrd', 'Fireplaces', 'GarageYrBlt', 'GarageCars',
'GarageArea', 'WoodDeckSF', 'OpenPorchSF', 'EnclosedPorch', '3SsnPorch', 'ScreenPorch']
column_y = 'SalePrice'
columns_of_interest = [ele for ele in columns_x]
columns_of_interest.append(column_y)

drop NaN:

filtered_trainset = house_trainset[columns_of_interest].dropna(axis=0)

string 类型转换, 以 LotShape 为例:

lot_shape_map = {"Reg": 1, "IR1": 2, "IR2": 3, "IR3": 4}
filtered_trainset['LotShape'] = filtered_trainset['LotShape'].map(lot_shape_map).fillna(0)

切分数据集:

split_train_X, split_val_X, split_train_y, split_val_y = train_test_split(
train_x,
train_y,
random_state=0)

model 选择


对数据集有了大概的了解, 接下来就是选择模型进行训练和预测, 从中选出误差最小的.

决策树支持数值型和类别型的数据, 容易理解和解释, 模型属于非黑箱, 只要对数据进行较好的标注, 就能较快的从已有数据进行学习和分类 .
但对未知的数据不一定有好的效果, 容易发生过拟合.

而随机森林是解决这个问题的一种有效方法, 其他的还有剪枝、Bagging、Boosting Tree、Rotation forest 等方法.
随机森林, 顾名思义:

随机森林是一个包含多个决策树的分类器,并且其输出的类别是由个别树输出的类别的众数而定。

不过, 我们还是需要实际动手去测试一下结果.

决策树


首先分别对不同叶子数的决策树进行测试评估.

def decision_tree_train():
for max_leaf_nodes in [5, 50, 500, 5000]:
my_mae = get_mae(max_leaf_nodes, split_train_X, split_val_X, split_train_y, split_val_y)
print("Max leaf nodes: %d \t\t Mean Absolute Error: %d" % (max_leaf_nodes, my_mae))

这里使用 平均绝对误差 MAE 来测量模型结果。

def get_mae(max_leaf_nodes, train_x1, val_x, train_y1, val_y):
model = DecisionTreeRegressor(max_leaf_nodes=max_leaf_nodes, random_state=0)
model.fit(train_x1, train_y1)
predict_val = model.predict(val_x)
mae = mean_absolute_error(val_y, predict_val)
return mae

result:

Max leaf nodes: 5  		 Mean Absolute Error:  32983
Max leaf nodes: 50 Mean Absolute Error: 24687
Max leaf nodes: 500 Mean Absolute Error: 26875
Max leaf nodes: 5000 Mean Absolute Error: 26928

随机森林


def random_forest_train():
forest_model = RandomForestRegressor()
forest_model.fit(split_train_X, split_train_y)
predict_y = forest_model.predict(split_val_X)
print("random forest :")
print(mean_absolute_error(split_val_y, predict_y))

result:

random forest:
19569.5348754

最后


从测试结果来看, 随机森林的效果比单纯决策树的效果要好。

参考


随机森林
https://zh.wikipedia.org/wiki/%E5%86%B3%E7%AD%96%E6%A0%91%E5%AD%A6%E4%B9%A0
https://clyyuanzi.gitbooks.io/julymlnotes/content/rf.html