I first construct a base model (default parameters) and obtain MAE (rfr base file for image).
# BASELINE MODEL
rfr_pipe.fit(train_x, train_y)
base_rfr_pred = rfr_pipe.predict(test_x)
base_rfr_mae = mean_absolute_error(test_y, base_rfr_pred)
MAE = 2.188
Then I perform GridSearchCV to get best parameters and get the average MAE (rfr grid for image).
# RFR GRIDSEARCHCV
rfr_param = {'rfr_model__n_estimators' : [10, 100, 500, 1000],
'rfr_model__max_depth' : [None, 5, 10, 15, 20],
'rfr_model__min_samples_leaf' : [10, 100, 500, 1000],
'rfr_model__max_features' : ['auto', 'sqrt', 'log2']}
rfr_grid = GridSearchCV(estimator = rfr_pipe, param_grid = rfr_param, n_jobs = -1,
cv = 5, scoring = 'neg_mean_absolute_error')
rfr_grid.fit(train_x, train_y)
print('best parameters are:-', rfr_grid.best_params_)
print('best mae is:- ', -1 * rfr_grid.best_score_)
MAE = 2.697
Then I fit the "best parameters" obtained to get an optimized MAE but the results are always worse than the base model MAE (opt rfr for image).
# OPTIMIZED RFR MODEL
opt_rfr = RandomForestRegressor(random_state = 69, criterion = 'mae', max_depth = None,
max_features = 'auto', min_samples_leaf = 10, n_estimators = 100)
opt_rfr_pipe = Pipeline(steps = [('rfr_preproc', preproc), ('opt_rfr_model', opt_rfr)])
opt_rfr_pipe.fit(train_x, train_y)
opt_rfr_pred = opt_rfr_pipe.predict(test_x)
opt_rfr_mae = mean_absolute_error(test_y, opt_rfr_pred)
MAE = 2.496
Not just once but every time and in most of the models (linear regression, random forest regressor)! I guess there is something fundamentally wrong with my code else this problem wouldn't arise every time. Any idea what might be causing this?