I am trying a Regression model on a dataset which has categorical and numerical variables along with nan values. I want to use Pipelines for imputation and encoding purposes. Now I have a few conditions which must be satisfied in building the model which are as follows:

1.) Use of Pipelines is a must for imputation and encoding (one hot encoding) purpose.

2.) Imputation should be done AFTER train test split.

3.) For feature selection (should be done AFTER train test split) use of mutual info regression

and RFECV is a must.

This is what I tried so far:-

# X AND Y FEATURES

y = np.log(data2['price'])

data2.drop(['price'], axis = 1, inplace = True)

# CATEGORICAL VARIABLES AND NUMERICAL VARIABLES

num_cols = [cname for cname in data2.columns if data2[cname].dtype in ['int64', 'float64']]

cat_cols = [cname for cname in data2.columns if data2[cname].dtype == 'object']

# IMPUTATION/ENCODING TO BE DONE

num_trans = SimpleImputer(strategy = 'mean')

cat_trans = Pipeline(steps = [('impute', SimpleImputer(strategy = 'most_frequent')),

('onehotencode', OneHotEncoder(handle_unknown = 'ignore', sparse=False))])

# PREPROCESSING USING COLUMNS TRANSFORMER

preproc = ColumnTransformer(transformers = [('cat', cat_trans, cat_cols),

('num', num_trans, num_cols)])

# MODEL INSTANCE

lire_model = LinearRegression(n_jobs = -1)

#FINAL PIPELINE WHICH IMPUTES, ENCODES AND THEN FITS MODEL WHEN CALLED

lire_pipe = Pipeline(steps = [('preproc', preproc), ('model', lire_model)])

train_x, test_x, train_y, test_y = train_test_split(data2, y, test_size = 0.2,

random_state=69)

# FEATURE SELECTION SECTION

# MUTUAL INFO FOR ALL VARIABLES

mi = mutual_info_regression(train_x, train_y)

mi

mi = pd.Series(mi)

mi.index = train_x.columns

mi.sort_values(ascending = False)

# RFE USING CV

rfecv = RFECV(estimator = LinearRegression(n_jobs = -1), step = 1,

cv = 5, scoring = 'neg_mean_absolute_error', n_jobs = -1)

rfecv.fit(train_x, train_y)

print('optimal no of features are:- ', rfecv.n_features_)

train_x.columns[rfecv.get_support()]

# BASELINE MODEL

cross_lire_score = -1 * cross_val_score(lire_pipe, train_x, train_y, cv = 5,

n_jobs = -1, scoring = 'neg_mean_absolute_error')

base_lire_score = cross_lire_score.mean()

Now the problem I am facing is up until the train_test_split part, the pipeline feature has not been called and hence none of the nan values are imputed and neither encoding has been done. So running anything after train_test_split (i.e. Feature Selection part) will give me an error as there are nan values and categorical variables on which we are performing feature selection.

The pipeline is not being called up until the baseline model cv. Only at that point will the imputation and encoding will happen. Not before that!

I tried some thing like below as a workaround (everything is same up until train_test_split):-

train_x, test_x, train_y, test_y = train_test_split(data2, y, test_size = 0.2, random_state

=69)

# MANUALLY IMPUTING NAN VALUES

train_x['vehicleType'].fillna(train_x['vehicleType'].value_counts().index[0], inplace = True)

train_x['gearbox'].fillna(train_x['gearbox'].value_counts().index[0], inplace = True)

train_x['model'].fillna(train_x['model'].value_counts().index[0], inplace = True)

train_x['fuelType'].fillna(train_x['fuelType'].value_counts().index[0], inplace = True)

train_x['notRepairedDamage'].fillna(train_x['notRepairedDamage'].value_counts().index[0],

inplace = True)

test_x['vehicleType'].fillna(train_x['vehicleType'].value_counts().index[0], inplace = True)

test_x['gearbox'].fillna(train_x['gearbox'].value_counts().index[0], inplace = True)

test_x['model'].fillna(train_x['model'].value_counts().index[0], inplace = True)

test_x['fuelType'].fillna(train_x['fuelType'].value_counts().index[0], inplace = True)

test_x['notRepairedDamage'].fillna(train_x['notRepairedDamage'].value_counts().index[0],

inplace = True)

# MANUAL ENCODING

ohe = OneHotEncoder(handle_unknown = 'ignore', sparse = False)

train_x_encoded = pd.DataFrame(ohe.fit_transform(train_x[['vehicleType', 'carname', '

fuelType']]))

train_x_encoded.columns = ohe.get_feature_names(['vehicleType', 'carname', 'fuelType'])

train_x.drop(['vehicleType', 'carname', 'fuelType'], axis = 1, inplace = True)

train_x = train_x.reset_index(drop = True)

train_x_encoded = train_x_encoded.reset_index(drop = True)

train_x1 = pd.concat([train_x, train_x_encoded], axis = 1)

test_x_encoded = pd.DataFrame(ohe.transform(test_x[['vehicleType', 'carname', 'fuelType']]))

test_x_encoded.columns = ohe.get_feature_names(['vehicleType', 'carname', 'fuelType'])

test_x.drop(['vehicleType', 'carname', 'fuelType'], axis = 1, inplace = True)

test_x = test_x.reset_index(drop = True)

test_x_encoded = test_x_encoded.reset_index(drop = True)

test_x1 = pd.concat([test_x, test_x_encoded], axis = 1)

# FEATURE SELECTION SECTION

mi = mutual_info_regression(train_x1, train_y)

mi

mi = pd.Series(mi)

mi.index = train_x1.columns

mi.sort_values(ascending = False)

# RFE USING CV

rfecv = RFECV(estimator = LinearRegression(n_jobs = -1), step = 1,

cv = 5, scoring = 'neg_mean_absolute_error', n_jobs = -1)

rfecv.fit(train_x1, train_y)

print('optimal no of features are:- ', rfecv.n_features_)

train_x1.columns[rfecv.get_support()]

# BASELINE MODEL

cross_lire_score = -1 * cross_val_score(lire_pipe, train_x1, train_y, cv = 5,

n_jobs = -1, scoring = 'neg_mean_absolute_error')

base_lire_score = cross_lire_score.mean()

But now there's no point declaring a pipline as I am manually doing the work , which completely defeats the purpose of a Pipeline!! It is mandatory that I use Pipeline along with all the conditions defined above.

Any help would be appreciated as I have spent the last 3 weeks, 4 days and a good part of my non existent social life trying to find a solution!

More Shrey Jain's questions See All
Similar questions and discussions