How to use Pipelines for imputation and encoding?

19 July 2021 0 10K Report

I am trying a Regression model on a dataset which has categorical and numerical variables along with nan values. I want to use Pipelines for imputation and encoding purposes. Now I have a few conditions which must be satisfied in building the model which are as follows:

1.) Use of Pipelines is a must for imputation and encoding (one hot encoding) purpose.

2.) Imputation should be done AFTER train test split.

3.) For feature selection (should be done AFTER train test split) use of mutual info regression

and RFECV is a must.

This is what I tried so far:-

# X AND Y FEATURES

y = np.log(data2['price'])

data2.drop(['price'], axis = 1, inplace = True)

# CATEGORICAL VARIABLES AND NUMERICAL VARIABLES

num_cols = [cname for cname in data2.columns if data2[cname].dtype in ['int64', 'float64']]

cat_cols = [cname for cname in data2.columns if data2[cname].dtype == 'object']

# IMPUTATION/ENCODING TO BE DONE

num_trans = SimpleImputer(strategy = 'mean')

cat_trans = Pipeline(steps = [('impute', SimpleImputer(strategy = 'most_frequent')),

('onehotencode', OneHotEncoder(handle_unknown = 'ignore', sparse=False))])

# PREPROCESSING USING COLUMNS TRANSFORMER

preproc = ColumnTransformer(transformers = [('cat', cat_trans, cat_cols),

('num', num_trans, num_cols)])

# MODEL INSTANCE

lire_model = LinearRegression(n_jobs = -1)

#FINAL PIPELINE WHICH IMPUTES, ENCODES AND THEN FITS MODEL WHEN CALLED

lire_pipe = Pipeline(steps = [('preproc', preproc), ('model', lire_model)])

train_x, test_x, train_y, test_y = train_test_split(data2, y, test_size = 0.2,

random_state=69)

# FEATURE SELECTION SECTION

# MUTUAL INFO FOR ALL VARIABLES

mi = mutual_info_regression(train_x, train_y)

mi = pd.Series(mi)

mi.index = train_x.columns

mi.sort_values(ascending = False)

# RFE USING CV

rfecv = RFECV(estimator = LinearRegression(n_jobs = -1), step = 1,

cv = 5, scoring = 'neg_mean_absolute_error', n_jobs = -1)

rfecv.fit(train_x, train_y)

print('optimal no of features are:- ', rfecv.n_features_)

train_x.columns[rfecv.get_support()]

# BASELINE MODEL

cross_lire_score = -1 * cross_val_score(lire_pipe, train_x, train_y, cv = 5,

n_jobs = -1, scoring = 'neg_mean_absolute_error')

base_lire_score = cross_lire_score.mean()

Now the problem I am facing is up until the train_test_split part, the pipeline feature has not been called and hence none of the nan values are imputed and neither encoding has been done. So running anything after train_test_split (i.e. Feature Selection part) will give me an error as there are nan values and categorical variables on which we are performing feature selection.

The pipeline is not being called up until the baseline model cv. Only at that point will the imputation and encoding will happen. Not before that!

I tried some thing like below as a workaround (everything is same up until train_test_split):-

train_x, test_x, train_y, test_y = train_test_split(data2, y, test_size = 0.2, random_state

=69)

# MANUALLY IMPUTING NAN VALUES

train_x['vehicleType'].fillna(train_x['vehicleType'].value_counts().index[0], inplace = True)

train_x['gearbox'].fillna(train_x['gearbox'].value_counts().index[0], inplace = True)

train_x['model'].fillna(train_x['model'].value_counts().index[0], inplace = True)

train_x['fuelType'].fillna(train_x['fuelType'].value_counts().index[0], inplace = True)

train_x['notRepairedDamage'].fillna(train_x['notRepairedDamage'].value_counts().index[0],

inplace = True)

test_x['vehicleType'].fillna(train_x['vehicleType'].value_counts().index[0], inplace = True)

test_x['gearbox'].fillna(train_x['gearbox'].value_counts().index[0], inplace = True)

test_x['model'].fillna(train_x['model'].value_counts().index[0], inplace = True)

test_x['fuelType'].fillna(train_x['fuelType'].value_counts().index[0], inplace = True)

test_x['notRepairedDamage'].fillna(train_x['notRepairedDamage'].value_counts().index[0],

inplace = True)

# MANUAL ENCODING

ohe = OneHotEncoder(handle_unknown = 'ignore', sparse = False)

train_x_encoded = pd.DataFrame(ohe.fit_transform(train_x[['vehicleType', 'carname', '

fuelType']]))

train_x_encoded.columns = ohe.get_feature_names(['vehicleType', 'carname', 'fuelType'])

train_x.drop(['vehicleType', 'carname', 'fuelType'], axis = 1, inplace = True)

train_x = train_x.reset_index(drop = True)

train_x_encoded = train_x_encoded.reset_index(drop = True)

train_x1 = pd.concat([train_x, train_x_encoded], axis = 1)

test_x_encoded = pd.DataFrame(ohe.transform(test_x[['vehicleType', 'carname', 'fuelType']]))

test_x_encoded.columns = ohe.get_feature_names(['vehicleType', 'carname', 'fuelType'])

test_x.drop(['vehicleType', 'carname', 'fuelType'], axis = 1, inplace = True)

test_x = test_x.reset_index(drop = True)

test_x_encoded = test_x_encoded.reset_index(drop = True)

test_x1 = pd.concat([test_x, test_x_encoded], axis = 1)

# FEATURE SELECTION SECTION

mi = mutual_info_regression(train_x1, train_y)

mi = pd.Series(mi)

mi.index = train_x1.columns

mi.sort_values(ascending = False)

# RFE USING CV

rfecv = RFECV(estimator = LinearRegression(n_jobs = -1), step = 1,

cv = 5, scoring = 'neg_mean_absolute_error', n_jobs = -1)

rfecv.fit(train_x1, train_y)

print('optimal no of features are:- ', rfecv.n_features_)

train_x1.columns[rfecv.get_support()]

# BASELINE MODEL

cross_lire_score = -1 * cross_val_score(lire_pipe, train_x1, train_y, cv = 5,

n_jobs = -1, scoring = 'neg_mean_absolute_error')

base_lire_score = cross_lire_score.mean()

But now there's no point declaring a pipline as I am manually doing the work , which completely defeats the purpose of a Pipeline!! It is mandatory that I use Pipeline along with all the conditions defined above.

Any help would be appreciated as I have spent the last 3 weeks, 4 days and a good part of my non existent social life trying to find a solution!

Badges
Science topic

Similar topics
Education
Students

More Shrey Jain's questions See All

Why is pd.concat increasing my row count and also returning nan values?

I am trying to one hot encode my train and test dataset. For my train dataset, I have 2 dataframes with different number of columns but same number of rows.A (with encoded features) = (34164, 293)...

16 July 2021 1,657 0 View

Why am I getting worse performance after GridSearchCV?

I first construct a base model (default parameters) and obtain MAE (rfr base file for image). # BASELINE MODEL rfr_pipe.fit(train_x, train_y) base_rfr_pred = rfr_pipe.predict(test_x)...

03 July 2021 10,013 0 View

Different results for mean absolute error when performing GridSearchCV vs when manually optimising the max_leaf_node parameter in Decision Tree model?

I am trying out hyperparameter tuning vs manually selecting best parameter (max_leaf_nodes) on a Decision Tree model with mean absolute error as scoring. In theory both should give me the same mae...

29 June 2021 3,209 3 View

Should I first split the data into train and validation sets and then use GridSearchCV on the training set followed by K Fold CV on my training set?

I am having a lot of confusion between GridSearchCV and K fold Cross Validation. I know that GridSearch is only for hyperparameter optimization and K Fold will split my data into K folds and...

26 June 2021 9,544 2 View

How do I calculate the correlation between two categorical variables in Python?

I am using Logistics Regression on a dataset where the dependent variable is a categorical one. I have multiple independent variables some of which are categorical. I want to know which of them...

03 June 2021 6,621 4 View

How do I overcome the "coffee ring" effect while performing drop casting?

I am trying to synthesize MoS2 thin film on a glass substrate using drop casting method but I am getting the "coffee ring" effect on the substrate. How do I get rid of the rings and instead get...

22 March 2021 7,267 15 View

How do I calculate absorption coefficient of my thin film if I dont know the thickness? Or how can I calculate the thickness of my sample?

I have done Uv-vis characterisation of my thin film and have received absorbance and transmittance data. Now I want to calculate the absorption coefficient but I dont know the thickness of my...

21 February 2021 7,463 8 View

What software would you say is the best for XRD analysis which has the option of profile fitting?

I am analyzing the data obtained from powder diffraction of a MoS2 thin film. I want to assign phases to the peaks obtained in my pattern. The software that I'm currently using is PowderX but it...

10 February 2021 7,455 10 View

GC-MS retention index prediticon?

Hello experts, Does anyone know any free software about retention index prediction ?

08 August 2024 7,403 2 View

Can formaldehyd-fixed gelatin be melt?

I am puzzled about the properties of gelatin, when used for tissue-embedding. Hard gelatin can be melt again by temperatures about 40°C. Formalin-fixed gelatin is like crosslinked protein and...

07 August 2024 1,686 1 View

How to fix errors in my heat transfer steel structure with reinforced concrete slab model Abaqus?

I have modelled a steel structure using beam elements in Abaqus and attached to this structure reinforced concrete slab. The analysis that I am making is heat transfer of the structure. The...

07 August 2024 1,028 0 View

Unusual intensity drop in some sections of chromatograms in DDA?

Hi, we have measured tryptic peptides using both DDA and DIA method on QExactive. In DDA replicates i saw unusual intensity drops occurring at the same sections of chromatograms in DDA replicates...

07 August 2024 3,218 4 View

Microtome tissue sectioning - How to prevent chatter of paraffin tissue specimen sections?

Hello everyone, I have recently started using the microtome device for sectioning of paraffin-embedded mice lung. While I had some success in sectioning and observing proper ribbons, some of my...

06 August 2024 666 4 View

Is there an alternative to a multinomial regression which allows the DV to be non mutually exclusive?

I am trying to analyse data from a survey examining what variables affect teachers perceived barriers to incorporating technology into their classroom. I have 5 predictor variables however my DV...

06 August 2024 1,752 3 View

Why do we equate male and female arousal?

Women, on the other hand, can become physically aroused (increased blood flow in the reproductive organs) without becoming psychologically aroused even in the slightest. (Robert Weiss)

05 August 2024 9,537 2 View

How to isolate lymphocytes from mouse spleen?

I have tried several times to isolate lymphocytes from mouse spleen, but all attempts have been unsuccessful. I tried most available protocols. I used different dissociation media (HBSS with Ca...

04 August 2024 9,913 7 View

What should I do with parameters that are not relate to my simulation in MyLake model?

I want to Estimate surface heat fluxes using MyLake, but I don't have all the initial values in model parameters section and other sections,is there a way?

04 August 2024 1,537 1 View

Training for new staff?

I am looking for some training for new staff that will be starting in a self contained classroom with students with ASD. Most new staff have little to no experience working with students with ASD....

03 August 2024 6,717 3 View