I am trying to one hot encode my train and test dataset. For my train dataset, I have 2 dataframes with different number of columns but same number of rows.A (with encoded features) = (34164, 293) and B (only contains numerical features) = (34164, 7). I need a final dataframe whose dimensions are C (dataframe with the encoded features and numerical features both) = (34164, 300).

When I use pd.concat function with axis = 1, I get a dataframe with dimensions (44845, 300) and also includes some nan values. I don't get why would it increase my row count when both the initial dataframes have same number of rows? Also from where did those nan values come from? Below is my code.

ohe = OneHotEncoder(handle_unknown = 'ignore', sparse = False)

train_x_encoded = pd.DataFrame(ohe.fit_transform(train_x[['model', 'vehicleType', 'brand']]))

train_x_encoded.columns = ohe.get_feature_names(['model', 'vehicleType', 'brand'])

train_x.drop(['model', 'vehicleType', 'brand'], axis = 1, inplace = True)

train_x_final = pd.concat([train_x_encoded, train_x], axis = 1)

Similar questions and discussions