Why is pd.concat increasing my row count and also returning nan values?

17 July 2021 0 2K Report

I am trying to one hot encode my train and test dataset. For my train dataset, I have 2 dataframes with different number of columns but same number of rows.A (with encoded features) = (34164, 293) and B (only contains numerical features) = (34164, 7). I need a final dataframe whose dimensions are C (dataframe with the encoded features and numerical features both) = (34164, 300).

When I use pd.concat function with axis = 1, I get a dataframe with dimensions (44845, 300) and also includes some nan values. I don't get why would it increase my row count when both the initial dataframes have same number of rows? Also from where did those nan values come from? Below is my code.

ohe = OneHotEncoder(handle_unknown = 'ignore', sparse = False)

train_x_encoded = pd.DataFrame(ohe.fit_transform(train_x[['model', 'vehicleType', 'brand']]))

train_x_encoded.columns = ohe.get_feature_names(['model', 'vehicleType', 'brand'])

train_x.drop(['model', 'vehicleType', 'brand'], axis = 1, inplace = True)

train_x_final = pd.concat([train_x_encoded, train_x], axis = 1)

Badges
Science topic

Similar topics
Education
Students

More Shrey Jain's questions See All

Why am I getting worse performance after GridSearchCV?

I first construct a base model (default parameters) and obtain MAE (rfr base file for image). # BASELINE MODEL rfr_pipe.fit(train_x, train_y) base_rfr_pred = rfr_pipe.predict(test_x)...

03 July 2021 10,013 0 View

Different results for mean absolute error when performing GridSearchCV vs when manually optimising the max_leaf_node parameter in Decision Tree model?

I am trying out hyperparameter tuning vs manually selecting best parameter (max_leaf_nodes) on a Decision Tree model with mean absolute error as scoring. In theory both should give me the same mae...

29 June 2021 3,209 3 View

Should I first split the data into train and validation sets and then use GridSearchCV on the training set followed by K Fold CV on my training set?

I am having a lot of confusion between GridSearchCV and K fold Cross Validation. I know that GridSearch is only for hyperparameter optimization and K Fold will split my data into K folds and...

26 June 2021 9,544 2 View

How do I calculate the correlation between two categorical variables in Python?

I am using Logistics Regression on a dataset where the dependent variable is a categorical one. I have multiple independent variables some of which are categorical. I want to know which of them...

03 June 2021 6,621 4 View

How do I overcome the "coffee ring" effect while performing drop casting?

I am trying to synthesize MoS2 thin film on a glass substrate using drop casting method but I am getting the "coffee ring" effect on the substrate. How do I get rid of the rings and instead get...

22 March 2021 7,267 15 View

How do I calculate absorption coefficient of my thin film if I dont know the thickness? Or how can I calculate the thickness of my sample?

I have done Uv-vis characterisation of my thin film and have received absorbance and transmittance data. Now I want to calculate the absorption coefficient but I dont know the thickness of my...

21 February 2021 7,463 8 View

What software would you say is the best for XRD analysis which has the option of profile fitting?

I am analyzing the data obtained from powder diffraction of a MoS2 thin film. I want to assign phases to the peaks obtained in my pattern. The software that I'm currently using is PowderX but it...

10 February 2021 7,455 10 View

• What the possible Persistent Organic Pollutants and Heavy metals present in fluorspar, sediments, and water bodies around its mining area?

Approximate concentrations are require in compared with the WHO permissible limts

11 August 2024 2,723 1 View

Handling Missing Data and Building a Predictive Model with Incomplete Information ?

I am developing a predictive model for a water supply network that involves 20 influencing points. However, I only have historical data for 10 out of these 20 points. I would like to know how to...

10 August 2024 4,005 2 View

Do interactions between biosphere, carbon cycle, & water cycle impact global warming & interaction between atmosphere & hydrosphere?

How do interactions between the biosphere, the carbon cycle, and the water cycle impact global warming and interaction between the atmosphere and the hydrosphere?

09 August 2024 3,291 2 View

Is it true that $\det(V(A))$ may be only $\pm 1$, depending on $n$, for the last symmetric tridiagonal matrix $A$?

One can try to generalize the Vandermonde determinant in the following direction: Let $A$ be any symmetric $n$-order square matrix. Consider its powers' diagonal elements $(A^k)_{ii}$ and...

08 August 2024 6,690 1 View

Research Methodology - Impact of Corporate Reputation on Stakeholders Behaviors?

Please can anyone support with the survey questions based on RQ measures and propose how to do it in FMCG industry and include as well the role of brand equity Thanks

06 August 2024 949 0 View

How to Add Missing Water Molecules in Protein-Membrane Simulations?

I have protein-membrane simulations (PDB, PSF, DCD) and have noticed that water molecules near the protein are not visible in the simulations. How can I fix this issue? Is there a way to place the...

04 August 2024 1,200 2 View

Training for new staff?

I am looking for some training for new staff that will be starting in a self contained classroom with students with ASD. Most new staff have little to no experience working with students with ASD....

03 August 2024 6,717 3 View

I need the datasets of Microgrid for system identification?

Hi I am working on data driven model of the microgrid, for that, i need the reliable datasets for the identification of MG data driven Model. Thanks

02 August 2024 5,748 4 View

Why is nonpoint source pollution potentially more harmful and difference between point and nonpoint sources of water pollution?

01 August 2024 1,180 2 View

How do living organisms play a role in the water cycle and why is nonpoint source pollution potentially more harmful than point source pollution?

01 August 2024 7,061 2 View