Data structure and missing values in random forest model?

More Lavinia Perumal's questions See All

Do you think can be any Uranium bearing rocks in Eastern part of Iran and western part of Afghanistan?

I want to know more about Uranium ore deposits in world.

11 August 2024 6,720 0 View

Do you think can be any diamond bearing rocks in Eastern part of Iran and western part of Afghanistan?

I want to know more about diamond ore deposits in world.

11 August 2024 2,167 1 View

What is the difference between mathematical R^4 space and physical 4D unit space?

We assume that the difference is huge and that it is not possible to compare the two spaces. The R^4 mathematical space considers time as an external controller and the space itself is immobile in...

10 August 2024 6,678 14 View

If Banks do not provide credit facility, what are the options available for FPOs and impact on producer’s income?

10 August 2024 8,198 5 View

Controlling for pupil light reflex when analyzing pupil size time course?

I used eye tracking to examine how participants from two different populations (A and B) react to an image. Participants in population A exhibit larger pupil sizes over time, but they also have...

10 August 2024 3,229 0 View

What are a “Farmers Producer Organization” (FPO) and its essential features?

10 August 2024 477 5 View

Strugglling with m6A dot blot any suugesstion ?

I have been doing the m6A dot blot for a while with no improvement, I am extracting the RNA, and I can see the dots although the three biological replicas give a different reading on the memberan...

10 August 2024 8,539 5 View

Do interactions between biosphere, carbon cycle, & water cycle impact global warming & interaction between atmosphere & hydrosphere?

How do interactions between the biosphere, the carbon cycle, and the water cycle impact global warming and interaction between the atmosphere and the hydrosphere?

09 August 2024 3,291 2 View

How to get moment output in Abaqus Standart?

I have input a moment load in module load Abaqus, i put my moment load on the node surface (using reference point). I have define moment in history output and make a set for moment too. But the...

08 August 2024 4,831 4 View

How is energy cycled through the Earth's climate system and how do matter cycle and energy flow through the rock cycle?

08 August 2024 8,162 0 View

How to learn more about SPSS and its Application?

I would like to learn more about SPSS and Its application especially in regards to data analysis. Please suggest me how I can learn more about it. Thank you so much.

11 August 2024 9,101 4 View

Handling Missing Data and Building a Predictive Model with Incomplete Information ?

I am developing a predictive model for a water supply network that involves 20 influencing points. However, I only have historical data for 10 out of these 20 points. I would like to know how to...

10 August 2024 4,005 2 View

Has anyone applied Python in the field of textile engineering for data analysis, automation, or smart textiles?

I'm currently exploring the application of Python in textile engineering, specifically in areas like data analysis, process automation, and the development of smart textiles. I'm interested in...

10 August 2024 7,429 2 View

Hello researchers Is this a random laser or just fluorescence?

I am using Rhodamine6G as gain medium and silver nanoparticles as scatterers on a microscope slide and laser input 532 nm comes from above.

09 August 2024 9,894 2 View

Request Python code?

Request Python code from this article : Gender equity of authorship in pulmonary medicine over the past decade. THANKS!

08 August 2024 6,242 2 View

Why does everyone use vs code?

Visual Studio Code (VS Code) has become a popular choice among developers for several reasons: 1. **Free and Open Source**: VS Code is free to use and open source, making it accessible to...

07 August 2024 7,013 4 View

Is Galaxy.org good to use for research for analyzing data and for publication?

Hello all, I wanted to know, can I use galaxy (USA, Europe or Australia) platform for analyzing the shotgun data, and can it be used for publication purpose as well? Thanks :)

06 August 2024 6,610 4 View

Do experts have journals in the field of artificial intelligence and big data that are not indexed by SCI or EI?

05 August 2024 8,836 2 View

How can i do multivariate Time Series forecast using MLP, ANFIS and LSTM?

I need the python code to forecast what crop production will be in the next decade considering climate and crop production variables as seen in the attached.csv file.

05 August 2024 2,977 3 View

What are possible strategies can be used to analyze data under sequential explanatory mixed method approach?

Better ways to analyze the qualitative and quantitative data in a sequential explanatory mixed method approaches

04 August 2024 2,703 6 View

Jan Bińkowski

Dear L. Perumal,

1) RF is accurate in this case. Like error said you probably have a string in any column in DataFrame, of course, string are not allowed as predictor value for a model, maybe try to use the pandas built-in function df.info() to find them.

2) In the case of NAN you can:

- filling gaps by zero will undoubtedly generate model noise!

- change NAN to median/mean it may be the most accurate option

- delete records with NAN using df.dropna() method (data loss can have an adverse impact on model training)

- train a new model to fill Nan - quite interesting idea

There is no best method, it is dependent on: data quality and how many NANs are in your data set. In my opinion, you need to try different ways to find the best one. I can not help more if i do not see code. Do you have repo in GitHub or GitLab ?

Best regards!

David Emde

Hello L. Perumal,

1. Random Forest should do the trick. Generally Random Forest models can accommodate both categorical and numerical variables, but strings and characters will need to be standardized and factored before the model will run.

2. I definitely agree with Jan Bińkowski regarding missing values. It's best not to leave them as 0, those values will not be treated as missing, they will be regarded as a value of 0 for that variable. Some algorithms will allow you to leave the cell blank and run ignoring that particular cell, though there are better ways to approach this.

The mean/median value is a pretty safe approach for a data set of that size (assuming there aren't too many missing values), although there are also some pretty interesting random forest imputation methods that will fill in missing cells by using a minimal RF model based on the other available variables.

Something else to take a peek at if you really want to dive in, check out Boosted Regression Trees. Functionally they are similar to RF, but allow for much greater parameter tuning and are able to deal with missing values.

Good luck on your ML journey!

Lavinia Perumal

Jan Bińkowski Thank you for the info! I dont have a repo but when I do I can let you know, if you are still interested.

David Emde I think Boosted Regression Trees might be a good option. Thank you for assisting :)