How should we handle the missing values in test data?

More Farideh Bagherzadeh Khiabani's questions See All

Is there any way to produce the same random numbers in R when using Linux or Windows?

I'm getting slightly different random numbers depending on the OS (Windows vs Linux) although I have specified the seed using set.seed. Is there anyway to guarantee reproducibility across...

02 March 2018 2,298 11 View

Does any one know how to calculate the bias and variance of a logistic regression model?

I have fitted a logistic regression model in R. I am not sure how to estimate the bias and variance of this model.

01 February 2017 4,190 5 View

Is there an R function to find the position of a vector of elements in another vector?

I am looking for an R function like the following one: f

07 August 2016 3,569 2 View

Do you know any available PhD position in data-mining/machine-learning?

Dear Colleagues Does anyone know any available phd scholarship in the field of data mining concerning clinical applications and bioinformatics (university ranking < 100)? I have a good...

31 December 2015 4,045 0 View

Where can I find a data set containing the two variables: "Prostate-specific antigen" and "prostate cancer"?

Where can I find a data set containing the two variables: "Prostate-specific antigen" and "prostate cancer"? The dataset has been used in a lot of studies by Vickers such as: Prostate-specific...

31 December 2015 7,585 3 View

Which one is superior: Save, dput, dump in R?

I used to save the result of my analysis (like an imputed data, ...) using save () and load it by load(). I recently came across dput() and dump(). Are they superior to save()?

09 October 2015 1,665 0 View

Any advice on a data set with both numerical and categorical variables and a two-class response?

I need a data set containing both numerical and categorical variables and a two-class outcome to be used in examples of my R package. Do you know any well-known one?

09 October 2015 5,125 1 View

Do you know any R Package for search algorithms?

I need a couple of search algorithms. Does anybody know a package containing a lot of search algorithms such as stepwise, hill climbing, genetic search?

08 September 2015 2,564 0 View

While importing a package in another Package in R, do I have access to the hidden functions ?

Hi, I am writing a package and I need the hidden functions of another package. If I import that package while writing my package, do I have access to its hidden functions?

06 July 2015 482 6 View

How can I import SPSS data into R and retain both labels and values?

I need to import SPSS data into R and retain both the values and value labels for the variables. The read.spss() function from foreign package gives me option to retain either values OR value...

04 May 2015 2,445 1 View

Feedback defines the constitution of an organism?

“Here is a thought experiment. Let's place Rodolpho Llinas's jarred-brain on top of a body (Fig. 1). I bet Llinas would argue that his jarred-brain retains its own consciousness, and the android...

11 August 2024 2,483 1 View

How to learn more about SPSS and its Application?

I would like to learn more about SPSS and Its application especially in regards to data analysis. Please suggest me how I can learn more about it. Thank you so much.

11 August 2024 9,101 4 View

Handling Missing Data and Building a Predictive Model with Incomplete Information ?

I am developing a predictive model for a water supply network that involves 20 influencing points. However, I only have historical data for 10 out of these 20 points. I would like to know how to...

10 August 2024 4,005 2 View

Has anyone applied Python in the field of textile engineering for data analysis, automation, or smart textiles?

I'm currently exploring the application of Python in textile engineering, specifically in areas like data analysis, process automation, and the development of smart textiles. I'm interested in...

10 August 2024 7,429 2 View

Self-Organizing Superorganisms—as envisaged by Nenad Sestan (2018)?

The rate of glucose consumption by the neocortex is reduced by over 80% during anesthesia (Sibson et al. 1998), which disables the synapses (Richards 2002) that are inundated by glial tissue (Engl...

08 August 2024 3,118 0 View

Do you know best mines of western part of Afghanistan?

I want to know more about Mn deposits in west of Afghanistan.

07 August 2024 3,427 1 View

Is Galaxy.org good to use for research for analyzing data and for publication?

Hello all, I wanted to know, can I use galaxy (USA, Europe or Australia) platform for analyzing the shotgun data, and can it be used for publication purpose as well? Thanks :)

06 August 2024 6,610 4 View

Do experts have journals in the field of artificial intelligence and big data that are not indexed by SCI or EI?

05 August 2024 8,836 2 View

Measuring the Intelligence of a Species?

Larger brains, which typically contain more neurons, store and transfer more information (Tehovnik and Chen 2015), but the precise relationship between number of neurons and information has yet to...

05 August 2024 1,238 2 View

How can i do multivariate Time Series forecast using MLP, ANFIS and LSTM?

I need the python code to forecast what crop production will be in the next decade considering climate and crop production variables as seen in the attached.csv file.

05 August 2024 2,977 3 View

Mohamed Elhoseiny

I think, this thesis might help. http://www.cs.toronto.edu/~marlin/research/phd_thesis/marlin-phd-thesis.pdf

David F. Nettleton

Firstly, "missing values" and "outliers" are two very different aspects which should be considered distinctly. In the case of outliers, I would say that it depends on your data modeling objective. maybe the outliers are precisely the values of interest (for example, fraud detection). In the case of missing values, if you have several input variables, maybe only one of the variables suffers from missing values. Then it depends if that variable is highly relevant to the data model. if not, maybe you could remove the variable altogether. Another criteria is what percentage of missing values is acceptable. Substituting the missing value for the average of the non missing (or the mode if categorical) is reasonable, up to a certain point. Overall, I would say that missing values are a data preprocessing step and not a modelling problem.

Patrick S Malone

Going to agree with David's overall distinction drawn, but disagree with mean substitution. It really messes with the normality assumptions, as well as reducing association with other variables.

Fabrice Clerot

when you learn a model on a training set and deploy it on a test set, you implicitly assume that both sets are drawn from the same distribution P(X,Y), X being the explicative variables and Y the target variable

if you learn a model on a pre-processed training set (imputing missing values for instance), you should deploy it on a test set pre-processed in the same way

(which means in particular that you should not use your target when imputing on the training set : as you do not know the targets on the test set, you cannot impute on the test set if the imputation on the training set uses the target !)

For further reading, you can check out Chapter 5 (Data Quality) of my latest book, which considers two visions of data quality (as input to a data model): relevance and reliability.

"Commercial Data Mining, Processing, Analysis and Modeling for Predictive Analytics Projects", Morgan-Kaufmann 2014. 1st Edition

http://store.elsevier.com/Commercial-Data-Mining/David-Nettleton/isbn-9780124166028/

Mostafizur Rahman

You can use machine learning method to predict missing data !

Article Machine Learning Based Missing Value Imputation Method for C...