Must unsupervised pre-processing method be used in tandem with only unsupervised (ie, NOT supervised) clustering / classification learning algorithms?

More Kevin Yong's questions See All

Why there was an increase in derivative weight from 250 °C onwards in TGA?

The samples used is 16 mg of CMC-chitosan polymer and reference material (α-Al2O3). The method used was heated from 35 °C to 500 °C at 10 °C min−1 in flowing (50 mL min−1) nitrogen.

10 November 2014 3,485 4 View

Is there any substitutional method to cross-sectionally dissect hydrogel spherical particles besides using ultramicrotome?

I have technical difficulties in using ultramicrotome to cut my hydrogel beads. However, I did read about the use of nitrogen liquid treatment and followed by cutting using razor blade.In short, I...

03 April 2014 7,649 13 View

Does colloid silica mean silica gel?

I want to purchase colloid silica as mentioned in your research paper. However, the supplier refer it as silica gel. So, I am a bit confused and would like you to clarify this issue regarding the...

07 August 2013 9,927 15 View

Can anyone suggest any methods to measure the RPM of a magnetic stirrer that does not have any scales written on the instrument itself, nor in its operation manual?

I have to report the rpm for the stirrer provided, but I cannot measure it qualitatively and quantitatively. The magnetic stirrer is three years old. I cannot contact the manufacturer for some...

31 December 2012 9,383 6 View

Are lux meter and light meter the same devices that are used for light intensity? Are the lux unit convertible to W/m2?

I want to measure the light intensity for a xenon lamp, but I do not have access to a lux meter. I have a 'light meter,' however, that has been used to measure the Photosynthetic Active Region...

31 December 2012 2,822 23 View

Feedback defines the constitution of an organism?

“Here is a thought experiment. Let's place Rodolpho Llinas's jarred-brain on top of a body (Fig. 1). I bet Llinas would argue that his jarred-brain retains its own consciousness, and the android...

11 August 2024 2,483 1 View

How to learn more about SPSS and its Application?

I would like to learn more about SPSS and Its application especially in regards to data analysis. Please suggest me how I can learn more about it. Thank you so much.

11 August 2024 9,101 4 View

Can I base on reverse DNA sequences to perform alignment, convert to amino acids and GenBank submission?

I have reverse sequences (AB1 format), can I base on reverse DNA sequences to perform nucleotide alignment, convert nucleotides to amino acids and deposit the sequence in GenBank database?

11 August 2024 5,138 1 View

Baseline drift in HPLC? What causes this?

Hello, Why do i see this baseline drift when i compare my blank (black) to the sample (blue)? Any suggestions as to why this happened? Thank you!

11 August 2024 3,770 4 View

Text-Communication from the M1 Hand Area using BCI—and then there is Elon Musk?

Willett, Shenoy et al. (2021) have developed a brain computer interface (BCI) that used neural signal collected from the hand area of the motor cortex (area M1) of a paralyzed patient. The...

10 August 2024 7,180 0 View

Has anyone applied Python in the field of textile engineering for data analysis, automation, or smart textiles?

I'm currently exploring the application of Python in textile engineering, specifically in areas like data analysis, process automation, and the development of smart textiles. I'm interested in...

10 August 2024 7,429 2 View

How can I use the cif data obtained from rietveld refinement extracted via gsas2, for microstructural analysis using ETEX software?

09 August 2024 7,718 0 View

Self-Organizing Superorganisms—as envisaged by Nenad Sestan (2018)?

The rate of glucose consumption by the neocortex is reduced by over 80% during anesthesia (Sibson et al. 1998), which disables the synapses (Richards 2002) that are inundated by glial tissue (Engl...

08 August 2024 3,118 0 View

How are iso-frequency contours plotted?

Let's say we have a standard, regular hexagonal honeycomb with a 3-arm primitive unit cell (something like the figure attached; the figure is only representative and not drawn to scale). The...

07 August 2024 1,937 1 View

How to prepare the nanoparticle treated fungal sample for Environmental SEM analysis?

A fungal strain was treated with nanoparticles. We want to do an environmental SEM analysis. So could anyone share your views on preparing the sample? Thank you.

07 August 2024 5,307 1 View

Alex Zwanenburg Popular answer

Pekka already answered your question to me.

Naturally both your development data and validation data require pre-processing. What I mean is that you should under no circumstance pool both data sets and pre-process it as a whole for classification. Take for instance normalisation as a pre-processing step. Development data is used to determine scale and shift parameters. Both the development and validation data are then normalised using these parameters. If we were to use the entire combined data set for normalisation, we would bias our classification models. The other alternative (a separate normalisation based on the validation set) is not as problematic, but may decrease classification performance if the validation set is small. This does not only apply to normalisation, but to other pre-processing steps (imputation, PCA, clustering etc.) as well.

Thus, you may perform one analysis on the combined data set, but the information generated should generally not be used for building models. In your case, you may perform a segmentation study on the whole data set, and also perform classification if, and only if, segmentation is not population-based (i.e. no information about other data is used when segmenting one data set), or if the segmentation results are not used for classification.

Sanket Suthar

You can use unsupervised machine learning method for quality assessment of tandem mass spectra without any training dataset.

Pekka Jounela

Hi, in unsupervised discretization you do not use class labels and hence 1) with k-means clustering (no labels) you either use unsupervised discretization or you do not discretize attributes. Regarding the more tricky question 2), I would not apply supervised discretization afterwards because it changes the kmeans based attiribute patterns.

Alex Zwanenburg

In general, I would be very careful when pre-processing data with the aim of generating a model. When modelling, it is important to keep development data, used to train a model (whether supervised or not), separate from validation data, that is used to evaluate model performance. Otherwise, biases may be incurred as information concerning validation data is incorporated into a model.

Therefore, one should perform pre-processing on the development data only, and subsequently transfer and apply the pre-processing results (scale and shift parameters for standardisation, clusters formed, imputation models, etc.) to the validation data set.

Other then that I don't think it is necessary to specifically use either supervised or unsupervised methods for pre-processing. Using supervised methods may however increase the risk of overfitting, as pre-processing will be somewhat adapted to the labels of the development set. Using unsupervised methods this risk should be somewhat less, though naturally you will also learn some of the patterns specific to the development data as well.

Kevin Yong

@Sanket Suthar, thank you for your reply regarding the application of unsupervised machine learning on examining the quality of training dataset.

@Pekka Jounela, thank you for your feedback. I totally agree with you regarding the need to remove the dependent variable before implementing the k-means clustering analysis. In this case, I am performing different analyses; I am not doing a cluster-based classification model. So, the first part of the study is to perform a segmentation studies on the biomedical data set, while the second part of the study is to perform a build a classification or predictive model based on the biomedical data set. Sorry for not clarifying clearly in the question. In this case, can I perform different pre-processing method, that is unsupervised and supervised discretization, on the the same dataset for two different purposes.

@Alex Zwanenburg, thank you for your advice.I do not understand your phrase "... transfer and apply the pre-processing results ...to the validation data set". Do you mean that we do no need to apply the same pre-processing method on the validation data set that will be used as an input to the developed classification model (during the validation phase) and not the cluster model? It sounds like I am asking two different questions in the above sentence, but please could you clarify and advise me accordingly.

Kevin, thanks for the clarification of the second task. In supervised classification just remember to use the same pre-processing (discretization/normalization etc.) model separately for testing/cross-validation examples. What I mean is explained in this blog (see Figure 2.): https://rapidminer.com/learn-right-way-validate-models-part-4-accidental-contamination/

...and a good old paper to read:

Dougherty J, Kohavi R, Sahami M. Supervised and unsupervised discretization of continuous features. In: Proceedings of the 12th

international conference on machine learning. San Francisco: Morgan

Kaufmann; 1995. p. 194–202.