How to split a training set and a testing set?

More Alessandro Cucchetti's questions See All

Can someone explain me how to insert the actual distance between two well logs in Petrel (picture attached)?

I uploaded two well logs to the "Well section window" of Petrel and they were attached together without any space. Then, I modified the horizontal scale from "relative" to "constant (50)". They...

12 June 2024 8,666 2 View

DNA purification from ATCC 25922 E.Coli strain with QIAGEN kits?

Hi everyone, We've got a plasmid we want to purify from E.Coli strain ATCC 25922. We usually use QIAGEN plasmid purification kits but we've never purified DNA from this particular strain. Would...

20 May 2024 3,803 1 View

steam distillations with urea/water solutions. do you have any directions to give me?

Good morning, I am carrying out steam distillations with urea/water solutions on fresh rosemary, urea has a relatively low vapor pressure and in theory the vapors generated by this solution should...

30 April 2024 3,920 1 View

Glycerol stock pichia pastoris ?

Hi ,I've woken up my electroporated picha pastoris strain from the glycerol stock at -80°C.from 2020. The literature said that it could be stored for years I tried to cultivate it, and the strain...

23 April 2024 5,940 8 View

Any updates of macroeconomic variables and energy modelling being tested in econometrics?

Are there new projects and studies considering recursive approaches, in energy transition and resources, being empirically tested with regression analysis and models?

14 April 2024 7,024 4 View

FeCl3 in HCl + culture media for microbial growth?

Hello, I've seen that this question has been already asked other times on Reddit and ResearchGate but with different details. So I'm asking it too with my specifications. I need to prepare...

18 January 2024 171 0 View

How to model a packed bed in Aspen Custom Modeler?

Hi, I am modeling a packed bed in ACM. I have to write 3 differential equations. At the moment the model is under by 6, but idk how to write boundary conditions. Plus, shall the model be under by...

09 December 2023 3,527 0 View

How to interpret SIMPER p-values in R?

I am doing a SIMPER analysis (in R) with the values relating to individuals contacted in different reproductive seasons to see how the surveyed bird community has changed, and which species...

05 November 2023 1,382 0 View

Is there any deterministic phase transiton other than cellular automata's?

I'm working on a many-unit model that shows a deterministic phase transition driven by a control parameter that do not contain noise or disorder, and drives the system to chaos basically, but at...

28 September 2023 7,091 14 View

RNA extraction from dynabeads after Cross-linking immunoprecipitation for RNA-seq?

Hi everyone, I am interested in the RNA interactors of a specific protein but I am struggling a bit with the RNA extraction from the dynabeads (invitrogen) after CLIP. What I did until now was...

17 September 2023 2,256 0 View

How to learn more about SPSS and its Application?

I would like to learn more about SPSS and Its application especially in regards to data analysis. Please suggest me how I can learn more about it. Thank you so much.

11 August 2024 9,101 4 View

Has anyone applied Python in the field of textile engineering for data analysis, automation, or smart textiles?

I'm currently exploring the application of Python in textile engineering, specifically in areas like data analysis, process automation, and the development of smart textiles. I'm interested in...

10 August 2024 7,429 2 View

Posthoc test lettering in JAMOVI?

Does anyone know of a module for the JAMOVI software that is capable of generating mean separations using the classic letters based on post hoc results (e.g., Tukey test)? If, as I believe, such...

31 July 2024 3,333 4 View

How to back transform the results generated from analyses using log transformed with In(X+1) data?

I am conducting my analysis using SPSS. I log transformed my data using In(X+1) as my data contain zero values. However, when I want to back transform the regression coefficients generated from my...

31 July 2024 7,860 3 View

Have you tried using Vizly for your data analysis? Use the link: https://vizly.fyi/?via=olatomide. How do you see it?

AI has made it easier to code and analyze data

25 July 2024 9,861 1 View

Is it appropriate for researcher(s) to collapse five or four rating Likert scales to three or two as the case maybe during data analysis?

Five or four rating Likert scales e.g. Strongly agree, agree, neutral, disagree and strongly disagree or Strongly agree, agree, disagree and strongly disagree are usually collapse to SA/A, N, D/SD...

24 July 2024 9,841 4 View

How to test multivariate outlier in STATA?

Hey all, I need help testing for multivariate outliers using STATA for my master thesis. The literature recommends the Minimum Covariance Determinant (MCD) (Verardi & Dehon, 2010). I found the...

22 July 2024 8,821 2 View

Who wants opportunities for scientific cooperation?

Dear Colleagues, I hope this message finds you well. My name is Noor Al-Huda K. Hussein, and I am a researcher specializing in deep learning applications in genetic data analysis. I am currently...

16 July 2024 3,981 6 View

Suggestion for PhD Research Topic/Topics in Applied Statistics?

Hi All I recently get admission in PhD statistics. After a long discussion with my supervisor, the topic I selected for PhD is " Air Pollution and its impact on Economy: A case study of...

15 July 2024 1,820 5 View

What is the difference between OTU and ASV analysis?

For microbiome data analysis

13 July 2024 4,542 2 View

Lionel Blanchet

Ideally you would do i randomly and repeat the procedure over and over.

A more pratical solution is to use the approach proposed by Kennard and Stone (R.W. Kennard, L.A. Stone, Computer aided design of experiments,

in Technometrics 11 (1969) 137-148) or the duplex method (R.D. Snee, Technometrics 19 (1977) 415-428)

This article mayalso be helpful

M. Daszykowski, B. Walczak, D.L. Massart, Representative subset selection,

in Analytica Chimica Acta 468 (2002) 91-103

Alessandro Cucchetti

Dear Lionel Blanchet did you know if r-project has a package for the Kennard-Stone?

I think it exist:

http://www.inside-r.org/packages/cran/soil.spec/docs/ken.sto

However i'm using Matlab, if you want to recode it in R I recommend to translate this code:

http://chemometria.us.edu.pl/download/kenstone.M

Claudia Beleites

Allessandro,

it is difficult to answer such a broad question: of course it depends... on your application, on the experimental design behind your data, on your modeling strategy, and so on.

This can be said in general:

- make sure the test set is able to answer your question (e.g. predictive performance for unknown samples or predictive performance for unknown future samples? etc.)

See e.g. Esbensen, K. H. & Geladi, P. Principles of Proper Validation: use and abuse of re-sampling for validation, J. Chemometrics, 2010, 24, 168-187 (http://dx.doi.org/10.1002/cem.1310)

- As Lionel said: repeat it over and over, with new random splits (if doing cross validation or out-of-bootstrap), with different algortihms for subset selection if you go for one of these.

The idea is that this allows you to check whether the results of different splits agree.

Also, a whole lot of information about model validation can be found at CrossValidated (http://stats.stackexchange.com/), including specific discussions for different modeling strategies.

Regarding the R function soil.spec::ken.sto vs. Matlab kenstone.m: After a very quick look at both I'd personally go for the Matlab version. ken.sto does enforce a PCA first, and there are several known problems (but maybe also a solution) http://r.789695.n4.nabble.com/problems-with-method-ken-sto-in-package-soil-spec-subscript-out-of-bounds-td4288193.html

(I've never used neither of them and instead go for iterated(repeated) k-fold cross validation or out-of-bootstrap;

as you were asking about R packages:

> require ("sos"); findFn ("cross validation")

should give you a decent overview)

Maurice HT Ling

My collaborators tend to use cross-validation ==> split dataset into 10 parts. Take 9 parts for training and the last part for testing. Repeat for 9 times with different part/section for testing. The advantage of this method is that you will end up with 10 precision/recall values which can be used to calculate a standard error.

Wow! I really need to study statistic more in detail!

Ok... I think that I will not use a training and a testing set so no Kennard-Stone... I think I will use Ten-fold cross validation or Bootstrap cross-validation. Thanks to all!

dr.venugopalan Rangarajan

Do it by random sampling with different sizes; for which ever the set you got minimum error in prediction of the object under study, flag down the same as optimum

Reinaldo Molina-Ruiz

I consider that the distribution of the dataset can be performed by the KS algorithm( Kennard Stone algorithm)