This question was confusing to me, in that I was not aware that "test set" and "training set" might take on a different meaning, depending on the classification technology. If this is the case, I'd certainly like to generate some discussion.
As I think of it:
"test set": A set of classification patterns including inputs and ideal outputs used *only* to measure the performance of a classification system; individual input patterns are never used to change or update the behavior of the classifier.
"training set": A set of classification patterns including all inputs and perhaps the associated ideal outputs, used individually to update the behavior of a classification system, and used collectively to assess the performance of the system.
There are several complications and quibbles buried here, the differences between observations and measurements, static classifiers versus those with memory (eg. Markov or adaptive), supervised versus unsupervised, time domain versus spacial, and so on. The principle of using a training vs. test set performance comparison remains one means to verify that the classifier is not over specific to the training set.
The conversation ought to continue on to the means to identify that appropriate sampling has occurred in constructing each set, and that both sets are sufficiently dissimilar that this is an appropriate check against overtraining. While one can argue - and regulatory groups like the FDA do argue - that this is a necessary test; I would argue that this is not sufficient. If one is classifying ECG, for instance, from a set of measurements, it is difficult to imagine a test set large enough to cover the combinatorial range produced by heart rate variations, QRS widths, QSRT morphologies and arrhythmias. I think other testing is possible and maybe necessary to assure appropriate generalization.
To the best of my knowledge I never heard or read about a different meaning of "training set" and "test set" regarding different classification approaches. Thats why I assumed the question to be of practical nature. So I pointed out some ressources how to get a quick start in Matlab.
Everything you wrote is of course correct, just for the sake of a bit more generalization:
In general we can comprehend a classifier as a function γ:X→C. The function tries to determine the appropriate class C for a given sample X. A sample is represented by a n-dimensional feature vector x=(x_1,x_2,x_3,…,x_n)∈X, depending on the concrete problem we have a set of n classes C={c_1,c_2,…,c_n}.
The fomalization of a training sample would be 〈t,c〉∈X×C, which means each sample of the training set is annotated with it's appropriate class. A training set in general is a set of annotated samples T_Training={〈t_1,c〉,〈t_2,c〉,…,〈t_n,c〉 } | 〈t,c〉∈X×C, in contrast to that the test set contains only samples without annotations T_Test={t_1,t_2,…,t_n } | t∈X.
During training stage the system is trained using the annotated samples from the training set which results in a set of rules also called model. Depending on the quality of the model the system should now be able to determine the class for a non-annotated unknown test sample from the test set.
--- Correction 29.07.2014 ---
After reading the answers of Francesco Bianconi and M. Ramakrishna Murty I realized I made a mistake in my previous post. Thanks for that ...
Of course not only the 'training set' but also the 'test set' is annotated / labeled. Better to distinguish between 'labeled' and 'non labeled' data.
Whereat, the labeled data is split into 'training set', 'validation set' and 'test set' for use during training phase: T_Labeled = { 〈t_1,c〉,〈t_2,c〉,…,〈t_n,c〉 } | 〈t,c〉∈X×C
And the non labeled data which represents all unkown samples during working phase: T_NonLabeled = { t_1,t_2,…,t_n } | t∈X
After reading the answers of Francesco Bianconi and M. Ramakrishna Murty I realized I made a mistake in my previous post. Thanks for that ...
Of course not only the 'training set' but also the 'test set' is annotated / labeled. Better to distinguish between 'labeled' and 'non labeled' data.
Whereat, the labeled data is split into 'training set', 'validation set' and 'test set' for use during training phase: T_Labeled = { 〈t_1,c〉,〈t_2,c〉,…,〈t_n,c〉 } | 〈t,c〉∈X×C
And the non labeled data which represents all unkown samples during working phase: T_NonLabeled = { t_1,t_2,…,t_n } | t∈X