I am working to improve classification results with my algorithm. I used it for different applications. For one of them, it works 100 percent accuracy. It seems very weird. Could you give any recommendations about it?
accuracy assessment is partial enumeration process.
when you are telling accuracy 1 means it is replica of ground which is nor practically possible.
increase number of points and again calculate.
there is no thumb rule for calculation accuracy. some researcher take uniformly distributed 100 point some 254 point. what i will suggest take stratified sample point based on classified area of different class.
100% is very uncommon and seldom occurs in standard classifcation tasks.
Either your recognition problem is rather easy, your test and training data are too much alike compared to practical scenarios, or you are actually re-classifying your training data in the test step. In the latter case, 100% classification accuracy can easily result for classifiers with many "parameters" (i.e. high capacity) such as e.g. the nearest-neighbor classifier.
It is possible that your classifier is overfitting the training set. Therefore, to avoid that, you need to perform your classification process under the evaluation of 10-fold corss-validation.
However, relying only on classification accuracy when evaluating certain learning method is not enough, you need to consider additional evaluation metrics such as Confusion Matrix, ROC, etc.
It may be possible that your training and test data sets are very much alike. Generally, 70% data is used for testing and remaining 30% for training your classification algorithm. I think you should recompute the results and try to calculate the average of your set of observations. 10-fold corss-validation may also be a good choice for testing.
I reccommend that you use stratified random sampling to split your data into that of training and testing, and if you have a large enough sample you could do n-fold cross validation.
You didn't provide any details about the algorithm or the data-set in question, hence this might not be applicable...
As Michael Kemmler pointed out, this is rather uncommon but can potentially happen in a number of scenarios. What I found often helpful is to try to visualize the data (both training and testing) as well as the decision boundary somehow. This should give you a good indication on how complicated the problem is, if the classifier is overfitting, if the testing data is too similar to the training data etc.
If the data is high-dimensional something likte t-SNE might help.
If you have sufficient data, try splitting it into train / validate / test sets (say 60% train, 30% validate, 10% test). After training on the training set, use the validate set. If you can get improvements with the validate set continue training with training set. When there is no more improvement, test ONCE ONLY with the test set. This avoids problems of adaptation tot he test set with repeated testing. Others have pointed out that if you have trivial data you could get 100%. Try running your data through Weka's ZeroR and OneR algorithms - if it comes out as 100% or very close, you're not dealing with an AI problem.
Finally, whether to use cross validation - I use a rule-of-thumb which I think originates from Quinlan. If you have a moderately complex problem you probably need more than 1,000 cases in your data set - if you haven't got that then definitely go for cross-validation. How many folds? 10 is the default answer, but you really need to split your data so you have enough in the training set for the classifier to learn the domain (something you have to work out for yourself and depends on the type of classifier. But for an example, in a problem where I had 1200 records I used 60-fold cross-validation.
Have you tried to visualize your data using PCA (assuming you have a dataset in a higher dimension than 2)? It might help you identify if there are clear separations between the data points and could help validate your models. Otherwise plotting a learning curve and checking for the bias-variance tradeoff might also help you understand your dataset better.
The classification accuracy depends on application as well as database. Classification accuracy may varied for different application for a classifier. Classification accuracy also depends on features of data.
Hallo Salih Tutun. I have just submitted a manuscript to a journal recently and I discovered some issues during validation and proposed an approach termed as spatial cross-validation as opposed to ordinary cross-validation. Now here are the details of two: 1) Conventional cross validation involves random selection of training and validation points: this means that the points randomly selected can occur anywhere in the image even neighbouring the points selected for validation. This means that correlation will be high between points in both training and validation and hence higher accuracy will be given i.e. this approach has been termed to give overly "optimistic accuracy". In other words the algorithm will already have prior knowledge of what patterns are in the vicinity of points in the validation set due to high correlation (spatial dependency). 2) In contrast, spatial validation encourages subdividing the image to be classified into grids/spatial blocks and then proceed to sample the data in different blocks for training and testing an algorithm. This is motivated by the goal of image classification which is to predict labels/classes/land-cover beyond where there is training data (knowledge of classes). Subsequently, we should use a validation approach considers and evaluates the goal. I hope this helps.