Sounds a bit optimistic. Voting performance is typically less than that of the best performing inner learner. I assume here that your voting operator uses a majority vote (for classification) or the average (for regression) on top of the predictions of the other inner learners.
Thank you so much for your consideration. i think you are right. I should use another combination method. Maybe an adaptive combination is more beneficial.
If accuracy is p, >= 2 of 3 voting gives you 1- (1-p)^3 - 3*p*(1-p)^2 performance improvement, for instance, assuming the samples are independent. 70,80, and 90% accuracy goes to 78, 90, and 97% accuracy. Works fine, at the cost of making three separate sets of measurements.
The independence assumption will mess up the improvement expected, but will be pretty hard to quantify in real situations, and depends entirely on how your measures are derived. You might, as an alternative, look at "bagging" and "forest of trees" approaches to improve your classification.
Another ensemble tool might also be stacking, which combines sub-models and which may yield better performance than any single one of the trained sub-models, and which can be done at least in Weka, RapidMiner and R. See also http://en.wikipedia.org/wiki/Ensemble_learning
Specifically, in stacking one uses two or more base learners to improve performance of a stacking model learner. Using all data in this training (learning) phase might result overfitting or overly optimistic performance. Hence, it may be a bit more appropriate to use cross-validation here. N-fold CV splits the data into n-fold train-test sets and, iterates stacked generalization process so that the test examples never overlap. In practice, an easy way to get started is to download RapidMiner (RM) from http://rapid-i.com/content/view/181/190/lang,en/ . After downloading RM you may open the built-in example Stacking process file by clicking: File Open, Samples, Processes, 19_Stacking, Ok, Run. Another way to get started is to copy-paste the attached txt.file into the XML- Tab of RM and, then by clicking the Run button. After running the attached txt-example process, in the results tab you see the difference between stacking performances with and without 10-fold CV. After that you may just include your own data into the attached RM stacking 10-fold-CV process and monitor how it works with your own data. Cheers P
One other minor comment about voting schemes. These are reasonable and practical ways to address accuracy and noise issues, for instance medical devices (AEDs, ICDs) often use voting where one measurement sample may randomly be corrupted by noise or measurement error. In these cases independence assumptions are very good, but adequate representative samples are difficult to capture and characterize for classifier development, so a voting scheme is a practical mitigation which doesn't require omniscience.
One other word, accuracy is rarely adequate, alone, to estimate the improvement in your base performance. Rates of error are rarely symmetrical between classes, nor is prevalence equal. So if you are 90% sensitive and 80% specific, voting may improve these rates to 97.2% and 89.6%, but overall accuracy will be the average of these two rates, weighted by the prevalence.
Sorry for late answer(Here in Iran is new year holiday :) )
Thank you all friend for your helpful comments.
Thanks Pekka,I downloaded the RM and installed it I'll try it.
You are right James and I've calculate sensitivity and false alarm rate. By using voting I got lower False Alarm Rate(I like it) but low sensitivity too. And the overall accuracy was as same as last.
I'm trying read about stacking method(frankly it's a little unclear to me right now. I'll appreciate if somebody explain it and talk about my last comment) and read about an adaptive combination method because I think they can be helpful.
If you are really interested in the ensemble field, you should read the «Ensemble based systems in decision making» (Polikar, 2006). It has a clear and large explanation of different combinations methods and different methods to create many classifiers creating diversity among them. The diversity among the base classifiers is the key to understand if your ensemble results are reasonable or not. This means that if all your base classifiers fail on different instances, then your ensemble have to enhance the result as you commented.
Fuerthermore, it is not usually to have that kind of performance, but it could be possible depending on the diversity.
Thank you Yeray. You are right, because in combination method each classifier is suppose to work in a part of problem where it acts better. I mean, we wanna reduce generalization error of classifiers.
To get to 99% accuracy is certainly *possible*, but not *likely*. It all depends on how the errors of the individual classifiers are distributed. Suppose that the errors are made on disjoint sets, that is, what classifier 1 gets wrong is classified correctly by classifiers 2 and 3 and analogously for the other two possibilities. (Note that disjoint error sets are possible, because the sum of the errors is 15%+10%+30% = 55% < 100%.) Then a simple majority vote would give even 100% accuracy. However, this is not likely to happen. If the classifiers are (stochastically) independent w.r.t. the mistakes they make, the expected accuracy of majority voting is (1-0.85)*0.9*0.7+0.85*(1-0.9)*0.7+0.85*0.9*(1-0.7)+0.85*0.9*0.7 = 0.919. This formula exploits the assumption that the classifiers are independent to compute the probabilities of joint events as simple products. The formula sums the probabilities of those cases in which at least two classifiers produce a correct result (and thus the majority vote yields a correct result). On the other hand, the lowest accuracy that majority voting with such three classifiers could give is 75%. This is achieved, for example, if the error sets of the 90% and the 85% classifier are disjoint, but their union (25% of the data) is covered by the error set of the 70% classifier. This gives the largest possible percentage in which two classifiers produce a wrong classification (and thus the majority voting produces a wrong result).
To summarize: any accuracy between 75% and 100% is possible (meaning: on the same data on which the individual accuracies were determined - unseen data are a different issue). 91.9% is to be expected under independence. 99% requires almost disjoint error sets.
Your tips are pretty clear and reasonable. I think, we can result that voting accuracy is close to maximum accuracy and with these accuracy getting such a high accuracy is "unlikely" not "impossible".
Thanks a lot Pekka Jounela! I used your code, and that was a good one!
I`ve got a question here! The Outer Performance node gives the performance of the model applied on all the data, but the inner one gives the performance of the model applied on the data that the learner was trained with, that`s why the outer performance gives lower accuracy, right?
and the second question is if that`s OK to wrap the base learners used in the stacked models in an X-validation operator to see their performance individually!
Hi Farideh, specifically in the former code the inner performance was the model performance over 10 non-overlapping test sets on average. Find attached a slightly tuned RapidMiner code where both base- (child) models and stacked (parent) model performances are logged at each round. So, check logs in the results. Cheers P.