I am developing a machine-learning model for a Network Intrusion Detection System (IDS) and have experimented with several ensemble classifiers including Random Forest, Bagging, Stacking, and Boosting. In my experiments, the Random Forest classifier consistently outperformed the others. I am interested in conducting a statistical analysis to understand the underlying reasons for this performance disparity.
Could anyone suggest the appropriate statistical tests or analytical approaches to compare the effectiveness of these different ensemble methods? Additionally, what factors should I consider when interpreting the results of such tests?
Thank you for your insights.