I'm working with Weka using KDD Cup 1999 dataset. So I've got a few questions I couldn't figure out via manuals:
How do we know what parameters for Ranker to set? I mean, threshold and numToSelect. Is there any explaination to this?
When I select attributes via explorer and save the modified dataset, it's always N+1 attribute (N selected attrbutes + class/label). Why? Isn't a label/class also an attribute?
Why when I use PCA+Ranker with default settings for attribute selection I get more attributes than I had?
1. How do we know what parameters for Ranker to set (threshold and numToSelect)?
Answer
The Ranker method in WEKA ranks attributes based on their importance and allows you to select the top attributes based on either a threshold (cutoff value) or a specific number (numToSelect).
Threshold: This parameter sets the minimum acceptable ranking score for attributes. Attributes with a score below the threshold will not be selected. If you're unsure what value to set, it's often useful to start with a small threshold (e.g., 0) and experiment by gradually increasing it.
numToSelect: This parameter specifies the exact number of attributes to retain. In large datasets like KDD Cup 1999, selecting only the most relevant attributes can help streamline your model. By reducing noise and complexity, you may improve model performance. A good starting point is to experiment with retaining 10-20% of the total attributes and adjust based on the model's performance.
2. When I select attributes via explorer and save the modified dataset, it’s always N+1 attribute (N selected attributes + class/label). Why?
Answer
Your question is not clear, you may need to elaborate on it.
3. Why, when I use PCA+Ranker with default settings for attribute selection, do I get more attributes than I had?
Answer
By default, PCA in WEKA is designed to generate enough components to capture a large percentage of the variance in the data (e.g., 95%). This can result in more components than the original number of attributes because PCA aims to retain as much information as possible. Even though PCA typically reduces dimensionality, if the goal is to capture a high level of variance, it might generate more components to achieve that.