well the problem is, what the top x dimensions are. This highly depends on the problem. But you can try to apply a singular value decomposition (SVD) to identify these dimensions which mainly influence your data. In the text mining area this is a common approach to reduce the number of dimensions. However, applying SVD, you will certainly loose information. It can be seen for example, that applying a SVM sometimes performs better on the original feature set.
Genetic Algorithms. Many papers and code out there. Just google. "Genetic Algorithm feature selection". Much better than univariate filtering (which misses interactions between variables), much better than ranking of "loadings" with projection methods (which can be misleading), and much more elegant than random projection.
Instead of genetic algorithms I recommend using swarm algorithms. Swarm algorithms are very efficient when it comes to searching huge feature spaces (in particular Cuckoo Search algorithm or Particle Swarm Optimization).
I would suggest to use a weight-function of some sort for distinguishing the most valuable data or/and the most valuable features. Probably there is some sort of function that can also incrementally calibrate that measure for each feature...
I think feature selection using weka is not recommended. it is better to do it by yourself. there are many algorithms that you can code to select best feature subset, grouped in Supervised and Unsupervised methods. in supervised methods, you can use LDA and SVD which not only select best features, but also classifies data. another method is using Genetic Algorithms, but it has some problems: time complexity and single objectiveness. so you can use LDA to eliminate time complexity and use Multi-Objective Genetic algorithms to eliminate the shortcomings. the best Mulit-Objective Genetic Algorithm is NSGA-II that I proved experimentally. the algorithm of Unsupervised feature selection is PCA which works very good and fast.
if you have a nonlinear feature space, it is recommended to use kernels in PCA or LDA.
You can use any DR techniques for doging that. The simplest could be by performing CoVariance analysis and selecting only those featuers with higher variance.
If run-time is your caution, from my experience Genetic Algorithms for selection of descriptive features works well. For one of my datamining project I just tried brute force approach and the other variations for comparison and genetic algorithm iwas the one provides relatively closer result to brute-force approach (trying every combination). For sure I am using one instance of learning for this suggestion.
I agree with other answers pointing to Genetic Algorithms, also from my experience they are very good for selecting a small set of features or prototypes out of a very large initial set.
Is your goal to select features to improve classification results, or to decrease the number of features used at test time (or both)?
I'm not sure what the feature selection in Weka is doing, but evaluating the features individually (rather than sets of features) should not be too intensive.
Unsupervised techniques like PCA/SVD/ICA/LSA/LDA tend to depend on relatively expensive eigenvector analysis techniques that don't scale well (often cubic), and FDA/LDA (Fisher/Linear Discriminant Analysis) are supervised versions that have the same dependencies, as well as normality assumptions. NDA (Nonparametric) Discriminant Analysis remove the distributional assumptions, but may be more expensive.
Techniques relating to analysing the features individually or in pairs, do not in general work: I can take an easy dataset, and a few sets of random numbers, and split attributes by adding different weighted combinations of the random numbers in to make it harder, a scrambled dataset. But as long as some combination of these splits adds up to the original attribute, in each case, perfect recovery is possible. But the importance of these features won't be discovered with these microanalyses, while PCA and LDA and their variants can rediscover them easily, if not cheaply.
If your data is sparse (e.g. frequencies of words in webpages) then random combinations are likely to work well, as any randomly chosen pair of attributes (words) are likely to be orthogonal (have correlation, or dot product, near zero). If your data can be made sparse you can use similar techniques (e.g. taking EEG or Audio data into the frequency domain, or exploiting spatial correlation, through preprocessing). As long as you don't reduce the rank of the matrix, these random combinations can make PCA etc. cheaper without impacting the result much on average. So if you have 10 classes, and stick with 40 reduced vectors, things are likely to be okay whether they are directly taken to 40 by SVD or reduced first to 100 by random indexing techniques.
Where your data tends to be between these sparse and scrambled eggstremes, random initialisation followed by genetic techniques is likely to improve versus exhaustive search of feature combinations, while ending up cheaper than straight SVD.
By the way, FFT followed by SVD makes no difference to the latent space discovered, as the linear mapping of FFT (or random indexing) is undone. The point of the preprocessing is to do sparsity-licensed data reduction before the SVD, rather than (just) after.
Also Cepstral processing (log frequency and a further FFT or IFFT) tends to increase sparsity, and similar techniques work for text: e.g. take log of counts (or TFIDFs) and then arithmetic mean (geometric mean in original space). Now when you demean the log data, your 0 counts and your average counts both are represented by 0, increasing sparsity - words that occur slightly less than arithmetic average are not interesting, as the topics you are talking about occur more than average and everything else has to be slightly depressed.