I want to have information about the size of each tree in random forest (number of nodes) after training. I usually use WEKA but it seems it is unusable in this case.
Please note if you are going to use the R randomforest package and you have computational performance issues then you should try the R sprint() package. This has parallel implementation of the R randomforest package. For typical use cases, we can obtain a speedup of around 40 over this same serial code. The sprint randomforest interface exactly mimics the existing serial implementation: modifying existing serial R scripts to take advantage of this functionality is trivial. You can find out more about its performance in the Concurrency and Computation journal article at
I really think scikitlearn is better than R unless you are already fluent in R. Python is much easier to read/learn, and scikit learn is optimized at the C level, meaning it will already be fast and suited for bigger datasets. That is not to say the randomForest package may not be more extensive than in skilearn, but skilearn does have randForest functions.
You also may want to look into ''ilastik'' if visualization and interactivity are important for you application.
Understand what tools and models are you using before applying it. You can dump any randomforest package if it can produce the final tree to you. You need to test the training time, testing time, accuracy and uncertainty with all available packages related to randomforest. I found that RF in R needs extensive time to train the model, especially with categorical variables over 20 levels. Salford RF provides me a fast training model but with poor in accuracy and large uncertainty. cforest needs huge computer memory with very very poor accuracy and large uncertainty. In addition, you needs huge running to complete the testing part. R rpart provides me a very fast training model but with very very poor in accuracy and large uncertainty. R RF provides me a training model with good accuracy and the lowest uncertainty among all other packages but it took several days to complete the training. Bagging does not provide me a satisfactory answer compared with RF. Hope this can help.
What do you exactly mean by "You can dump any randomforest package if it can produce the final tree to you. You need to test the training time, testing time, accuracy and uncertainty with all available packages related to randomforest."
Are you referring to the actual classifying capabilities in real life situations, or writing about the mathematical properties of a particular algorithm?
It is not likely that you can produce the final tree. There are a lot of commercial packages (RandomForest) and they can produce the final tree. For rpart or CART, you can do it but not randomforest..
Yes, but what is the relevance of this? Will the classification then be false or inferior in some aspect? I have been in the belief that banks that avoid loan risks and prostate cancer patients who get the right prognosis from their marker studies are pretty happy with results from these applications.
In my statistical consulting and reviewing experience over US 40 billions in different fields, only 20% of the proposed models is useful and about 5% of them may be modelled properly with the right predictor variables. For other proposed models, even all the predictor variables are significant but the information is not the right information. In one situation, I only can have 2-5% accuracy in the prediction, even all the predictor variables are highly significant and not cor-related. I told my client that the right information is not in the model, or you need to provide me more predictor variables, or you may need to wait until the right information exists and collect it. For each prediction, it will cost $5000 to a quarter of million. I need to use it over 20000 times. So, I told my clients that the best available science is not using the best available proposed model. For this type of prediction, I need over 70% prediction accuracy in my model validation and forward validation (not 2-5%). Hope this can help.
I had a doubt regarding this, I recently loaded a dataset onto my Weka tool but I am not able to apply Random Forest or even Linear Regression on that data. The problem I am facing is that when I select these algorithms to run on the data, the start option is not enabled. Is it something to do with the data set I am using? Since they have nominal and numerical values.
If anyone knows the reason behind this, kindly let me know.