I know that one of the most important disadvantage of Naive Bayes is that it has strong feature independence assumptions. What are the other disadvantages?
A subtle issue ("disadvantage" if you like) with Naive-Bayes is that if you have no occurrences of a class label and a certain attribute value together (e.g. class="nice", shape="sphere") then the frequency-based probability estimate will be zero. Given Naive-Bayes' conditional independence assumption, when all the probabilities are multiplied you will get zero and this will affect the posterior probability estimate.
This problem happens when we are drawing samples from a population and the drawn vectors are not fully representative of the population. Lagrange correction and other schemes have been proposed to avoid this undesirable situation.
In classification tasks you need a big data set in order to make reliable estimations of the probability of each class. You can use Naïve Bayes classification algorithm with a small data set but precision and recall will keep very low.
A subtle issue ("disadvantage" if you like) with Naive-Bayes is that if you have no occurrences of a class label and a certain attribute value together (e.g. class="nice", shape="sphere") then the frequency-based probability estimate will be zero. Given Naive-Bayes' conditional independence assumption, when all the probabilities are multiplied you will get zero and this will affect the posterior probability estimate.
This problem happens when we are drawing samples from a population and the drawn vectors are not fully representative of the population. Lagrange correction and other schemes have been proposed to avoid this undesirable situation.
If you're using Naive-Bayes only for classification then have a look at generative vs discriminative models (e.g. logistic regression). One of the primary problems with using a generative model for classification is that you're often actually only interested in the separating hyperplane between your classes. With a generative model you're also trying to model data points far away from this plane.
The independence assumption, oddly, isn't necessarily a disadvantage. You do lose the ability to exploit the interactions between features; however, for classification tasks, this often isn't a problem. This doesn't seem to be true with respect to regression; at this point, it becomes more of an issue.
You may want to refer to this article:
Eibe Frank, Leonard E. Trigg, Geoffrey Holmes, and Ian H. Witten. Naive Bayes for regression (technical note). Machine Learning, 41(1):5-25, 2000
One problem that is often overlooked is how to calculate probabilities, for Naive Bayes, when working with real valued features. Often, people attempt to either discretize the feature (which leads to questions on the number of discrete values to have) or they attempt to fit a normal curve. Potential solution is to estimate a none normal distribution. See this reference:
John, G. H., & Langley, P. (1995). Estimating continuous distributions in Bayesian classifiers. Proceedings of the Eleventh Conference on Uncertainty in Artificial Intelligence (pp. 338-345). Montreal, Quebec: Morgan Kaufmann.
in the same vein as Ryan's and Rafael's answers above, see :
Boullé, M. (2005). A Bayes Optimal Approach for Partitioning the Values of Categorical Attributes. Journal of Machine Learning Research 6. p. 1431-1452.
Boullé, M. (2006). MODL: A Bayes optimal discretization method for continuous attributes, Machine Learning 65(1). p. 131-165
Boullé, M. (2007). Compression-based Averaging of selective Naive Bayes Classifiers. Journal of machine Learning Research 8. p. 1659-1685.
(available from the author's home page http://perso.rd.francetelecom.fr/boulle/ )
regarding the NB itself, there is an interesting paper :
D. Hand and K. Yu, "Idiot Bayes, not so stupid after all",International Statistical Review, Vol. 69, No. 3. (2001), pp. 385-398
Abstract :
Folklore has it that a very simple supervised classification rule, based on the typically false assumption that the predictor variables are independent, can be highly effective, and often more effective than sophisticated rules. We examine the evidence for this, both empirical, as observed in real data applications, and theoretical, summarising explanations for why this simple rule might be effective.