What is the philosophy behind using Bernoulli, Binomial, Negative binomials, Gamma, Beta, exponential, Chi-square, Normal, and Multinomial distribution?
They all describe (in a mathematical form) what we know or what we can conjecture about a value, when we do not know its actual, precise value. The distributions put relative weights on all values of that outcome that we assume might be possible. These weights here are called probabilities; they need to have some mathematical properties to be consistent and coherent. They express the "relative expectation" of the possible outcomes.
Hence, the (probability) distributions should acknowledge which outcomes are considered possible (this is called the domain). One consequence of the fact that probabilities have to be coherent is that the probabilities must not be negative and the probabilities of all mutually exclusive outcomes that cover the entire domain have to sum up to 1. Now take the simplest possible example: the outcome can take one of two possible values (if is can have only one possible outcome, then this is the value and there is no need to use probabilities). This outcome could be, say A or B (and notching else). Then you know that P(A)+P(B)=1. P(A) can be any value between 0 and one, indicating how much you expect the outcome being "A", relative to being "B". Let's say you set P(A) = a, then P(B) must be b = P(B) = 1-P(A) = 1-a. The ratio of these probabilities, a/b = 1/(1-a), is called the odds of A (and b/a = b/(1-b) is the odds of B) and expresses this relative expectation between two mutually exclusive events directly.
From this simplest possible example you get to the Bernoulli distribution when you map the two possible outcomes to the numeric variable X what can take the possible values X=0 or X=1. If you say that the event A is encoded by the value X=0 and event B is encoded by the value X=1, then the probability distribution can be formulated as a numeric mathematical function of the value x (that can be either 0 or 1):
P(X=x) = a1-x * bx
Note that is x=1, then (1-x)=0; and if x=2, then (1-x)=1.
If you further consider that a = 1-b, then you get
P(X=x) = (1-b)1-x * bx
This is the Bernoulli distribution with parameter b. The variable X is called a random variable with a Bernoulli (b-)distribution, or simply a Bernoulli variable. The parameter b encodes the probability of event B happening, and by mapping B to X=1, we may say that this distributions gives the probability of event B happening once. So we can use this to infer counts among a series of similar (and stochastically independent) Bernoulli experiments (experiments that all have the same two possible outcomes that we can model by the same Bernoulli distribution with the same value of the parameter (b)). So we may derive probability distribution of a the sum of n Bernoulli variables (X1, X2, ... Xn) as
Y = + X1 + X2 + ... + Xn
The domain of the new random variable Y now is 0, 1, 2, ... n. The probabilities for these n possible outcomes can be derived from the Bernoulli distribution, and the result happens to be the Binomial distribution:
P(Y=k) = nCk*(1-b)n-k * bk
where nCk is a constant (the binomial coefficient) to ensure that P(Y=0)+P(Y=1)+...P(Y=n) = 1.
Instead of counting the number of events (k) in a series of given length (n), we might want to count the number of trials until the first event happens. We can again start from the Bernoulli distribution and derive this new distribution which is the geometric distribution
P(X=k) = b*(1-b)k
The domain of this variable are all integers, including 0 (0, 1, 2, ...).
You might now want to infer the probabilities for the number of trials after the first m events happened, what least you to the negative binomial distribution.
With this, you stepped forward from the simplest possible distribution (Bernoulli) to some other useful discrete distributions. But it is possible to go from there even to continuous distributions. For instance, you may want a probability distribution for the number of events happening within some spatial or temporal interval. You can think of the interval being divided into many tiny peaces, so tiny, that at most only a single event may happen therein. So you can understand each sub-interval as a Bernoulli-trial (with relatively some small probability of an event happening in any of these sub-intervals) . If you look at the whole set of sub-intervals you may understand this as a binomial trial with large n and small "event probability". Now you continue to make these intervals smaller and adjusting the interval-wise event probability, so that the product of the event-probability and the number of sub-intervals remains constant. The limit for infinitely small sub-intervals turns out to result in the Poisson distribution.
You may then ask for the times (or distances) between adjacent events in such a Poisson process, and doing the math gives the exponential distribution, which is a continuous version of the geometric distribution. As you got from the geometric to the negative binomial distribution, it is possible to go from the Poisson distribution to the gamma distribution. The gamma distribution is thus a model for "concentrations" (in time or space). And if you consider the concentration of Y being a part of a mixture with Z, then you can derive the probability distribution of the proportion of Y as X = Y/(Y+Z) as being a beta distribution, which turns out to be the continuous analogue of the binomial distribution.
The derivation of the normal distribution is a bit different. There exist several different derivations, all coming to the same result. The most famous derivation is given be de Moivre and Gauss, of course. It is derived only from three simple assumptions:
(1) small errors are more likely than large errors
(2) for any real number ϵ the likelihood of errors of magnitudes −ϵ and +ϵ are equal
(3) in the presence of several measurements of the same quantity, the most likely value of the quantity being measured is their average
Another enlightening derivation is made independently about 1850 from the astronomer John Herschel. He was concerned about different coordinates of stars obtained in replicate measurements. If the measurements are comparable in their quality, our uncertainty about the stars position should be reflected in the very same way for all the measurements and no matter what coordinate system we chose. Herschel could show that this can be achieved only with one kind of probability density, which was the normal density.
And eventually, most of us know that the normal distribution emerges as the limiting distribution in many cases, particularly for sums of random variables, but also for distribution where parameter values are pushed to their limits.
Many frequently used statistical models assume an (approximate) normal distribution of the response variable. From the observed data, test statistics are calculated that capture all the information from the data relevant to a hypothesis that should be tested. One can, of course, derive the distribution of these test statistics (based on the assumption about the distribution of the response variable) to judge if values of that statistic that are at least as extreme as the one calculated from the observed data are unexpected under the tested hypothesis. These derivations lead us to the chi², t, and F distributions (for instance, if X is normal distributed with mean m and variance v, then Y = (X-m)/sqrt(v) is standard normal distributed, and Y² = (X-m)²/v is chi²(1) distributed).
More on the derivation of the normal distribution:
There are other approaches to answering this vaguely posed question.
A second approach is to start by noting that all of these commonly used distributions are in fact members the same wider families of distributions ... at least one for each of the discrete and continuous distributions. The common distributions are used in preference to the wider families because the latter would involve estimating more parameters. Practical datasets seldom provide enough information to estimate more than two or three unknown parameters, and so it is expedient to impose as much structure as possible that can be justified as reasonable in the circumstances (such as described in the first reply). These wider families provide scope for testing whether an assumed distribution is adequate for representing a given dataset.
A third approach is to start from “for what are the distributions being used?” If an intended data analysis is simply to provide a detailed summary of the distribution of data from a given population, then attention needs to put on providing a good fit either to the whole of the distribution or to some selected part of the distribution (such as the tails)), and an analysis would seek to assess whether a selected and fitted distribution adequately represented the observed data in that context. However, since “statistics” is about making comparisons, an intended data analysis may well involve making comparisons between populations. This may be either distinct populations or in a regression-like context. Here, the adequacy of a chosen distribution may not be of direct consequence for the intended purpose. Thus analysis may give a result like “population A behaves like this distribution with these parameters while population B behaves like the same distribution with these other parameters” and, stated with reasonable caveats, this may be as far as a given analysis need be taken (for the particular context). In other cases a good approach may be to do the analysis based on choosing a given family of distributions, but to do something to assess the sensitivity to that selection.
As a summary one may say that “the philosophy behind using a selected distribution” is not to make a selection and to regard that selection as final. Rather, regard must always be taken of the particular contexts, both of the population and dataset and of the reason for doing the analysis.
Because each probability distribution has won characteristics distribution which are different of others, therefore you have to select the correct probability distribution that fit with your variables by the following procedures: Look at the variable in question. Review the descriptions of the probability distributions. Select the distribution that characterizes this variable. If historical data are available, use distribution fitting to select the distribution that best describes your data.Because each probability distribution has won characteristics distribution which are different of others, therefore you have to select the correct probability distribution that fit with your variables by the following procedures: Look at the variable in question. Review the descriptions of the probability distributions. Select the distribution that characterizes this variable. If historical data are available, use distribution fitting to select the distribution that best describes your data.
Select the distribution that characterizes this variable. ...
If historical data are available, use distribution fitting to select the distribution that best describes your data.