From a frequentist's prespective, "probability" is the limiting relative frequency (of an event). In this philosophy, only results of repeatable processes can have probabilities. Observed data is fixed and has nothing like a "probability". But the data-generating process can ascertain that possible outcomes have probabilities. Here the sampling enters the game:
The probability (=limiting relative frequency = f) of an item (or individual or event) being in a sample depends on the way the sampling is performed. This f will be identical to the frequency of the item in the population - but only if f is identical for each item in the population. This means the "probability" of being sampled must be identical for each item in the population. AFAIK this is called "random sampling".
But how is this ascertained? How can I say that a process has "equal sampling probabilities" when these probabilities will only establish in infinite repetitions of the sampling process?
No doubts that the probability (=f) of an event in repeated sampling tends to frequency of the item in the population - when the sampling is random, i.e. when the sampling probability (=f) of each item in the population is equal. This follows from the law of large numbers. I do not understand how the "random sampling" can create the same sampling probabilities of the items.
Prototype-example:
The population contains two items (A and B). The frequency of "A" = frequency of "B" = 0.5. A sample of size n=1 will either contain an A or a B. If a larger and larger random sample (with replacement!) is taken, then the theory sais that the frequency of "A" will approach 0.5. But this will only be the case when the sampling procedure itself will guarantee that the limiting relative frequencies (=probabilities) of "selecting A" and "selecting B" are equal.
Just to make it clear: The population may contain 3 items: A, A, B. When the probabilities of sampling each item are equal (=1/3), then and only then will the relative limiting frequency in the sample of "A" be 2/3 (and 1/3 for "B") and thus match the relative frequencies in the population.
I think the question is not clear, or maybe the question is clear after all, but every person reads something different into it.
I'll try a different approach by commenting on what you say.
-"But how is this ascertained?"
It is not. What you can do is limited to: really trying to make each selection independent from the former ones, doing the best to avoid selection biases, and making a number of statistical tests to see if they fail at detecting both deviations from the uniform distribution and from independence.
Of course, the tests can only do so much, in that if the sample is small, then deviations will go undetected, and if the sample is large enough, then the real world never passes a statistical test.
-" I do not understand how the "random sampling" can create the same sampling probabilities of the items."
The "random sampling" is *by definition* one with the same sampling probabilities. This is not an operational definition, whence statistical theorems only have heuristic value in practice as their mathematical assumptions are never met exactly (and often not even inexactly!)
The equal limit relative frequencies are created by the fact that, under the assumptions of simple random sampling, all possible samples of a given size n have the same probability of being the sample you select, and as n increases, those with 'the right frequencies' quickly overwhelm in number those with 'the wrong frequencies'.
-"From a frequentist's prespective, "probability" is the limiting relative frequency"
Yes, but frequentist statisticians are typically not frequentist in this philosophical sense.
Does this question confuse repeated sampling with sample size? (I hope and assume it isn't a troll).
For a population with 3 sampling units, with values A, A and B, we want the units to have equal selection probabilities. Number the units 1, 2 and 3 (3 having the value B). In that case, with n = 1 the values A and B will have selection probabilities 2/3 and 1/3 as noted in the question. The possible samples have values A, A and B. If we repeat the sampling process a large number, k, times we would expect about a proportion 2/3 of the samples to have a value A and 1/3 to have value B. This is consistent with the frequentist interpretation. Also, all of the individual units occur with probability 1/3 and this is consistent with the repeated sampling interpretation.
If we take samples of size n = 2 without replacement (and remembering that the two As are different sampling units and the order is immaterial), there are 3 possible samples with values: (A, A); (A, B); (A, B). Estimates of the proportion of As will be 1, 1/2 and 1/2 and in repeated sampling a large number, k, times these will occur about k/3 times each so the average estimate of the proportion of A in the population will be (1/3)1 + (1/3)(1/2) + (1/3)(1/2) = 2/3 and for B it will be (1/3)0 + (1/3)(1/2) + (1/3)(1/2) = 1/3.
Similar results apply to sampling with replacement (with n = 2 there are now 9 possible ordered samples: (1, 1); (1, 2); (1, 3); (2, 1); (2, 2); (2, 3); (3, 1); (3, 2); (3, 3), with values: (A, A); (A, A); (A, B); (A, A); (A, A); (A, B); (B, A); (B, A); (B, B) each with probability 1/9).
When n > 3 there are many more possible samples. However, the frequentist interpretation is to imagine k repetitions of the whole process as k -> oo. It is k and not n that tends to infinity.
I wonder if I misinterpreted the question. Simple random sampling (with or without replacement) does guarantee the correct long run frequencies (i.e.unbiased) as demonstrated in my first answer. Complex sampling designs do not, but weighting the estimates appropriately does guarantee unbiased estimates (assuming ideal conditions: 100% response rate, no undercoverage etc.)
Stratified samples with simple random sampling in each stratum is unbiased within each stratum. Single stage cluster sampling is unbiased so long as clusters are regarded as sampling units and clusters are selected with probability proportional to size or estimates are appropriately weighted.
However, there are practical difficulties with choosing a random sample unless we have an up to date list of the population.
Thank you for your comments and answers!
"Simple random sampling (with or without replacement) does guarantee the correct long run frequencies (i.e.unbiased) as demonstrated in my first answer."
Exactly this was/is my question: why do we "know" this? Terry, I do not see that you demonstrated this. You just said that this is like this (what is neither an explanation nor a demonstration).
Labeling the items with numbers 1...m (m is the size of the population) and creating a sample by taking the items 1,2,...n will in fact ensure equal sampling frequencies for each item in a single sample when n -> Inf (start over at 1 when n>m). Actually the exact uniform distribution is obtained when n=m. I think this is trivial.
My question is about repeated sampling. As I understood this, the items have to be labelled again, and again the items 1,2,...n have to be selected. Now this makes only sense when the order of labelling is not identical. But then the order of the labelling has to guarantee that, in the long run, each iitem will be equally frequently be labelled with "1" and with "2" and so on.
If you say that the odering of the labels can remain constant, then for each sample a different "sample of labels" has to be chosen, what turns the question back to the original problem: what ensures that -in the long run- the frequencies of selectioin a number between 1...m are equal for all the numbers?
@ Alvinn: "by definition, random sampling means that the probability of selecting an item is not dependent on anything and therefore must be assumed to be equal."
This is what I do not understand. For me this sounds like a circular argument: when sampling is random then the long-run frequencies are all the same, and because all frequencies (--> i.e. the probabilities!) are all the same, the sampling is random.
As I wrote above I think the only *sure* method of creating a sample with euqal probabilities of all items being sampled is to number all items and then take one by one in the order of the labelling. The selection is surely not random, but the way (or the order) the items are labelled may now be called "random". In a repeated sampling setting, the problem thus shifts to the lapelling process (again, as already explained above).
It might be that your statement is correct and it is simply an assumption. But when this is the case, the whole frequentistic interpretation of experiemnts seems to be based on a (wild) assumption that can not be shown to hold in any particular case. I don't like to believe that this is the case, therefore my question. I thought that the law of large numbers would give a rationale for this assumption, but I found that this is also not the case (this law relies itself on this very same assumption).
Jochen. If you say that "The population may contain 3 items: A, A, B." then population may contain only A and B, according to sets theory. So relative frequences have to be 1/2 if you want to obtain an expected value of A=B:
U exp = A/2+B/2, and if A=B, then Uexp.=A=B
The sampling procedure does not guarantee that the limiting relative frequencies (=probabilities) of "selecting A" and "selecting B" are equal. It looks like a biassed die.
Thanks, emilio
I think I might have been misinterpreted. I don't claim to have proved anything--I gave some illustrations. If the question means 'how do you take a truly random sample?' then I would say that that is not possible in practice, but random sampling can be approximated.
Jochen, you have given one approach for sampling without replacement from a finite population: label the units in random order and take the first n. Alternatively, assign random real numbers to each unit (with probability 1 no two will be equal, but with finite precision arithmetic the probability will be slightly less than 1), then sort them and take the first n. This method is often used to select several non-overlapping samples, but with pseudo-random numbers.
With a finite population and enough repetitions, all possible samples will be obtained approximately equal relative frequencies (but not necessarily wihin the lifetime of the universe--we need to take a longer term view). If we could arrange to take each possible sample exactly once we would achieve exactly the correct relative frequencies.
On further thoughts, your question could be a more sophisticated version of the argument against Bernoulli's theorem (i.e. the law of large numbers for a proportion) as justifying empirical probability as a definition of probability. The justification is, indeed, circular. I can see one way out: leave probability undefined. Then Bernoulli's theorem demonstrates that probabilites can be estimated empirically with errors whose probability distribution can be estimated. (Of course this leads to infinite regress if you try to estimate the errors of the latter estimates, the errors in those, and so on).
The same problem applies throughout mathematics (and science and everyting else, for that matter): some things must be left undefined and some statements must be assumed without proof. In this case, it is probability that must be left undefined. It is left unproved that probabilities exist that satisfy axioms such as those of Kolmogorov).
I think the question is not clear, or maybe the question is clear after all, but every person reads something different into it.
I'll try a different approach by commenting on what you say.
-"But how is this ascertained?"
It is not. What you can do is limited to: really trying to make each selection independent from the former ones, doing the best to avoid selection biases, and making a number of statistical tests to see if they fail at detecting both deviations from the uniform distribution and from independence.
Of course, the tests can only do so much, in that if the sample is small, then deviations will go undetected, and if the sample is large enough, then the real world never passes a statistical test.
-" I do not understand how the "random sampling" can create the same sampling probabilities of the items."
The "random sampling" is *by definition* one with the same sampling probabilities. This is not an operational definition, whence statistical theorems only have heuristic value in practice as their mathematical assumptions are never met exactly (and often not even inexactly!)
The equal limit relative frequencies are created by the fact that, under the assumptions of simple random sampling, all possible samples of a given size n have the same probability of being the sample you select, and as n increases, those with 'the right frequencies' quickly overwhelm in number those with 'the wrong frequencies'.
-"From a frequentist's prespective, "probability" is the limiting relative frequency"
Yes, but frequentist statisticians are typically not frequentist in this philosophical sense.
Thank you Pedro, this is what I meant with my question, and what your answered largely reflects what I thought about this topic: that "random sampling" means *by definition* that each item has the same probability of being selected, and that there is at best some heuristic reasoning that this might be (approximately) obtainable by a sampling procedure where the selection of any individual item is to the best of our knowledge independent on the selection of any other item.*
You said: "This [random sampling] is not an operational definition". I understand this, but is there any suggested operation that is shown to produce "random samples" (one which is not based on a circular argument like "a sample is random when each item is seletcted at random (or by chance or whatever synonym one may use)"?
Your last sentence still strikes me:
-"From a frequentist's prespective, "probability" is the limiting relative frequency"
Yes, but frequentist statisticians are typically not frequentist in this philosophical sense.
How else then? The p-value of a hypothesis test *is* such a frequency (conditional on H0), and confidence intervals are solely defined based on such frequency-properties. Both, tests and confidence intervals are thinkable only in the frequentistitc philosophical frame, and they are sensible only if the sampling probability *is* a limiting frequency. Or am I wrong here?
*Just to mention: I know well that small samples -even large samples!- can be severely wrong, that means: having characteristics far off from the corresponding population characteristic (e.g. a sample mean very much different from the population mean). It is understood that there is no guaranty that a particular sample faithfully reflects a population. We are talking about *statistical* properties: the expected frequency of wrong** samples decreases with the sample size; and bias of (infinitively) repeated samples is zero (or irrelevantly small).
** "wrong" to a particular degree, e.g. |µ-µ0|>e for a given e,
Hi Jochen,
Interesting philosophical question... As answered already, there is a bit of circular issues in your questions, so some assumptions must be done.
First, I think saying "probability IS limit of frequencies" is a much too strict interpretation of a frequentist point of view, or may be a too practical one. A probability is simply a quantity between 0 and 1 that satisfies a (very few) additional properties to convey the idea that "the higher, the more probable" and quantifies the lucks of an event to occur. The idea of limit of frequencies when _independant_ _identical_ realisations are done several times, several going to infinity, gives a way to obtain these probabilities, or as an intuitive support to understand them, but not more.
This convergence is insured by the "law of big numbers" (the strongest one) provided that the two conditions are satisfied: 1) experiment is done always in the same conditions and 2) experiments are done independantly (that is, informations of the first one do not give any information about the second one).
Consequently, any randomization scheme that satisfies these two criteria will lead to the right probability, in the limit. Note that this does not even imply equiprobability, which is a very special case... A very simple randomization procedure would then be to flip a coin or a dice with the convenient number of faces, not necessarily fair (but convergence will be slower) as far as it is not completely unfair (it does not give always the same result) --- unless you can predict the result of the coin or dice with a precise enough physical model. I don't agree with the assumption that random sampling means equiprobable values for each event, but it's most often the case in practice.
Another problem with the frequence-in-first approach is that it does not work in infinite-size populations, and is difficult to define for events like "catching a disease": it can occur or not, but even in a population of one patient, you can not be sure he will catch it so where is the frequency?
This also solve your difficulties with p-values and confidence interval (also evocated in another post). CI for instance give the right answer with a high probability, so we can trust them [or not, it's a choice...], hence are adapted in estimating values. Other kinds of interval also must be trusted in their own way, they just give different events... Same idea for the p-values. Of course, in both case, their is a strong tendancy to over-interprete them, but that's another debate.
So, at least in my mind, "frequentist" means "we trust the law of big numbers to _estimate_, _approximate_ probabilities" and hence "we can trust random sampling (if one have a practical way to make one), but not more. Defining probabilities as limit of frequencies and all similar things are instead helpful methods to understand or teach the tools, but not at all the basis of the methods...
Just my opinion of course ;)
Like Emmanuel, I'd rather identify 'statistical frequentism', so to speak, as a commitment to long-term sampling properties as the basic standard to evaluate statistical procedures.
That seems to be quite independent from 'philosophical frequentism'. For instance, under de Finetti's definition of subjective probability (your fair price for a ticket returning 1€ if the event happens), it still follows that you should not bet repeatedly against 99% confidence intervals unless you can get tickets for under 0'011€ each. You don't need to define probability as the limit frequency for limit frequencies to provide meaningful guarantees.
(If you adopt de Finetti's point of departure, the natural emphasis will be in "How to use incoming data to improve your subsequent betting?" instead of "How to choose a procedure now which will be reliable under a wide variety of possible future data?" so you are more likely to regard those frequential guarantees as secondary or even irrelevant when looked at from that dynamic learning perspective.)
I think that, from a statistical perspective, frequentism is more about giving those frequential guarantees a central role. Most people, like Emmanuel and myself, have just been taught from the agnostic position that 'A probability is simply a quantity between 0 and 1 that satisfies a (very few) additional properties.'
Hi Emmanuel,
First, I think saying "probability IS limit of frequencies" is a much too strict interpretation of a frequentist point of view, or may be a too practical one.
What else then? I do not want to start a discussion about the interpretation of probability. It is eventually about tests, p-values, CIs. These are frequentistic measures; error-rates are long-run (limiting) frequencies. This “limiting frequency” is the only objective interpretation of probability. I do not want to turn to any flavour of subjective interpretation. It is NOT about the question whether I subscribe to any of these interpretations (frequentisic or subjective). The entire theory about hypothesis tests is based on the frequentistic interpretation and the results of such tests are to be interpreted in this context. If this was not the case, why then do all this mess with testing, p-values and CIs? Why not stay with LI’s then? The only answer I can see is: “because LI’s do not control error-rates” – and here we are again: error-rates are frequencies.
A probability is simply a quantity between 0 and 1 that satisfies a (very few) additional properties to convey the idea that "the higher, the more probable" and quantifies the lucks of an event to occur.
What is “lucks”? If it is not something objective, then the results of tests and the control of error-rates are not objective, too. If “lucks” is something objective, then please explain better what “lucks” precisely is. It seems that you use it just as a synonym for “probability” or for “limiting frequency”. If this is the case, then your argument is circular.
This convergence is insured by the "law of big numbers"
This I do not see. Maybe I am too blind or too stupid (probably both), so I am really grateful for getting an explanation that I understand and that would resolve the “circularity” that I see: The law is based on a so-called repeatable “random experiment” what is an experiment that can give different possible results each time it is performed. The weak law *requires* that the possible results have defined probabilities. Under this assumption the law says that the actual frequencies of observed results tend (in probability! haha) to the probabilities of the possible results. This is a circular argument: when the limiting frequency is defined, then the limiting frequency in an infinite series will be the limiting frequency. Aha.
The strong law is based on expected values of random variables. The existence of an expected value again requires that the possible values of the variable have defined probabilities (or probability densities), and the convergence of averages is eventually nothing else but the convergence of relative frequencies.
Consequently, any randomization scheme that satisfies these two criteria will lead to the right probability, in the limit. Note that this does not even imply equiprobability, which is a very special case...
I agree that equiprobability is not required. It’s only when the probabilities for the possible values (or items) are not similar, then this has to be considered in the calculations (so the probability distribution must be known). Otherwise the results will be biased.
Another problem with the frequence-in-first approach is that it does not work in infinite-size populations,
Why not? Sampling WITH replacement is a valid repeatable random experiment.
and is difficult to define for events like "catching a disease": it can occur or not, but even in a population of one patient, you can not be sure he will catch it so where is the frequency?
This is not a repeatable experiment. It would be repeatable when different patients are allowed (so that some patients may catch the disease and others won’t) or when the patient can be sampled according to the requirements (independence!), for instance: it may be recorded if the patient had a cold within a year (some time long enough to cure a cold and possibly catch another), what can be repeated for several years.
CI for instance give the right answer with a high probability,
So this means exactly? As I understand can we not trust a (particular) CI – but we can trust the method how the CIs are constructed. A given CI is a unique, non-repeatable case, so there is no “probability” that it covers the correct value. We don’t know whether or not it covers the correct value, and (within the frequentist’s context) there is no possibility to rate our belief(?) that it contains the correct value. All we can say is that the limiting frequency of coverage of such intervals is known (and reasonably high). Here is the frequency again. It is eventually only all about this frequency. A single CIs makes no probability statement. Therefor it was called “confidence interval” and not “probability interval”. AFAIK this was a deliberate choice of the name.
hence are adapted in estimating values.
I am not sure about this. They allow to see what values of the parameter may not be rejected to keep the desired rate of false rejections (the long-run rate, the limiting frequency).
"we can trust random sampling (if one have a practical way to make one)”
So back to my initial question: how can we know that there exists (even only theoretically) a sampling procedure that results in a known probability distribution (equiprobable or whatever)?
Defining probabilities as limit of frequencies and all similar things are instead helpful methods to understand or teach the tools, but not at all the basis of the methods...
I don't agree here. The frequency definition has been "invented" to have an objective foundation of probabilities, and the testing procedure are based on this. Actually, the whole frequency-thing is ONLY required to get CIs and p-values (or does anyone know another reason?) as the basis of an "inductive behaviour" within a decision-theoretic framework. To be unable to precisely show how the basis of all this can be obtained (-> random sampling) everything constructed on this is all meaningless. All likelihood calculations (MLEs, LIs, information matrix etc.) do not require a frequentistic/objective interpretation of probability. And likelihoods can be turned back to probabilities that also do not need to have any frequentistic/objective meaning.
Thank you for sharing your opinion :)
Hi Jochen,
Probability is a mathematical concept. A probability is a function that assign a number between 0 and 1 to any "event" (which also has a precise mathematical definition, but let's not go into this now), with the three following properties:
- if event A is included in event B, then p( event A)
'I am really grateful for getting an explanation that I understand and that would resolve the “circularity” that I see'
If I get it right, what you present is the following dichotomy. If you take probability=limit frequency as a definition, then the law of large numbers (LLN) is void of content. If you don't, then frequentist methods stop making sense because limit frequencies will not play along to your defining probabilities as you wish. If you dissociate probabilities from limit frequencies, you could have a test with alpha=P(type I error)=0'05 that under H0 gets it wrong 87% of the time.
The first part is, I think, correct. But the second part is not, precisely because of the LLN. The LLN ensures that, whatever you do, you cannot come up with an event having probability 0'05 and limit frequency 0'87 (except if another event with probability 0 occurs).
Here you say: Aha, but the LLN is circular!
To see that it is not, you need to stop regarding it as a statement about the real world, and regard it as a piece of abstract mathematics.
For instance, you say that it 'is based on a so-called repeatable “random experiment” what is an experiment that can give different possible results each time it is performed'. It is not. Application to the real world within the frequentist framework is.
You can change the narrative and the theorem is unaffected. An example is Borel's normal number theorem (i.e., in almost every real number between 0 and 1, each digit appears with the same relative frequency). It follows from the LLN but it does not involve random experiments or sample selections. Speaking of digits of a real number or infinite sampling with replacement from a set of ten objects is just a matter of language, as far as the abstract content of the theorem is concerned.
The frequentist framework, or even the probabilistic language of events and random variables, is not a prerequisite for the theorem. As long as your way of measuring probability satisfies Kolmogorov's axioms, limit frequencies (as mathematical objects, not real-world 'frequencies in trials') necessarily exist and coincide with the probabilities, except for a set with probability 0.
'Here is the frequency again. It is eventually only all about this frequency. A single CIs makes no probability statement. '
Exactly, that is the same thing we are saying: that frequentism, as statisticians practice it, is only about that frequency, not about whether you call that frequency 'probability' or not.
'Defining probabilities as limit of frequencies and all similar things are instead helpful methods to understand or teach the tools, but not at all the basis of the methods...
I don't agree here.'
This is just a matter of background. Maybe you couldn't help but agreeing if you had been trained in the following fashion: "Definition. A test t is a mapping from Rn to {0,1}. The parametric space is a set of probability distributions. Two non-empty disjoint subsets of the parametric space are called the null hypothesis and the alternative hypothesis. The size of the test is sup P(t=1) with P ranging over the null hypothesis..."
But, as a matter of fact, a part of the theory of tests, e.g. the Karlin-Rubin theorem on uniformly most powerful tests, was not developed from a frequentist point of view.
Finally we seem to have reached the heart of the question. How should probability be defined.
It is a circular argument to define probability as limiting relative frequency and then to prove the limit exists by proving the weak law of large numbers.
E.g. P[|x/n - p| > epsilon] < eta for all epsilon, eta > 0 for large enough n. (Bernoulli's theorem).
What is the basis of the P in that law? It can't be a limiting relative frequency, can it? If so we need a meta theorem about repeating the whole repeatable sequence, then a meta meta theorem, and so on.
One alternative is to assume as an axiom that the limiting relative frequency exists. Another is to leave probability as an undefined measure that obeys certain axioms.
I have heard it argued that the limiting relative frequency definition is invalid because one should not define a physical property on the basis of an impossible experiment. If you can show me how to do an infinite sequence of trials, I will accept the definition, otherwise I won't.
Hi Emmanuel,
no need to discuss the mathematical definition of probability. We agree on this (for sure) and that all the laws and properties derived from this hold true. This is all proven. I apologize if I was imprecise in what I meant with “definition” of probability. It was the wrong word. You are correct that it is defined axiomatically as a measure base on a sigma-algebra. Agreed. This is the objective definition. But there must be some link how to interpret such measures in the real world, practically. There are a number of different interpretations out there, where I think that they are categorizable in either subjective or frequency interpretations. The interpretation that probabilities are frequencies is the basis of the entire test- and sampling theory (sure the situation today is more complicated because subjectivists tried to establish subjective counterparts of the test theory, but this is off-topic here).
If probability would be seen/interpreted *only* as measure-theoretic construct of a sigma-algebra without any relation to the real world, any statement like p(A) > p(B) would not tell me anything useful. I can neither claim that A happens more frequently than B, nor can I say that I believe more that A will happen (or is the case) than I believe that B will happen (or is the case) (what would be a subjective interpretation -> off topic here). If I accept that it has to do with frequencies, but rather loosely, I still cannot say that A is more frequent than B. The statement is interpretable (related to frequencies) only when p is in fact a defined frequency. Since the interpretation as being an observed frequency is obviously problematic, it must be interpreted as a limiting frequency.
Imagine again that p has not a frequency-meaning. It is only an abstract mathematical notation for a measure in a sigma-algebra. A paper reports a “significant” effect of something, p
Hi Jochen,
I think this is precisely were we differ here : for me, p (for instance) is indeed just an abstract measure of "how lucky was the event", defined on objective ground hence comparable between researchers/... It does not answer the true question, but this is obviously the same for any method: any method will answer on a mathematical model of the data/underlying experiment/phenomenon, but finally, what you have is "yes my model agree with data" or "not it does not", never "my effect exists" or "it does not" --- I'm not sure what alternative method you may think about, but it's the same... And will all methods, you ends up with a probability of something, which can be interpretated in a frequentist view (« 95 % credibility interval » is an interval such that p( Theta in [A, B] / data & prior ) = 0,95, right ? Difference with confidence interval is « just » in the choice of a conditionnal probability [and the fact that theta is random also], as far as I understand it, so interpretation would be the same: if you do an infinite set of experiments, then in average 95 % of the credibility intervals will contain theta and so on)
Consequently, the introduction of frequencies to interpret the world is not the basis of all tools used (see for instance definition of tests by Pedro), but in fact it's just the other way round: it's because you use probabilities (as mathematically defined) that the LLN holds and hence that you can interpret any probability introduced in your model (like p-value) as a limit of repeating the exact same experiment independantly, and that random sampling works. As this is more concrete that the abstract view, it's easier to use this last form, but it is a consequence of the use of the mathematical framework, not the ground base of it.
The fact that it is possible to build standard tools starting from this idea is not meaning that these tools do need this idea. It's just an educative convenince (I'm not sure my biology students would like stats if I present them a test as a mapping defined on a partitionned probability space...)
And because of this link between p and the LLN, if you use a probability, whatever it is, you can always interpret it as « event occurs more frequently »... as far as you imagine iid repetition of the experiment (because if not, event occured or not, and stop, but probability of the event still exists and is unchanged (and useless since you know what occurred)). When you play loto, for instance, you know that p(Win) = a value before playing. You know that it's low, so you have a very few luck to win. But if you play only once, either you win or not. You don't need limit-frequence to interpret this, right? But you can use it if you replace it in the context of "several players tried" (for instance)... and it answers something else. So probability is much rich of interpretation than just frequentist. For me, it's the same thing for a p-value or a confidence interval: you make only _one_ experiment, so the probability you get only measure "more or less lucky" (on an objective background), nothing more.
So in short: probability is indeed an abstract measure of luck. Probabilily of an event can always be interpretated as a limit of a frequency of this event, because of the LLN, hence the practical presentation of methods in this way: once you define something as a probability _of an event_ (may be there is confusion between probability, a function, and probability of an event, a number), everyone must understand it the same way, which includes the limit frequency interpretation (there can be discussion on the exact nature of the event, however). But this frequency is for a very precise event, which is a mathematical abstract event (like « µ is in [A, B] » or « T < t_obs », never of something real (like « treatment has an effect » or even « I observed an effect »). The link between the mathematical event and the one of interest depends on the experimental design, the current state of knowledge and many things --- but not of statistics, to be a little bit provocative.
Eg: when I test « H0: treatment has no effect » vs « H1: treatment is efficient », what I really test is « H0 : \pi_1 = \pi_2 » vs « H1 : \pi_1 \neq \pi_2 » (where \pi are probabilities on an abstract mathematical space on which a Bernoulli random variable on {Yes, No} is defined). And saying that H1 is true really means « the treatment is efficient » has nothing to do with stats or the mathematical models, but with question like "did I considered possible biases in my experimental design" and things like that.
PS : note also that many tests are precisely signal-to-noise ratio, so p-values are somehow a generalization of a normalized signal-to-noise ratio measure...
PS bis: since science is based on repetition of experiments to confirm an effect, and it is the gold standard to « proove » something in science, do you really think that the question « how likely will I get p-values lower than this one when there is no effect? » is really not interesting? Is'nt it precisely the Fisher approach of the p-value you mention, is it worth redoing the experiment to confirm what I saw?
Dear Pedro,
If I get it right, what you present is the following dichotomy. […]
Exactly.
To see that it is not, you need to stop regarding it [LLN] as a statement about the real world, and regard it as a piece of abstract mathematics.
I try. But I get lost. Can we take, for the sake of simplicity, the Bernoulli theorem? Whatever I can find to read about it, it states:
“in a sequence of independent trials, in each of which the probability of occurrence of a certain event A has the same value p, 0 0. Here, µn is the number of occurrences of the event in the first n trials and µn/n is the frequency of the occurrence.”
I can see it “abstract” in the sense that any sigma-measure can be used, and that “probability” is just a name for any not further specified measure. But I cannot abstract the relation of the limiting frequency.
For instance we may substitute p by “a part of a meter” (a fractional length with 0
Dear Terry,
What is the basis of the P in that law? It can't be a limiting relative frequency, can it? If so we need a meta theorem about repeating the whole repeatable sequence, then a meta meta theorem, and so on.
Yes. That’s what I meant.
One alternative is to assume as an axiom that the limiting relative frequency exists. Another is to leave probability as an undefined measure that obeys certain axioms.
I think leaving it as “undefined measure” in not acceptable, since it is used in practice, in research, people’s lives depend on its meaning (e.g. via drug testing and approval).
It remains debatable if there is a possibility to link it to something objective, be it even in an abstract way, or to state precisely that it relates to a degree of belief. But this was not the question. The question arises from the meaning of a confidence interval or the p-value of a hypothesis test and their interpretation. This interpretation is frequentistic (to the best of my knowledge), and their calulation is strictly connected to "random sampling". This random sampling (let it be stratified or whatever) is the important prerequisite so that the p-values and CIs have a valid frequency interpretation - and this interpretation is only correct when the limiting frequencies of the samples items are all equal. Therefor the question: how is random sampling producing equal limiting frequencies? The answer was: Because of the LNN, but there I stumble over a circular definition of limiting frequency = probability = limiting frequency = probability = ...
Dear Jochen,
Just a very quick answer for the moment on one point:
What means “independent”? Here is my original question popping up again. As far as I understand, this “independence” must mean that, in the long run, the mark will be placed at each position with the same limiting frequency.
No, independance does not mean that, what you define here is equiprobability (or uniform law on [0, 1]), which is a completely different thing (or I misunderstood your sentence, and may be you should explain more formally what you mean).
Independance mean that when you place the mark for the second time, the fact that you knew its position on the first time does not help you to predict its new position. It would (approximately) true if you set it with an arrow (dart ?) you throw to the segment; it would be wrong if it is a cursor you move from a certain amount...
You can have dependance & equiprobability, or independance & non-equiprobability (imagine the dart thrower is a good one and it targets the first third of the segment...)
Confusion between randomness, equiprobability and independance are frequent but it is important to really distinguish these notions...
The question arises from the meaning of a confidence interval or the p-value of a hypothesis test and their interpretation. This interpretation is frequentistic (to the best of my knowledge), and their calulation is strictly connected to "random sampling". This random sampling (let it be stratified or whatever) is the important prerequisite so that the p-values and CIs have a valid frequency interpretation - and this interpretation is only correct when the limiting frequencies of the samples items are all equal.
No. One possible interpretation is as limiting frequencies, but it is neither the basis of their construction, nor mandatory for interpretation.
Random sampling is not necessary either for tests or p-values, the only thing needed is randomness somewhere in your data. If you measure ten times something, and have 10 different values, there is randomness, hence random variables, hence confidence interval (if you want to build one), but no random sampling in the sence that you do not assign randomnly any subject to anything yourself, randomness is purely "external"... The only thing you need to have to define a confidence interval is a set of n (independant in the simplest case) identically distributed random variables X1...Xn with a law depending on an unknown parameter. But this assumption also holds for any other statistical method, or you can't use probabilities... Random sampling is a way to get that, but not the only one.
Dear Emmanuel,
your explanations sound like voodoo in my ears. However, I think I see what you mean by "equiprobability is not required", and this seems to be more important than I thought.I now start seeing the LLN from a different angle:
When a "random" sample is drawn, all there is is the frequency of items in the sample. Even when n->Inf, the frequency is still the frequency in the sample, and not in the population.
When a "random" sample is drawn and the frequency of an A (in the sample!) stabilizes for n->Inf, then this limiting value is defined as p(A).
Therefore, the actual frequency of A in a sample is an estimate of p(A), and this tells us only something about the "sampling property" of A but not about its frequency in the population.
Is this correct?
And then: the p-value of an experiment is 0.01. This means that such (ore more extreme) test statistics are sampled only rarely under H0, but we have no information about their frequency in the "population of similar experiments where H0 is true". It may thus be very common that such experiments do result in such (or more extreme) test statistics, but -for strange and unknown reasons- we observe them rarely (in the neccesarily finite series of experiments we actually perform).
Is this correct? And if so: where is the point of all this?
I think I am a tiny step further now. Still far from resolving my problem, but a step further.
Hi Jochen,
I'm sorry if I'm not clear, and would be happy to give more details on anything unclear... I'm not quite sure of what you mean by voodoo...
When a "random" sample is drawn, all there is is the frequency of items in the sample. Even when n->Inf, the frequency is still the frequency in the sample, and not in the population.
I agree with you on this point
When a "random" sample is drawn and the frequency of an A (in the sample!) stabilizes for n->Inf, then this limiting value is defined as p(A)
From a mathematician point of view, this is the old (before 1930) definition of a probability, but now it's not the case anymore, p(A) should have been defined before (with several other things) and then you remove the word « defined » in your sentence. Note that you should also define what is a « limiting value », since several definitions exists (limit in probability, in law [not applicable here], in squared expectation, almost sure...).
From what is called a frequentist point of view, yes it still is (as far as I understand it from this thread and the documents you sent), but a careful definition of several other things must have been done to give meanings to "random sampling" and what is the limiting value. As discussed in at least one of the reference you cite, there are difficulties with it (hence its abandon in maths...), including the one you are facing if I understand well.
I would say that reading this papers, there may have been a slight misunderstanding (in my mind) about "what is a probability". I think (personnal opinion again) that the question you raise is instead "assuming a probability, in the mathematical definition, exists, how can we characterize it" --- an analogy would be « f is an increasing function from R to R » [probability exists --- it is a function], OK, so now, how to compute/know f(3) ? If you characterize f as f(x) = 2 x + 3, for instance, then you can compute f(3). My opinion is that it is exactly what occurs here: probability is a function (not a number), and you want to know p(A) (a number which is a value taken by the function p for the special event A) for any A.
You can define p(A) = measure(A)/measure(all events) [analogy to length, area, volume...], you can define p(A) = lim n->+800 (n[A]/n[events]) [frequentist approach, with the problem of defining what n[a]/n[events] is and so on...], you can define p(A) as « is the same as p(X
I don't see the problem. Sampling from an infinite population does have philosophical problems--what is the population when we take measurement of the speed of light, or the amount of a food crop per hectare? But that's not the problem here. We are sampling from a finite population.
If we take a sample of size n without replacement from a population with N members then there are NCn possible samples. With simple random sampling these are all equally likely by definition. So long as we have a listing of the sample frame and a source of random numbers we can take a random sample (practical difficulties aside). We can say nothing about this particular sample except that it is a random member of the NCn possible samples. But it is possible, in theory, to repeat the process a large number of times; and by the weak law of large numbers, characteristics such as whether some statistic is in a tail of the distribution will occur with a known relative frequency in the limit.
If we are testing a hypothesis, and if the distribution does not satisfy that hypothesis, then we might expect more observations in the tail. So our single value in the tail is more evidence against the hypothesis than for it.
Therefore, as far as I can see, the question resolves to 'how can we believe the weak law of large numbers?' This is either an empirical rule, or deduced from the axioms of probability. The latter approach relies on a belief that events do have probabilities, whether or not we know their values. This seems like a propensity theory. We can verify limits empirically, but we can't verify limit theorems empirically without infinite regress.
@ Terry: I agree with your description, but I do not see why limit ourselves to finite size populations is necessary, or what added values it has.
I think the even notion of « population » is too restrictive, despite it has a clear (or at least thought to be clear) meaning in several experimental contexts, including clinical trials. And it is not needed in probability theory; my belief (would be glad to discuss it) is that population is instead a way to allow generalization of experimental results to a wider set of individuals, hence is meaningful in clinical trials or things like that (but what is exactly the population is a difficult question), but is a notion orthogonal with probabilities and event spaces: both can exist « alone » in some experimental contexts.
For instance, I can select non-randomly patients from a population and look at something on them (eventually, exhaustively: elections...), there is no need for probabilities. Conversely, in the examples you give (light speed measurement), there is no need for a population notion, but a probability approach is a way to assess the precision measurement...
That's why I think limit ourselves to (finite size) populations may not help so much the understanding of probabilities, since it should be applicable in other cases also...
Hope I am clear, it's difficult thoughts and having to use a foreign language makes things even harder to explain clearly...
Emannuel: I agree, but I thought the original question was about sampling from a finite population. This avoids many of the philosophical problems.
I'll leave questions about fictitious populations to philosophers, although it is interesting to wonder what the population is in repeated measures of the speed of light or patients in a clinical trial. The latter could be the finite population of all persons with a particular disease at a specific time, but that is not what we want to apply our results to. We want to apply results to future patients. But in the future the disease organisms could become resistant to the treatment, or better treatments could be found. So perhaps we must not look too far into the future. So what is the population? And was our sample even a random one from the present population? Perhaps we only sampled from the US, and not as randomly as we would have liked?
Hi Emmanuel,
From a mathematician point of view, this is the old (before 1930) definition of a probability, but now it's not the case anymore, p(A) should have been defined before (with several other things) and then you remove the word « defined » in your sentence. Note that you should also define what is a « limiting value », since several definitions exists (limit in probability, in law [not applicable here], in squared expectation, almost sure...).
So I must consider that p(A) is defined. What is this definition of p(A)? And does “p” here stand for “probability” (if so, the question is: how is the probability of A defined?).
You also say that the kind of limit must be specified. I think here it is a “limit in probability”. There is again the term “probability”, for the whole expression. I mean: the weak LLN is
lim[n->Inf] Pr(|Xbar[n]-E(X)|>e) = 0
so the expression inside Pr(.) can be simplified as event C, so that it reads
lim[n->Inf] Pr(C) = 0
So the statement is that the probability of C tends to 0 when n goes to infinity. Do I understand this correctly?
If so, then my further question is about the event C. This event is based on values X1, X2, … what, citing Wikipedia, “... is an infinite sequence of i.i.d. Lebesgue integrable random variables with expected value E(X1) = E(X2) = ...= µ.
There are three things that have to be understood: “random variable”, “iid”, and “expected value”. I think we can ignore the Lebesque integability here. I will go on using the explanations from Wikipedia (please correct me when these explanations are wrong):
(1) Random variable: “A random variable can take on a set of possible different values (similarly to other mathematical variables), each with an associated probability, in contrast to other mathematical variables.”. It is understood that –whatever the assigned probabilities may be– these probabilities obey Kolmogorow’s axioms.
An often employed example for such a random variable is the result of tossing a coin or a die. Let’s take the coin as a prototype Bernoulli experiment and name the two possible events A and B (shorter for “head” and “tail”). Now how will I be able to know if these events have probabilities at all? Or can I just assume/postulate that they do?
I further note that a “random variable” is defined operationally, that means that the procedure of obtaining the (possible) values is part of the definition of the variable. Therefore the often-cited property of “fairness” does not refer solely to the coin but to the combination of coin and tossing and evaluating the result (the latter seems strange for a coin [there is no doubt which side is up], but if we’d talk about diagnosing a disease it may be that the diagnostic tool is unsuited to detect a particular disease; or the interpretation of the physical measurements [histological sections, X-ray images, whatever] is a an evaluation step that may be subject to different rules or standards).
(2) iid: “In probability theory and statistics, a sequence or other collection of random variables is independent and identically distributed (i.i.d.) if each random variable has the same probability distribution as the others and all are mutually independent“
A understand this as: for each toss of the coin, p(A) must be the same (same for p(B)). Again I think that this is something that I can only assume or postulate for a given real-world example.
I do not really understand why both terms “independent” and “identically” are used. As I understand this, “independence” follows from identically distributed values (but not the other way around). But this may be a relevant misconception I have, so I would be happy if you could give me an example of non-independent but identically distributed values.
(3) Expected value: E(X) = X1 * p(X1) + X2 * p(X2) + …
The expected value is defined through the probabilities (or densities). These are the same probabilities/densities as for the “random variable” in (1)?
I understand that E is a weighted average of the values X1, X2, … I further see that LLN states that the “limiting average” of a sample equals this weighted average, but this requires that the weight factors are assumed just be so that the limiting frequency of each value getting into the sample equals this weight factor.
I still don’t get it. To my understanding this is trivial and still a circular reference.
My opinion is that it is exactly what occurs here: probability is a function (not a number), and you want to know p(A) (a number which is a value taken by the function p for the special event A) for any A.
But the problem remains: how to choose the function?
Therefore, the actual frequency of A in a sample is an estimate of p(A), - Yes
But that would mean that the function / Probabilities have to be chosen in a way that this estimate is sensible. I do not see any other argument for its sensibility except the fact the limiting frequency tends to p(A). And this is a circular argument.
I do not understand what you mean by « sampling property of A »...
p(A) is a measure of how likely A is selected (given the operational definition of the random variable). This is a self-referring definition of p again (“likely” and “probable” are just synonyms, like “luck” btw.).
but not about its frequency in the population. - Well, about that, I'm uncomfortable because I don't really see what is a « frequency in the population », not even the need for a population.
Ok, accepted. I just thought things would be simpler when I first get the understanding of what all this means for a finite population, and then I could go on to see how this extends to “hypothetical” populations or to an understanding that is based on sampling alone without referring to any kind of population. (as you may infere: I was taught statistics based on samples and populations, and all statistics I was taught was about using samples to estimate parameters of a population. So I am mentally biased but I try to overcome this).
It is appropriate in some cases, but what is the population when you throw a dice or play loto? And what is the frequency in the population if it has infinite size, especially uncountable infinite size, except 0 for any elementary event?
It is not about the frequency in the population, it is about the LIMITING frequency in the population. And there is another misconception I think:
When we consider a real country with N citizens (at some specified time), and the incidence of a disease should be estimated from a “random sample”, than the population does NOT refer to these N citizens but to the hypothetically infinite population of samples (!!) that can be imagined to be drawn from these N citizens.
The same applies for the speed-of-light example. The population is again the hypothetically infinite series of imaginable measurements.
I think this conceptual difference is relevant. Do you agree to this concept?
I'm not sure to quite understand your idea here. If the p-value of an experiment is 0.01, then it mean that in the probability space you use to model your experiment, the event « observing these data [or more extreme] » (assuming H0 true) as probability 0.01, stop. The real question (?) would be « OK, but what is this probability space », which is a difficult question.
Not quite sure but I think that this difficult question is essentially what I intended to ask, originally. At least it is pretty much related.
@ Terry: I agree, and by rereading the question indeed it focused on populations; even if finite size was not mentionned, the two examples were so and the contrary was not mentionned either, so only Jochen can clarify.
The examples you give renforce my belief that the population notion, which may or may not be important, is not on the level of probabilities/statistics definition, but on the level of experimental results generalization, hence is needed in some cases and out-of-the scope in other.
Hi Jochen
A very quick answer, on a very very important point; I'll try to make a more complete one latter on other points. RG is really missing a citation tool :(
A understand this as: for each toss of the coin, p(A) must be the same (same for p(B)). Again I think that this is something that I can only assume or postulate for a given real-world example.
This is only defining the "identical", not the "independant" part.
I do not really understand why both terms “independent” and “identically” are used. As I understand this, “independence” follows from identically distributed values (but not the other way around). But this may be a relevant misconception I have, so I would be happy if you could give me an example of non-independent but identically distributed values.
No, independance and identical distribution are really two orthogonal things. Identical is what you described above; independance is the fact that the result of an experiment does not give any insight on the result of the second experiment.
Two examples of identical distributions which are not independant at all:
1) You toss a fair coin. Then for the second "toss", you simply put your hand on the coin and remove it. What is the probability of "face" for the first toss? for the second toss (if you did not looked at the first one)? for the second one, if you know that the first one is a face? So, as you see, if you know the results of the first one, you have informations about the results of the second one (you can't expect more information, in fact). Nevertheless, the absolute probability of having a face is 1/2 for the first and the second "toss", so identity.
2) You toss ten times a fair coin and count the number of faces (F) and the number of piles (P). P and F have the same distribution, by symmetry, but they are not independant, because when you know one, the second is also known.
There are many many such other examples (imagine X is a N(0,1) variable, -X is also a N(0,1) variable which is not independant of N).
Conversely, you can have independance without identical distribution: you throw a faire coin for the first toss, and an unfair one for the second toss. Independance, but no identity...
Ok, I see. Thanks.
I thought that all the values X1, X2, ... must be from the same "variable", where "variable" means a complete operationalization. If you allow the operationalization to be different for different X's, then independence is not anymore following vrom identity (in distribution) and in fact is required as an additional assumption.
(an operationalization is for instance: how exactly what kind(s) of coin(s) is(are) tossed and what amount of variability is allowed in this tossing and that the side showing up will be recored as the outcome in a particular way)
@Jochen: I'm not sure difference between independance & same distribution disappears when complete operationalization is imposed. Imagine you have a urn of balls (set of patients) with 2 reds and 2 blues (2 smokers, 2 non-smokers). You choose randomly two balls [patients], one after the other, and you don't put the ball back in the urn [you don't want the patient twice in the trial]. Then X1 and X2 are identically distributed, but not independent, despite an identical « operationalization », if I understand correctly your idea.
Despite very intuitive, independance and identical distribution do not imply each other in any way, I think... Both are different assumptions, because they model different things: one is « the experiment does not change », the other is « experiment result cannot be predicted by results of previous experiments ».
On the contrary, some operationalization impose independance whereas some other impose dependance seems to me a better description. And same thing for « identically distributed ».
@ Jochen, long answer...
> So I must consider that p(A) is defined. What is this definition of
> p(A)? And does “p” here stand for “probability” (if so, the question
> is: how is the probability of A defined?).
The answer strongly depends on the context I think. In some
circumstances, you will choose a model for p --- choice of your prior
in Bayesian analyses ; normality assumption in several methods... In
other circumstances, you just assume that p exists, and p(A) is
defined [in the mathematical sense, that is A is in the set on which
the function p is defined], and it is precisely the aim of the
experiment to discover what will be the value of p(A).
Note that mathematically, « p is defined » (or « p(A) is defined »)
does not mean we know its value; it just means it exits and we can
handle it using its properties, assumed or deduced from theorems. And
to say « p is a probability », at least in usual mathematical context,
it must be defined on a special kind of set and have the 3 properties
mentionned in a previous answer.
> You also say that the kind of limit must be specified. I think here
> it is a “limit in probability”. There is again the term
> “probability”, for the whole expression. I mean: the weak LLN is
> lim[n->Inf] Pr(|Xbar[n]-E(X)|>e) = 0
That's a possible expression, yes.
> so the expression inside Pr(.) can be simplified as event C, so that it reads
> lim[n->Inf] Pr(C) = 0
> So the statement is that the probability of C tends to 0 when n goes
> to infinity. Do I understand this correctly?
Yes, I think so.
> If so, then my further question is about the event C. This event is
> based on values X1, X2, … what, citing Wikipedia, “... is an
> infinite sequence of i.i.d. Lebesgue integrable random variables
> with expected value E(X1) = E(X2) = ...= µ.
> There are three things that have to be understood: “random
> variable”, “iid”, and “expected value”. I think we can ignore the
> Lebesque integability here. I will go on using the explanations from
> Wikipedia (please correct me when these explanations are wrong):
I think you should really read a mathematical course of probability,
it would certainly help much.
> (1) Random variable: “A random variable can take on a set of
> possible different values (similarly to other mathematical
> variables), each with an associated probability, in contrast to
> other mathematical variables.”. It is understood that –whatever the
> assigned probabilities may be– these probabilities obey Kolmogorow’s
> axioms.
That's not a very precise definition. I prefer « a random variable is
a function defined on a probabilized space [that is a space/set with
associated probability defined] », which associates each event of the
space to something in another space, usually a number (« real random
variable ») but not always, « and such that it is compatible with the
definition of probability » [this second part is not clean at all
mathematically, but I don't know how to formulate it without technical
terms like « mesurable », especially in English; see below for what it
means on an example].
> An often employed example for such a random variable is the result
> of tossing a coin or a die. Let’s take the coin as a prototype
> Bernoulli experiment and name the two possible events A and B
> (shorter for “head” and “tail”). Now how will I be able to know if
> these events have probabilities at all? Or can I just
> assume/postulate that they do?
In that case, the « universe » will be \Omega = {A, B}, the event
space will be the set of all possible subsets of \Omega, that is
P(\Omega) = {\emptyset, {A}, {B}, {A, B} = \Omega}. There are four events,
including two elementary events, {A} and {B}.
From a practical point of view, you can't know if there is a
probability defined.
From a mathematical point of view, the question is instead « can I
define a probability on this set ? », that is can I define a function
from P(\Omega) to [0, 1] that satisfies the three axioms?
The answer is here obviously yes ==> on the practical side, you can
safely assume they do have probabilities associated.
Exemple : p(\emptyset) = 0 [mandatory for p to be a probability]
p( {A} ) = a, with 0 I further note that a “random variable” is defined operationally,
> that means that the procedure of obtaining the (possible) values is
> part of the definition of the variable.
No, I don't think so, at least in the form I gave you above. For
instance, if I define X as the random variable that associates « red »
to A and « green » to B, it is a random variable, but nowhere did I
had to define how I toss the coin, how I record the results...
I think you try to « stick » to concrete interpretations in terms of
experiments and so on, but mathematics don't care about that. It's up
to you after to choose the right model/maths parts to suit to your
practical example.
But it may also mean we do not give the same meaning to « define »: I
may have a more mathematical (or more loosse?) definition, « something
is defined if it exists and has some properties », and you may have a
more precise definition, needing to detail everything... In that case,
with your definition, I agree with your note, but not for « a » random
variable (in general), but for « the » random variable that will model
your precise experiment... You then « loose » all generality of
mathematical theorems that apply to all kinds of random variables, and
not only the special one you need...
> Therefore the often-cited
> property of “fairness” does not refer solely to the coin but to the
> combination of coin and tossing and evaluating the result (the
> latter seems strange for a coin [there is no doubt which side is
> up], but if we’d talk about diagnosing a disease it may be that the
> diagnostic tool is unsuited to detect a particular disease; or the
> interpretation of the physical measurements [histological sections,
> X-ray images, whatever] is a an evaluation step that may be subject
> to different rules or standards).
Agree for the definition of an experimental protocol and the choice of
a distribution for the random variable you need. Too easy to make a
protocol that will conclude to biased coin with an unbiased one...
> (2) [I skip this, since we already discussed it]
> (3) Expected value: E(X) = X1 * p(X1) + X2 * p(X2) + …
Well, it's not clear here if X1 is a random variable or the first
value that can take X, and so on for X2.... What you write is true
only in the second case [as for probability, confusion between the
function and its values will lead to at least difficulties of
understanding...], so I assume this what you meant
> The expected value is defined through the probabilities (or
> densities). These are the same probabilities/densities as for the
> “random variable” in (1)?
Yes. If X has values in a discrete subset of R, expectation can be
defined as you wrote. But p is technically not exactly the same.
Let's continue the coin example, and let X be the random variable
defined as the indicator of the result A: it takes the two values 0
and 1, with X(A) = 1 and X(B) = 0 --- note it is defined on \Omega,
not P(\Omega), in the contrary of the probability.
You can then define the results {X = 1} and {X = 0}, and define a
probability p' on the events defined from these results, exactly as
before (just replace {A} and {B} by these two events).
If X is a random variable, then by definition p'({X = 1}) = p({A}) and
the same for p'({X = 0}) = p({B}) and all other events. But p' and p
are not the same functions, since they are not defined on the same
space... However, since p'({X=...}) = p( X^-1{...}) --- the X function
creates a probability on the arrival space that mimics the one on the
starting space --- by convenience we use also p.
It is in fact p' which appears in the formula of the expectation
value,
E(X) = 0 * p'({X = 0}) + 1 * p'({X = 1})
= 0 * p({B}) + 1 * p({A})
= p({A}) = a
(but nobody will keep the distinction between p and p')
Note that this variable follows the classic Bernoulli law...
> I understand that E is a weighted average of the values X1, X2, … I
> further see that LLN states that the “limiting average” of a sample
> equals this weighted average, but this requires that the weight
> factors are assumed just be so that the limiting frequency of each
> value getting into the sample equals this weight factor.
There is no assumption on the weight factors, they are the one defined
by the probability on the original space, whatever it is as far as it
is a probability (and X is a random variable, and it has an
expectation [this is what means "Lebesgue integrable"]).
> I still don’t get it. To my understanding this is trivial and still
> a circular reference.
Hope with this, you see the absence of circular reference: you assume
there exists a probability, and then just mathematics shows you by the
WLLN (or the strong one, even better) that iid experiments will give
the values taken by the probability function (« the probabilities of
the events ») or more generally the expectation value.
> > My opinion is that it is exactly what occurs here: probability is
> > a function (not a number), and you want to know p(A) (a number
> > which > is a value taken by the function p for the special event
> > A) for any > A.
> But the problem remains: how to choose the function?
Exactly. Well, that's not the mathematician's problem ;) More
seriously, has suggested before, either _you_ do that as an assumption
of your model (priors, Gaussian, Poisson whatever you want) --- but in
general, this choice is only of a subset of all the possible
functions, and most often the aim of your experiment is precisely to
discover what function is at work in your setting.
> Therefore, the actual frequency of A in a sample is an estimate of
p(A), - Yes
> But that would mean that the function / Probabilities have to be
> chosen in a way that this estimate is sensible. I do not see any
> other argument for its sensibility except the fact the limiting
> frequency tends to p(A). And this is a circular argument.
I think what you say applies to the interpretation of the result of
the experiment: LLN and all above ensure that indeed, what you get is
the probability of event A, no other choise. That this probability is
really the one you need is another problem, related to your
experimental design, confusion factors, biases and whatever you can
imagine. Or in other words, it's p(A), OK, but is A really what you
want?
For example, to evaluate a treatment effect, A should be « cured »,
but when you do the experiment, does the p(« cured ») you get really
apply to all other patients, is really because of the treatment and so
on.
> I do not understand what you mean by « sampling property of A »...
> p(A) is a measure of how likely A is selected (given the operational
> definition of the random variable). This is a self-referring
> definition of p again (“likely” and “probable” are just synonyms,
> like “luck” btw.).
Defined like this, yes, removing the word « selected » which is
useless from all the maths above. That is exactly what we want, no,
that p(A) is a measure of likely is A? But note we don't have to
impose precise values on it, we just have to assume that _p_ has a few
properties that are somehow « common sense » on what should be a
measure.
Note that the same difficulties arise for length: what is length?
length of an object may be easy to determine, but length in general?
It's exactly the same thing in fact, and mathematically modeled just
the same way (except that normalization to 1 is not required).
> but not about its frequency in the population.
> > Well, about that, I'm uncomfortable because I don't really see
> > what is a « frequency in the population », not even the need for a
> > population.
> Ok, accepted. I just thought things would be simpler when I first
> get the understanding of what all this means for a finite
> population, and then I could go on to see how this extends to
> “hypothetical” populations or to an understanding that is based on
> sampling alone without referring to any kind of population. (as you
> may infere: I was taught statistics based on samples and
> populations, and all statistics I was taught was about using samples
> to estimate parameters of a population. So I am mentally biased but
> I try to overcome this).
I think population instead adds confusion, but may be because I was
never taught statistics, « only » probabilities, and learned
statistics by myself from that.
> > It is appropriate in some cases, but what is the population when
> > you throw a dice or play loto? And what is the frequency in the >
> > population if it has infinite size, especially uncountable
> > infinite > size, except 0 for any elementary event?
> It is not about the frequency in the population, it is about the
> LIMITING frequency in the population.
May be, but still what is the population in such contexts?
> And there is another misconception I think:
> When we consider a real country with N citizens (at some specified
> time), and the incidence of a disease should be estimated from a
> “random sample”, than the population does NOT refer to these N
> citizens but to the hypothetically infinite population of samples
> (!!) that can be imagined to be drawn from these N citizens.
Somehow yes, the population is not the physical one, but I don't think
population is the set of samples. I think in that case, it is at best
an abstract one, of all possibles citizens including future
ones. However, I think it's not even necessary to have one: it's the
instantaneous probability of getting ill.
> The same applies for the speed-of-light example. The population is
> again the hypothetically infinite series of imaginable measurements.
I don't agree, and still don't see the need for a population in that
case either.
For me, in that case, there is a space of possible random events that
will add error to the observed/measured value, that have physical
origin (including the itself random events of quantum mechanics if
needed) [slight changes in currents,...] that we cannot model,
control, handle... To each of these possible situations you can assign
a different result value; the original probability is on the set of
all these situations, and translates through a relevant random
variable to probabilities associated to the experiment results.
For a coin, would be at the origin the all possible combinations of
strengh, angle, wind strength... that translate into head or tail.
But may be that original space would be what you call population? In
that case, I agree, but dislike the word (may be just my problem
anyway...)
> I think this conceptual difference is relevant. Do you agree to this
> concept?
Not really, but may be just because I use another word as in the
examples above.
> > I'm not sure to quite understand your idea here. If the p-value of
> > an experiment is 0.01, then it mean that in the probability space
> > you use to model your experiment, the event « observing these
> > data [or more extreme] » (assuming H0 true) as probability 0.01,
> > stop. The real question (?) would be « OK, but what is this
> > probability space », which is a difficult question.
> Not quite sure but I think that this difficult question is
> essentially what I intended to ask, originally. At least it is
> pretty much related.
Not quite sure: we're not interested in exactly what could be the
cause of randomness in the result/test, but just have a results that
says that whatever occurred when making the experiment, we should not
have obtained the result we get if our H0 model were true --- after
that smoothed with a probability, as always.
In this sampling technique, the researcher must guarantee that every individual has an equal opportunity for selection and this can be achieved if the researcher utilizes randomization.You may write down names of each of your friends on a separate small piece of paper, fold all small pieces of papers so no one know what name is on any paper. Then you ask someone to pick 5 names and give chocolates to first 5 names. This will remove the bias without hurting any of your friend's feelings. The way you did this is what we call randomization.This URL, may be helpful.
www.nbt.nhs.uk/.../Randomisation_in_clinical_trials.pdf
Randomization is a necessary condition for many of the tests, because the probability distribution of the random error reliable and then distribute calculable test. For this, the accuracy of the test depends on randomization.
Dear Jochen
All random samples do not guarantee you equal probabilities. In finite sampling, random sampling itself means probability sampling where in all units in the population have some probability of getting selected in the sample. If units have equal probabilities, you get Equal probability sampling and if units have unequal probs, you get unequal probability sampling. Hence random sampling need not guarantee you equiprobability.
Sharada, you are right, but the problem is that random samplings "need not guarantee your equiprobability" as you say, neither guarantee a particular multiprobability model -or a priori distribution model-.
If a univariate random sampling of N=100 observations has 3 close observations once ordered, like 10.2 10.0 and 9.8 for a range that goes from 5 to 20 units, it is simpler to assume that these 3 points have together a frequence of 3% for an average value of 10 units of the 3 points, than to assume that they belong to a particular model -like normal one or any other-.
I mean that if the random sample brings some set of values it is because they are real in nature so they are probable, even if not perfectly representative or even if next sample looks different. And given that we must interpret them as information, we need to asign frequences to each value and also average of repeated close values, to get something close to mean value of that 3% portion. Randomness and repetition may damage any frequences asumption, but it is worse to refuse it if we want to infer some general conclusion from the observing effort.
That is my vision, accepting that it is prone to error in not few moments and that time produces unexpected changes -OK, thanks, emilio
There have been many sensible statements in this discussion (e.g. 'random sampling' does not always mean 'equal probability sampling') but they have departed a little from the original question, which I paraphrase as 'How can we trust sampling distributions?'
We know that if probability is defined as long run relative frequency then the law of large numbers is a circular argument because the probability of any deviation is also defined as a relative frequency. So we can't use this to prove that relative frequency converges in probability (or almost surely).
I don't think we can give a satisfactory proof. We either assume convergence or use a different definition of probability. Then the laws of large numbers say that relative frequencies are estimators of probabilities.
A pragmatic solution is to treat statistics as science rather than logic or mathematics. Science relies on empirical observation as well as a certain amount of wishful thinking. (We wish that Occam's razor applies so that everything is a simple as possible and elegant. Before they were tested, Einstein's theories, for example, were based more on elegance than on observation).
Likewise in statistics, it might be better to treat the convergence of relative frequencies as empirical. But at least the law of large numbers shows some sort of consistency.
However, one of Jochen's references mentions the red-blue sequence. In a sequence of tosses of a fair coin return blue if there are more heads than tails and red if there are fewer. When there are the same number return the previous colour. Start with a random red or blue. An observer who doesn't know how the sequence is formed will not easily determine that red and blue are, in fact, equally likely. Feller considers a model for two queues, when someone in one queue is ahead it takes an infinite expected time to catch up. If the red-blue sequence is similar, the expected run length is infinite. The difference between the numbers of heads and tails can be arbitrarily large, and if it is large it will certainly precede a long run of red or blue. In such cases relative frequencies could be quite misleading.
Is that your concern, Jochen?
Yes, Terry, this pretty exactly hits my concern.
There is one point in your answer that I am still not confortable with: " Then the laws of large numbers say that relative frequencies are estimators of probabilities." - my interpretation is different, and I think this difference is relevant:
The laws of large numbers says that "taking probabilities as limiting frequencies (of realizations, not [neccesarily] of values in a population)" is consistent; rare realizations are such of outcomes with low probability, and the probability to observe such rare outcomes is low, what in turn means that such observations will be rare; there is no logical contradiction within this frame of arguments, but is does not demonstrate that taking probabilities as such limiting frequencies is sensible in the first place. This sensibility is given only empirically. We take (observed) frequency distributions and interpret them as probability distributions, so we can calculate p-values or confidence intervals based on probability theory that, eventually, are again interpreted to have a "limiting frequency interpretation".
I tend to take a pragmatic point of view. Intuitively, long run relative frequencies converge. Many experiments have been carried out to verify this. Rate of convergence can sometimes be a problem, but we design our experiments so that we expect a reasonable sample size to be accurate enough.
The red-blue problem is not an example of independent trials--the successive colours are strongly related, so I wouldn't expect the law of large numbers to apply. Therefore I doubt if such pathological examples would arise in real experiments.
I'm sorry I can't give a convincing proof that statisitics works as advertised, but I take comfort from the fact that noone else has dones so.
Terry, if have mentioned in another question of RG that if the first estimated parameter is the media obtained from the arithmetic mean, it implies also that each sample value is asumed as an average and that each frequence is equal to 1/N for the N measures. If we add another a priori distribution model that brings different frequences and more parameters, then we start to work with a double standard that contradicts and mess the whole analysis rationality. Therefore I prefer to asume that the sample is representative of itself given the asumed equal frequences, even if not too much for a bigger sample. That doubt from method must be solved with more replications and researchers work until they consider there are enough evidences and controls to guarantee conclusions. Thanks, emilio
Some times purposive selection may have also better representation of sample than random selection.