Adilson, this expression sounds familiar but strange to me. Representativeness is always a concern in sampling. If I will be "philosophical", will there also be a Non-representative Sampling ? Please bear with me. Ed
Non-representative Sampling is very common. Over-sampling some categories can improve efficiency, but may also introduce bias. Non representative samples can prove very useful, but have to be manipulated with caution. If you do not master the topic, it is simpler to use representative samples, i.e. samples in which the distribution of your variables of interest is the same as in your population.
Yes, simpler, but useless if the variable you record is categorical and some categories you are interested in are in very very small proportions : in this case, you may very well find yourself with no individuals from these categories.
Stratification (leading to such a "non-reprensentative" sample) can help. But you are right, specific statistical tools have then to be used to analyse the data, and a drawback is that your sample does not allow you to estimate the proportions of the different categories you are interested in.
A sample is a subset of define population. we always try to pick up a sample from population based on probability or non-probability that have small sampling error and minimum bias. probability sampling with small sampling error and minimum bias represents the distribution of define population. The probability sampling may be a representative for population. Non probability sampling represents the restricted areas.
"Representativeness" in a sample, as indicated by answers above, may mean different things to different people.
Consider this, regarding sampling from finite populations: At one place I worked for many years, I had a supervisor that threw that term (representative sample) around continuously. It made it into documents declaring that our samples were "statistically representative." What he really meant to convey, likely, was a sense that what we did was valid. I think most people do think of that as meaning that probability (random-based) methodology is used, which makes it somewhat amusing when I noticed he still used the same wording long after I switched to model-based estimation and quasi-cutoff sampling. :-)
In the 1940s there was much controversy between proponents of randomized sampling and 'purposive' sampling. Randomized sampling won. As I think Ken Brewer once put it, people conceive of a random sample as "fair." And I suppose that purposive sampling back then might generally have meant someone's expert opinion of a 'representative' sample. But even if it were better, how do you do inference from such a purposive sample? What would the variance be?
I read some old idea about what 'representative' might mean. I wish I could remember where I saw it. I think it was even different than I imagined.
Today there are a number of different forms of purposive sampling. Mike Brick (great name for a 1940s Private Detective, like Sam Spade, or Mike Hanmer, eh?) was the Washington Statistical Society's (ASA chapter) President's invited seminar speaker (last March, I think), and he talked about how varied are the types of purposive sampling. My feeling is that by far the best occurs when you have good regressor data to do model-based estimation, and you know how to stratify so that each model application is only applied to data which should be modeled together.
My experience is mostly in establishment surveys, with continuous data, and there are basically three methods in use. Ray Chambers (an Australian, as is Ken Brewer), had a seminar a couple of years ago, where he used this same breakdown, as would many other survey/mathematical statisticians:
(1) design-based (randomized) sampling and estimation methodology, (2) model-based (regression) methodology, and (3) model-assisted design-based methodology.
Note that using a model-assisted design-based method means that you sample using randomized methods, but your estimation takes into account 'auxiliary data' (which are really 'regressor' data under model-based estimation). The advantage is that the auxiliary data (on the entire population) adjusts your results to compensate for random sampling that may not be very 'representative.' That is, if you randomly picked only the smallest or only the largest members of a population in your random sample, the 'model-assisted' part would compensate to a degree, during the estimation phase.
In random sampling (simple or stratified) you have survey weights. (Even probability proportionate to size - PPS - sampling has a sort of built in weight for each respondent.) For random sampling, we generally use a "w" for this survey design weight. For model-based estimation, we also generally use a "w" for the regression weight. However, one type of model-assisted design-based methodology uses "calibration weights," again "w," which combine both survey design weights (magically renamed "d"), and regression weights (now named "c" or another letter I forget at the moment). Calibration weights are survey weights that are modified to account for the better representativeness that a model can impose. Models can also be used to help decide what design-based method is used, as I have seen in work by Ken Brewer, and furthered later by Anders Holmberg (Statistics Sweden). Also, Ken Brewer wrote a book published in 2002 on better combining design-based and model-based methods.
So what does representativeness mean? I think that many may mean that a "representative sample" should help you estimate well for a finite population. We may think random is best, but that is not necessarily true. As humans, we may first tend to think of randomization. But as noted more than once in the TV series "Numbers," we tend to confuse 'random' with 'uniform.' And here, as Sergio noted, we may mean that the sample has the same distribution as the population, finite or not.
I think that you can't beat having good "auxiliary"/regressor data, and good stratification. But to avoid the appearance of manipulation, having rules for selection are helpful. Some like "balanced" sampling, where you select a sample that has the same mean for its associated regressor (or linear combination of regressors) as the corresponding mean for the population. However, I have found that, looking at this from a "total survey error" point-of-view, it is much more accurate (representative?) for highly-skewed establishment surveys not to do this. A technique that at least one other person and I have dubbed "quasi-cutoff sampling" appears to generally be much better for establishment surveys. But at any rate, the key is good regressor data, and logical stratification. The goal here is to estimate aggregate level information, however, so that the sample is representative in that it will yield good estimates for those aggregate values for the population - not necessarily have the same distribution as that of the population.
Another view of "representativeness" in previous responses to the question above had to do with 'oversampling,' such as that proposed at times by the US Bureau of the Census, and others worldwide, I'm sure, to try to obtain good results for smaller segments of the population. That might be considered unrepresentative in some sense overall, or perhaps more representative, in order to not fail at estimating well for each and every segment of the population. I prefer to think of it as the latter. :-) At any rate, I think that this is yet another legitimate way to think of the word 'representative.'
for those who might be interested and do not read French, a short synopsis of the paper i mentionned above :
Section 2 discusses various "definitions" (or attempts at defining) in the litterature and their often circular nature
Section 3 gives a formal definition of a representative sample as follows :
Definition 1
A characteristic of a population of size N is a vector of size N which keeps for this population the values taken by each population unit at a given time (eg. age of each person)
Definition 2
The set of the characteristics of a population of size N is a NxK matrix which keeps for this population the values taken by each population unit for the set of the K characteristics (eg. age, size, CSP, ...)
Definition 3 : representative sample for a characteristic
A sample E composed of n units {u_i} (i in set S) is representative of the characteristic C_k of a population of size N if there is a probabilistic sampling method in E for a unit u_i of E such that the probability law of C_(i,k) (which is the value of this characteristic for u_i, taken randomly in the sample) is equal to the empirical distribution law F_N(C_k) of this characeristic in the population P
Definition 4 representative sample of a finite population
A sample E composed of n units {u_i} (i in set S) is representative of a finite population P if there is a probabilistic sampling method in E for a unit u_i of E such that the joint probability law of (C_(i,1),...,C_(i,K)) for unit u_i taken randomly in the sample is equal to the empirical distribution law of the characteristics in the population P, that is F_{E_1}(C_1,...,C_K) = F_{N}(C_1,...,C_K)
Property 1
The population P is a representative sample of the population P
Property 2
A simple random sampling produces a representative sample of the population P
Property 3
Assume E is a sample of n individuals of a population P of size N and E has been obtained by a probabilistic sampling method ; if there is a probabilistic sampling of u_i in E with P(u_i is in E_1) = 1/N for all i=1,...,N then E is a representative sample of P
// in bold
hence, "in words", a sample is representative if its construction is "equivalent" to the construction of a simple random sample
Property 4
If E is a sample of n individuals of a population P of size N, if E has been obtained by a probabilistic sampling method with known inclusion probabilities and if these probabilities are all greater or equal to 1/N, then E is a representative sample of P
Section 4 shows that the quota method, if and only if each population unit has the same probability to be selected, builds a representative sample according to the definition of section 3
Section 5 deals with a posteriori reweighting and reminds the reader that reweighting a non-representative sample does not lead to much in terms of representativeness
I largely agree with your comments. However, let's examine this part of what you included: "In general, random samples provide a good approximation of the population and offer better assurance against sampling bias; thus are more representative than non-probability samples." Often that is true, but there are some drawbacks, even with regard to accuracy. First, for any kind of sampling, there are potential problems if you do not stratify when necessary. Perhaps most relevantly here, if the sample size (even by stratum) is small, then the chances of drawing a 'representative' sample at random can be very unsatisfactory. Depending upon the data distribution, the estimates of variance and bias can also be very inaccurate. But if you have auxiliary data on the population which can be used for modeling, then this will generally solve this problem. Then one might use a sample not necessarily taken at random. (See "balanced sampling," and for highly skewed establishment survey populations, cutoff or quasi-cutoff sampling with prediction and stratification.)
...
The paper at the following link was written by KRW Brewer (Ken Brewer) as a result of his selection about three years ago as the Waksberg Award winner, for survey statistics, and explains the history of this thinking, and concludes with his recommendation as to when to use which method:
Brewer, K.R.W. (2014), “Three controversies in the history of survey sampling,” Survey Methodology,
(December 2013/January 2014), Vol 39, No 2, pp. 249-262. Statistics Canada, Catalogue No. 12-001-X.
The key is to have good regressor data available, which often is true, especially for periodically collected official data. (Note that in the example model, he uses the model-based classical ratio estimator (CRE).)
...........
In addition, there is the following by Mike Brick, who was selected about three years ago as the Washington Statistical Society's (a chapter of the ASA) presidential invited speaker on the topic of a variety of types of nonprobability sampling:
J. Michael Brick on Inference from Nonprobability Sampling:
Adilson, you stated that "Probability sampling I know, but it [representativeness] seems be different." Yes, they are different. First, representativeness can have varying definitions, but regardless, one can say that probability-based sample selection is just an attempt at finding a representative sample. This would happen, on "average," if you were to repeat your sample selection infinitely many times. But you only select once, and not even bootstrapping can tell you what you did not select. However, it is very often a reasonable and often even best thing to do to hope for 'representativeness.' Further, estimates of variance and bias can be obtained, though they can be very inaccurate themselves, and most bias and much variance often comes from nonsampling error, such as measurement error, which can basically just be modeled anyway. (Probability sampling with a small sample from a population with multiple modes, skewness, or any unusual distributional features, can be particularly prone to substantial failure.)
Chalamalla noted the following -
Bevins, Duke, & Bevins: Representativeness - means that the characteristics of the population and the sample are congruent
This is a good definition, but note that probability sampling may often not come even close to doing this in many examples, from one draw of a sample.
Thus, Adilson, you are correct to recognize that representativeness and probability sampling "seem different." They are. The latter is only an attempt to obtain an approximation of the former characteristic for a sample. Very often it is a good idea. But not always. (See Brewer paper linked previously. He generally liked combining probability of selection methodology with regression modeling.)
First of all, "representative" should always specify with respect to which characteristic(s) (gender, age, education...)
Second, representative sample is not the same as representative sampling.
* The former ensures representativeness EX POST, possibly using weights.You first draw a sample and then compute the weights for each observation in the sample to make it representative, i.e. equalize the (weighted) distribution of the characteristic(s) of interest in the sample and in the population (or minimize the difference between distributions).
* The latter ensures representativeness EX ANTE, but only statistically. You draw your sample in the same distribution as the population, but you may end up, by pure chance on the random draws, with different distributions.
Note that it may be efficient to overweight small categories in the sampling strategy, and then to use weights (small weights on overweighted categories) to compensate and ensure ex post that your sample is representative. If you use the weights in all your statistics and regressions, your reseults will be unbiased, and more efficient (lower variance) than with a representative sampling strategy.