In your survey or research, the survey area is too large like place, district, country or province or population is dispersed, then you can use cluster sampling method. This method is the most frequently used in the field in the large population based study. In cluster sampling, basic sampling units are selected within groups named clusters like villages, administrative areas, camps, etc. The objective of this method is to choose a limited number of smaller geographic areas in which simple or systematic random sampling can be conducted. It is therefore a multi-stage sampling method. There are two stages are there and are follows:
In firststage, the random selection of clusters would be the entire population of interest is divided into small distinct geographic areas, such as villages, camps, etc. You then need to find an approximate size of the population for each “village”. At this stage, the primary sampling unit is the village.
In second stage, the random selection of households within clusters like the households are chosen randomly within each cluster using simple or systematic random sampling.The sampling interval is calculated that, which is the total population of all the geographic units divided by the number of clusters needed.
If the total population is 4200 and you need 30 clusters; then,
Sampling Interval = 4200/30
= 140.
Sampling will begin at a randomly selected starting point. So, we choose a random number as our starting between 1 and the sampling interval should be 140. The geographic unit where this number lies will be the first cluster number. Let’s assume that the random start number was 100. This falls into first village.
The second cluster will be: 100+140 = 240, which also falls into the range defined for first village the list of village as 1 to 500 and this will be cluster 2. Then, you have find the 3rd cluster as 240+140= 380, which is again in village 1. However, if we add again the sampling interval to 380 + 140 = 520, which is the second village for your study. Then, it will be continue like this way until it attains up to 30 clusters. That is your study villages and you would do the research according to your protocol. This method you can use if your study would be based on community.
I don't know the specifics of your work but in general I would recommend you to use k-means partitioning using hierarchical clustering centroids to start the agglomerative schedule. The number of clusters could be determined, for example, with visual inspection of the hiearchical clustering dendrogram and with the Kalinsky-Harabasz criterion. After that you should check for cluster's internal consistency and stability (there are several measures, please check "Brock G, Pihur V, Datta S, Datta S. (2008) clValid: An R Package for Cluster Validation. J Stat Softw 25: 1-22").
In any clustering techniques first we randomly elect some of cluster heads or any member who initiates clustering process.Next step is to check all the remaining members characteristics with initiative members. They join with CH which has more similarities. Here we cant calculate number of clusters to be selected (up to my knowledge).
The number of clusters, I would think, would have to be a compromise between the difficulty in traveling to or otherwise reaching the clusters in the first stage, and the number of smallest units you can handle for a sample size. Cluster sampling, unlike stratification, actually increases the overall sample size needed, but may lower your cost. Also, in general, the larger your sample size, the greater might be your nonsampling error, like measurement error, but the convenience of cluster sampling (which is a randomization/design-based method) may ameliorate this to a degree.
So to determine the number of clusters depends upon the convenience (NOT "convenience sampling") of this design, the sample size you can handle, and the accuracy you can attain considering both sampling error and nonsampling error. So this is rather customized.
You could look into this in a book such as Cochran, W.G.(1977), Sampling Techniques, 3rd ed., John Wiley & Sons.
If we are talking about survey sampling for inference to a finite population, as discussed in a book such as Cochran's, then I'd say there is no such rule-of-thumb, other than to say that only if the difference between clusters is small could you expect to only need a few of them. (But even then, by drawing too few clusters you may miss the fact that some are quite different.)
For stratified random sampling, we can be more efficient than simple random sampling if the difference between strata is great, and the strata themselves have little variance within each. But for cluster sampling, you are drawing clusters at the unit level, as a randomized sample, and then either performing a census on each selected such unit (cluster), or having a second stage of sampling within each. Typically this is less efficient than simple random sampling. If you look in Cochran, or a similar book, you will see that to even determine the sample size needs of a simple random sample requires information about the standard deviation of the population. "One size does not fit all."
There are sample size "calculators" on the internet that may give you the impression that you do not need to know the spread of the population data to estimate sample size needs, but that is incorrect. Such calculators are generally not properly documented. They are only for yes/no data (not the continuous data with which I generally worked), usually assume the worst case of p=q=0.5, rather than estimate standard deviation, and usually do not include a finite population correction (fpc) factor, so they could even suggest a sample size that is larger than the population.
To estimate, or guess, reasonably for standard deviations, Cochran suggested a pilot study, or related data could be consulted. Where I worked in official statistics, there were periodic surveys, so that was not such a problem.
But comparing simple random sampling to cluster sampling, you are going to need at least as many clusters out of the total number of clusters as you would in simple random sampling if n were the number of clusters in the sample, and N were the total number of clusters in the population. That is considering clusters of equal size. It gets more complicated with clusters of unequal size. And even more complicated if you sample at the second stage, which would mean you would need more clusters, but perhaps a smaller overall sample size.
So, the basic answer is that it always depends upon the variability of the data, not the percent of the clusters nor of the population as a whole, without considering variability. You may only be able to roughly guess variability to start, and may have to experiment with a pilot study to determine what size clusters, and how many.
No matter how you sample and do inference, even with the model-based methodology I used, variability is still the key concept.
Unfortunately I do not have a reference but you can find this on the net, the site is "cross validated". You can search the net for this heading "rule of thumb for number of clusters".
I am studying a rural population of 21 million and would like to employ geographical sampling. what formula should i use for determining number of clusters and also number of sampling units under each cluster?
I have taken sample size of 5000 subjects based on cocran formula. that I have to select from 43 pre defined clusters of unknown size. Population size is also not defined and total numbers of household in all clusters is 20,000. On what basis I can select the no of clusters from where I can pick samples. Kindly suggest with example?
for area sample, you should first allocate your total sample size n in terms of dwelling or households across strata, then you should calculate the optimum cluster size depend on cost and ROH (Intraclass correlation) see kish (1965) then you divide the stratum share over the optimum cluster size to the number of clusters
I suggest to calculate the needed sample size using the design effect, and then dividing the sample size by the average number of people in each cluster to get the number of clusters to target. The design effect can be obtained from the literature. If the design effect is not available from the literature, then you can calculate it from this formula: DE= 1+(m-1)p where m is the number of people in each cluster, and p is the intra-class correlation coefficient.
The formula for calculating the sample size is:
n=[2(za/2+zb)2PQ * DE/∆2
Anyway, we should know the average number of people in each cluster and (the DE or the intra-class correlation) to be able to know the number of clusters to include.
You should be able to define your clusters first then you calculate your sample size given a finite population using 50% percentages square multiply by z test square divided by the error margin square.
When you get you sample size divide it by the clusters to arrive at the number allocated to each cluster and make sure the number of each cluster is not less than 30 minimum.
One should know the origin and/or something about the derivation of 'formulas' that you use. (Formulas with p and q are for yes/no (proportions) data only. Some of us, as in my case, have worked more with continuous data.) Clusters are selected, n of them, and then the m elements are selected from each cluster. If m is all M of the elements in each cluster, then this simplifies things. If not, then if cluster sizes, M, are equal, that is not as involved as if the Mi are each different. Both variance between clusters and variances within each cluster are important. How you select the clusters (say, perhaps simple random sampling) is important, and how you select within the clusters is important. The number of clusters, n, is important, and the number of elements within each cluster to be selected is important. There may be an important cost consideration. Variance should be minimized with a minimum total number of elements selected, unless it is easier and less expensive to allow, say, more variance between clusters and collect more elements within clusters for a larger total number of elements, but easier to handle selection, with the same overall variance as other options. That is why the book by W.G. Cochran, Sampling Techniques, 1977, 3rd ed, Wiley, has multiple chapters in this area. There are textbooks and information on the internet, notably lessons from Penn State (search on Pennsylvania State University and your topic), which can be helpful. But you need to narrow your requirements and see what is required to reach a desired level of variance, assuming good data quality and no biasing conditions.
Back to the original question, the number of clusters needed is partially dependent on the number of elements chosen from each cluster. The variance between clusters cannot be too inaccurately calculated, and we have to know they likely represent the most part of the population well so there should not be too few clusters. But you can see that there are various considerations.