I am assuming that you are referring to the sample size. The main issue is whether the sample is representative of the population under study and will confine my remarks to a simple survey sample and not more elaborate sampling schemes.
The situation you pose is reminiscent of the story of the person who lost his wallet in the forest. When approached by a passerby the person was asked where he lost it and he replied “Over there in the forest.” “Then why are you looking under the lamp post?” asked the passerby. “Well, the light is better here” was the response. Statistician John Tukey was asked in a conference what he would do if data came to him for analysis without a pedigree. His response was “Send the mongrels back.”
Statistical estimation and statistical inference is predicated on having data that accurately represents the population under study. Notwithstanding, there are a few things you might do. You could construct upper bound and lower bound estimates based on lumping all of the data not collected into the most reasonable extreme positions (all high and then all low). These are often way too wide, but it is a possibility. You could also take this modified data and Winsorize (symmetrically move extreme observations toward the center) it or trim it (symmetrically eliminate extreme observations) and calculate descriptive statistics (with corrections based on whatever assumptions you are willing to make.) Estimates of relationship are a bit more complicated, though some robust tools are available and imputing missing data is not necessarily out of the question.
You could get another sample of data and compare it to your original dataset. If you have enough power and still cannot detect differences, then just combine the samples or if the new dataset contains only some critical variables, and no differences can be detected then just use the original data.
If you find differences between the second sample and your original sample, then a more complete investigation is warranted and you may learn things that you not expecting to find. In any event some replication and confirmation of your findings is warranted, particularly with dirty data.
If you are really interested in this problem then you may want to consult the literature in robust statistics, survey sampling and data imputation.
Good research usually starts with good methodological design. I know of no satisfactory statistical procedures that can save a study in the Neyman, Pearson, Fisher hypothesis traditions. You can, of course, describe the sample you have and its properties. Generalizing to some other situation or population is, however, unwarranted.
Perhaps others will share their opinions and ideas.
Dear David thank you for constructive feedback. Actually we had done a national survey where we estimated 150 clinics from 25 districts (there are total 75 districts), and we had no preexisting sampling frame. We had to go in the field and had to develop the frame. But in many districts there were not enough sample, so we decided to replace it from other sampled district but even with this strategy there were not sufficient sample across 25 districts. Though we did census in these 25 districts, we only achieved 78. Now will this be nationally representative?
The best thing is to examine your data and get robust estimates for your current level of uncertainty. That will let you know if the data can be used for any actions or if you need to rethink the collection strategy. I would not advise a complete statistical assessment at this stage, as you want to limit the number of times you run tests to avoid the risk of false positives.
Undersampling will increase the uncertainty in the data, which will reduce the power of the study. Is the variation you are observing acceptable or does it indicate that you need more complete. I assume from your description that there was a sampling pattern designed that was based on some assumption of heterogenity and variance in the population. How does your current data's variance compare to the assumptions you made?
If it is worse, then you need to think of strategies to expand data collection.
If it is comaprable or better then consider if you wish to complete your statistical analysis (but be prepared to accept whatever the outcome is).
Dear Deepak, The question I would have for you, now is "How did you come to pick these 25 districts?" Could you consider these districts to be clusters from your national divisions and were these clusters selected in a representative manner (i.e., random or some other strategy to guarantee representativeness?) Clusters, not individuals, then become the unit of sampling under this assumption. While this is an ex post facto sampling plan, it is the best that I can think of right now.
The next issue that comes to mind is how your data was obtained. Ideally, you would have a random sample within each cluster. So let's pretend that this requirement is met.
Now we get to what I thought was your original question, and it turns out that it may be the least problematic issue of the sampling design. Most statistical analytic techniques assume that the underlying sampling distribution of the statistic under consideration is based on sampling with replacement, particularly for finite populations. If there is some compelling reason to have particular cell sizes, and some of them contain a complete census, then why not create a resampling scheme to fill vacant data. I can imagine a couple of reasons for using such a procedure. First, the statistical procedure requires equal cell sizes for computation or second, proportional representation in each cell is required. So what I am suggesting is a hybrid model where if a cell requires 10 observations and you have only 8, then randomly (with replacement) pick another two observations to complete that part of your data.
Alternatively, you might consider that each cell contains an unbiased data set. Then you could create a sample of whatever size you desire by sampling with replacement from each corresponding data set.
In summary, I am proposing for your consideration that you conceptualize your design as a cluster (or some other nested) sampling plan and a resampling plan for data within each cluster.
I am reminded of some advice I got from one of my (tor)mentors many years ago. "It is always better to figure out what you want to do before you assess how well you've done it." Life would be much simpler if I heeded this advice in my own work more often.