Histograms are awful for comparing groups. Try altering the bar width or the start point of the first bar and you will see that you can make many different-looking histograms of the same data.
Boxplots can alert you to differences in location and distribution shape, but do not show the fine structure of the data. Overlaying boxplots on dot plots (stripplots) is a more powerful method. I've attached a quick plot of mpg from the cars dataset as an example
For visual inspection, you can use the function ~histogram(data|groups) from lattice library (generates conditional histograms) or the in a ggplot2 object (geom_histogram) you can use the "fill" argument to specify the groups (in the same histogram).
Histograms are awful for comparing groups. Try altering the bar width or the start point of the first bar and you will see that you can make many different-looking histograms of the same data.
Boxplots can alert you to differences in location and distribution shape, but do not show the fine structure of the data. Overlaying boxplots on dot plots (stripplots) is a more powerful method. I've attached a quick plot of mpg from the cars dataset as an example
I think that an overlay of dots and boxes is not really neccesary. One can show the data when the number of groups and dots is relatively small (otherwise the plot becomes too busy) and use boxplots otherwise.
To avoid overplotting of points in the dot plots (1D scatterplots), the dots can be arranged to be all completely visible. Such plots give a good impression about the (empirical) densities. This is only hinted when jittering is used (as in your picture).
I attached an example from R showing the IinsectSprays dataset using the function beeswarm. The code is:
While 1D scatterplots give a good impression of densities, they don't give information about quantiles, which is why I like the addition of boxplots. Here's an alternative, with the points stacked, thus making a sort of distribution plot that doesn't have to be read as deviation from an imaginary centre point.
Thanks a lot. Will you share R codes for this? One thing is presentation showing mean and o0r distribution. Another thing, how do we compare whether distributions from different groups are statistically different or not? I am not familiar with this. Moreover, density plots or kernel density plots are better than just counts? I would really appreciate if you share knowledge and experience!!
The chi-squared test or Fisher's exact test can be used to test differences between frequency distributions.
Actually, density plots are not better than just counts, because they already hide and smooth information that would otherwise be visible, and -more importanlty- they assume a continuous variate, but counts are discrete. They may be ok when you have a lot of data. If the amount of data is small, density plots can be extremely misleading.
With 70,000 data points, tiny differences will be statistically significant without necessarily meaning anything important. I'm in a similar position with a study that ran about 150 trials that yielded 80,000 data points (ah, technology!). I'm trying to characterise each trial by an area under the curve (though the noisiness of the data makes this a tricky problem), and so get a meaningful measure that I can use to estimate effect size.
Your initial problem may well be defining a meaningful variable that you can synthesise from your primitive data (I don't mean primitive in the sense of rough-hewn, but in the sense of data as building blocks for meaningful variables rather than as meaningful in themselves)
I have already compared the mean values from different groups! What I would also like to show is distribution for each group and compare these distributions as this is important or at least interesting as a biologist. I have also searched a bit here and here in the google about but would like to have precise response from the experts here if possible. I am experienced with regular ggplot2 or hist graphs in R but are there any better packages or function in R for this?
Are you interested in comparing two distributions visually? In that case you've had some good suggestions: lattice::histogram, violin plots, box plots.
Or are you looking for a statistical test? In which case, a two-sample Kolmogorov-Smirnov test would be a candidate.
Or summary statistics? Reporting several percentiles (min, 0.05, 0.10, 0.25, 0.50, 0.75, 0.90, 0.95, max) is often meaningful.
Or tests of specific statistics? If may be meaningful to compare the medians or say 75th percentiles by permutation test.
1) Try a computer intensive approach. Re-plot the data many thousands of times and in each re-plot leave a few individuals out of the plot. This will result in a distribution where the edge is fuzzy. The more individuals that are removed, the fuzzier the edge will become. If you remove 5% of the data from graph 1, and 5% from graph 2, and they both do not overlap, then (maybe) they are different.
2) Find a statistic, like the 25th percentile (25% of values are less than "this" value), and use a randomization test to decide if there is a statistically significant difference. This approach will work for any statistic you choose. The can rapidly become a fishing expedition, but please don't. The 25th percentile didn't work so I'll try the 26th percentile. Maybe geometric mean, skew, kurtosis, and all percentiles from 1 to 99. I can then do all of this on transformed data .... and eventually I will find something. This is a fishing expedition, and not good science.
3) What I am missing from the figures is some idea of how much the figures would change if I gathered more data. You could gather several more data sets using the same methods and then use that to produce a graph that is a mean and 95% confidence interval at each x-value.
3a) An alternative is to break the data set into pieces. With 70000 data points you could make five sets of 15000.
3b) Break the data set in half by randomly assigning individual observations. Do this over and over again to examine variability in outcome. At each point where there is real data, you can quantify this varaibility.