But there the actual problem starts! Many of these foundation courses - also at well-established institutions - lack the time or the didactics to motivate the students to think statistically. They often throw lots of terms at you and in the end - sadly - do not get past providing a cookbook recipe for one or two hypothesis tests and maybe a glimpse at linear regression. Usually the 13-week semester or the two-week workshop is then over.
So these basic principles AND where you get them is crucial.
I see a lot of problems when using Excel, mainly because of lacking of reproducibilty of results which is mandatory in statistics. Then there is a lot of issues with missing values (treated differently in Excel), rounding (no IEEE standard), etc. , and of course of lack of new advanced statistical methods that might be only available in R.
I made the experience that advicing R forced researchers a little more to think about their data and their analysis, what I consider good. Those who get completely scared after looking into R will have a lower inhibition thershold to consult a statistician in future, what is also a positive aspect in my opinion.
Excel (or similar spreadsheet software) usually does not require any "advice" - this is the most-often used every-day software to organize the data, and to anaylze it, too. Advising to use Excel is like carrying coals to Newcastle. And some very simple analyses can be done in Excel quite conveniently. This is not generally bad. However, producing diagrams of a minimum required quality in Excel is a major pain. As soon as a reasercher is concerned about the quality of the diagrams he/she will automatically seek alternatives anyway. The known/standard diagram types can be produced in far better quality with many programs relatively simple (e.g. Prism, Origin, ...), and I do not see why such programs should not used for this purpose. However, that main problem in my eyes is that reaserchers do not think about good alternatives to the kinds of diagrams they use, and that it is not clear if they want to show data, model predctions, or effects (many do not even recognize the difference!).
From a statistician's or "scientist's" point of view, for sure any advise should be to first understand the concepts and principles. What tool is then used to solve a well-understood problem does not really matter much. But the reality is not that nice. The most important aim of most researches is to publish. Since funding, job, and careers are tight to successful publishing, there is no (and can not be a) high level of self-critique and scepticism about own "results", and all behaviour is tuned to reach this aim. The main hurdle is to convince editors and reviewers, who often enough have very "old-fashioned" views of statistics and are often not quite competent in this subject, too. Including professional statisticians in reviewing is a step into the right direction, but still often suffers from the lack of understanding the data, measurement techniques, and the problem by the statistician, and advices of the statistical and other reviewers concerning analysis are often contradictory (what should authors do then?...). So at the end, authors want to do what the reviewers want to see. And as long as reviewers want to just to see "the usual stuff" (be it non-sensical, irreevant or even wrong - this does not matter), authors seek for and easy way to get this. And, yes, having some "I-do-it-all-for-you" software and knowing what menu item to click to get the chart and the numbers that have to be presented to satisfy the reviewers is exactly the answer to the author's problems. It seems not neccesary that anyone (authors, editors, reviewrs, readers) do understand what the charts show or what the numbers (e.g. p-values) actually mean.
In conclusion, I think it makes perfect sense *in our environment* (pubishing & funding) to advice stats packages that quicly and simply do the job the researcher wants to be done. Blocking this does not solve the problem that the education in statistics (also in "philosophy of knowledge" and in "ars conjectandi"!) has to become much much better. But this must be achieved in parallel to the recent "bad practice" of essentially doing "mindless statistics" (G. Gigerenmzer, The Journal of Socio-Economics 33 (2004) 587–606).
Education in mathematics and statistics has to become much much better - I fully agree with you. Furthermore, it needs to communicate the need of quantitative personnell in the present and future business & industry - with data amounts doubling every couple of years and smart devices collecting more and more information. Google chief economist Hal Varian said; statistician is going to be the most sexy job in the next ten years - well, he said that 2009, but there are still 5 years to go - and I am convinced it will not stop there. Especially for countries where the statistical awareness is not as developped as for instance in the US.
Should we use Excel? Yes, but not for statistical analysis and visualisation. Should we only use R? No, because the very steep learning curve is to scary for the statistically uninitiated decision makers. That means we have to spend a lot of money for SPSS, Statistica or SAS? Not necessarily, because these software packages are not small programs meant for performing this one analysis, they are software solutions for process controlling, heirarchically distributed analysis and so on - so make sure you know what you want the big ones for. So, what shall we do? Be sure what your final product shall be, get training and stick to a scientific conduct when performing statistics - choose whatever tool suits the job best (the link of Matthias is a nice one on that subject, while Excel is a too small saw for a mammoth tree, it might be a handy tool for a small birch, here's Matthias' link again: http://bit.ly/1rNdT1x).
The latter part of your post is also aiming a bit at the ongoing discussion of «statistician vs. data scientist» - where all the buzz words like big data, correlogram, hadoop, and so on get thrown around. See for instance: http://bit.ly/1efPVXP, for ten reasons why small data is more fruitful, or http://bit.ly/1d2tDp0, for a data scientists view, or http://bit.ly/1iGGbbx, for a statistician's view.'
I also totally agree with that statement. As my mentor Gerd Gigerenzer put it: "We need statistical thinking, not statistical rituals." The software package invite people to do cookbook statistics.
See these two articles that discuss this issue in more detail:
Gigerenzer, G. (2004). Mindless statistics. Journal of Socio-Economics,
Marewski, J. N., & Olsson, H. (2009). Beyond the null ritual: Formal modeling of psychological processes. Zeitschrift für Psychologie/Journal of Psychology, 217, 49-60.
Excel is not at all adequate for statistics. I have encountered many people miscalculating R-squared values (as simple as that) because in some cases Excel does not give the correct value (even negative values are encountered sometimes).
With respect to R, I think is one of the best options to train people in statistics, at some intermediate and upper-intermediate level. It's true that syntax is complicated, but this has the following advantage: by looking the meaning of all options for a certain command in the Help file, you find lots of explanations about the rationale behind what the command is doing, which gets you all the time back to the basics of whatever method you are using. There is a good book entitled "Introduction to statistics using R" which I think is really helpful for teaching and training.
We need to be careful advising anyone to do any type of analysis when they do not understand basic statistics. There are many problems with performing an analysis without knowing what you are doing. For example every model used has a number of assumptions and would this researcher know them or how to test them?
Perhaps they could get basic results such as number summaries on their characteristics but aside from that if one has no knowledge of basic statistics they should consult a statistician.
I think the answer is definitely must not be advised, below are some consierations.
Some researchers, if they are efficient in the use of EXCEL , MATLAB, or any other application can use them to make their analysis, as those programs have the mathematical tools and statistical functions. However there is always the possibility of making a mistake in the application of a formula. I would add that the best idea for data analysis is to use the packages availables at the universities: SAS, SPSS, MINITAB for example.
It is difficult today to find researchers with limited statistical training in highly developed countries. The reason is that in the best universities, professors have a very high level and generally, your supervisory committee will require at least one course in statistical methods, and one in design of experiments, making it difficult to find researchers with low level in statistics. Of course there may be exceptions, but certanly should be only a few. Because the intellectual level of a research depends on what you get from your professors and the commitment that the students put in acquiring knowledge .
Of course, if a researcher does not have a good statistical level, he does not have the ability to work with a statistical package, that could be very dangerous, for its prestige and quality of research that is producing, as well as, for the society. A crude comparison would be to claim that a person who does not know how to drive a car, try to drive it.
Well, Raphael, I invite you to ask researchers in "highly developed" countries for their understanding of statistical concepts. I am sure that at least in the bio/medical field you will find the vast majority being unable to correctly say what a p-value is (see, for instance, H. Haller and S. Krauss, Methods of Psychological Research Online 2002, Vol.7, No.1). The results will be very similar when you ask for "confidence interval", and it will surely be worse (if this is possible, though!) when you will ask for "likelihood (intervals)", "posterior densities", "effect sizes", "linear models", ... They have not even heard about the existence of two entirely incompatible concepts of "hypothesis testing" and "significance testing" (the old battle between Fisher and Neyman). There is already very little understanding what probability means (rules for calculating with probabilities are excessively covered in intro books, but it is rarely explained what probabilities actually are about - except some frequentitst properties that are desierable but what does not hit the point). There is no concept of information and entropy (except that -sometimes- the measurement scales [nominal,ordinal,interval,ratio] are associated with "information content", but even then the practical impact is not clear to most).
It is a prominent part of the problem that there are so many really bad introductory statistics books (and courses) out there. No more than "Cookbooks", avoiding to introduce people to "statistical thinking" but rather providing quick recipies for frequent problems, re-iterating pre-computer concepts and methods. The philosophical backgrounds of probability is never discussed in intro books (I don't know a single book), at best (?!) circular definitions are given that finally only confuse people. Same for definitions and concepts of knowledge, information, and learning. Young researchers are instructed in using a hammer without telling them what a nail is and what a nail is for and what it if not for.
Sad but true it is common practice that researchers do a t-test because others do it and reviewers request it (if appropriate or not). And they present bar charts for the same reasons. Occasionally, some put a linear regression on a chart or calculate a coefficient of correlation (again, if appropriate or not; and again for the same reasons). There is not much more. I can show you tons of publications where two-factorial experiments are analyzed with a one-way Anova and some post-hoc-tests, without considering the main research question for the interaction at all (the term "interaction" is widely unknown). And all this in "highly developed" countries.
That is also true Dr Wilhem. Unfortunately there are also those nasty cases. some times we find papers that should not be published, but i think that this is not the rule.
Probably yes, but how can we avoid the use of those packages? Everybody with a knowledge of spreadsheet-like programs can "play" with those packages. The only serious barrier is from command-line programs, like R, where you cannot "play" without a minimum knowledge.
But there the actual problem starts! Many of these foundation courses - also at well-established institutions - lack the time or the didactics to motivate the students to think statistically. They often throw lots of terms at you and in the end - sadly - do not get past providing a cookbook recipe for one or two hypothesis tests and maybe a glimpse at linear regression. Usually the 13-week semester or the two-week workshop is then over.
So these basic principles AND where you get them is crucial.
The core concepts (mean, median, mode, variance, standard deviation) then the more complex concepts of hypothesis testing using linear regressions, ANOVA's, and even Chi-Squares and T-Tests are essential. If students don't learn these concepts they will not be able to read and interpret the output sheets from SAS, SPSS or any other statistical software package. I would argue that the first statistics course might even preclude them using software packages until students demonstrate a mastery of these concepts.
Here, again is is stressed that statistics courses should cover the "concepts" of mean, meadian, variancee and so on. This is what a typical course tries to cover. And to my opinion, this is not the solution but rather part of the problem. I think we agree that "statistical thinking" is the main part to be tought, but the mentioned "concepts" are simply not the basis of "statistical thinking". What is much much more important to achieve "statistical thinking" is to get an idea about things like information, learning, and knowledge, all being at first much more philosophical terms than technical terms. The philosophy must be made clear, then one can go on and see how information can be obtained, that data can transport information, that this is relative to the knowledge and depending on the kind of data. After this it will become more clear that and how the relevant pieces of information can be "extracted" from the data, and this is where eventually things like mode, median, variance etc. enter the game.
Teaching statistics is very much influenced by mathematicans (I mean "proffessional" statisticians, who really know the maths), directly or indirectly. This is generally good and important, though it has a major drawback: whereas for mathematicians the "technical" aspects (I just call it is "technical", hoping you understand what I mean) are most important or even the only subject, practioneers foremost require the underlying philosophy - the technical details are solved by computer programs. A reasonable statistics couse - to my opinion - should start with one or even two semesters of philosophy (of science, knowledge, information, learning, empirism, data) then leading to models (from the philosophical to statistical). In this course, properties of different kinds of data should be examined under different conditions, thereby understanding what information can be extracted, how it can be extracted and represented, and how it influences the body of knowledge.
We should recall that scientists of the natural sciences usually make a Ph.D., and the "Ph" stands for phisosophy(*). We do not educate students to "engineers in statistics" or "science engineers" (these are different educational aims, not less valuable, but with a different focus!). If we want to have PhDs wee need to teach philosophy, and this is the very heart and fundament of all "statistical thinking".
My 2pc.
From Wikipedia: "Philosophy is the study of general and fundamental problems, such as those connected with reality, existence, knowledge, values, reason, mind, and language."
I own a book by Charles Landesman (1985) titled, "Philosophy: The Central Issues." He argues that philosophy in the Greek literally means the "love of wisdom," and that any person seriously seeking wisdom is by default a "philosopher."
I cannot argue against your claim that students need to know the philosophy of science. However, in the social sciences statistics is a tool used to test hypotheses based on theories. A field of study can advance only through testing theories. In fact, all knowledge is tentative (theories are expected to be altered by future findings) for this reason.
The software packages (SAS, SPSS, MINITAB) used are computer operated systems that require a background in statistics to interpret output pages. If students don't know the core concepts in statistics, they will not be able to advance science, because, they will not be able to accurately interpret the results of the hypotheses they are testing based on whatever theories they used to develop the hypotheses.
Science is a method, and statistics is a tool used in social science research to test theories, to advance a field. Eddie's question was specific to this point, largely predicated on using software to aid researchers in quantitative hypotheses testing.