Currently there are common short tests in psychology, of 4 items, 3 items and a single item, what is your opinion about these practices ?, consider that has more advantages than disadvantages
That really depends on what you are planning to do. If you want to screen large populations with little effort, some short tests may be sufficient. If you are working with small populations, e.g. in a clinical setting, I would go for the longer scales. Also, short scales differ in quality, so I would always consult the corresponding literature and see how this specific test is doing. For example, I worked with a short form of the Big Five personality scales because it was well-tested and commonly used.
Disadvantages to have less than 4 items because if you have to delete items to get a Cronbach's alpha of at least 0.70 with few items it is more difficult.
Jose, this depends on what you are trying to do with the test. Lets think a moment about measurement and not logistics or other testing considerations. Start by assuming that your trait goes from a little to a lot of that trait. Imagine it as a bipolar scale. The job of measurement is to locate a person somewhere on that scale. The issue isn't about how many items you have, it is about the precision of measurement. Lets consider each item to be an indicator of whether the person is above or below the threshold indicated by that item. If the person is located at the top of the scale (has a lot of the trait) then items that distinguish well at the bottom of the scale add almost no information (the person is above the threshold of all of the items) and will have a lack of precision when encountering someone at the tip of the scale. On the other hand, if the person is within the cluster of items, each contributes information as to whether the person is above or below that threshold. Ideally, the item will capture the space around the person and that person's location on the scale can be well determined with low error (high precision, high reliability) We might call this short scale at the bottom of the range, a high precision peaked information test that provides high quality information only in that area of the scale.
If the test is only 4 items long, it may work very well as a peaked information test because each item delivers fairly high quality information around the location (ability or strength of the trait) at that level. The farther away the person is actually located from the peaked cluster of items, the lower the precision of measurement (more error, less reliability, conceptually, not Cronbach's alpha). If the same number of items are spread across the entire range of the trait, the test can be considered a broad information test. In this case, each item contributes less information about location of the person. Now each item contributes less precise information about the person and is more error prone, though it may capture more of the scale space. So it is not the number of items, per se that determines the quality of measurement, it is the spread around the location of the person.
Increasing the number of items has often been associated with increasing reliability, and in a correlational sense that may be correct. From a scaling viewpoint, broadly spreading items across the whole trait domain creates the equivalent of many mini peaked information tests. But it depends on what part of the scale is captured by each item.
If you need a test that validates a diagnosis and whether the person meets some specific criterion, then a short peaked information test would be appropriate. If you need a broad screening measure that may not be nearly as accurate, then a broad information test might fit your needs. Be prepared that when you are talking with people trained only in classical test theory they may figuratively look at you as if you have seven heads.
For a more complete treatment of this modern view of measurement, see the following:
Engelhard, G. (2012). Invariant measurement: Using Rasch models in the social, behavioral, and health sciences. New York, N.Y. : Psychology Press
Wright, B. D., & Stone, M.H. (1969). Best Test Design, Chicago: Mesa Press.
You might want also to look at the field of adaptive testing where the response to an item determines the next item to be administered. Very efficient measurement can be made that combines the benefits of broad based and peaked information testing.
Your response has been very helpful. You have a lot of reason to think about the measurement, rather than the quantity of the items. Thank you very much, your answer has helped me clarify several doubts.
Disadvantage 1: In brief tests or questionnaires, missing values do much more harm to the reliability of the score than in sufficiently long tests/questionnaires.
Disadvantage 2: Brief questionnaires may not cover the trait or disorder one is interested in to a sufficient degree.
Disadvantage 3: If the questionnaire is meant to rate the degree in which a person may suffer from a so-called mental disorder, then, in particular, several items are needed because such disorders consist of several partially independent, partially interrelated phenomena. The items referring to these phenomena cannot be considered to be so-called indicators of one or two so-called latent factors that would fully account for the correlations between them. In other words, te questionnaire does not obey the common factor model. This argument has repeatedly been put forward by the proponents of a network approach of mental disorders and traits (Dennis Borsboom, Angelique Cramer, Eiko Fried and several others).
Conclusion: If you use a questionnaire, which is very summary, for a complicated nvestigation, the results may be meaningless or highly untrustworthy. To speak of my own area of interest: I am afraid the much used OCI-R, devised by Foa et al. (rating the degree of six OCD subtypes by means of only three items per subtype) may be such a risky instrument. If the cooperation of many subjects had to be asked for this research, you may have wasted their time and energy. If you get the study published, you may waste the time of the readers of your article. And you may have wasted your own time.
However, this is not to deny the OCI-R's value for a quick first screening on OCD in potential sufferers.
What you are describing are ultra-short measures. Typically, someone interested in a construct develops an inventory with 30-50 items to measure it. Someone else then develops a "short form" with about 10-20 items, usually by selecting those items that correlate most strongly with the remainder (or load most strongly on the factor - same thing, really). If done well, this often results in very little loss of reliability, and the "short form" gets used in much subsequent research. But there is a real risk of losing some of what was originally intended - that is, the short form may not sample the domain of the construct as well as did the original. (Remember: The construct itself may be broad & not totally "tight," therefore any good measure will have limited reliability.)
The new trend toward "ultra short" measures (sometimes using as few as 2 items per scale) is, I think, driven by the possibility of recruiting very large samples on the Web. Basically, the psychologists who are engaging in this research are doing what their cousins in sociology departments have been doing right along. Sociologists always relied more on crude measures administered to huge samples, while psychologists tended to rely more on refined measured administered to smaller (possibly even inadequate?) samples. The ultra-short scales often do suffer from poor reliability as well as poor domain sampling.
What you're seeing a lot of lately are elaborate statistical analyses applied to very shoddy data, but relying upon the brute strength of sample size to get the job done. Honestly, a lot of the studies aren't all that interesting. It isn't just the measures that are at fault - it's as if the psychologists themselves aren't really thinking about the underlying constructs, just playing in the statistical sandbox.
But ultra-short measures do have a place. If you're conducting a big, complicated study with lots and lots of participants, you're better off measuring a variable (even if poorly) than leaving it out altogether.
When building a scientific model, one must often create abstract concepts that are, by definition, fictitious and unobservable, but that are useful in predicting and/or controlling observable phenomena, i.e., have practical value. This is the case of the notions of "force", "mass" and "energy", which refer to things that cannot be measured except indirectly, through their assumed interaction with phenomena that can be directly observed. Their value is not in their inherent existence, for they have none, but rather in the fact that one can use them to deal effectively with the experiential world. In human and social sciences, such abstract concepts are referred to as "constructs", examples of which include "intelligence" and "personality".
Constructs refer to an abstract dimension considered to be underlying the behavior of a set of observable variables. Their validity is determined by how closely such variables covariate, meaning how much of their behavior displays a common pattern. This is taken as a sign of the strength with which the construct can be seen as governing them, as well as a form of estimating its quantitative value in a given situation. Their usefulness is assessed by how strongly they associate to observable variables that one seeks to predict and/or control.
When measuring a construct in human and social sciences with a large test with many items, one has the advantage of having a greater chance of random errors canceling themselves out, leading to greater precision. Also, when using more variables, one can also make more conceptual mistakes such as choosing some wrong or misleading items, but still have a robust enough measurement overall. The price, however, is the decreased practicality of longer tests.
With short tests, one has to make sure the few items used are strongly associated to the construct and, even so, there will be greater error in the estimation. However, the precision may very well be "enough" for a given purpose. The classic case is when one is interested in populations and associations between variables rather than individual clinical assessments.
I make ample use of short tests for IQ, Big Five dimensions, and Emotional Regulation, among other things. The results are usually very good, even when the sample size is not huge. For illustration purposes, when comparing IQs of adult men and women, over a dozen studies in the literature mention a difference between 2-5 points in favor of men (between the ages of roughly 14-16 girls develop faster than boys both physically and mentally, so studies involving these groups usually show little or no difference). Using a 5-item IQ test standardized through a sample of only 1,291 individuals, I was still able to find a difference of 3.5 between the sexes in a group of 121 men and 116 women. Such a difference is exactly in the middle of the results in the literature. This brief measure of IQ also correlates well with scholastic performance, general knowledge, socioeconomic status and openness to experience, as expected of a measure of IQ.
Depende del tipo de estudio (objetivos, alcances, ¿exploratorio?) y la población con que quieras trabajar, de cuán bien quieras que los ítems midan el constructo (dominio de muestreo del constructo), y si te interesa las propiedades psicométricas del instrumento (confiabilidad y validez).