7 Page Psychology Essay
Peter Kruyen
Using short tests and questionnaires for making decisions about individuals:
Save your time - order a paper!
Get your paper written from scratch within the tight deadline. Our service is a reliable solution to all your troubles. Place an order on any task and we will take care of it. You won’t have to worry about the quality and deadlines
Order Paper Now
Using Short Tests and Questionnaires for Making Decisions about Individuals:
When is Short too Short?
Peter Mathieu Kruyen
Cover design by Roos Verhooren (info@roosverhooren.nl) Printed by Ridderprint BV, Ridderkerk, the Netherlands © Peter Mathieu Kruyen, 2012 No part of this publication may be reproduced or transmitted in any form or by any means, electronically or mechanically, including photocopying, recording or using any information storage and retrieval system, without the written permission of the author, or, when appropriate, of the publisher of the publication. ISBN/EAN: 978-90-5335-614-2
This research was supported by a grant from the Netherlands Organisation for Scientific Research (NWO), grant number 400-05-179.
Using Short Tests and Questionnaires for Making Decisions about Individuals:
When is Short too Short?
Proefschrift ter verkrijging van de graad van doctor
aan Tilburg University
op gezag van de rector magnificus,
prof. dr. Ph. Eijlander,
in het openbaar te verdedigen ten overstaan van een
door het college voor promoties aangewezen commissie
in de aula van de Universiteit
op vrijdag 14 december 2012 om 14:15 uur
door
Peter Mathieu Kruyen,
geboren op 26 juli 1983 te Dordrecht
Promotor: Prof. dr. K. Sijtsma
Copromotor: Dr. W. H. M. Emons
Overige leden van de Promotiecommissie:
Prof. dr. M. Ph. Born
Prof. dr. R. R. Meijer
Prof. dr. M. J. P. M. van Veldhoven
Dr. L. A. van der Ark
Dr. A. V. A. M. Evers
.
Contents
1. Introduction …………………………………………………………………………………………………………….. 1
1.1 Test length and individual decision-making ………………………………………………………………… 3 1.2 Preliminaries: Test length and measurement precision …………………………………………………. 6 1.3 Overview of the thesis ……………………………………………………………………………………………. 12
2. On the shortcomings of shortened tests: A literature review ……………………………………. 15
2.1 Introduction …………………………………………………………………………………………………………… 17 2.2 Research questions …………………………………………………………………………………………………. 18 2.3 Technical terms ……………………………………………………………………………………………………… 19 2.4 Method …………………………………………………………………………………………………………………. 26 2.5 Results ………………………………………………………………………………………………………………….. 29 2.6 Discussion …………………………………………………………………………………………………………….. 40 Appendix: Coding scheme ……………………………………………………………………………………………. 44
3. Test length and decision quality: When is short too short? ………………………………………. 47
3.1 Introduction …………………………………………………………………………………………………………… 49 3.2 Background …………………………………………………………………………………………………………… 50 3.3 Method …………………………………………………………………………………………………………………. 53 3.4 Results ………………………………………………………………………………………………………………….. 63 3.5 Discussion …………………………………………………………………………………………………………….. 71
4. Assessing individual change using short tests and questionnaires …………………………….. 75
4.1 Introduction …………………………………………………………………………………………………………… 77 4.2 Theory ………………………………………………………………………………………………………………….. 79 4.3 Method …………………………………………………………………………………………………………………. 82 4.4 Results ………………………………………………………………………………………………………………….. 87 4.5 Discussion …………………………………………………………………………………………………………….. 94
5. Shortening the S-STAI: Consequences for research and individual decision-making … 99
5.1 Introduction …………………………………………………………………………………………………………. 101 5.2 Background: Pitfalls of shortening the S-STAI ………………………………………………………… 103 5.3 Method ……………………………………………………………………………………………………………….. 108 5.4 Results ………………………………………………………………………………………………………………… 112 5.5 Discussion …………………………………………………………………………………………………………… 117 Appendix: Explanation of strategies used to shorten the S-STAI …………………………………….. 119
6. Conclusion and discussion …………………………………………………………………………………….. 121
6.1 Conclusion ………………………………………………………………………………………………………….. 123 6.2 Discussion …………………………………………………………………………………………………………… 128
References ……………………………………………………………………………………………………………….. 131
Summary …………………………………………………………………………………………………………………. 143
Samenvatting (Summary in Dutch) …………………………………………………………………………… 149
Woord van dank (Acknowledgments in Dutch) …………………………………………………………. 157
1
Chapter 1: Introduction
Chapter 1
2
Introduction
3
1.1 Test Length and Individual Decision-Making
Psychological tests and questionnaires play an important role in individual
decision-making in areas such as personnel selection, clinical assessment, and educational
testing. To make informed decisions about individuals, psychologists are interested in
constructs like motivation, anxiety, and reading level, which have been shown to be valid
predictors of criteria such as job success, suitability for therapy, and mastery of reading
skills. These unobservable constructs are measured by a collection of items comprising a
test. Addition of the scores on the items provides a respondent’s total score or test score,
which reflects the respondent’s level on the construct of interest. Total scores are used to
decide, for example, which applicant to hire for a job, whether a patient benefited from a
treatment, or to determine if a particular student needs additional reading help.
Before using a test for individual decision-making, tests users need to be certain to
a particular extent that decisions for individual respondents do not depend on one
particular test administration (e.g., Emons, Sijtsma, & Meijer, 2007, p. 133; Hambleton &
Slater, 1997). When total scores vary considerable across different (hypothetical) test
administrations due to random influences like mood and disturbing noises during the test
administration, the risk of incorrect individual decisions may be substantial. As a result,
test users may reject a suited applicant, continue an unsuccessful treatment, or deny
additional help to a student with a low reading level. Incorrect decisions may have
important negative consequences such as a decline of the well-being of individual
respondents and the waste of organizational resources.
In this PhD thesis, the focus is on the influence of random measurement error or
total-score unreliability on test performance in relation to individual decision-making.
Special attention is given to test length in relation to reliability, which is a group
characteristic, and measurement precision, which pertains to measurement of individuals.
Chapter 1
4
Throughout, we concentrate on reliability issues in decision-making about individuals, and
for the sake of simplicity assume that tests are valid. Validity is a highly important topic
that cannot be addressed in passing and justifies a PhD study on its own.
Generally, tests consisting of many items, say, at least 40 items, are more reliable
than tests consisting of only a few items, say, 15 or fewer items. Specifically, psychometric
theory—the theory of psychological measurement—shows that the more items respondents
answer, the smaller the relative influence of random errors on total scores (e.g., Allen &
Yen, 1979, pp. 85-88; Nunnally & Bernstein, 1994, pp. 230-233). However, extending test
length with the purpose of minimizing the relative influence of random errors encounters
numerous practical objections. For example, filling out long tests may result in high
administration costs. In other applications, test users do not want to trouble respondents
with many questions, for example, when critically-ill patients need to be assessed. To
summarize, test users and test constructors often do not appreciate long tests and
questionnaires and rather prefer tests that are as short as possible, often within the limits of
particular psychometric constraints.
Hence, short tests—including shortened versions of previously developed longer
tests—are abound in practice. In personnel psychology, for example, researchers
developed a 12-item form of the Raven Advanced Progressive Matrices test (originally 36
items, Arthur & Day, 1994) and a scale measuring focus of attention by means of 10 items
(Gardner, Dunham, Cummings, & Pierce, 1989). In clinical psychology, examples include
a 13-item version of the Beck Depression Inventory (originally 21 items, Beck & Beck,
1972) and a 3-item version of the Brief Pain Inventory (originally 11 items, Krebs et al.,
2009). In personality psychology, we can find, for example, a 10-item and a 5-item
questionnaire allegedly measuring the complete Big Five construct (Gosling, Rentfrow, &
Introduction
5
Swann, 2003). For the purpose of comparison, the NEO PI-R contains 240 items to
measure the Big Five (Costa & McCrae, 1992).
One way or another, psychologists need to deal with the consequences of using
short tests for the risk of making incorrect decisions about individuals. The goal of this
thesis is to assess whether, based on psychometric considerations, short tests may be used
for making decisions about individuals. Total scores can be reliable, but if a test designed
to measure reading level also measures another construct such as anxiety, scores will be
interpreted incorrectly. In this thesis, we focus on the relationship between test length and
reliability, because reliability is a necessary (although not a sufficient) condition for tests to
be valid (Nunnally & Bernstein, 1994, p. 214); that is, poor test performance that is mostly
due to random measurement error does not reflect the influence of the construct of interest,
and a reliable test may or may not measure the intended construct. Validity should be
studied on its own but for practical reasons is simply assumed here.
We answer the following research questions:
1. To what extent do psychologists pay attention to the consequences of using short
tests for making decisions about individuals?
2. How should one assess the risk of making incorrect individual decisions?
3. To what extent does test shortening increase the risk of making incorrect individual
decisions?
4. What are minimal test-length requirements for making decisions about individuals
with sufficient certainty?
Chapter 1
6
1.2 Preliminaries: Test Length and Measurement Precision
Often reliability is assessed by coefficient alpha or the test-retest correlation
(Nunnally & Bernstein, 1994, pp. 251-255). A test is deemed suited for individual
decision-making if the total-score reliability exceeds a minimum value that is recognized
as a rule of thumb. However, for psychologists interested in individual decision-making,
measurement precision is more important than reliability (Harvill, 1991; Mellenbergh,
1996; Nunnally & Bernstein, 1994, p. 260; Sijtsma & Emons, 2011).
In this section, we show that the reliability coefficient conveys insufficient
information to assess whether the total score is precise enough to be useful for individual
decision-making. Specifically, we show that total-score reliability as estimated by
coefficient alpha can be acceptable for short tests but that meanwhile measurement
precision of the test is much lower and even unacceptably low.
1.2.1 Theory
We studied the relationship between test length and measurement precision from
the perspective of classical test theory (CTT). CTT assumes that a total score, which is the
sum of the scores on the items in the test, and which is denoted 𝑋 , equals the sum of true score 𝑇 and random measurement error 𝐸 : 𝑋 = 𝑇 + 𝐸 . The statistical model of CTT assumes that the same test is administered an infinite number of times to a particular
respondent and that these administrations are independent so that different administrations
can be considered to be replications. Due to random processes reflected by the error
component, replications produce a distribution of total scores, also known as the propensity
distribution (Lord & Novick, 1968, pp. 29-30). The mean of the propensity distribution is
defined as the respondent’s true score and the dispersion is the respondent’s measurement-
Introduction
7
Figure 1.1: Example of propensity distributions for two respondents with different true scores and error variances.
error variance. Figure 1.1 shows for two respondents their hypothetical propensity
distribution, which are different with respect to true score and error variance. Thus, CTT
assumes that different respondents are measured with different precision.
In the real world, instead of a propensity distribution only one total score is
available for each respondent. Hence, in practice one uses the sample of total scores from
all individuals to estimate one common error variance, which may be considered the mean
of all the error variances of the unobserved propensity distributions (Lord & Novick, 1968,
p. 35). The mean standard deviation, which is known as the standard error of measurement
(SEM), is used for quantifying measurement precision for each individual. Let 𝑆 denote
0.0
0.1
0.2
0.3
0.4
0.5 D
en si
ty
Total scores (X +)
True scoreRespondent 1 True scoreRespondent 2 0 2 4 6 8 10 12 14
Chapter 1
8
the total-score variance in the sample, 𝑆 the unobservable true-score variance, and 𝑆 the measurement-error variance. Given the definition of random measurement error, it can be
shown that 𝑆 = 𝑆 + 𝑆 . Using this result, in the sample total-score reliability is defined as 𝑟 = 𝑆 /𝑆 = 1 − 𝑆 /𝑆 . The SEM can be derived to be equal to 𝑆 = 𝑆 1 − 𝑟 (Allen & Yen, 1979, p. 89), and 𝑟 may be substituted by coefficient alpha when the SEM is estimated from the data.
The SEM is used to estimate confidence intervals (CIs) for true score 𝑇 (Allen & Yen, 1979, p. 89). The narrower the CI, the more precise is the estimate of 𝑇. CIs are computed as follows. An observed total score 𝑋 is taken as an estimate of true score 𝑇 such that 𝑇 = 𝑋 , and it is assumed that the SEM is the standard error of a normal distribution with mean 𝑇. When a 95% CI is taken, the respondent’s true score 𝑇 lies in the interval 𝑋 ± 1.96𝑆 in 95% of the hypothetical test replications. However, in certain practical settings such as in personnel selection, “few organizations can wait to be 95%
sure of success” (Smith & Smith, 2005, p. 126). Apart from the incorrect interpretation of
CIs expressed here, the result is that in these settings a lower confidence level is often
chosen (e.g., 68% CI meaning 𝑋 ± 𝑆 ), implying that organizations are willing to take a higher risk of making an incorrect decision for individual respondents.
1.2.2 Method
We did a computational study to illustrate the relation between test length,
reliability and measurement precision. Let 𝐽 be the number of items in the test, and let the score on item 𝑗 be denoted by 𝑋 . Items may be dichotomously scored (i.e., 0 for an incorrect answer and 1 for a correct answer), or polytomously scored (e.g., 0, 1, …, 𝑚 for rating-scale items). We define the range of the scale as the difference between the
maximum possible total score and the minimum possible total score. For 𝐽 dichotomous
Introduction
9
items, the scale range equals 𝐽, and for 𝐽 rating-scale items the scale range equals 𝐽 × 𝑚. For dichotomous items, we studied how the ratio of the CI and the scale range, henceforth
denoted relative CI, relates to test length.
For 1,000 respondents, item scores were simulated using the item response model
known as the Rasch model (Embretson & Reise, 2000, p. 67; Rasch, 1980). The Rasch
model is defined as follows. Instead of a true score, the model uses a latent variable,
denoted 𝜃, as the person variable of interest. Without much loss of generality, we assumed that 𝜃 has a standard normal distribution. Items are characterized by their difficulty, here denoted 𝛿 , which is expressed on the same scale as the latent person variable 𝜃. The Rasch model expresses the probability of a 1 score on dichotomous item 𝑗 as a function of the latent variable 𝜃 and the difficulty 𝛿 of the item as 𝑃 𝑋 = 1 𝜃, 𝛿 = exp [𝑎 𝜃 − 𝛿 ]1 + exp [𝑎 𝜃 − 𝛿 ] . (1.1) Constant 𝑎 expresses the common discrimination power of the 𝐽 items in the test. The higher 𝑎, the higher the probability that a respondent with a low 𝜃 value relative to the item location 𝛿 scores 0 and a respondent with a high 𝜃 value relative to 𝛿 scores 1. Because an increase of 𝑎 in all items causes an increase of total-score reliability, in a simulation study 𝑎 can be used to manipulate the reliability of the total-score. For 𝐽 = 40, we chose item difficulty values between –1.5 and 1.5 such that distances between adjacent values were equal throughout. Item 1 is the easiest item
implying that out of all 40 items it has the highest probability of a 1 score for each 𝜃 value, and item 40 is the most difficult item implying the lowest probability for each 𝜃 value. By choosing 𝑎 = 2.9, we found that for 𝐽 = 40 the reliability estimated by coefficient alpha equaled .96. This is high but not unrealistic for a 40-item test. Next, to obtain tests
consisting of 20, 15, 10, and 5 items, we removed items from the 40-item test, such that in
Chapter 1
10
each test the item difficulties of the remaining items were spread at approximately equal
distances between –1.5 and 1.5.
1.2.3 Results
Table 1.1 shows coefficient alpha as an estimate of total-score reliability, the SEM,
the width of the 68% CI and the 95% CI, and the relative CIs. The table shows that
removing items from the test caused coefficient alpha to decrease from .96 for 𝐽 = 40, to .70 for 𝐽 = 5. The latter alpha value is still acceptable for some practical applications (Kline, 2000, p. 524). The SEM and the width of the CI also decreased as the test grew
shorter. For example, for 𝐽 = 40 the SEM was 2.12 and the 95% CI covered a range of 8.27 scale points, but for 𝐽 = 5 the SEM was 0.74 and the 95% CI covered only 2.92 scale points.
Table 1.1: Test length and measurement precision for five test lengths.
Confidence level Coefficient 68% 95% 𝐽 Alpha SEM CI Relative CI CI Relative CI
40 .96 2.12 4.22 .11 8.27 .21 20 .92 1.49 2.98 .15 5.85 .29 15 .90 1.31 2.62 .17 5.13 .34 10 .85 1.06 2.12 .21 4.16 .42 5 .70 0.74 1.49 .30 2.92 .58
Smaller SEMs and CIs suggest greater measurement precision but this would be the
wrong conclusion, which is shown by the relative CIs which increased substantially as
scale range decreased. Figure 1.2 shows the relative CI at the midpoints of the scale. As 𝐽 decreases a larger part of the scale becomes unreliable. This means that only if respondents
differ to a large degree will their differences on the scale be significant. However, the vast
Introduction