7 Page Psychology Essay

Peter Kruyen

Using short tests and questionnaires for making decisions about individuals:

Save your time - order a paper!

Get your paper written from scratch within the tight deadline. Our service is a reliable solution to all your troubles. Place an order on any task and we will take care of it. You won’t have to worry about the quality and deadlines

Order Paper Now

Using Short Tests and Questionnaires for Making Decisions about Individuals:

When is Short too Short?

Peter Mathieu Kruyen

Cover design by Roos Verhooren (info@roosverhooren.nl) Printed by Ridderprint BV, Ridderkerk, the Netherlands © Peter Mathieu Kruyen, 2012 No part of this publication may be reproduced or transmitted in any form or by any means, electronically or mechanically, including photocopying, recording or using any information storage and retrieval system, without the written permission of the author, or, when appropriate, of the publisher of the publication. ISBN/EAN: 978-90-5335-614-2

This research was supported by a grant from the Netherlands Organisation for Scientific Research (NWO), grant number 400-05-179.

Using Short Tests and Questionnaires for Making Decisions about Individuals:

When is Short too Short?

Proefschrift ter verkrijging van de graad van doctor

aan Tilburg University

op gezag van de rector magnificus,

prof. dr. Ph. Eijlander,

in het openbaar te verdedigen ten overstaan van een

door het college voor promoties aangewezen commissie

in de aula van de Universiteit

op vrijdag 14 december 2012 om 14:15 uur

door

Peter Mathieu Kruyen,

geboren op 26 juli 1983 te Dordrecht

Promotor: Prof. dr. K. Sijtsma

Copromotor: Dr. W. H. M. Emons

Overige leden van de Promotiecommissie:

Prof. dr. M. Ph. Born

Prof. dr. R. R. Meijer

Prof. dr. M. J. P. M. van Veldhoven

Dr. L. A. van der Ark

Dr. A. V. A. M. Evers

Contents

1. Introduction …………………………………………………………………………………………………………….. 1

1.1 Test length and individual decision-making ………………………………………………………………… 3 1.2 Preliminaries: Test length and measurement precision …………………………………………………. 6 1.3 Overview of the thesis ……………………………………………………………………………………………. 12

2. On the shortcomings of shortened tests: A literature review ……………………………………. 15

2.1 Introduction …………………………………………………………………………………………………………… 17 2.2 Research questions …………………………………………………………………………………………………. 18 2.3 Technical terms ……………………………………………………………………………………………………… 19 2.4 Method …………………………………………………………………………………………………………………. 26 2.5 Results ………………………………………………………………………………………………………………….. 29 2.6 Discussion …………………………………………………………………………………………………………….. 40 Appendix: Coding scheme ……………………………………………………………………………………………. 44

3. Test length and decision quality: When is short too short? ………………………………………. 47

3.1 Introduction …………………………………………………………………………………………………………… 49 3.2 Background …………………………………………………………………………………………………………… 50 3.3 Method …………………………………………………………………………………………………………………. 53 3.4 Results ………………………………………………………………………………………………………………….. 63 3.5 Discussion …………………………………………………………………………………………………………….. 71

4. Assessing individual change using short tests and questionnaires …………………………….. 75

4.1 Introduction …………………………………………………………………………………………………………… 77 4.2 Theory ………………………………………………………………………………………………………………….. 79 4.3 Method …………………………………………………………………………………………………………………. 82 4.4 Results ………………………………………………………………………………………………………………….. 87 4.5 Discussion …………………………………………………………………………………………………………….. 94

5. Shortening the S-STAI: Consequences for research and individual decision-making … 99

5.1 Introduction …………………………………………………………………………………………………………. 101 5.2 Background: Pitfalls of shortening the S-STAI ………………………………………………………… 103 5.3 Method ……………………………………………………………………………………………………………….. 108 5.4 Results ………………………………………………………………………………………………………………… 112 5.5 Discussion …………………………………………………………………………………………………………… 117 Appendix: Explanation of strategies used to shorten the S-STAI …………………………………….. 119

6. Conclusion and discussion …………………………………………………………………………………….. 121

6.1 Conclusion ………………………………………………………………………………………………………….. 123 6.2 Discussion …………………………………………………………………………………………………………… 128

References ……………………………………………………………………………………………………………….. 131

Summary …………………………………………………………………………………………………………………. 143

Samenvatting (Summary in Dutch) …………………………………………………………………………… 149

Woord van dank (Acknowledgments in Dutch) …………………………………………………………. 157

Chapter 1: Introduction

Chapter 1

Introduction

1.1 Test Length and Individual Decision-Making

Psychological tests and questionnaires play an important role in individual

decision-making in areas such as personnel selection, clinical assessment, and educational

testing. To make informed decisions about individuals, psychologists are interested in

constructs like motivation, anxiety, and reading level, which have been shown to be valid

predictors of criteria such as job success, suitability for therapy, and mastery of reading

skills. These unobservable constructs are measured by a collection of items comprising a

test. Addition of the scores on the items provides a respondent’s total score or test score,

which reflects the respondent’s level on the construct of interest. Total scores are used to

decide, for example, which applicant to hire for a job, whether a patient benefited from a

treatment, or to determine if a particular student needs additional reading help.

Before using a test for individual decision-making, tests users need to be certain to

a particular extent that decisions for individual respondents do not depend on one

particular test administration (e.g., Emons, Sijtsma, & Meijer, 2007, p. 133; Hambleton &

Slater, 1997). When total scores vary considerable across different (hypothetical) test

administrations due to random influences like mood and disturbing noises during the test

administration, the risk of incorrect individual decisions may be substantial. As a result,

test users may reject a suited applicant, continue an unsuccessful treatment, or deny

additional help to a student with a low reading level. Incorrect decisions may have

important negative consequences such as a decline of the well-being of individual

respondents and the waste of organizational resources.

In this PhD thesis, the focus is on the influence of random measurement error or

total-score unreliability on test performance in relation to individual decision-making.

Special attention is given to test length in relation to reliability, which is a group

characteristic, and measurement precision, which pertains to measurement of individuals.

Chapter 1

Throughout, we concentrate on reliability issues in decision-making about individuals, and

for the sake of simplicity assume that tests are valid. Validity is a highly important topic

that cannot be addressed in passing and justifies a PhD study on its own.

Generally, tests consisting of many items, say, at least 40 items, are more reliable

than tests consisting of only a few items, say, 15 or fewer items. Specifically, psychometric

theory—the theory of psychological measurement—shows that the more items respondents

answer, the smaller the relative influence of random errors on total scores (e.g., Allen &

Yen, 1979, pp. 85-88; Nunnally & Bernstein, 1994, pp. 230-233). However, extending test

length with the purpose of minimizing the relative influence of random errors encounters

numerous practical objections. For example, filling out long tests may result in high

administration costs. In other applications, test users do not want to trouble respondents

with many questions, for example, when critically-ill patients need to be assessed. To

summarize, test users and test constructors often do not appreciate long tests and

questionnaires and rather prefer tests that are as short as possible, often within the limits of

particular psychometric constraints.

Hence, short tests—including shortened versions of previously developed longer

tests—are abound in practice. In personnel psychology, for example, researchers

developed a 12-item form of the Raven Advanced Progressive Matrices test (originally 36

items, Arthur & Day, 1994) and a scale measuring focus of attention by means of 10 items

(Gardner, Dunham, Cummings, & Pierce, 1989). In clinical psychology, examples include

a 13-item version of the Beck Depression Inventory (originally 21 items, Beck & Beck,

1972) and a 3-item version of the Brief Pain Inventory (originally 11 items, Krebs et al.,

2009). In personality psychology, we can find, for example, a 10-item and a 5-item

questionnaire allegedly measuring the complete Big Five construct (Gosling, Rentfrow, &

Introduction

Swann, 2003). For the purpose of comparison, the NEO PI-R contains 240 items to

measure the Big Five (Costa & McCrae, 1992).

One way or another, psychologists need to deal with the consequences of using

short tests for the risk of making incorrect decisions about individuals. The goal of this

thesis is to assess whether, based on psychometric considerations, short tests may be used

for making decisions about individuals. Total scores can be reliable, but if a test designed

to measure reading level also measures another construct such as anxiety, scores will be

interpreted incorrectly. In this thesis, we focus on the relationship between test length and

reliability, because reliability is a necessary (although not a sufficient) condition for tests to

be valid (Nunnally & Bernstein, 1994, p. 214); that is, poor test performance that is mostly

due to random measurement error does not reflect the influence of the construct of interest,

and a reliable test may or may not measure the intended construct. Validity should be

studied on its own but for practical reasons is simply assumed here.

We answer the following research questions:

1. To what extent do psychologists pay attention to the consequences of using short

tests for making decisions about individuals?

2. How should one assess the risk of making incorrect individual decisions?

3. To what extent does test shortening increase the risk of making incorrect individual

decisions?

4. What are minimal test-length requirements for making decisions about individuals

with sufficient certainty?

Chapter 1

1.2 Preliminaries: Test Length and Measurement Precision

Often reliability is assessed by coefficient alpha or the test-retest correlation

(Nunnally & Bernstein, 1994, pp. 251-255). A test is deemed suited for individual

decision-making if the total-score reliability exceeds a minimum value that is recognized

as a rule of thumb. However, for psychologists interested in individual decision-making,

measurement precision is more important than reliability (Harvill, 1991; Mellenbergh,

1996; Nunnally & Bernstein, 1994, p. 260; Sijtsma & Emons, 2011).

In this section, we show that the reliability coefficient conveys insufficient

information to assess whether the total score is precise enough to be useful for individual

decision-making. Specifically, we show that total-score reliability as estimated by

coefficient alpha can be acceptable for short tests but that meanwhile measurement

precision of the test is much lower and even unacceptably low.

1.2.1 Theory

We studied the relationship between test length and measurement precision from

the perspective of classical test theory (CTT). CTT assumes that a total score, which is the

sum of the scores on the items in the test, and which is denoted 𝑋 , equals the sum of true score 𝑇 and random measurement error 𝐸 : 𝑋 = 𝑇 + 𝐸 . The statistical model of CTT assumes that the same test is administered an infinite number of times to a particular

respondent and that these administrations are independent so that different administrations

can be considered to be replications. Due to random processes reflected by the error

component, replications produce a distribution of total scores, also known as the propensity

distribution (Lord & Novick, 1968, pp. 29-30). The mean of the propensity distribution is

defined as the respondent’s true score and the dispersion is the respondent’s measurement-

Introduction

Figure 1.1: Example of propensity distributions for two respondents with different true scores and error variances.

error variance. Figure 1.1 shows for two respondents their hypothetical propensity

distribution, which are different with respect to true score and error variance. Thus, CTT

assumes that different respondents are measured with different precision.

In the real world, instead of a propensity distribution only one total score is

available for each respondent. Hence, in practice one uses the sample of total scores from

all individuals to estimate one common error variance, which may be considered the mean

of all the error variances of the unobserved propensity distributions (Lord & Novick, 1968,

p. 35). The mean standard deviation, which is known as the standard error of measurement

(SEM), is used for quantifying measurement precision for each individual. Let 𝑆 denote

0.0

0.1

0.2

0.3

0.4

0.5 D

en si

Total scores (X +)

True scoreRespondent 1 True scoreRespondent 2 0 2 4 6 8 10 12 14

Chapter 1

the total-score variance in the sample, 𝑆 the unobservable true-score variance, and 𝑆 the measurement-error variance. Given the definition of random measurement error, it can be

shown that 𝑆 = 𝑆 + 𝑆 . Using this result, in the sample total-score reliability is defined as 𝑟 = 𝑆 /𝑆 = 1 − 𝑆 /𝑆 . The SEM can be derived to be equal to 𝑆 = 𝑆 1 − 𝑟 (Allen & Yen, 1979, p. 89), and 𝑟 may be substituted by coefficient alpha when the SEM is estimated from the data.

The SEM is used to estimate confidence intervals (CIs) for true score 𝑇 (Allen & Yen, 1979, p. 89). The narrower the CI, the more precise is the estimate of 𝑇. CIs are computed as follows. An observed total score 𝑋 is taken as an estimate of true score 𝑇 such that 𝑇 = 𝑋 , and it is assumed that the SEM is the standard error of a normal distribution with mean 𝑇. When a 95% CI is taken, the respondent’s true score 𝑇 lies in the interval 𝑋 ± 1.96𝑆 in 95% of the hypothetical test replications. However, in certain practical settings such as in personnel selection, “few organizations can wait to be 95%

sure of success” (Smith & Smith, 2005, p. 126). Apart from the incorrect interpretation of

CIs expressed here, the result is that in these settings a lower confidence level is often

chosen (e.g., 68% CI meaning 𝑋 ± 𝑆 ), implying that organizations are willing to take a higher risk of making an incorrect decision for individual respondents.

1.2.2 Method

We did a computational study to illustrate the relation between test length,

reliability and measurement precision. Let 𝐽 be the number of items in the test, and let the score on item 𝑗 be denoted by 𝑋 . Items may be dichotomously scored (i.e., 0 for an incorrect answer and 1 for a correct answer), or polytomously scored (e.g., 0, 1, …, 𝑚 for rating-scale items). We define the range of the scale as the difference between the

maximum possible total score and the minimum possible total score. For 𝐽 dichotomous

Introduction

items, the scale range equals 𝐽, and for 𝐽 rating-scale items the scale range equals 𝐽 × 𝑚. For dichotomous items, we studied how the ratio of the CI and the scale range, henceforth

denoted relative CI, relates to test length.

For 1,000 respondents, item scores were simulated using the item response model

known as the Rasch model (Embretson & Reise, 2000, p. 67; Rasch, 1980). The Rasch

model is defined as follows. Instead of a true score, the model uses a latent variable,

denoted 𝜃, as the person variable of interest. Without much loss of generality, we assumed that 𝜃 has a standard normal distribution. Items are characterized by their difficulty, here denoted 𝛿 , which is expressed on the same scale as the latent person variable 𝜃. The Rasch model expresses the probability of a 1 score on dichotomous item 𝑗 as a function of the latent variable 𝜃 and the difficulty 𝛿 of the item as 𝑃 𝑋 = 1 𝜃, 𝛿 = exp [𝑎 𝜃 − 𝛿 ]1 + exp [𝑎 𝜃 − 𝛿 ] . (1.1) Constant 𝑎 expresses the common discrimination power of the 𝐽 items in the test. The higher 𝑎, the higher the probability that a respondent with a low 𝜃 value relative to the item location 𝛿 scores 0 and a respondent with a high 𝜃 value relative to 𝛿 scores 1. Because an increase of 𝑎 in all items causes an increase of total-score reliability, in a simulation study 𝑎 can be used to manipulate the reliability of the total-score. For 𝐽 = 40, we chose item difficulty values between –1.5 and 1.5 such that distances between adjacent values were equal throughout. Item 1 is the easiest item

implying that out of all 40 items it has the highest probability of a 1 score for each 𝜃 value, and item 40 is the most difficult item implying the lowest probability for each 𝜃 value. By choosing 𝑎 = 2.9, we found that for 𝐽 = 40 the reliability estimated by coefficient alpha equaled .96. This is high but not unrealistic for a 40-item test. Next, to obtain tests

consisting of 20, 15, 10, and 5 items, we removed items from the 40-item test, such that in

Chapter 1

each test the item difficulties of the remaining items were spread at approximately equal

distances between –1.5 and 1.5.

1.2.3 Results

Table 1.1 shows coefficient alpha as an estimate of total-score reliability, the SEM,

the width of the 68% CI and the 95% CI, and the relative CIs. The table shows that

removing items from the test caused coefficient alpha to decrease from .96 for 𝐽 = 40, to .70 for 𝐽 = 5. The latter alpha value is still acceptable for some practical applications (Kline, 2000, p. 524). The SEM and the width of the CI also decreased as the test grew

shorter. For example, for 𝐽 = 40 the SEM was 2.12 and the 95% CI covered a range of 8.27 scale points, but for 𝐽 = 5 the SEM was 0.74 and the 95% CI covered only 2.92 scale points.

Table 1.1: Test length and measurement precision for five test lengths.

Confidence level Coefficient 68% 95% 𝐽 Alpha SEM CI Relative CI CI Relative CI

40 .96 2.12 4.22 .11 8.27 .21 20 .92 1.49 2.98 .15 5.85 .29 15 .90 1.31 2.62 .17 5.13 .34 10 .85 1.06 2.12 .21 4.16 .42 5 .70 0.74 1.49 .30 2.92 .58

Smaller SEMs and CIs suggest greater measurement precision but this would be the

wrong conclusion, which is shown by the relative CIs which increased substantially as

scale range decreased. Figure 1.2 shows the relative CI at the midpoints of the scale. As 𝐽 decreases a larger part of the scale becomes unreliable. This means that only if respondents

differ to a large degree will their differences on the scale be significant. However, the vast

Introduction

Save your time - order a paper!

Experience

Our mission

Latest Posts

Talk to us...