Precision in the measurement of lexical expertise: the selection of optimal items for a spelling assessment

Eskenazi, Michael A.; Askew, Robert L.; Folk, Jocelyn R.

doi:10.3758/s13428-022-01834-3

Precision in the measurement of lexical expertise: the selection of optimal items for a spelling assessment

Published: 05 April 2022

Volume 55, pages 623–632, (2023)
Cite this article

Download PDF

Behavior Research Methods Aims and scope Submit manuscript

Precision in the measurement of lexical expertise: the selection of optimal items for a spelling assessment

Download PDF

Michael A. Eskenazi¹,
Robert L. Askew¹ &
Jocelyn R. Folk²

1150 Accesses
2 Citations
1 Altmetric
Explore all metrics

Abstract

Skilled adult readers vary in many skills related to visual word form processing such as phonological processing, vocabulary size, comprehension skill, and spelling skill (Kuperman & Van Dyke, 2011). Spelling skill in particular has received much attention because low- and high-skill spellers show different patterns of lexical processing as measured through eye movement behavior, reaction times, and word learning (Eskenazi et al., 2018; Veldre & Andrews, 2014). Researchers commonly use a spelling dictation task to measure lexical expertise; however, there is limited evidence for its psychometric properties and room for improvement in item selection (Andrews et al., 2020). The purpose of this study was to assess the precision of 110 words as measures of lexical expertise, to compare various subsets of words in a spelling dictation task, and to provide a set of words that more precisely measure lexical expertise. In Study 1, a spelling dictation task with 110 words was administered to 682 participants. In Study 2, that same task and measures of vocabulary and comprehension were administered to 786 participants. Results indicated that the set of 110 words contains many words that are imprecise measures of spelling skill. Through an iterative process of removing words with high error variance, a set of 20 words was selected that minimizes measurement error and demonstrates discriminant validity from vocabulary and comprehension ability. We recommend this set of words as a more precise measure of spelling skill, which will provide more power to detect moderating effects of lexical expertise on reading processes.

Measuring Lexical Quality: The Role of Spelling Ability

Article 13 April 2020

Norms of vocabulary, reading, and spelling tests in French university students

Article 10 September 2021

Knowing what we don’t know: cognitive correlates of early spelling of different target types

Article Open access 07 March 2019

Reading involves the complex process of decoding visual symbols to access a lexical representation of a word. Decades of research in visual word form processing have elucidated the detailed process through which a lexical representation is accessed. Much of this research has appropriately taken a “uniformity assumption” to understand how this process works for the average skilled reader (Andrews, 2012, 2015). However, a recent shift in this field of research has been towards understanding how individual differences in lexical abilities influence that process. Although all literate adult readers have developed skills in comprehension, spelling, vocabulary, phonological awareness, and other lexical abilities, there is considerable difference in the degree to which literate adult readers perform in each of these domains (Kuperman & Van Dyke, 2011). Spelling skill in particular has demonstrated largely different patterns of eye-movement behavior and reaction time measures for low-skill and high-skill literate adults.

Differences in participants’ spelling skill, as a marker of orthographic ability and lexical expertise, may explain null effects or contradictory findings in previous research. For example, previous research has shown that low neighborhood words provide faciliatory priming (eble-ABLE) in a masked priming lexical decision task, but high neighborhood words (tand-SAND) do not show this effect (Forster et al., 1987). However, this null effect for high neighborhood primes depends on spelling skill. Andrews and Hersch (2010) found inhibitory priming from high neighborhood primes for high-skill spellers, but faciliatory priming for low-skill spellers. Without measuring spelling skill, the overall averaged sample would have shown no priming effects from high neighborhood primes. Other differential effects of spelling skill indicate that low-skill spellers rely more on context and use top-down processing to identify words, whereas high-skill spellers identify words using bottom-up processing directly from the visual form of the word (Andrews & Bond, 2009; Hersch & Andrews, 2012). Thus, the inclusion of spelling skill as a moderating variable enhances understanding of visual word form processing. It is important to note that these studies often measure comprehension ability or vocabulary ability, which have their own contributions to reading processes.

Eye movement behavior during reading is another area where spelling skill has demonstrated differential effects. Eye movement behavior is often used as an index of moment-moment cognitive processing during reading (Rayner, 1998). Earlier research has shown that lexical factors such as vocabulary size and comprehension ability influence eye movement behavior, such that high-skill readers, based on measures of vocabulary and comprehension, show more efficient eye movements (Ashby et al., 2005). When taking spelling skill into account, research has shown that high-skill spellers are not only more efficient at processing the currently fixated word, but are also able to extract more information from upcoming words in the parafovea (Veldre & Andrews, 2015). Further, the perceptual span, or amount of information that can be extracted on each fixation, of high-skill spellers is larger than that of low-skill spellers (Veldre & Andrews, 2014). Spelling skill also influences the process of learning new words, such that high-skill spellers are better at learning the meaning of new words when reading them in context (Eskenazi et al., 2018).

Taken together, these effects clearly indicate that spelling skill is associated with readers’ eye movement behavior, lexical processing, and lexical acquisition. These effects have been taken as evidence to support the lexical quality hypothesis (LQH; Perfetti, 2007). The LQH explains individual differences in reading skill and comprehension through variability in the quality of lexical representations. A high-quality lexical representation includes fully specified and interconnected knowledge about a word’s orthography, phonology, and semantics. These higher-quality representations result in more efficient activation of word forms, which aids in higher-order processing such as comprehension and text integration. Orthographic precision is of particular importance because these lexical representations are accessed through the written forms of words. In other words, readers with greater knowledge about the written forms of words (orthographic precision) will have greater access to lexical representation of word forms. Thus, individual differences in spelling ability serve as a good indicator of the ability to access lexical representations.

Spelling ability is often measured through a spelling dictation task and spelling recognition task. However, there is one important shortcoming of research using these tasks: it remains unclear how well individual words in these tasks serve as indicators of spelling ability. In the spelling recognition task, readers identify misspelled words from a set of words that contain some common spelling errors. In the spelling dictation task, participants recall spellings of words after hearing them spoken. The list of words in the spelling dictation task was originally created by selecting 20 words out of 110 possible words provided by Burt and Tate (2002) with a broad range of difficulties (Andrews et al., 2020). These two spelling tasks are commonly used in many publications from various researchers investigating individual differences in lexical processing (Andrews & Bond, 2009; Andrews & Veldre, 2021; Beyersmann et al., 2015; Drieghe et al., 2019; Eskenazi et al., 2018; Parker & Slattery, 2021; Rahmanian & Kuperman, 2019; Slattery & Yates, 2018; Tan & Yap, 2016). In these studies, researchers attempt to successfully differentiate low-skill and high-skill spellers, and the average or total number of correctly spelled items is generally used as an estimate of an individual’s spelling ability. By extension, the variance in spelling scores is used as an estimate of the magnitude of individual differences in spelling ability observed in a sample. Use of this variance estimate for statistical hypothesis testing, however, can be problematic when the assumption of error-free measurement has been violated, and the magnitude of error can have profound consequences on the outcomes of experimental trials.

Given the practice of using a spelling dictation task to measure individual differences in spelling ability, it is important to determine how well these words perform individually and collectively as measures of spelling ability and the degree to which distinct subsets of words may vary in measurement precision. Some evidence already exists to address this question. Recently, Andrews et al. (2020) investigated the set of 20 words that are regularly used in a spelling dictation task and reported good internal consistency and unidimensionality; however, the authors note that this set of words can be refined to improve measurement precision.

Thus, the purpose of the first study was to identify a more precise set of words for a spelling dictation task by starting with all 110 words from Burt and Tate’s word bank. There were four specific goals of the first study: (1) assess precision observed in a spelling dictation task with all 110 words, (2) identify potential differences among individual words that comprise the task in terms of measurement precision and error, (3) compare precision and error estimates of scores from spelling tests comprised of the best-performing words (i.e., with maximal precision) to scores from tests used in prior research and scores from tests with words randomly selected from the word bank, and (4) assess the potential impact of item choice on categorizations of spelling ability (i.e., low- and high-skill spellers). The purpose of the second study was to (1) validate this set of items in a new sample, (2) provide external validity in a proxy sample for the general population, and (3) assess discriminant validity with other measures of lexical ability including vocabulary and comprehension. If measurement is sufficiently improved through intentional selection of the best-performing items, the final result will be a precise set of words that improve statistical power to detect the moderating effects of orthographic processing skill on reading processes.

Classical test theory (CTT) provides a coherent framework to quantify precision and to determine whether the magnitude of measurement error is of practical consequence. A foundational premise of CTT is that variance in scores observed on tests are a composite of two distinct forms of variance: true score variance (in this case, variance reflecting real or genuine differences among participants in their spelling ability) and error variance (in this case, variance attributable to individual words as imperfect measures of spelling ability). This is observed in the following fundamental equation.

$${\sigma}_o^2={\sigma}_t^2+{\sigma}_e^2$$

where ${\sigma}_{o}^{2}$ represents observed test score variance, ${\sigma}_t^2$ represents true score variance, and ${\sigma}_e^2$ represents error variance. The proportion of variance in observed scores that is attributable to true score variance represents an estimate of measurement precision (i.e., reliability), as follows.

$$\rho ={{\sigma}_t^2}\left/ {{\sigma}_t^2+{\sigma}_e^2}\right.$$

Likewise, the proportion of observed score variance that is attributable to error variance represents an estimate of measurement error, and because observed score variance is a composite of true score and error variance, measurement error can also be expressed as 1-ρ. While there are a variety of tests and metrics of measurement precision, we employed factor analytic models because factor analysis facilitates more exacting tests of item reliability (i.e., precision) and of scores derived from differing subsets of words that comprise the spelling task. As such, this approach is well aligned with the specific goals of this study.

Study 1

Method

Participants

A total of 682 participants were recruited from Kent State University (n = 478), a large public state university, and Stetson University (n = 204), a small private liberal arts university. Prior investigations indicate this sample size far exceeds the minimum required to achieve sufficient statistical power in confirmatory factor analytic models (Wolf et al., 2013). All participants reported English as their native language and had no reported reading disabilities. Nine participants were excluded from the sample for having more than 10% missing data. The sample contained 164 male participants (24%) and 509 female participants (76%). The average age of the sample was 20.16 (SD = 2.47).

Materials

Previous research on the organization of the orthographic lexicon provided 110 words with a wide range of difficulty from 13% to 100% average accuracy (Burt & Tate, 2002). Thus, these items provided a broad basis from which to select a subset of items to measure lexical expertise as it contains items that are of low, moderate, and high difficulty. Further, this is the same set of words from which 20 words were selected in the previously mentioned research. The 110 items have an average word length of 8.9 letters (SD = .56). The average word frequency is 1.77 counts per million, (SD = 1.89), which was determined using the CELEX corpus (Baayen et al., 1995). The full list of items with their average accuracies are included with the supplemental materials.

Procedure

All procedures were first approved by the Kent State University and Stetson University Institutional Review Boards. All materials were presented on a computer using the survey website Qualtrics. The survey JavaScript was edited to prevent participants from using a web browser-enabled spellchecker. Recordings were created for each of the 110 words such that each word was spoken clearly three times. Two words were determined to have multiple possible pronunciations, and thus were spoken four times – twice with each pronunciation. The word affluent was spoken with the emphasis on the first syllable or the second syllable, and the word omniscient was spoken with the /s/ sound or with the /š/ sound. Participants were instructed to listen to each word spoken three (or four) times and to spell that word by typing it into the space provided. The 110 words were presented in random order. After completing the spelling portion of the study, participants answered several demographic questions. All participants received course credit for their participation.

Results

Analytic approach

All analyses were conducted using R statistical software version 4.1.0 (R Core Team, 2021) and the psych (Revelle, 2021) and lavaan (Rosseel, 2012) packages. We employed an iterative process of model refinement by fitting a series of unidimensional confirmatory factor analytic (CFA) models starting with Burt and Tate’s 109 item word bank. The word occident was excluded from analyses, as it is a heterographic homophone with the word oxidant. The ten items with the most measurement error were then removed at each iteration, and a new CFA model was fit until the final ten items with maximal precision and minimal error were identified. Because the data were binary (i.e., each word spelled correctly or incorrectly), we employed tetrachoric correlation matrices (a special case of polychoric correlations for binary variables) and a diagonally weighted least squares estimator (WLSMV), which have been shown to be more effective when modeling binary (or polytomous) item response structures (Holgado-Tello et al., 2010). We calculated robust χ², CFI, TLI, and SRMR as measures of absolute and relative fit, but we included alternate performance indicators (outlined below), given that CFI and TLI have been shown to be overly optimistic measures of fit when modeling ordinal data (i.e., tetrachoric/polychoric correlation matrices) with diagonal least squares estimators (WLSMV, ULSMV; Xia & Yang, 2019). We also reported SRMR, given its demonstrated superiority over RMSEA in terms of power and type 1 error rates (Shi et al., 2020), especially when the number of indicators is large (Maydeu-Olivares et al., 2018). We then address each of the four study aims through a comparison of psychometric performance indicators from each test of spelling ability, as follows.

Item selection

The first aim of the study was to compare estimates of precision observed in spelling dictation tasks beginning with the full set of 109 items and to progressively remove items with the most measurement error until the items with the most precision remain. As outlined in Table 1, all indicators of model fit (CFI and TLI) and misfit (robust χ² and SRMR) improved as poorly performing items were removed from each successive unidimensional CFA model. We also estimated R² as a conservative measure of precision by dividing the sum of squared factor loadings by the total number of items in a test (i.e., proportion of total variability among all items explained by “true score” variability in the latent factor of spelling ability). Precision improved consistently as poorly performing items were dropped from the 109 item model (R² = .346) through to the ten-item model (R² = .570). We also noted that the 20 item model was the first to achieve more true score variability than error (i.e., R² > .500). Lastly. McDonald’s Ω_hierarchical was calculated as an upper bound estimate of scale precision (i.e., internal consistency reliability), which has been shown to outperform more popular but less robust reliability estimators, such as Cronbach’s α (Trizano-Hermosilla & Alvarado, 2016; Zinbarg et al., 2005). When moving from the 109 item model to the ten-item model, Ω_hierarchical likewise improved from .75 to .85. In sum, precision in the measurement of orthographic processing improved in each successive test as poorly performing items were dropped from the model, and this trend was observed across every metric evaluated.

Table 1 Robust fit indices, reliability estimates, and scores for each model

Full size table

Item precision

The second aim of the study was to identify potential differences in the precision of individual words that comprise the spelling task (i.e., test the tau-equivalence assumption in the classical test theory framework). The factor loadings presented in Table 2 represent the correlation between the factor score (i.e., true score of spelling ability) and the outcome of the spelling trial for each word. Words with higher factor loadings are more strongly correlated with the latent construct (in this case, spelling ability), and accordingly serve as better indicators of spelling ability. The average factor loadings for each model increased from .580 in the 109 item model to .755 in the ten-item model, indicating that the words in the ten-item model function as more precise measures of spelling ability. Further, the absolute range of factor loadings observed in the 109 item model was much wider (.251–.774) than that observed in the ten-item model (.722–.794), as was the standard deviation (.093 in the 109 item model vs .026 in the ten-item model). The increase in factor loadings coupled with the threefold reduction in the variability of factor loadings in the ten-item model indicates that precision and consistency in precision are both improved when poorly performing items are removed from model.

Table 2 Words and factor loadings for the Best 20 and Best 10 models

Full size table

Comparison of various item sets

The third aim of the study was to compare precision and error estimates of scores from tests comprised of the 20 best performing words (i.e. with maximal precision) to four other sets of words. We chose lists with 20 words because the Best 20 model was the first to demonstrate more true score variability than error. The first set of 20 words were randomly selected from Burt and Tate’s word bank (Random 20). The second set of words were selected to be the 20 items from the middle range of difficulty (Median 20). The third set of words were those used in previous research (Eskenazi et al., 2018; Eskenazi & Folk, 2015) and were semi-randomly selected to include a wide range of difficulty (Prior 20). Finally, the fourth list of words were those recently subjected to psychometric testing by Andrews et al. (2020). This list is similar to Prior 20 in that it was sampled from Burt and Tate and was designed to include words with a wide range of difficulties. One word on this list (persuade) was not sampled from Burt and Tate, and thus we can only report on 19 words from this list (Andrews 19)^{Footnote 1}. Each of these four comparison lists represent possible approaches that other researchers might take when designing their spelling dictation tasks^{Footnote 2}. The model with the 20 best performing items evidenced greater precision by both reliability metrics: R² for Best 20 (.515), Random 20 (.342), Median 20 (.325), Prior 20 (.343), Andrews 19 (.330); Ω_hierarchical for Best 20 (.85), Random 20 (.70), Median 20 (.72), Prior 20 (0.73), and Andrews 19 (.75). This finding suggests that accuracy in tests of spelling ability can be improved if investigators use the best 20 performing words rather than 20 randomly selected words, 20 words selected in the median difficulty range, and importantly, 20 words used in previous research. The full list of the Best 20 words with their factor loadings are listed in Table 2.

Impact of precision and error on classification

The fourth aim of the study was to assess the potential impact of item choice on categorizations of spelling ability (i.e., low- and high-skill spellers). Spelling dictation tasks are often used as a method to categorize participants into high-skill or low-skill spelling groups. Thus, we assessed the degree to which participants would change groups depending on the item set used (Best 20, Random 20, Prior 20). We first categorized participants into groups with a median split using the Best 20 items and then determined the number of participants that would change groups if using items from the other two sets. When using the Random 20 set, 14% of participants would have changed groups with 7% moving from high-skill to low-skill and 7% moving from low-skill to high-skill. When using the Prior 20 set, 20% of participants would have changed groups with 11% moving from high-skill to low-skill and 9% moving from low-skill to high-skill. Thus, categorization of participants also varies widely depending on the item set selected.

Discussion

Results of the first study identified a list of 20 words with greater precision and less error variance than other potential sets of words that could be used to measure spelling ability. Before this set of words can be recommended for use in practice, they must be validated in another sample and evidence must be provided for discriminant validity. Thus, the purpose of the second study was to test the fit of these 20 words in another sample of college students, to provide evidence of generalizability in an MTurk sample that approximates characteristics of the general population, and to test discriminant validity with measures of vocabulary ability and comprehension ability.

Study 2

Method

Participants

The second study included 786 participants, which included a sample of college students from Stetson University (n = 372) and a sample from Amazon Mechanical Turk (MTurk; n = 414). MTurk is a crowdsourcing tool for collecting data from a large representative sample. The college sample served as a validation check from the first study and the MTurk sample served as a generalizability check from the first study. The average age of the college student sample was 20.85 (SD = 4.75) and was mostly female (73%). The average age of the MTurk sample was 41.16 (SD = 13.85) and was also mostly female (65%). All participants reported English as their native language and had no reported reading disabilities.

Materials

Spelling measures

The same set of words from the first study was used to validate the spelling measure. A spelling recognition task was also included to determine the degree to which this spelling measure is associated with other commonly used measures of spelling ability. In the spelling recognition task, participants saw 50 words and selected incorrectly spelled words. Half of the words contained phonologically plausible errors created by either removing or adding one letter (e.g., rasberry, reccommend).

Vocabulary measure

Vocabulary skill was measured using the vocabulary subtest of the Wechsler Adult Intelligence Scale IV (WAIS-IV; Wechsler, 2008). This measure is appropriate for adults aged 16 through 90 and provides age-normed percentile rank scores. This measure included 30 words, and participants were instructed to provide a brief definition for each word. An example word and definition were provided for participants before they began. The average age-normed score was 78th percentile (SD = 19) with a range from 1st percentile to 100th percentile.

Comprehension measure

Comprehension was measured using the sentence comprehension subtest of the Wide Range Achievement Test 4 (WRAT 4; Wilkinson & Robertson, 2006). This measure is appropriate for children and adults from ages 5 through 94 and provides age-normed percentile ranks of comprehension ability. In this task, participants read 50 sentences and were instructed to enter the most appropriate word or short two-word phrase that best completed the sentence. Before beginning, participants were provided with a sample sentence with several possible correct answers as examples. The average age-normed score was 60th percentile (SD = 25) with a range from 1st percentile to 98th percentile.

Procedure

All study procedures were approved by the Stetson University Institutional Review Board. After providing informed consent, participants completed each of the measures described above. The only difference in procedure was that the college students were compensated with course credit, and the MTurk sample was compensated with $1.00 after completing the study. To ensure data integrity, several attention check and non-human response checks were used because MTurk samples regularly include non-human or non-attentive responses (Chmielewski & Kucker, 2020). First, participants listened to an audio file of the word “apple” and were instructed to type the word. Any participant who failed to enter this word correctly was blocked from accessing the rest of the study. Second, a CAPTCHA was included as a non-human response filter. Third, three attention check questions were included randomly throughout the study, which instructed participants to type a specific phrase to ensure that they were paying attention. Finally, participants were instructed to write three complete English sentences in response to a prompt. Any participant who failed the initial audio check, the CAPTCHA, failed at least one of the three attention checks, or provided gibberish or broken-English responses to the final question were not included in data analysis.

Results

External validation

To test the external validity of findings from Study 1, we assessed the psychometric performance of the Best 20 and Best 10 items in the new college student sample and the MTurk sample. As presented in Table 3, similar indicators of model fit (CFI and TLI) and misfit (χ² and SRMR) were observed. Moreover, R² and Ω_h were also in line with findings from Study 1, as were the mean factor loadings in both samples. The standard deviations of the factor loadings, however, were larger in both samples, with the largest difference observed with the MTurk sample. These findings provide additional support for the external validity of the results from Study 1.

Table 3 Robust fit indices, reliability estimates, and scores for each model

Full size table

Discriminant validity

Correlation analyses were conducted to determine whether the Best 20 and Best 10 measures were sufficiently unrelated to measures of other lexical abilities including comprehension skill and vocabulary skill. Previous research has shown that spelling skill and comprehension skill are moderately or strongly related (Landi, 2010: r = .23; Burt & Fury, 2000: r = .26; Andrews & Veldre, 2021: r = .31; Yates & Slattery, 2019: r = .54). In the current study, the Best 20 (r = .190) and Best 10 (r = .153) spelling measures were weakly related to comprehension skill, which provides strong evidence of discriminant validity (See Table 4). Similarly, previous research has shown moderate to strong correlations between spelling skill and vocabulary skill (Landi, 2010: r = .30; Burt & Fury, 2000: r = .47; Andrews & Veldre, 2021: r = .43; Yates & Slattery, 2019: r = .58). The current study showed moderate correlations between Best 20 (r = .355) and Best 10 (r = .318) with vocabulary skill, which again demonstrates discriminant validity.

Table 4 Correlations between spelling accuracy in each model and comprehension, vocabulary, and spelling recognition

Full size table

We also tested correlations with a secondary spelling measure: spelling recognition skill. Previous research has shown that spelling dictation and spelling recognition are strongly related (Yates & Slattery, 2019: r = .73, Eskenazi et al., 2021: r = .65; Andrews & Veldre, 2021: r = .75). In the current study, the Best 20 (r = .601) and Best 10 (r = .539) lists were also strongly related to spelling recognition skill. In sum, the Best 20 and Best 10 lists are in line with expected correlations with other measures of lexical ability.

General discussion

Previous research has demonstrated the importance of measuring spelling skill in studies of lexical processing, and our aim was to improve precision when measuring spelling skill. To that end, we assessed the precision of individual words, compared precision of different sets of words, and compared classifications of high- and low-skill spellers using different sets of words, with the following findings.

First, we observed substantial measurement error in the full list of 109 words, and estimates of error varied considerably among words. Second, we were able to identify a set of 20 words with the least amount of error in the measurement of spelling skill that performed better than multiple other sets of words including those used in previous research. Third, there was substantial variability in the categorization of participants into high-skill and low-skill spelling groups depending on the set of words used to measure spelling skill. Fourth, the precision of this set of words replicates in another set of college students and performs similarly in an Internet sample serving as a proxy for the general population. Finally, scores derived from this set of words are discriminant from vocabulary and comprehension abilities. Taken together, these findings indicate that researchers should carefully consider the choice of words used to assess spelling ability, namely item-level reliability (i.e., precision), as the power of statistical tests is directly affected by the amount of error variance contaminating the total variance of observed scores.

Moreover, the final set of 20 words with maximized reliability were the only set to result in more true-score variability than error. Indeed, one of the most important findings of this study is that randomly selecting a set of words or using a set of words with different ranges of difficulty does not ensure precise measurement. Although the set of words proposed by Andrews et al. (2020) has provisional reliability evidence supporting its use (.81 alpha reported in their data, .75 omega hierarchical reported here), their list of words evidences markedly lower levels of precision that is similar to a randomly selected list of words. As such, the intentional selection of the Best 20 words through iterative model refinement is a demonstrably better process for creating a more precise measure of spelling ability.

An additional benefit of this set of words is that it can be used to measure spelling ability in multiple populations. As with many areas of psychology, most research on visual word form processing, eye movement behavior, and reading processes is conducted using college student samples. Therefore, the original intent of this research was to create a list of words that works well within college students. However, the results of Study 2 demonstrate that this set of words works similarly in a broader sample with a wide range of educational attainment and ages. Thus, as researchers work with samples from the general public, the list of words can provide similar levels of precision as within college student samples.

The fact that the final set of 20 words in this study provided greater measurement precision than other possible sets of words has implications for previous research using other sets of words. Some studies employing tasks that measure spelling ability have obviously had sufficient statistical power to detect the moderating effect of spelling ability on various reading processes. However, because of publication bias, it is not clear how many studies failed to detect these effects, and type II error has likely been inflated, given the inverse relationship between measurement error and statistical power. Moreover, given that correlations decrease in strength relative to the magnitude of measurement error (Osborne, 2002), it is likely that previous studies employing less precise words have underestimated true effects of spelling ability.

One might wonder whether optimally precise measurement can be achieved with fewer than the 20 (or the 10) best words we identified. While it is tempting to continue reducing the number of items to as few as possible, there are foreseeable consequences of doing so. Given that a primary goal of measurement is to assess for individual differences, it is important to note that test score variance decreases as the number of items that comprise the test decreases, as per the variance sum law. As an extreme example, a test with only the most discriminating word would only yield two possible scores (correct or incorrect), and a test with only two words would only allow for three possible scores. Therefore, researchers opting for too few words in a spelling task may inadvertently limit the range of detectable differences in their sample. We report and recommend the Best 20 word list, because the Best 20 yields larger variances in spelling scores, and because it was the largest item set we tested with more than half of the total item variability explained by latent spelling ability (i.e., R² > .50).

The conclusions of this work come with several important limitations. First, although the Best 20 list provides a clear improvement over other potential lists, there is still room for improvement in measurement precision. The initial list of 110 words only represents a sample from a population of thousands of potential words that could be used in a spelling dictation task. A larger initial list may have contained words that would more precisely measure spelling skill. Our method provides researchers with an intentional process of selecting items through quantitative methods that are preferable to subjective selection. Second, our analysis only improved one of at least two potential means of measuring spelling ability. A separate spelling recognition task with 88 items has been developed with good internal consistency reported; however, there is room for improvement in item selection (Andrews et al., 2020). Whether one, two, or more measures should be used to measure spelling ability is an open question. Andrews et al. note that there is “little evidence that dictation and recognition tests tap independent input and output representations” (p. 2268); however, other researchers suggest that dictation tasks may be more associated with a spelling lexicon whereas recognition tasks may be more associated with a reading lexicon (Yates & Slattery, 2019). Whether both tests represent independent contributions warrants further research. Finally, the scoring method used in this task was a binary correct or incorrect categorization. Incorrect spellings may represent a wide range of possible errors such as phonologically plausible, phonologically implausible, phoneme addition, phoneme omission, transposition, or other errors (Holmes & Carruthers, 1998). It is possible that categorizing each spelling error could provide more information about a participant’s true spelling ability; however, it is still an open question whether a more complex scoring method would provide any additional measurement precision. In this study, we chose to maintain the binary scoring method to maximize generalizability to the extant literature on spelling ability. Future research should investigate whether alternative scoring methods provide additional measurement precision beyond what can be achieved through item refinement.

The past decade has seen increased interest in the construct of lexical expertise as a moderating variable in many reading, eye movement, and visual word form processing effects. Given the current replicability crisis in psychology, precision in measurement is paramount to ensure that observed effects are real and replicable (Elson et al., 2014). The current study directly illustrates the impact of word choice on the error variance in test scores and provides a precise subset of 20 words with optimized precision. It is our hope that these findings will inform future investigations of lexical expertise and that through improved measurement, researchers will have more power to detect the moderating effects of spelling skill on reading processes.

Notes

Three words on this list were morphologically different from what we measured here as sampled from Burt and Tate (aggravate to aggravation, acquaint to acquaintance, conciliate to conciliatory). The latter was used by Andrews et al. (2020). The overall pattern does not change when these words are excluded from analyses.
There was a small degree of overlap between each of these list sets. Overlap ranged between two to five words with an average of 3.33 words overlapped (SD =1.21).

References

Andrews, S. (2012). Individual differences in skilled visual word recognition and reading: The role of lexical quality. In James Adelman (Eds.), Visual Word Recognition Volume 2: Meaning and Context, Individuals and Development, (pp. 151–172). Sussex, UK: Psychology Press.
Andrews, S. (2015). Individual differences among skilled readers: The role of lexical quality. In Pollatsek, A., & Treiman, R. (Eds.), The Oxford Handbook of Reading, (pp. 129–148). Oxford: Oxford University Press.
Andrews, S., & Bond, R. (2009). Lexical expertise and reading skill: Bottom-up and top-down processing of lexical ambiguity. Reading and Writing, 22(6), 687–711.
Article Google Scholar
Andrews, S., & Hersch, J. (2010). Lexical precision in skilled readers: Individual differences in masked neighbor priming. Journal of Experimental Psychology: General, 139(2), 299–318.
Article PubMed Google Scholar
Andrews, S., & Veldre, A. (2021). Wrapping up sentence comprehension: The role of task demands and individual differences. Scientific Studies of Reading, 25(2), 123–140.
Article Google Scholar
Andrews, S., Veldre, A., & Clarke, I. E. (2020). Measuring lexical quality: The role of spelling ability. Behavior Research Methods, 52(6), 2257–2282.
Article PubMed Google Scholar
Ashby, J., Rayner, K., & Clifton, C. (2005). Eye movements of highly skilled and average readers: Differential effects of frequency and predictability. The Quarterly Journal of Experimental Psychology Section A, 58(6), 1065–1086.
Article Google Scholar
Baayen, R. H., Piepenbrock, R., & van Run H, (1995). The CELEX lexical data base, Release 2 on [CD-ROM]. Linguistic Data Consortium, University of Pennsylvania, Philadelphia.
Beyersmann, E., Casalis, S., Ziegler, J. C., & Grainger, J. (2015). Language proficiency and morpho-orthographic segmentation. Psychonomic Bulletin & Review, 22(4), 1054–1061.
Article Google Scholar
Burt, J. S., & Fury, M. B. (2000). Spelling in adults: The role of reading skills and experience. Reading and Writing, 13(1), 1–30.
Article Google Scholar
Burt, J. S., & Tate, H. (2002). Does a reading lexicon provide orthographic representations for spelling? Journal of Memory and Language, 46(3), 518–543.
Article Google Scholar
Chmielewski, M., & Kucker, S. C. (2020). An MTurk crisis? Shifts in data quality and the impact on study results. Social Psychological and Personality Science, 11(4), 464–473.
Article Google Scholar
Drieghe, D., Veldre, A., Fitzsimmons, G., Ashby, J., & Andrews, S. (2019). The influence of number of syllables on word skipping during reading revisited. Psychonomic Bulletin & Review, 26(2), 616–621.
Article Google Scholar
Elson, M., Mohseni, M. R., Breuer, J., Scharkow, M., & Quandt, T. (2014). Press CRTT to measure aggressive behavior: The unstandardized use of the competitive reaction time task in aggression research. Psychological Assessment, 26(2), 419–432.
Article PubMed Google Scholar
Eskenazi, M. A., & Folk, J. R. (2015). Reading skill and word skipping: Implications for visual and linguistic accounts of word skipping. Journal of Experimental Psychology: Learning, Memory, and Cognition, 41(6), 1923–1928.
PubMed Google Scholar
Eskenazi, M. A., Swischuk, N. K., Folk, J. R., & Abraham, A. N. (2018). Uninformative contexts support word learning for high-skill spellers. Journal of Experimental Psychology: Learning, Memory, and Cognition, 44(12), 2019–2025.
PubMed Google Scholar
Eskenazi, M. A., Kemp, P., & Folk, J. R. (2021). Word skipping during the lexical acquisition process. Quarterly Journal of Experimental Psychology, 74(3), 548–558.
Article Google Scholar
Forster, K. I., Davis, C., Schoknecht, C., & Carter, R. (1987). Masked priming with graphemically related forms: Repetition or partial activation? The Quarterly Journal of Experimental Psychology Section A, 39(2), 211–251.
Article Google Scholar
Hersch, J., & Andrews, S. (2012). Lexical quality and reading skill: Bottom-up and top-down contributions to sentence processing. Scientific Studies of Reading, 16(3), 240–262.
Article Google Scholar
Holgado-Tello, F. P., Chacón-Moscoso, S., Barbero-García, I., & Vila-Abad, E. (2010). Polychoric versus Pearson correlations in exploratory and confirmatory factor analysis of ordinal variables. Quality & Quantity, 44(1), 153–166.
Article Google Scholar
Holmes, V. M., & Carruthers, J. (1998). The relation between reading and spelling in skilled adult readers. Journal of Memory and Language, 39, 264–289.
Article Google Scholar
Kuperman, V., & Van Dyke, J. A. (2011). Effects of individual differences in verbal skills on eye-movement patterns during sentence reading. Journal of Memory and Language, 65(1), 42–73.
Article PubMed PubMed Central Google Scholar
Landi, N. (2010). An examination of the relationship between reading comprehension, higher-level and lower-level reading sub-skills in adults. Reading and Writing, 23(6), 701–717.
Article PubMed PubMed Central Google Scholar
Maydeu-Olivares, A., Shi, D., & Rosseel, Y. (2018). Assessing fit in structural equation models: A Monte-Carlo evaluation of RMSEA versus SRMR confidence intervals and tests of close fit. Structural Equation Modeling: A Multidisciplinary Journal, 25(3), 389–402.
Article Google Scholar
Osborne, J. W. (2002). Effect sizes and the disattenuation of correlation and regression coefficients: Lessons from educational psychology. Practical Assessment, Research, and Evaluation, 8, 11.
Google Scholar
Parker, A. J., & Slattery, T. J. (2021). Spelling ability influences early letter encoding during reading: Evidence from return-sweep eye movements. Quarterly Journal of Experimental Psychology, 74(1), 135–149.
Article Google Scholar
Perfetti, C. (2007). Reading ability: Lexical quality to comprehension. Scientific Studies of Reading, 11(4), 357–383.
Article Google Scholar
R Core Team (2021). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. https://www.R-project.org/
Rahmanian, S., & Kuperman, V. (2019). Spelling errors impede recognition of correctly spelled word forms. Scientific Studies of Reading, 23(1), 24–36.
Article Google Scholar
Rayner, K. (1998). Eye movements in reading and information processing: 20 years of research. Psychological Bulletin, 124, 372–422.
Article PubMed Google Scholar
Revelle, W (2021). psych: Procedures for Psychological, Psychometric, and Personality Research. Northwestern University, Evanston, Illinois. R package version 2.1.6, https://cran.r-project.org/package=psych.
Rosseel, Y. (2012). Lavaan: An R package for structural equation modeling. Journal of Statistical Software, 48(2), 1–36.
Article Google Scholar
Shi, D., Maydeu-Olivares, A., & Rosseel, Y. (2020). Assessing fit in ordinal factor analysis models: SRMR vs. RMSEA. Structural Equation Modeling: A Multidisciplinary Journal, 27(1), 1–15.
Article Google Scholar
Slattery, T. J., & Yates, M. (2018). Word skipping: Effects of word length, predictability, spelling and reading skill. Quarterly Journal of Experimental Psychology, 71(1), 250–259.
Article Google Scholar
Tan, L. C., & Yap, M. J. (2016). Are individual differences in masked repetition and semantic priming reliable? Visual Cognition, 24(2), 182–200.
Article Google Scholar
Trizano-Hermosilla, I., & Alvarado, J. M. (2016). Best alternatives to Cronbach's alpha reliability in realistic conditions: Congeneric and asymmetrical measurements. Frontiers in Psychology, 7, 769.
Article PubMed PubMed Central Google Scholar
Veldre, A., & Andrews, S. (2014). Lexical quality and eye movements: Individual differences in the perceptual span of skilled adult readers. The Quarterly Journal of Experimental Psychology, 67(4), 703–727.
Article PubMed Google Scholar
Veldre, A., & Andrews, S. (2015). Parafoveal preview benefit is modulated by the precision of skilled readers' lexical representations. Journal of Experimental Psychology: Human Perception and Performance, 41(1), 219–232.
PubMed Google Scholar
Wechsler, D. (2008). Wechsler Adult Intelligence Scale—Fourth Edition. San Antonio, TX: Pearson Assessment.
Wilkinson, G. S., & Robertson, G. J. (2006). Wide Range Achievement Test 4. Lutz, FL: Psychological Assessment Resources.
Wolf, E. J., Harrington, K. M., Clark, S. L., & Miller, M. W. (2013). Sample size requirements for structural equation models: An evaluation of power, bias, and solution propriety. Educational and Psychological Measurement, 76(6), 913–934.
Article PubMed Google Scholar
Xia, Y., & Yang, Y. (2019). RMSEA, CFI, and TLI in structural equation modeling with ordered categorical data: The story they tell depends on the estimation methods. Behavior Research Methods, 51(1), 409–428.
Article PubMed Google Scholar
Yates, M., & Slattery, T. J. (2019). Individual differences in spelling ability influence phonological processing during visual word recognition. Cognition, 187, 139–149.
Article PubMed Google Scholar
Zinbarg, R. E., Revelle, W., Yovel, I., & Li, W. (2005). Cronbach’s α, Revelle’s β, and McDonald’s ω H: Their relations with each other and two alternative conceptualizations of reliability. Psychometrika, 70(1), 123–133.
Article Google Scholar

Download references

Availability of data, materials, and code

Materials used in this study have been made available previously by Burt and Tate (2002) and are accessible in appendices A, B, and C in their original publication.

These studies were not preregistered.

Complete data outputs from each of the models from 109 to 10, Median 20, Random 20, Prior 20, and Andrews 19 are all available along with the complete data and R code used for analyses at the link below:

https://osf.io/7tuwr/?view_only=c62d60752bbd48fa8dd7cc3ee092ea51

Author information

Authors and Affiliations

Department of Psychology, Stetson University, DeLand, FL, 32723, USA
Michael A. Eskenazi & Robert L. Askew
Department of Psychological Sciences, Kent State University, Kent, OH, USA
Jocelyn R. Folk

Authors

Michael A. Eskenazi
View author publications
You can also search for this author in PubMed Google Scholar
Robert L. Askew
View author publications
You can also search for this author in PubMed Google Scholar
Jocelyn R. Folk
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Michael A. Eskenazi.

Ethics declarations

Conflict of interest

The authors have no conflicts of interest to disclose.

Ethics approval

This research was approved by the Kent State University and Stetson University Institutional Review Boards, and all research practices were conducted in accordance with the Belmont Report and the National Research Act.

Consent to participate and for publication

All participants completed an informed consent prior to participating in the study and gave their consent to have their data published.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Eskenazi, M.A., Askew, R.L. & Folk, J.R. Precision in the measurement of lexical expertise: the selection of optimal items for a spelling assessment. Behav Res 55, 623–632 (2023). https://doi.org/10.3758/s13428-022-01834-3

Download citation

Accepted: 14 March 2022
Published: 05 April 2022
Issue Date: February 2023
DOI: https://doi.org/10.3758/s13428-022-01834-3

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Precision in the measurement of lexical expertise: the selection of optimal items for a spelling assessment

Abstract

Similar content being viewed by others

Measuring Lexical Quality: The Role of Spelling Ability

Norms of vocabulary, reading, and spelling tests in French university students

Knowing what we don’t know: cognitive correlates of early spelling of different target types

Study 1

Method

Participants

Materials

Procedure

Results

Analytic approach

Item selection

Item precision

Comparison of various item sets

Impact of precision and error on classification

Discussion

Study 2

Method

Participants

Materials

Spelling measures

Vocabulary measure

Comprehension measure

Procedure

Results

External validation

Discriminant validity

General discussion

Notes

References

Availability of data, materials, and code

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Ethics approval

Consent to participate and for publication

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation