Reading involves the complex process of decoding visual symbols to access a lexical representation of a word. Decades of research in visual word form processing have elucidated the detailed process through which a lexical representation is accessed. Much of this research has appropriately taken a “uniformity assumption” to understand how this process works for the average skilled reader (Andrews, 2012, 2015). However, a recent shift in this field of research has been towards understanding how individual differences in lexical abilities influence that process. Although all literate adult readers have developed skills in comprehension, spelling, vocabulary, phonological awareness, and other lexical abilities, there is considerable difference in the degree to which literate adult readers perform in each of these domains (Kuperman & Van Dyke, 2011). Spelling skill in particular has demonstrated largely different patterns of eye-movement behavior and reaction time measures for low-skill and high-skill literate adults.

Differences in participants’ spelling skill, as a marker of orthographic ability and lexical expertise, may explain null effects or contradictory findings in previous research. For example, previous research has shown that low neighborhood words provide faciliatory priming (eble-ABLE) in a masked priming lexical decision task, but high neighborhood words (tand-SAND) do not show this effect (Forster et al., 1987). However, this null effect for high neighborhood primes depends on spelling skill. Andrews and Hersch (2010) found inhibitory priming from high neighborhood primes for high-skill spellers, but faciliatory priming for low-skill spellers. Without measuring spelling skill, the overall averaged sample would have shown no priming effects from high neighborhood primes. Other differential effects of spelling skill indicate that low-skill spellers rely more on context and use top-down processing to identify words, whereas high-skill spellers identify words using bottom-up processing directly from the visual form of the word (Andrews & Bond, 2009; Hersch & Andrews, 2012). Thus, the inclusion of spelling skill as a moderating variable enhances understanding of visual word form processing. It is important to note that these studies often measure comprehension ability or vocabulary ability, which have their own contributions to reading processes.

Eye movement behavior during reading is another area where spelling skill has demonstrated differential effects. Eye movement behavior is often used as an index of moment-moment cognitive processing during reading (Rayner, 1998). Earlier research has shown that lexical factors such as vocabulary size and comprehension ability influence eye movement behavior, such that high-skill readers, based on measures of vocabulary and comprehension, show more efficient eye movements (Ashby et al., 2005). When taking spelling skill into account, research has shown that high-skill spellers are not only more efficient at processing the currently fixated word, but are also able to extract more information from upcoming words in the parafovea (Veldre & Andrews, 2015). Further, the perceptual span, or amount of information that can be extracted on each fixation, of high-skill spellers is larger than that of low-skill spellers (Veldre & Andrews, 2014). Spelling skill also influences the process of learning new words, such that high-skill spellers are better at learning the meaning of new words when reading them in context (Eskenazi et al., 2018).

Taken together, these effects clearly indicate that spelling skill is associated with readers’ eye movement behavior, lexical processing, and lexical acquisition. These effects have been taken as evidence to support the lexical quality hypothesis (LQH; Perfetti, 2007). The LQH explains individual differences in reading skill and comprehension through variability in the quality of lexical representations. A high-quality lexical representation includes fully specified and interconnected knowledge about a word’s orthography, phonology, and semantics. These higher-quality representations result in more efficient activation of word forms, which aids in higher-order processing such as comprehension and text integration. Orthographic precision is of particular importance because these lexical representations are accessed through the written forms of words. In other words, readers with greater knowledge about the written forms of words (orthographic precision) will have greater access to lexical representation of word forms. Thus, individual differences in spelling ability serve as a good indicator of the ability to access lexical representations.

Spelling ability is often measured through a spelling dictation task and spelling recognition task. However, there is one important shortcoming of research using these tasks: it remains unclear how well individual words in these tasks serve as indicators of spelling ability. In the spelling recognition task, readers identify misspelled words from a set of words that contain some common spelling errors. In the spelling dictation task, participants recall spellings of words after hearing them spoken. The list of words in the spelling dictation task was originally created by selecting 20 words out of 110 possible words provided by Burt and Tate (2002) with a broad range of difficulties (Andrews et al., 2020). These two spelling tasks are commonly used in many publications from various researchers investigating individual differences in lexical processing (Andrews & Bond, 2009; Andrews & Veldre, 2021; Beyersmann et al., 2015; Drieghe et al., 2019; Eskenazi et al., 2018; Parker & Slattery, 2021; Rahmanian & Kuperman, 2019; Slattery & Yates, 2018; Tan & Yap, 2016). In these studies, researchers attempt to successfully differentiate low-skill and high-skill spellers, and the average or total number of correctly spelled items is generally used as an estimate of an individual’s spelling ability. By extension, the variance in spelling scores is used as an estimate of the magnitude of individual differences in spelling ability observed in a sample. Use of this variance estimate for statistical hypothesis testing, however, can be problematic when the assumption of error-free measurement has been violated, and the magnitude of error can have profound consequences on the outcomes of experimental trials.

Given the practice of using a spelling dictation task to measure individual differences in spelling ability, it is important to determine how well these words perform individually and collectively as measures of spelling ability and the degree to which distinct subsets of words may vary in measurement precision. Some evidence already exists to address this question. Recently, Andrews et al. (2020) investigated the set of 20 words that are regularly used in a spelling dictation task and reported good internal consistency and unidimensionality; however, the authors note that this set of words can be refined to improve measurement precision.

Thus, the purpose of the first study was to identify a more precise set of words for a spelling dictation task by starting with all 110 words from Burt and Tate’s word bank. There were four specific goals of the first study: (1) assess precision observed in a spelling dictation task with all 110 words, (2) identify potential differences among individual words that comprise the task in terms of measurement precision and error, (3) compare precision and error estimates of scores from spelling tests comprised of the best-performing words (i.e., with maximal precision) to scores from tests used in prior research and scores from tests with words randomly selected from the word bank, and (4) assess the potential impact of item choice on categorizations of spelling ability (i.e., low- and high-skill spellers). The purpose of the second study was to (1) validate this set of items in a new sample, (2) provide external validity in a proxy sample for the general population, and (3) assess discriminant validity with other measures of lexical ability including vocabulary and comprehension. If measurement is sufficiently improved through intentional selection of the best-performing items, the final result will be a precise set of words that improve statistical power to detect the moderating effects of orthographic processing skill on reading processes.

Classical test theory (CTT) provides a coherent framework to quantify precision and to determine whether the magnitude of measurement error is of practical consequence. A foundational premise of CTT is that variance in scores observed on tests are a composite of two distinct forms of variance: true score variance (in this case, variance reflecting real or genuine differences among participants in their spelling ability) and error variance (in this case, variance attributable to individual words as imperfect measures of spelling ability). This is observed in the following fundamental equation.

$${\sigma}_o^2={\sigma}_t^2+{\sigma}_e^2$$

where \({\sigma}_{o}^{2}\) represents observed test score variance, \({\sigma}_t^2\) represents true score variance, and \({\sigma}_e^2\) represents error variance. The proportion of variance in observed scores that is attributable to true score variance represents an estimate of measurement precision (i.e., reliability), as follows.

$$\rho ={{\sigma}_t^2}\left/ {{\sigma}_t^2+{\sigma}_e^2}\right.$$

Likewise, the proportion of observed score variance that is attributable to error variance represents an estimate of measurement error, and because observed score variance is a composite of true score and error variance, measurement error can also be expressed as 1-ρ. While there are a variety of tests and metrics of measurement precision, we employed factor analytic models because factor analysis facilitates more exacting tests of item reliability (i.e., precision) and of scores derived from differing subsets of words that comprise the spelling task. As such, this approach is well aligned with the specific goals of this study.

Study 1

Method

Participants

A total of 682 participants were recruited from Kent State University (n = 478), a large public state university, and Stetson University (n = 204), a small private liberal arts university. Prior investigations indicate this sample size far exceeds the minimum required to achieve sufficient statistical power in confirmatory factor analytic models (Wolf et al., 2013). All participants reported English as their native language and had no reported reading disabilities. Nine participants were excluded from the sample for having more than 10% missing data. The sample contained 164 male participants (24%) and 509 female participants (76%). The average age of the sample was 20.16 (SD = 2.47).

Materials

Previous research on the organization of the orthographic lexicon provided 110 words with a wide range of difficulty from 13% to 100% average accuracy (Burt & Tate, 2002). Thus, these items provided a broad basis from which to select a subset of items to measure lexical expertise as it contains items that are of low, moderate, and high difficulty. Further, this is the same set of words from which 20 words were selected in the previously mentioned research. The 110 items have an average word length of 8.9 letters (SD = .56). The average word frequency is 1.77 counts per million, (SD = 1.89), which was determined using the CELEX corpus (Baayen et al., 1995). The full list of items with their average accuracies are included with the supplemental materials.

Procedure

All procedures were first approved by the Kent State University and Stetson University Institutional Review Boards. All materials were presented on a computer using the survey website Qualtrics. The survey JavaScript was edited to prevent participants from using a web browser-enabled spellchecker. Recordings were created for each of the 110 words such that each word was spoken clearly three times. Two words were determined to have multiple possible pronunciations, and thus were spoken four times – twice with each pronunciation. The word affluent was spoken with the emphasis on the first syllable or the second syllable, and the word omniscient was spoken with the /s/ sound or with the /š/ sound. Participants were instructed to listen to each word spoken three (or four) times and to spell that word by typing it into the space provided. The 110 words were presented in random order. After completing the spelling portion of the study, participants answered several demographic questions. All participants received course credit for their participation.

Results

Analytic approach

All analyses were conducted using R statistical software version 4.1.0 (R Core Team, 2021) and the psych (Revelle, 2021) and lavaan (Rosseel, 2012) packages. We employed an iterative process of model refinement by fitting a series of unidimensional confirmatory factor analytic (CFA) models starting with Burt and Tate’s 109 item word bank. The word occident was excluded from analyses, as it is a heterographic homophone with the word oxidant. The ten items with the most measurement error were then removed at each iteration, and a new CFA model was fit until the final ten items with maximal precision and minimal error were identified. Because the data were binary (i.e., each word spelled correctly or incorrectly), we employed tetrachoric correlation matrices (a special case of polychoric correlations for binary variables) and a diagonally weighted least squares estimator (WLSMV), which have been shown to be more effective when modeling binary (or polytomous) item response structures (Holgado-Tello et al., 2010). We calculated robust χ2, CFI, TLI, and SRMR as measures of absolute and relative fit, but we included alternate performance indicators (outlined below), given that CFI and TLI have been shown to be overly optimistic measures of fit when modeling ordinal data (i.e., tetrachoric/polychoric correlation matrices) with diagonal least squares estimators (WLSMV, ULSMV; Xia & Yang, 2019). We also reported SRMR, given its demonstrated superiority over RMSEA in terms of power and type 1 error rates (Shi et al., 2020), especially when the number of indicators is large (Maydeu-Olivares et al., 2018). We then address each of the four study aims through a comparison of psychometric performance indicators from each test of spelling ability, as follows.

Item selection

The first aim of the study was to compare estimates of precision observed in spelling dictation tasks beginning with the full set of 109 items and to progressively remove items with the most measurement error until the items with the most precision remain. As outlined in Table 1, all indicators of model fit (CFI and TLI) and misfit (robust χ2 and SRMR) improved as poorly performing items were removed from each successive unidimensional CFA model. We also estimated R2 as a conservative measure of precision by dividing the sum of squared factor loadings by the total number of items in a test (i.e., proportion of total variability among all items explained by “true score” variability in the latent factor of spelling ability). Precision improved consistently as poorly performing items were dropped from the 109 item model (R2 = .346) through to the ten-item model (R2 = .570). We also noted that the 20 item model was the first to achieve more true score variability than error (i.e., R2 > .500). Lastly. McDonald’s Ωhierarchical was calculated as an upper bound estimate of scale precision (i.e., internal consistency reliability), which has been shown to outperform more popular but less robust reliability estimators, such as Cronbach’s α (Trizano-Hermosilla & Alvarado, 2016; Zinbarg et al., 2005). When moving from the 109 item model to the ten-item model, Ωhierarchical likewise improved from .75 to .85. In sum, precision in the measurement of orthographic processing improved in each successive test as poorly performing items were dropped from the model, and this trend was observed across every metric evaluated.

Table 1 Robust fit indices, reliability estimates, and scores for each model

Item precision

The second aim of the study was to identify potential differences in the precision of individual words that comprise the spelling task (i.e., test the tau-equivalence assumption in the classical test theory framework). The factor loadings presented in Table 2 represent the correlation between the factor score (i.e., true score of spelling ability) and the outcome of the spelling trial for each word. Words with higher factor loadings are more strongly correlated with the latent construct (in this case, spelling ability), and accordingly serve as better indicators of spelling ability. The average factor loadings for each model increased from .580 in the 109 item model to .755 in the ten-item model, indicating that the words in the ten-item model function as more precise measures of spelling ability. Further, the absolute range of factor loadings observed in the 109 item model was much wider (.251–.774) than that observed in the ten-item model (.722–.794), as was the standard deviation (.093 in the 109 item model vs .026 in the ten-item model). The increase in factor loadings coupled with the threefold reduction in the variability of factor loadings in the ten-item model indicates that precision and consistency in precision are both improved when poorly performing items are removed from model.

Table 2 Words and factor loadings for the Best 20 and Best 10 models

Comparison of various item sets

The third aim of the study was to compare precision and error estimates of scores from tests comprised of the 20 best performing words (i.e. with maximal precision) to four other sets of words. We chose lists with 20 words because the Best 20 model was the first to demonstrate more true score variability than error. The first set of 20 words were randomly selected from Burt and Tate’s word bank (Random 20). The second set of words were selected to be the 20 items from the middle range of difficulty (Median 20). The third set of words were those used in previous research (Eskenazi et al., 2018; Eskenazi & Folk, 2015) and were semi-randomly selected to include a wide range of difficulty (Prior 20). Finally, the fourth list of words were those recently subjected to psychometric testing by Andrews et al. (2020). This list is similar to Prior 20 in that it was sampled from Burt and Tate and was designed to include words with a wide range of difficulties. One word on this list (persuade) was not sampled from Burt and Tate, and thus we can only report on 19 words from this list (Andrews 19)Footnote 1. Each of these four comparison lists represent possible approaches that other researchers might take when designing their spelling dictation tasksFootnote 2. The model with the 20 best performing items evidenced greater precision by both reliability metrics: R2 for Best 20 (.515), Random 20 (.342), Median 20 (.325), Prior 20 (.343), Andrews 19 (.330); Ωhierarchical for Best 20 (.85), Random 20 (.70), Median 20 (.72), Prior 20 (0.73), and Andrews 19 (.75). This finding suggests that accuracy in tests of spelling ability can be improved if investigators use the best 20 performing words rather than 20 randomly selected words, 20 words selected in the median difficulty range, and importantly, 20 words used in previous research. The full list of the Best 20 words with their factor loadings are listed in Table 2.

Impact of precision and error on classification

The fourth aim of the study was to assess the potential impact of item choice on categorizations of spelling ability (i.e., low- and high-skill spellers). Spelling dictation tasks are often used as a method to categorize participants into high-skill or low-skill spelling groups. Thus, we assessed the degree to which participants would change groups depending on the item set used (Best 20, Random 20, Prior 20). We first categorized participants into groups with a median split using the Best 20 items and then determined the number of participants that would change groups if using items from the other two sets. When using the Random 20 set, 14% of participants would have changed groups with 7% moving from high-skill to low-skill and 7% moving from low-skill to high-skill. When using the Prior 20 set, 20% of participants would have changed groups with 11% moving from high-skill to low-skill and 9% moving from low-skill to high-skill. Thus, categorization of participants also varies widely depending on the item set selected.

Discussion

Results of the first study identified a list of 20 words with greater precision and less error variance than other potential sets of words that could be used to measure spelling ability. Before this set of words can be recommended for use in practice, they must be validated in another sample and evidence must be provided for discriminant validity. Thus, the purpose of the second study was to test the fit of these 20 words in another sample of college students, to provide evidence of generalizability in an MTurk sample that approximates characteristics of the general population, and to test discriminant validity with measures of vocabulary ability and comprehension ability.

Study 2

Method

Participants

The second study included 786 participants, which included a sample of college students from Stetson University (n = 372) and a sample from Amazon Mechanical Turk (MTurk; n = 414). MTurk is a crowdsourcing tool for collecting data from a large representative sample. The college sample served as a validation check from the first study and the MTurk sample served as a generalizability check from the first study. The average age of the college student sample was 20.85 (SD = 4.75) and was mostly female (73%). The average age of the MTurk sample was 41.16 (SD = 13.85) and was also mostly female (65%). All participants reported English as their native language and had no reported reading disabilities.

Materials

Spelling measures

The same set of words from the first study was used to validate the spelling measure. A spelling recognition task was also included to determine the degree to which this spelling measure is associated with other commonly used measures of spelling ability. In the spelling recognition task, participants saw 50 words and selected incorrectly spelled words. Half of the words contained phonologically plausible errors created by either removing or adding one letter (e.g., rasberry, reccommend).

Vocabulary measure

Vocabulary skill was measured using the vocabulary subtest of the Wechsler Adult Intelligence Scale IV (WAIS-IV; Wechsler, 2008). This measure is appropriate for adults aged 16 through 90 and provides age-normed percentile rank scores. This measure included 30 words, and participants were instructed to provide a brief definition for each word. An example word and definition were provided for participants before they began. The average age-normed score was 78th percentile (SD = 19) with a range from 1st percentile to 100th percentile.

Comprehension measure

Comprehension was measured using the sentence comprehension subtest of the Wide Range Achievement Test 4 (WRAT 4; Wilkinson & Robertson, 2006). This measure is appropriate for children and adults from ages 5 through 94 and provides age-normed percentile ranks of comprehension ability. In this task, participants read 50 sentences and were instructed to enter the most appropriate word or short two-word phrase that best completed the sentence. Before beginning, participants were provided with a sample sentence with several possible correct answers as examples. The average age-normed score was 60th percentile (SD = 25) with a range from 1st percentile to 98th percentile.

Procedure

All study procedures were approved by the Stetson University Institutional Review Board. After providing informed consent, participants completed each of the measures described above. The only difference in procedure was that the college students were compensated with course credit, and the MTurk sample was compensated with $1.00 after completing the study. To ensure data integrity, several attention check and non-human response checks were used because MTurk samples regularly include non-human or non-attentive responses (Chmielewski & Kucker, 2020). First, participants listened to an audio file of the word “apple” and were instructed to type the word. Any participant who failed to enter this word correctly was blocked from accessing the rest of the study. Second, a CAPTCHA was included as a non-human response filter. Third, three attention check questions were included randomly throughout the study, which instructed participants to type a specific phrase to ensure that they were paying attention. Finally, participants were instructed to write three complete English sentences in response to a prompt. Any participant who failed the initial audio check, the CAPTCHA, failed at least one of the three attention checks, or provided gibberish or broken-English responses to the final question were not included in data analysis.

Results

External validation

To test the external validity of findings from Study 1, we assessed the psychometric performance of the Best 20 and Best 10 items in the new college student sample and the MTurk sample. As presented in Table 3, similar indicators of model fit (CFI and TLI) and misfit (χ2 and SRMR) were observed. Moreover, R2 and Ωh were also in line with findings from Study 1, as were the mean factor loadings in both samples. The standard deviations of the factor loadings, however, were larger in both samples, with the largest difference observed with the MTurk sample. These findings provide additional support for the external validity of the results from Study 1.

Table 3 Robust fit indices, reliability estimates, and scores for each model

Discriminant validity

Correlation analyses were conducted to determine whether the Best 20 and Best 10 measures were sufficiently unrelated to measures of other lexical abilities including comprehension skill and vocabulary skill. Previous research has shown that spelling skill and comprehension skill are moderately or strongly related (Landi, 2010: r = .23; Burt & Fury, 2000: r = .26; Andrews & Veldre, 2021: r = .31; Yates & Slattery, 2019: r = .54). In the current study, the Best 20 (r = .190) and Best 10 (r = .153) spelling measures were weakly related to comprehension skill, which provides strong evidence of discriminant validity (See Table 4). Similarly, previous research has shown moderate to strong correlations between spelling skill and vocabulary skill (Landi, 2010: r = .30; Burt & Fury, 2000: r = .47; Andrews & Veldre, 2021: r = .43; Yates & Slattery, 2019: r = .58). The current study showed moderate correlations between Best 20 (r = .355) and Best 10 (r = .318) with vocabulary skill, which again demonstrates discriminant validity.

Table 4 Correlations between spelling accuracy in each model and comprehension, vocabulary, and spelling recognition

We also tested correlations with a secondary spelling measure: spelling recognition skill. Previous research has shown that spelling dictation and spelling recognition are strongly related (Yates & Slattery, 2019: r = .73, Eskenazi et al., 2021: r = .65; Andrews & Veldre, 2021: r = .75). In the current study, the Best 20 (r = .601) and Best 10 (r = .539) lists were also strongly related to spelling recognition skill. In sum, the Best 20 and Best 10 lists are in line with expected correlations with other measures of lexical ability.

General discussion

Previous research has demonstrated the importance of measuring spelling skill in studies of lexical processing, and our aim was to improve precision when measuring spelling skill. To that end, we assessed the precision of individual words, compared precision of different sets of words, and compared classifications of high- and low-skill spellers using different sets of words, with the following findings.

First, we observed substantial measurement error in the full list of 109 words, and estimates of error varied considerably among words. Second, we were able to identify a set of 20 words with the least amount of error in the measurement of spelling skill that performed better than multiple other sets of words including those used in previous research. Third, there was substantial variability in the categorization of participants into high-skill and low-skill spelling groups depending on the set of words used to measure spelling skill. Fourth, the precision of this set of words replicates in another set of college students and performs similarly in an Internet sample serving as a proxy for the general population. Finally, scores derived from this set of words are discriminant from vocabulary and comprehension abilities. Taken together, these findings indicate that researchers should carefully consider the choice of words used to assess spelling ability, namely item-level reliability (i.e., precision), as the power of statistical tests is directly affected by the amount of error variance contaminating the total variance of observed scores.

Moreover, the final set of 20 words with maximized reliability were the only set to result in more true-score variability than error. Indeed, one of the most important findings of this study is that randomly selecting a set of words or using a set of words with different ranges of difficulty does not ensure precise measurement. Although the set of words proposed by Andrews et al. (2020) has provisional reliability evidence supporting its use (.81 alpha reported in their data, .75 omega hierarchical reported here), their list of words evidences markedly lower levels of precision that is similar to a randomly selected list of words. As such, the intentional selection of the Best 20 words through iterative model refinement is a demonstrably better process for creating a more precise measure of spelling ability.

An additional benefit of this set of words is that it can be used to measure spelling ability in multiple populations. As with many areas of psychology, most research on visual word form processing, eye movement behavior, and reading processes is conducted using college student samples. Therefore, the original intent of this research was to create a list of words that works well within college students. However, the results of Study 2 demonstrate that this set of words works similarly in a broader sample with a wide range of educational attainment and ages. Thus, as researchers work with samples from the general public, the list of words can provide similar levels of precision as within college student samples.

The fact that the final set of 20 words in this study provided greater measurement precision than other possible sets of words has implications for previous research using other sets of words. Some studies employing tasks that measure spelling ability have obviously had sufficient statistical power to detect the moderating effect of spelling ability on various reading processes. However, because of publication bias, it is not clear how many studies failed to detect these effects, and type II error has likely been inflated, given the inverse relationship between measurement error and statistical power. Moreover, given that correlations decrease in strength relative to the magnitude of measurement error (Osborne, 2002), it is likely that previous studies employing less precise words have underestimated true effects of spelling ability.

One might wonder whether optimally precise measurement can be achieved with fewer than the 20 (or the 10) best words we identified. While it is tempting to continue reducing the number of items to as few as possible, there are foreseeable consequences of doing so. Given that a primary goal of measurement is to assess for individual differences, it is important to note that test score variance decreases as the number of items that comprise the test decreases, as per the variance sum law. As an extreme example, a test with only the most discriminating word would only yield two possible scores (correct or incorrect), and a test with only two words would only allow for three possible scores. Therefore, researchers opting for too few words in a spelling task may inadvertently limit the range of detectable differences in their sample. We report and recommend the Best 20 word list, because the Best 20 yields larger variances in spelling scores, and because it was the largest item set we tested with more than half of the total item variability explained by latent spelling ability (i.e., R2 > .50).

The conclusions of this work come with several important limitations. First, although the Best 20 list provides a clear improvement over other potential lists, there is still room for improvement in measurement precision. The initial list of 110 words only represents a sample from a population of thousands of potential words that could be used in a spelling dictation task. A larger initial list may have contained words that would more precisely measure spelling skill. Our method provides researchers with an intentional process of selecting items through quantitative methods that are preferable to subjective selection. Second, our analysis only improved one of at least two potential means of measuring spelling ability. A separate spelling recognition task with 88 items has been developed with good internal consistency reported; however, there is room for improvement in item selection (Andrews et al., 2020). Whether one, two, or more measures should be used to measure spelling ability is an open question. Andrews et al. note that there is “little evidence that dictation and recognition tests tap independent input and output representations” (p. 2268); however, other researchers suggest that dictation tasks may be more associated with a spelling lexicon whereas recognition tasks may be more associated with a reading lexicon (Yates & Slattery, 2019). Whether both tests represent independent contributions warrants further research. Finally, the scoring method used in this task was a binary correct or incorrect categorization. Incorrect spellings may represent a wide range of possible errors such as phonologically plausible, phonologically implausible, phoneme addition, phoneme omission, transposition, or other errors (Holmes & Carruthers, 1998). It is possible that categorizing each spelling error could provide more information about a participant’s true spelling ability; however, it is still an open question whether a more complex scoring method would provide any additional measurement precision. In this study, we chose to maintain the binary scoring method to maximize generalizability to the extant literature on spelling ability. Future research should investigate whether alternative scoring methods provide additional measurement precision beyond what can be achieved through item refinement.

The past decade has seen increased interest in the construct of lexical expertise as a moderating variable in many reading, eye movement, and visual word form processing effects. Given the current replicability crisis in psychology, precision in measurement is paramount to ensure that observed effects are real and replicable (Elson et al., 2014). The current study directly illustrates the impact of word choice on the error variance in test scores and provides a precise subset of 20 words with optimized precision. It is our hope that these findings will inform future investigations of lexical expertise and that through improved measurement, researchers will have more power to detect the moderating effects of spelling skill on reading processes.