5 Modeling

Chapter 5 of the Dynamic Learning Maps® (DLM®) Alternate Assessment System 2021–2022 Technical Manual Update—Science (Dynamic Learning Maps Consortium, 2022) provides a complete description of the psychometric model used to calibrate and score DLM assessments, including the psychometric background, structure of the assessment system, suitability for diagnostic modeling, and a detailed summary of the procedures used to calibrate and score DLM assessments. This chapter provides a high-level summary of the model used to calibrate and score assessments, along with a summary of updated modeling evidence from the 2023–2024 administration year.

5.1 Psychometric Background

Learning maps, which are the networks of sequenced learning targets, are at the core of the DLM assessments in English language arts and mathematics. While development of a science learning map is planned for the future development work, the similarity across all subjects in scoring at the linkage level means the general background below is useful for understanding the current science scoring model, even though there is not currently an underlying map.

Because of the underlying map structure and the goal of providing more fine-grained information beyond a single raw or scale score value, student results are reported as a profile of skill mastery. This profile is created using diagnostic classification modeling (e.g., Bradshaw, 2016), which draws on research in cognition and learning theory to provide information about student mastery of multiple skills measured by the assessment. Results are reported for each Essential Element (EE) at the three levels of complexity (linkage levels) for which assessments are available: Initial, Proximal, and Target.

Data from the previous three administrations (2020–2021, 2021–2022, and 2022–2023) were retained to calibrate the models at each linkage level. The previous 3 years were chosen to use data most consistent with the current administration model, minimize confounds due to the outbreak of Covid-19, and increase model estimation efficiency. We retained data from additional administrations in cases where the sample size for the EE and linkage level was less than 250. The threshold of 250 was chosen based on a review of previous operational calibrations using all of the available data, which indicated a sample size of 250 is sufficient to obtain adequate psychometric properties. Retaining data from additional administrations was unnecessary for most linkage levels. The combined sample size for the three previous administrations was at least 250 for 102 linkage levels (100%). In cases where the combined sample size from the previous three administrations was less than 250, we prioritized data from more recent administrations when retaining additional data.

Each linkage level is calibrated separately for each EE using separate log-linear cognitive diagnosis models [LCDMs; Henson et al. (2009)]. Each linkage level within an EE is estimated separately because of the administration design, in which it is uncommon for students to take testlets at multiple levels for an EE. Also, because items are developed to meet a precise cognitive specification, See Chapter 3 of the 2021–2022 Technical Manual Update—Science (Dynamic Learning Maps Consortium, 2022). the item parameters defining the probability of masters and nonmasters providing a correct response for items measuring a linkage level are assumed to be equal. That is, all items are assumed to be fungible, or exchangeable, within a linkage level.

The DLM scoring model for the 2023–2024 administration was implemented as follows. Each linkage level within each EE was considered the latent variable to be measured (the attribute). Using diagnostic classification models (DCMs), a probability of mastery on a scale of 0 to 1 was calculated for each linkage level within each EE. Students were then classified into one of two classes for each linkage level of each EE: master or nonmaster. As described in Chapter 6 of the 2021–2022 Technical Manual Update—Science (Dynamic Learning Maps Consortium, 2022), a posterior probability of at least .8 was required for mastery classification. As per the assumption of item fungibility, a single set of probabilities was estimated for all items within a linkage level. Finally, only a single structural parameter was needed (\(\nu\)), which is the probability that a randomly selected student who is assessed on the linkage level is a master. In total, three parameters per linkage level are specified in the DLM scoring model: a fungible intercept, a fungible main effect, and the proportion of masters.

Once the LCDM parameters have been calibrated, student mastery probabilities are obtained for each assessed linkage level, and these probabilities are used to determine the highest linkage level mastered for each EE. Although connections between linkage levels are not modeled empirically, they are used in the scoring procedures (see the 7.3.1 section). In particular, if the LCDM determines a student has mastered a given linkage level within an EE, then the student is assumed to have mastered all lower levels within that EE.

5.2 Model Evaluation

Model fit and classification accuracy are critical to making valid inferences about student mastery. If the model used to calibrate and score the assessment does not fit the data well, results from the assessment may not accurately reflect what students know and can do. Also called absolute model fit (e.g., Chen et al., 2013), model fit involves an evaluation of the alignment between the three parameters estimated for each linkage level and the observed item responses. Classification accuracy refers to how well the classifications represent the true underlying latent class. The accuracy of the assessment results (i.e., the classifications) is a prerequisite for the validity of inferences made from the results. Thus, the accuracy of the classifications is perhaps the most crucial aspect of model evaluation from a practical and operational standpoint. Model fit and classification accuracy results from the 2023–2024 administration year are provided in the following sections.

For a complete description of the methods and process used to evaluate model fit and classification accuracy, see Chapter 5 of the 2021–2022 Technical Manual Update—Science (Dynamic Learning Maps Consortium, 2022).

5.2.1 Model Fit

Linkage levels were flagged for misfit if the adjusted posterior predictive p-value (ppp) was less than .05. Table 5.1 shows the percentage of models with acceptable model fit (i.e., ppp > .05) by linkage level. Across all linkage levels, 41 (40%) of the estimated models showed acceptable model fit. Misfit was not evenly distributed across the linkage levels. The lower linkage levels were flagged at a higher rate than the higher linkage levels. This is likely due to the greater diversity in the student population at the lower linkage levels (e.g., required supports, expressive communication behaviors), which may affect item response behavior. To address the areas where misfit was detected, we are prioritizing test development for linkage levels flagged for misfit so that testlets contributing to misfit can be retired. For a description of item development practices, see Chapter 3 of the 2021–2022 Technical Manual Update—Science (Dynamic Learning Maps Consortium, 2022). We also plan to incorporate additional item quality statistics to the review of field test data to ensure that only items and testlets that conform to the model expectations are promoted to the operational assessment. Overall, however, the fungible LCDM models appear to largely reflect the observed data. Finally, it should be noted that a linkage level flagged for model misfit may still have high classification accuracy, indicating that student mastery classifications are accurate, even in the presence of misfit. Data demonstrating linkage levels may have high classification accuracy despite being flagged for model misfit are presented in the next section.

Table 5.1: Number and Percentage of Models With Acceptable Model Fit for the 2023–2024 Administration Year (ppp > .05)
Linkage level %
Initial 7 (20.6)
Precursor 16 (47.1)
Target 18 (52.9)
Note. ppp = posterior predictive p-value. ppp > .05 indicates acceptable model fit.

5.2.2 Classification Accuracy

Table 5.2 shows the number and percentage of models within each linkage level that demonstrated each category of classification accuracy. Across all estimated models, 62 linkage levels (61%) demonstrated at least fair classification accuracy. Results are fairly consistent across linkage levels, although the proportion of models demonstrating at least good classification accuracy was slightly higher at the Initial linkage level. The classification accuracy results have been consistent across years since classification accuracy for the DLM assessments was introduced in 2021–2022. As was the case for model misfit, linkage levels flagged for low classification accuracy are prioritized for test development.

Table 5.2: Estimated Classification Accuracy by Linkage Level for the 2023–2024 Administration Year
Weak
(%)
Poor
(%)
Fair
(%)
Good
(%)
Very good
(%)
Excellent
(%)
Linkage level 0.00–.54 .55–.82 .83–.88 .89–.94 .95–.98 .99–1.00
Initial 0 (0.0)   2   (5.9)   5 (14.7) 22 (64.7) 5 (14.7) 0 (0.0)
Precursor 0 (0.0) 17 (50.0) 11 (32.4)   6 (17.6) 0   (0.0) 0 (0.0)
Target 0 (0.0) 21 (61.8)   9 (26.5)   1   (2.9) 3   (8.8) 0 (0.0)

When looking at absolute model fit and classification accuracy in combination, linkage levels flagged for absolute model misfit often have high classification accuracy. Of the 61 linkage levels that were flagged for absolute model misfit, 43 (70%) showed fair or better classification accuracy. Thus, even when misfit is present, we can be confident in the accuracy of the mastery classifications. In total, 82% of linkage levels (n = 84) had acceptable absolute model fit and/or acceptable classification accuracy.

5.3 Calibrated Parameters

As stated previously in this chapter, the item parameters for diagnostic assessments are the conditional probability of masters and nonmasters providing a correct response. Because of the assumption of fungibility, parameters are calculated for each of the 102 linkage levels in science. Parameters include a conditional probability of nonmasters providing a correct response and a conditional probability of masters providing a correct response. Across all linkage levels, the conditional probability that masters provide a correct response is generally expected to be high, while it is expected to be low for nonmasters. In addition to the item parameters, the psychometric model also includes a structural parameter, which defines the base rate of class membership for each linkage level. A summary of the operational parameters used to score the 2023–2024 assessment is provided in the following sections.

5.3.1 Probability of Masters Providing a Correct Response

When items measuring each linkage level function as expected, students who have mastered the linkage level have a high probability of providing a correct response. Instances where masters have a low probability of providing correct responses may indicate that the linkage level does not measure what it is intended to measure or that students who have mastered the content select a response other than the key. These instances may result in students who have mastered the content providing incorrect responses and being incorrectly classified as nonmasters. This outcome has implications for the validity of inferences that can be made from results, including educators using results to inform instructional planning in the subsequent year.

Using the 2023–2024 operational calibration, Figure 5.1 depicts the conditional probability of masters providing a correct response to items measuring each of the 102 linkage levels. Because the point of maximum uncertainty is .50 (i.e., equal likelihood of mastery or nonmastery), masters should have a greater than 50% chance of providing a correct response. The results in Figure 5.1 demonstrate that the vast majority of the linkage levels (n = 101, 99%) performed as expected. Additionally, 93% of linkage levels (n = 95) had a conditional probability of masters providing a correct response over .60. None of the linkage levels had a conditional probability of masters providing a correct response less than .40.

Figure 5.1: Probability of Masters Providing a Correct Response to Items Measuring Each Linkage Level for the 2023–2024 Administration Year

5.3.2 Probability of Nonmasters Providing a Correct Response

When items measuring each linkage level function as expected, nonmasters of the linkage level have a low probability of providing a correct response. Instances where nonmasters have a high probability of providing correct responses may indicate that the linkage level does not measure what it is intended to measure or that the correct answers to items measuring the level are easily guessed. These instances may result in students who have not mastered the content providing correct responses and being incorrectly classified as masters.

Figure 5.2 summarizes the probability of nonmasters providing correct responses to items measuring each of the 102 linkage levels. There is greater variation in the probability of nonmasters providing a correct response to items measuring each linkage level than was observed for masters, as shown in Figure 5.2. While the majority of the linkage levels (n = 89, 87%) performed as expected, nonmasters sometimes had a greater than .50 chance of providing a correct response to items measuring the linkage level. However, none of the linkage levels have a conditional probability for nonmasters providing a correct response greater than .60.

Figure 5.2: Probability of Nonmasters Providing a Correct Response to Items Measuring Each Linkage Level for the 2023–2024 Administration Year

5.3.3 Item Discrimination

The discrimination of a linkage level represents how well the items are able to differentiate masters and nonmasters. For diagnostic models, this is assessed by comparing the conditional probabilities of masters and nonmasters providing a correct response. Linkage levels that are highly discriminating will have a large difference between the conditional probabilities, with a maximum value of 1.00 (i.e., masters have a 100% chance of providing a correct response and nonmasters a 0% chance). Figure 5.3 shows the distribution of linkage level discrimination values. Overall, 63% of linkage levels (n = 64) have a discrimination greater than .40, indicating a large difference between the conditional probabilities (e.g., .75 to .35, .90 to .50). However, there were 3 linkage levels (3%) with a discrimination of less than .10, indicating that masters and nonmasters tend to perform similarly on items measuring these linkage levels. Table 5.3 presents the three linkage levels with a discrimination of less than .10, with the Precursor linkage level being the most prevalent.

Figure 5.3: Difference Between Masters’ and Nonmasters’ Probability of Providing a Correct Response to Items Measuring Each Linkage Level for the 2023–2024 Administration Year

Table 5.3: The Number and Percentage of Linkage Levels With Low Discrimination for the 2023–2024 Administration Year
Linkage level n (%)
Initial 0   (0.0)
Precursor 2 (66.7)
Target 1 (33.3)

5.3.4 Base Rate Probability of Class Membership

The base rate of class membership is the DCM structural parameter and represents the estimated proportion of students in each class for each EE and linkage level. A base rate close to .50 indicates that students assessed on a given linkage level are, a priori, equally likely to be a master or nonmaster. Conversely, a high or low base rate would indicate that students testing on a linkage level are, a priori, more or less likely to be masters, respectively. Figure 5.4 depicts the distribution of the base rate probabilities. Overall, the distribution is roughly normal, with 82% of linkage levels (n = 84) exhibiting a base rate of mastery between .25 and .75. This indicates that students are most likely to be assessed on linkage levels where they have an approximately equal likelihood of mastery. On the edges of the distribution, 11 linkage levels (11%) had a base rate of mastery less than .25, and 7 linkage levels (7%) had a base rate of mastery higher than .75. For the linkage levels that do not have an approximately equal likelihood of mastery, this suggests students are more likely to be assessed on linkage levels they have not mastered than those they have mastered.

Figure 5.4: Base Rate of Linkage Level Mastery for the 2023–2024 Administration Year

5.4 Conclusion

In summary, the DLM modeling approach uses well-established research in Bayesian inference networks and diagnostic classification modeling to determine student mastery of skills measured by the assessment. A DCM is estimated for each linkage level of each EE to determine the probability of student mastery. Items within the linkage level are assumed to be fungible and are estimated with equivalent item probability parameters for masters and nonmasters, owing to the conceptual approach used to construct DLM testlets. An analysis of the estimated models indicates that the estimated models have acceptable levels of absolute model fit and classification accuracy. Additionally, the estimated parameters from each DCM are generally within the optimal ranges. We use the results from model fit analyses, classification accuracy, and estimated parameters to continually improve on the DLM assessments (e.g., improve item writing practices, retire testlets contributing to misfit).