¹ Holon Institute of Technology · ² Afeka College of Engineering
Generative AI makes it trivial to obtain an answer and difficult to obtain understanding, and uncritical use is increasingly linked to cognitive offloading and weakened critical thinking. This exposes a mismatch between what education assesses and what students now do: our examinations still measure unaided performance, while the task graduates actually face, in classrooms and workplaces saturated with capable AI, is to produce good work with it, by framing an ill-defined task, judging the output, and steering the model toward something better. This ability to work with AI is neither taught nor assessed as a competency in its own right; where it is measured at all, it is folded into a single "prompting" score that cannot diagnose why a given learner's AI use succeeds or fails (whether they specified the task poorly, missed the flaws in the output, or failed to correct them). We treat it instead as a teachable, assessable competency that can be named, decomposed, and measured. We propose CoReasoning, a competency model that factors productive work with generative AI into three temporally and cognitively distinct, independently-assessable skills: Framing (transforming an ill-defined problem into a well-specified task before invoking AI), Judging (critically evaluating AI output for errors, gaps, unstated assumptions, and risk), and Steering (iteratively redirecting the AI toward a better solution across cycles). The central structural claim that distinguishes CoReasoning from existing AI- and prompt-literacy frameworks is the separation of the pre-generation skill (Framing) from the post-generation corrective skill (Steering), with Judging as the epistemic gate between them. We ground each skill in established theory (metacognitive monitoring and control; self-regulated learning; epistemic vigilance and critical thinking; productive struggle), and state five testable propositions about how the skills relate. We instantiate the model in CoReasoning Lab, an open learning platform that presents deliberately flawed AI output for students to judge and steer, and scores the three skills independently. In a feasibility demonstration over simulated learners (generated and graded by different models to avoid self-grading), the three skills dissociate: each skill's grade tracks its own manipulated competence and stays flat in the others, the cleanest pair (Framing and Judging) is uncorrelated, and the result replicates across grader backends. We make no learning-outcome claims: because the learners are simulated and the grades are not yet validated against human raters, this establishes that the constructs are separable and automatically measurable, not that students learn. We release the instrument, the controlled-generation harness, the data, and a prepared human-validation protocol.
Most AI-in-education tools optimize for the wrong variable. They shorten the path from question to answer, when the answer was never the point of education. The cognitive work of specifying a problem, evaluating a candidate solution, and improving it is where learning happens; when that work is delegated wholesale to a machine, the learner walks away with a correct artifact and an unchanged mind. The educational opportunity of generative AI is therefore not to deliver answers faster, but to make the reasoning around them visible, practiced, and assessable.
Recent evidence makes the stakes concrete. Frequent, uncritical AI use correlates with lower critical-thinking performance, an effect mediated by cognitive offloading and most pronounced in younger users (Gerlich, 2025). Controlled studies of AI-assisted writing report "metacognitive laziness," in which learners bypass the self-regulatory processes of diagnosing, evaluating, and revising (Fan et al., 2025), and neurophysiological work describes an accumulation of "cognitive debt" when an assistant carries the reasoning load (Kosmyna et al., 2025). At the same time, field studies of professional AI use show that the benefit of AI is sharply conditional on the user's skill in directing it: assistance helps inside a model's competence frontier and harms outside it (Dell'Acqua et al., 2023), and the most effective users adopt an iterative, critical "push-back-and-validate" mode rather than wholesale delegation (Randazzo et al., 2025). Meta-analytic evidence underscores the stakes: human-AI teams frequently underperform the better of the human or the AI alone (Vaccaro et al., 2024), which makes the human's skill in directing the system, not mere access to it, the decisive variable.
These findings share a structure. The difference between productive and counterproductive AI use is not access to AI; it is a competency that some learners exercise and others do not. This echoes the long-standing distinction between the effects obtained with a technology during use and the cognitive residue left of it, which depends on the learner's mindful engagement in the partnership (Salomon et al., 1991). If that competency can be named and decomposed, it can be taught and assessed. Current AI-literacy assessment can often tell us that a learner's AI use is unproductive but not why: whether they specified the task badly, failed to detect the flaws in the output, or knew the flaws but could not correct them. These are different failures with different remedies, and a single "prompting" score conflates them.
We propose CoReasoning, a competency model that decomposes productive work with generative AI into three distinct skills, each independently assessable:
Contributions. This paper makes the following contributions:
We make no learning-outcome claims; the empirical material is a feasibility demonstration of construct separability and measurability, not an efficacy study.
Existing instruments for assessing how students work with AI are not wrong so much as under-resolved. Knowledge-oriented frameworks such as Long and Magerko (2020) and the UNESCO (2024) student competency framework define AI literacy chiefly as understanding AI systems: what AI is, what it can and cannot do, how it works, and how it should be governed. These are necessary, but they say little about the performance of working with a model on a task. Prompt-literacy and generative-AI-literacy models (Lo, 2023; Annapureddy et al., 2024) move closer to performance, but they bundle task specification and iterative refinement into a single "prompting" competency and treat evaluation of the output as a downstream check rather than a co-equal skill.
The cost of this coarse resolution is diagnostic. When a learner's AI use produces a poor result, a single prompting score cannot tell an instructor which cognitive operation failed: did the learner specify the task badly, so that the model solved the wrong problem; did they fail to detect the flaws in an otherwise plausible output; or did they see the flaws but issue corrections too vague to fix them? These are three different failures with three different remedies (teaching problem specification, teaching critical evaluation, and teaching corrective communication), and an instrument that cannot separate them cannot guide instruction. Integrative reviews of AI literacy after generative AI confirm that the field still lacks a scheme isolating distinct, independently-assessable reasoning competencies (Gu & Ericson, 2025), even as syntheses document generative AI's mixed effects on critical and creative thinking (Li et al., 2026). A diagnostic competency model must therefore decompose the human-AI loop into the distinct operations that can each break, and must show that those operations are in fact separable in learners. That is the gap CoReasoning addresses.
The three skills are not an arbitrary list; they instantiate a well-understood cognitive architecture. Nelson and Narens (1990) describe metacognition as a two-level system: an object level (cognition itself) and a meta level (a dynamic model of the object level), linked by monitoring (information flowing from object to meta) and control (commands flowing from meta to object). In CoReasoning, the learner's meta level supervises an object level that is external, the AI's generative process, rather than the learner's own cognition. This is a deliberate extension of the Nelson-Narens architecture from intrapersonal monitoring to what we term exo-directed monitoring and control, and it is the framework's central theoretical move: the same metacognitive machinery is turned outward onto a fallible cognitive artifact. This extension is non-trivial. Exo-directed monitoring adds an epistemic-vigilance burden that self-monitoring does not have: the learner must judge the reliability of a source whose competence differs from, and is hidden from, their own. This is exactly why Judging is bounded by domain knowledge (Proposition P4). With that point made, the mapping is direct:
The Judge→Steer cycle is therefore a monitor→control loop seeded by a task definition. This is the structural spine of the framework and the reason the three skills cohere rather than merely coexist (Figure 1).
Figure 1. The CoReasoning loop. An upstream task definition (Framing, a self-regulated-learning forethought activity) sets the standards against which a metacognitive monitor-control cycle (Judging then Steering) supervises a fallible AI at the object level. Judging is the monitor's read-out; Steering is the controller's write.
Existing frameworks treat "use the AI well" as one skill or, at most, pair "prompt" with "evaluate." CoReasoning's distinctive move is to separate two operations that prior models fuse: the pre-generation skill of structuring the task (Framing) and the post-generation skill of correcting the output (Steering). These are different cognitive acts at different points in time, with different error modes and different instructional remedies. A learner can frame impeccably and steer poorly (detect a flaw but issue a vague correction), or steer fluently atop a malformed task (drive the AI energetically toward the wrong target). Collapsing both into "prompting" hides exactly the distinctions an educator needs.
Because Judging and Steering both occur after generation and in a tight loop, their boundary needs stating precisely. Judging is assessment; Steering is action. Judging produces a representation of what is wrong with the current output, an internal or articulated list of detected flaws, gaps, and risks, and a calibrated sense of how much to trust the output. It is evaluative and its output is a diagnosis. Steering consumes that diagnosis and produces a corrective instruction aimed at changing the next output: it is generative and directive, and its quality depends on prioritization (addressing the most critical flaw first), specificity (an actionable command rather than "improve this"), and effectiveness (whether the output actually converges). In the monitor-control terms of Section 3.1, Judging is the monitor's read-out and Steering is the controller's write. The two dissociate because the competencies differ: a learner may diagnose accurately yet communicate the fix poorly (good Judging, weak Steering), or issue fluent, confident commands that target the wrong thing because the diagnosis was wrong (weak Judging propagating into misdirected Steering, the failure mode Proposition P2 predicts). This last point also reconciles an apparent tension: P2 (Judging bounds Steering) implies the two grades will be positively correlated in the aggregate, while P3 claims they dissociate. Both hold. P2 is a ceiling relation (Steering cannot exceed the quality its Judging permits), which induces correlation without identity; P3 is the claim that the off-diagonal is well below the reliability ceiling, so the skills are not interchangeable. We therefore test dissociation pairwise and report whether each skill's grade responds to its own manipulated competence while remaining comparatively flat in the others, rather than relying on a single global factor model.
Each skill is anchored in established theory, and the anchors are mutually consistent because they share the metacognitive backbone of Section 3.1.
Framing is the task-definition phase of self-regulated learning. In the COPES model (Winne & Hadwin, 1998), self-regulated work begins by constructing a definition of the task and the standards a product must satisfy; Zimmerman's (2000) forethought phase similarly precedes performance with goal setting and strategic planning. Framing applies this phase to a human-AI loop: the learner converts an ill-defined situation into a specified task with explicit constraints and success criteria. In Bloom's revised taxonomy (Anderson & Krathwohl, 2001), specifying an original task is a Create-level activity, and the competence to know what makes a task tractable for a given tool is metacognitive task knowledge (Flavell, 1979).
Judging is metacognitive monitoring (Nelson & Narens, 1990) directed at an external generative source. Its content is the evaluative core of critical thinking, the Delphi-consensus skills of analysis and evaluation (Facione, 1990) and the Paul and Elder (2006) intellectual standards, which supply a ready vocabulary for assessing reasoning. Because the object being judged is a communicated knowledge claim from a fluent but fallible source, the most precise anchor is epistemic vigilance (Sperber et al., 2010), which pairs source monitoring (is this source trustworthy?) with content evaluation (is this internally and externally coherent?). Barzilai and Chinn's (2018) account of apt epistemic performance adds the criteria-for-good-knowledge dimension that a learner must hold to judge well, and the human-automation literature supplies the calibration target: reliance matched to actual reliability (Lee & See, 2004), the failure of which is the over-reliance documented in AI-assisted decision-making (Bansal et al., 2021; Buçinca et al., 2021) and the broader automation-complacency it extends (Parasuraman & Manzey, 2010). That Judging is a trainable skill rather than an automatic byproduct of competence is underscored by recent evidence that metacognitive monitoring can decouple from performance in human-AI reasoning (Fernandes et al., 2024), and by recent work on whether learners can evaluate AI output quality as experts do (Nazaretsky et al., 2025).
Steering is metacognitive control (Nelson & Narens, 1990): acting on the monitoring signal to change the object-level process. Pedagogically it inverts cognitive-apprenticeship coaching and scaffolding (Collins, Brown & Newman, 1989): the learner, not the master, supplies the corrective guidance. Its quality depends on the learner monitoring the work against held standards (Sadler, 1989), and is well described by feed-forward, the "where to next" component of effective feedback (Hattie & Timperley, 2007).
The loop and its positioning. The Judge-Steer cycle is designed to push learners into the Interactive mode of the ICAP framework (Chi & Wylie, 2014), in which knowledge is co-constructed through dialogue rather than passively received, the mode ICAP associates with the greatest learning. The framework casts the AI as a mediating cultural tool that extends the learner's zone of proximal development (Vygotsky, 1978): the learner accomplishes with the model what they could not alone, while internalizing the Framing, Judging, and Steering moves for eventual independent use.
The pedagogical stance. CoReasoning's rejection of speed-to-answer rests on the literature of productive struggle and desirable difficulties (Bjork & Bjork, 2011; Kapur, 2008): conditions that slow performance but deepen learning. The friction of specifying, evaluating, and correcting is not an obstacle to be engineered away but the very locus of learning, which is why the instructional design deliberately presents imperfect output for the learner to improve rather than a polished answer to accept.
These propositions differ in how far the present work tests them. The feasibility demonstration (Section 8) directly supports P3 (the three skills dissociate) and bears partly on P2 (the steering own-effect is bounded in a way consistent with judging gating steering). P1, P4, and P5 are stated here as falsifiable hypotheses for the validation and classroom agenda (Section 10), not as claims the demo establishes. Each names a concrete test: P1 fails if a strong steering intervention recovers grades after a deliberately malformed framing; P4 fails if Judging transfers across domains as readily as Framing; P5 fails if exercising the loop does not reduce the offloading signatures (for example, reduced post-task recall) documented in the cognitive-debt literature.
The constructs that compose CoReasoning are not individually new; what is new is their separation into three parallel, independently-assessable competencies anchored in a monitor-control architecture. We make the boundary explicit by confronting the nearest priors directly.
AI-fluency frameworks. The closest practitioner framework is the 4D model of AI fluency (Dakan & Feller, 2025): Delegation, Description, Discernment, Diligence. Its Discernment maps cleanly onto Judging, but its Description bundles two operations we deliberately separate: crafting the initial specification and conducting the iterative back-and-forth. CoReasoning's contention is that these are distinct skills at distinct times, the pre-generation act of Framing and the post-generation act of Steering, with different error modes (a malformed task versus a mis-targeted correction) and different instructional remedies. CoReasoning also derives its skills from learning theory and an assessment rationale rather than from a fluency heuristic, and so yields rubrics and dissociation predictions that a checklist does not.
Metacognitive analyses of generative AI. The strongest construct-level neighbour is Tankelevitch et al. (2024), who analyse generative-AI use through the lens of metacognitive demands, naming prompting, output evaluation, and workflow iteration as sites of metacognitive monitoring and control. We share the metacognitive foundation but differ in goal: their contribution is a demands analysis (a cognitive-load lens that explains why GenAI is hard to use well), whereas ours is an assessable competency model with explicit rubrics, propositions, and feasibility evidence that the components dissociate. The two are complementary: their analysis motivates why each of our skills is cognitively demanding; our framework makes each one measurable.
Prompt-literacy frameworks. Prompt-literacy models such as CLEAR (Lo, 2023) and the broader prompt-literacy literature define competence as constructing a precise prompt and iteratively refining it. The most developed recent decomposition, an operationalization of prompt literacy into formulate, interpret, and refine sub-practices (Tour & Zadorozhnyy, 2025), is the closest competitor to our triad; but it bundles task specification and iterative refinement under "prompting" and does not treat the sub-practices as separately graded, dissociable competencies. This fusion of Framing and Steering into a single "prompting" skill is exactly the conflation CoReasoning rejects. Treating them separately is not a cosmetic relabeling: it predicts, and our feasibility demonstration supports, that a simulated learner's model-assigned Framing and Steering grades can diverge.
Metric frameworks for human-AI cognition. A distinct 2026 line proposes named multi-metric schemes for working with AI, for instance a cognitive-amplification-versus-delegation framework with dependency and drift metrics (Di Santi, 2026). These measure the sustainability of reliance rather than teachable framing, judging, and steering competencies, and so are complementary to, not competitive with, an assessable-skill decomposition.
Knowledge-oriented AI-literacy frameworks. Field-defining competency sets such as Long & Magerko (2020) and the UNESCO (2024) student framework define AI literacy primarily as understanding AI systems, their capabilities, limits, and ethics, rather than as executing tasks within a human-AI loop. They contain no Framing or Steering construct and only a diffuse notion of critical evaluation that partly overlaps Judging. CoReasoning is orthogonal: it specifies the task-execution competencies these frameworks leave implicit.
Empirical accounts of AI-use modes. Field studies describe how skilled users actually work with AI: the "Cyborg" mode of continuous push-back-and-validate (Randazzo et al., 2025) and the sharp skill-dependence of AI's value at the competence frontier (Dell'Acqua et al., 2023). These describe the behaviour; CoReasoning supplies the assessable skill decomposition that underlies it.
Table 1 makes the boundary explicit by mapping each nearest prior framework's constructs onto Framing, Judging, and Steering. The recurring pattern is that prior frameworks either (i) omit a construct, or (ii) fuse Framing and Steering into one "prompting/iteration" skill.
Table 1. Where prior frameworks place the three CoReasoning skills.
| Prior framework | Framing (pre-generation) | Judging | Steering (post-generation) |
|---|---|---|---|
| 4D AI Fluency (Dakan & Feller, 2025) | Delegation + part of Description | Discernment | fused into Description |
| Metacognitive demands (Tankelevitch et al., 2024) | "prompting" (as a demand, not a skill) | "evaluating outputs" | "workflow iteration" |
| Prompt literacy / CLEAR (Lo, 2023) | fused into "prompting" | weakly present | fused into "iterative refinement" |
| AI literacy (Long & Magerko, 2020; UNESCO, 2024) | absent | diffuse "critical evaluation" | absent |
| Cyborg/Centaur modes (Randazzo et al., 2025) | "directed" mode (described, not assessed) | "push back / validate" | "continuous dialogue" |
No prior column cleanly separates the pre-generation and post-generation skills and treats all three as independently scored competencies. That conjunction is the contribution.
Validated GenAI-competency instruments. A parallel line of work builds psychometric instruments for AI and generative-AI competence, including validated scales such as GenAIComp (Lee et al., 2025; with factors derived from digital-competence frameworks) and assessment tests such as GLAT (Jin et al., 2024). These establish that GenAI competence can be measured, but their factor structures are literacy-oriented (information literacy, ethics, content creation) and do not isolate a Framing, Judging, or Steering construct. The work closest to ours pairs AI-collaboration literacy with metacognition (Sidra & Mason, 2025) and includes an AI-evaluation sub-construct that overlaps Judging; we differ in separating the pre-generation and post-generation control skills and in demonstrating their dissociation rather than positing correlated factors. CoReasoning is complementary to this measurement program: it supplies the specific, theory-derived decomposition that a future validated instrument could operationalize.
Problem formulation as the AI-era skill. A prominent strand argues that, as models absorb execution, the durable human skill shifts from prompt crafting to problem formulation, identifying, analyzing, and delineating the problem worth solving (Acar, 2023). This is precisely our Framing construct, and classroom work has begun to assess it directly, for example through "prompt problems" that require students to specify and evaluate rather than merely prompt (Denny et al., 2024), and through instruments that treat question formulation (Kim et al., 2025) and problem decomposition (Srinath et al., 2025) as independently measurable, trainable skills in generative-AI settings. That Framing is a distinct competency is further supported by the older problem-finding literature, which established problem finding as empirically separable from problem solving (Runco & Chand, 1995). We build on this strand by placing Framing in a measured loop with Judging and Steering.
One-line novelty. CoReasoning is, to our knowledge, the first theoretically-grounded decomposition of productive generative-AI use into three independently-assessable competencies that separates pre-generation Framing from post-generation Steering, with feasibility evidence that the three skills dissociate. The defensible claim is not that any single skill is new, but that the separation is both theoretically motivated (monitor-control plus an upstream task definition) and empirically consequential (the skills can be measured apart).
The framework was instantiated in a prototype web platform, CoReasoning Lab, which we describe here so that the abstract skills map onto a concrete learner experience. The platform is role-based (student, instructor, administrator) and bilingual (English and Hebrew), and supports both practice and assessment modes and both multiple-choice and open-ended response formats per phase. The reusable artifact we release and evaluate is the platform's scoring engine: a library of sixteen prompts plus a controlled-generation harness (Section 7.1). The interface figures in this paper (Figures A1 and A2) are representative mockups of the prototype, not screenshots of a running deployment. All quantitative results come from the released prompt engine, not from platform usage logs.
Authoring flow (instructor). An instructor defines a challenge by choosing a course and subject path; the system then generates the ill-defined problem, the three per-skill rubrics, the gold-standard framing, and the seeded-flaw solution that the learner will critique (Section 7.1). Challenges are organized into courses and can be assigned to cohorts.
Learner flow (student). From a dashboard of assigned challenges (Figure A2), a student enters a challenge run that walks through the framework's two phases (Figure A1):
This separation in the interface, distinct phases, distinct feedback channels, and distinct grade columns, is the framework's central claim made operational: a learner sees, and is scored on, three different things they did, not one undifferentiated "AI use." The remainder of this section describes the instrument that produces those scores.
To show that the three constructs are not only conceptually distinct but practically measurable, we describe a working instrument that scores each skill from a learner's transcript. The instrument is a pipeline of large-language-model prompts; we use it here as an existence proof that automated, rubric-driven scoring of Framing, Judging, and Steering is feasible, not as a validated assessment.
Challenge construction. Each challenge begins with a deliberately ill-defined problem generated to contain two or three unstated gaps, recorded internally but never shown. A per-challenge set of three rubrics, one each for Framing, Judging, and Steering, is generated for the subject area, each with three to five measurable criteria and explicit excellent and poor indicators. A gold-standard "best framing" is generated as an internal reference. The design instantiates an inverted cognitive apprenticeship: rather than observing an expert, the learner is given a fallible artifact to repair. This connects the instrument to the instructional literature on learning from errors and erroneous examples, in which studying and correcting flawed solutions improves error detection and conceptual understanding relative to studying only correct ones (Große & Renkl, 2007; Durkin & Rittle-Johnson, 2012); CoReasoning generalizes that paradigm from static worked examples to an interactive, learner-driven repair loop over AI output.
The deliberately-imperfect output. After the learner frames the task, the model produces a plausible, professional-looking solution that is required to embed two to four non-trivial issues, wrong-but-reasonable assumptions, missing edge cases, or subtle logical errors, each recorded internally with a severity label and none flagged to the learner. Across steering cycles, updates address the learner's commands but may introduce new minor issues, so that difficulty adapts to the quality of steering rather than collapsing to a perfect answer.
Scoring. Each skill is scored in two stages. A skill-specific evaluator assesses the learner's response against the (internal) rubric and produces per-criterion ratings on a three-point scale; a generic grading stage then aggregates those ratings into a final grade, weighting critical criteria more heavily rather than averaging. Crucially, the three evaluators differ in what they compare against: Framing is evaluated against the gold framing and the rubric; Judging is evaluated against the seeded ground-truth issues, yielding a recall/precision signal (issues correctly identified, missed, and falsely flagged); and Steering is evaluated against the trajectory of the output across cycles, rewarding corrections that demonstrably move the solution toward correctness. This is why the skills are measured apart: each evaluator interrogates a different referent.
The instrument used in this paper is the engine of a deployed prototype (CoReasoning Lab), a library of sixteen prompts spanning challenge construction, AI generation, and evaluation. Scoring a single learner exercises the relevant subset, the three skill evaluators and the generic grader, over controlled inputs, so that the measurements are reproducible and the ground truth is known. A methodological caveat applies throughout: the grader is itself a fallible language model, so the feasibility results below speak to the internal behavior of the instrument (does it separate controlled competence levels and dissociate the skills?), not to agreement with human experts, which is the separate validity question addressed by the prepared study in Section 10.
We exercise the instrument over controlled inputs to test three feasibility claims: that it discriminates competence, that the three skills dissociate (Proposition P3), and that the grader is self-consistent enough to report. We make no learning-outcome claims. The "learners" are simulated personas of controlled per-skill competence, generated by one model (gpt-4o-mini) and graded by a different model (gpt-4o), so that no model grades its own output. The validation against human expert graders is prepared but not yet run (Section 10).
Design. We use a crossed factorial: each of the three skills is independently set to a strong or weak competence level, giving $2^3 = 8$ profiles, crossed with ten challenges across ten distinct subjects (algorithms, microeconomics, machine learning, databases, statistics, operating systems, calculus, organic chemistry, linguistics, and corporate finance), for 80 simulated learners. Each learner is scored by seven grader calls (three skill evaluators, a grading pass per skill, and one steering-update call); grading inputs that are identical across profiles are cached and reused, so the issued total is below the 560 nominal. Framing and Steering responses are generated by the competence-conditioned learner model; Judging is operationalized by a competence-conditioned selection over the challenge's ground-truth seeded issues (a strong judge flags all real issues and no false ones; a weak judge flags few real issues and some false ones). These seeded issues and the distractors are themselves machine-generated and not yet human-verified, so Judging's by-construction own-effect is explicitly contingent on that ground truth being correct. Because the manipulation is per skill, the design can separate whether each grade tracks its own skill's competence (discrimination) from whether it is insensitive to the other skills' competence (dissociation).
Discrimination. Grades move monotonically with competence: the all-weak profile averages C on every skill, the all-strong profile averages between B and A, and each skill's grade rises when that skill is set to strong. The judging signal is mechanistically transparent: a strong judge flags all of the seeded issues with no false alarms and is graded A; a weak judge flags none and raises false issues and is graded C.
Dissociation (the central result). Table 2 reports, for each graded skill, the change in its mean grade (on a 3-point scale, A=3..C=1) when each skill in turn is moved from weak to strong. The diagonal (the effect of a skill's own competence on its own grade) averages +1.02; the off-diagonal (the effect of the other skills' competence) averages +0.01. Judging's own-effect (+2.00) is fixed by construction, since its competence is operationalized by a controlled selection over ground-truth issues; the decisive evidence is therefore the two blind-graded skills, Framing and Steering, whose free-text responses the grader scores without knowing the intended competence. Each shows a clear positive own-effect (+0.62 and +0.43) with near-zero cross-effects, so a simulated learner's three model-assigned grades move independently. Each grade responds to its own skill and is essentially flat in the others (Figure 2).
Table 2. Effect on each skill's grade of manipulating each skill's competence (grade Δ, strong − weak; N=80).
| grade of ↓ \ manipulated → | Framing | Judging | Steering |
|---|---|---|---|
| Framing | +0.62 | −0.02 | −0.07 |
| Judging | +0.00 | +2.00 | +0.00 |
| Steering | −0.12 | +0.27 | +0.43 |
The diagonal (own-skill) effects are statistically reliable: bootstrap 95% confidence intervals (2,000 resamples) exclude zero for Framing (+0.62, CI [+0.46, +0.77]) and Steering (+0.43, CI [+0.20, +0.65]); Judging is deterministic by construction (+2.00). The same separation appears in the inter-skill grade correlations across the 80 learners: Framing-Judging $\rho = -0.03$ ($p = 0.82$) and Framing-Steering $\rho = -0.12$ ($p = 0.29$) are both non-significant, while Judging-Steering $\rho = +0.25$ ($p = 0.02$) is positive and significant. The Framing-Judging pair is the decisive demonstration: these two skills are scored by entirely separate mechanisms (free-text framing evaluation versus issue-selection judging) yet their grades are uncorrelated, which a single general-ability account cannot produce. The Judging-Steering correlation behaves exactly as Proposition P2 predicts, the expected ceiling relation in which judging gates steering, and so is consistent with, rather than evidence against, separability. The first principal component accounts for 43% of the variance, but with only three indicators a formal factor model is under-identified, so we report this descriptively rather than as a confirmatory dimensionality test; the point is simply that the grades do not collapse onto a single dimension.
The own-skill effects are directionally consistent across subjects: broken down by the ten subject areas (eight learners each), Framing and Steering each show a positive own-competence effect in nine of ten subjects, with the two exceptions (Framing in statistics, Steering in microeconomics) reflecting the small per-subject sample rather than a sign reversal; Judging is fixed by construction in every subject. The designed-contrast personas make the separation concrete (Figure 3, which plots five profiles). Three are especially telling: a weak-framer / strong-judge learner scores Framing C but Judging A; a strong-framer / weak-judge learner inverts this to Framing B, Judging C; and a weak / weak / strong-steerer elevates only Steering. A single underlying "AI-use ability" cannot produce these crossed profiles. Converging evidence comes from outside our synthetic setting. In an intervention study, students' behavioral regulation of LLM use (reformulating queries, checking correctness) predicts effective use, whereas self-rated AI expertise does not (Clerc et al., 2026). The skill of working with AI is thus distinct from a general, self-assessed competence.

Figure 2. Effect of manipulating each skill's competence (columns) on each skill's grade (rows). The diagonal (own-skill effect) dominates; off-diagonal (cross-skill) effects are near zero.

Figure 3. Mean per-skill grade for five competence profiles. Judging reaches A only when judging is strong, regardless of framing or steering; each skill responds to its own competence.
Reliability. To check that the grades are not noise, we hold four learner transcripts fixed and re-run the evaluation-and-grading prompts five times each (bypassing the cache so each repeat is an independent sample at the production temperatures). The grader is 92% self-consistent overall: the mean grade-flip rate across repeats is 0.08, with Judging deterministic (0.00, by construction), Framing 0.15, and Steering 0.10, and a mean within-cell grade standard deviation near 0.2 on the three-point scale, indicating that disagreements are occasional one-level boundary jitter rather than unstable scoring. This addresses the documented self-inconsistency of LLM judges, though it measures precision (repeatability), not accuracy against humans (Section 10). Because Judging contributes a flip rate of zero by construction, the load-bearing figure is the blind-graded skills, whose flip rate is near 0.12.
Grader-backend robustness. All grades above come from a single grader model (gpt-4o), while the deployed prototype used a different engine (llama-3.3-70b). To probe whether the dissociation is an artifact of one grader, we re-graded the identical 40 transcripts (challenges and learner responses held fixed) with a different model, gpt-4o-mini. The dissociation replicates: own-skill effects again dominate cross-skill effects (diagonal +0.47 versus off-diagonal +0.07). Notably, under this weaker grader Judging's own-effect falls from +2.00 to +0.65, comparable to Framing's own-effect under the same grader (also +0.65), which confirms that the large +2.00 was specific to gpt-4o's strict adherence to the seeded-issue selection rather than a property of the construct; the separation holds, and in fact becomes more balanced across the three skills, under a second backend. (Steering's own-effect is smallest, +0.10 here, consistent with the Proposition-P2 ceiling relation discussed below rather than absence of the construct.)
Dependence on ground-truth scaffolding. To probe how much the instrument's discrimination relies on the seeded ground truth (the gold framing supplied to the framing evaluator and the known seeded issues supplied to the judging evaluator) versus the model's own judgment, we re-grade a 40-learner subset (five subjects) with that scaffolding removed, comparing against that subset's own baseline (Framing +0.60, Judging +2.00, Steering +0.50). The effect is sharply skill-specific. Framing discrimination is essentially unaffected (own-effect +0.75 without the gold reference, versus +0.60 with it): the rubric and the model's own judgment carry it. Judging discrimination, by contrast, falls sharply (+2.00 to +0.45): without the known issues the grader has no recall-precision anchor and no longer separates strong from weak judging. Steering falls in between (+0.50 to +0.25). Separability itself survives (the skills still dissociate, ratio 29), but the result clarifies what each grade measures: Framing and Steering are scored largely by rubric-guided model judgment, whereas Judging as instrumented here is close to a measurement of agreement with a known answer key. This both explains Judging's by-construction +2.00 and bounds its transfer to open-ended settings where no such key exists, an important caveat for any deployment that scores judging without seeded ground truth.
Scope and caveats. These results establish feasibility, not validity. (i) Judging's dissociation is partly built in: its competence is operationalized by a controlled selection over ground-truth issues, so its clean diagonal (+2.00, with zero cross-effects) is a controlled check that the grader correctly rewards recall and precision. Independent evidence of separation comes from the free-text, blind-graded skills, Framing and Steering, whose positive own-effects (+0.62, +0.43) with near-zero cross-effects carry the result. (ii) The grader is a single model family; whether the dissociation replicates across grader backends, and whether the automated grades agree with human experts, are the validity questions deferred to Section 10. (iii) As noted above, with three indicators a factor model is under-identified, so the separability evidence rests on the manipulation-based effect matrix and the inter-skill correlations, not on a confirmatory dimensionality test. (iv) Steering shows the smallest own-effect (+0.43 at N=80). This is not a grader-leniency artifact: under a deliberately strict steering rubric (which mandates a C for vague or non-prioritized commands), the steering own-effect on the 40-learner subset does not increase (+0.40, versus +0.50 under the standard rubric, statistically indistinguishable), while the dissociation persists (ratio 27.5; the designed contrasts still decouple). The modest steering effect instead reflects the ceiling relation that Proposition P2 predicts (steering quality is bounded by the judging it follows, so a competence manipulation on steering alone has limited headroom) together with homogeneity in the simulated steering responses; both are best resolved with human steering data, which the validation study supplies.
A mature account must confront four tensions rather than paper over them.
Offloading versus productive struggle. Evidence that AI use can depress critical thinking through cognitive offloading (Gerlich, 2025; Kosmyna et al., 2025) appears to threaten any framework that puts learners in close partnership with AI. The resolution is in the zone-of-proximal-development stance: CoReasoning treats the AI as a mediating tool whose purpose is the learner's eventual independence, and it offloads execution while deliberately retaining the cognitive work of specifying, evaluating, and correcting. The framework is, in this sense, the designed inverse of metacognitive laziness (Proposition P5).
The calibration trap. Judging is metacognitive monitoring, and monitoring is only useful when calibrated. A learner who lacks the domain knowledge to recognize an error cannot detect that error in AI output, however vigilant. Judging is therefore bounded by domain competence, and the framework should be read as describing a skill that develops with domain knowledge, not as a substitute for it (Proposition P4).
Interactive is not automatically productive. ICAP predicts that interactive engagement yields the most learning, but rapid AI dialogue can be voluminous and shallow, a sequence of re-rolls rather than reasoning. Steering counts as genuine metacognitive control only when corrections are knowledge-generating, which is why the instrument rewards targeted, convergent corrections rather than mere repetition.
Standards can regress. Judging and Steering presuppose that the learner holds standards adequate to evaluate the output. When the AI is more competent in the domain than the learner, the standards the learner applies may be inferior to the artifact under review, a reversal that classical formative-assessment theory does not anticipate and that bounds the framework's applicability at the expert frontier.
The feasibility demonstration shows that the constructs are separable and measurable; it does not establish that the automated grades match expert human judgment, nor that exercising the skills improves learning. We therefore specify the validation program the framework invites, organized as an argument-based validity case in the sense of Messick (1995) and Kane (2013): the present evidence supports the scoring and generalization inferences (the instrument scores consistently and the three constructs separate), while the extrapolation inference (that the scores reflect a human competency that transfers) and the implication inference (that the scores support instructional decisions) remain to be established by the studies below.
First, an instrument-validity study: a stratified sample of transcripts is re-graded by multiple blind human experts using the per-skill rubrics, and agreement with the automated grader is reported as Cohen's $\kappa$, Fleiss' $\kappa$, and ordinal Krippendorff's $\alpha$, per skill, against the field-typical bar of $\kappa \approx 0.3$ to $0.8$. This step is indispensable rather than a formality: recent work shows that LLM-as-judge agreement with human experts is moderate and task-dependent, sometimes falling to Fleiss' $\kappa$ near $0.1$ to $0.3$ on hard rubric judgments (Feng et al., 2025), so automated grades must be validated against humans rather than assumed reliable. This exposes a recursive calibration threat, who grades the grader, that the framework must confront on two levels: the per-skill grades require human-rater agreement (above), and the Judging construct additionally rests on machine-seeded ground-truth issues, so a stratified subset of those seeded issues will be verified by human domain experts to confirm they are genuine, non-spurious flaws before the recall/precision signal can be trusted. We have prepared the agreement study (codebook, blinded rater task files, and scoring scripts) so that it can be run directly. Second, a construct-validity study at scale to test Propositions P1 through P4 with real learners, examining whether Framing, Judging, and Steering dissociate across a learner population and whether the proposed gating relations hold. Third, a grader-robustness study across multiple model backends to separate the framework's signal from any single model's idiosyncrasies. Only an efficacy study with real learners can test Proposition P5 and any learning claim; that is explicitly outside the present scope.
This is a conceptual contribution accompanied by a proof-of-concept instrument, not an efficacy study. The feasibility demonstration uses simulated learners of controlled competence, which establish that the grader has signal and that the skills can be measured apart, but which are not human learners. To avoid a model grading its own output, learners are generated by one model (gpt-4o-mini) and graded by a different one (gpt-4o); the dissociation replicates under a second grader backend, but both are from one provider, and the deployed prototype's engine (llama-3.3-70b) differs again, so cross-vendor robustness is only partly established. The challenges are English-language, and the automated grades have not yet been validated against human experts (Section 10). Two construct-specific limits deserve emphasis. First, Judging as instrumented here measures agreement with a seeded answer key: that key drives its discrimination (Section 8), so the judging score applies to settings with known ground truth and would need re-validation for open-ended use without it. Second, Steering yields the smallest and most P2-bounded signal (it is gated by the judging it precedes), and is the construct whose measurement would benefit most from human data. We make no learning-outcome claim. What the paper establishes is conceptual: a theoretically-grounded decomposition with stated propositions, a precise novelty boundary, and evidence that the three constructs are separable and automatically measurable.
The task of education in the age of generative AI is not to produce faster answer-getters but to cultivate critical collaborators: learners who can specify a problem worth solving, judge what a machine returns, and steer it toward something better. That capability is teachable only if it can be named and assessed. CoReasoning offers a decomposition of it into three theoretically-grounded, independently-assessable skills, Framing, Judging, and Steering, separates the pre-generation skill from the post-generation one in a way prior frameworks do not, and shows that the three can be measured apart. We offer it as a foundation for the assessment and instruction the moment demands.
Figure A1 shows a single challenge run in CoReasoning Lab, illustrating how the three skills appear as distinct, separately-scored stages of one continuous task. In Phase 1 the learner refines an ill-defined problem and receives a Framing grade. The system then produces a deliberately flawed solution. In Phase 2 the learner judges that output (flagging real issues while avoiding distractors) and steers the AI with a targeted correction; the output converges across cycles. The final report returns three independent grades with per-skill diagnostic feedback, the interface-level expression of the framework's central claim that productive AI use is not one skill but three.
Figure A1. The learner experience: Framing (Phase 1), the Judge-Steer cycle (Phase 2), and the per-skill report. Each skill is scored and given feedback independently.
The platform exposes role-specific interfaces beyond the challenge run (Figure A2). Students see a dashboard of pending challenges and courses and a personal results view that trends Framing, Judging, and Steering separately over time. Instructors author challenges (the system auto-generates the ill-defined problem and the three per-skill rubrics) and read course analytics that break grade distributions down by rubric and by student. In every view the three skills remain distinct columns, which is the design commitment the framework makes visible.
Figure A2. Role interfaces. Top: the student dashboard and the instructor challenge-authoring screen. Bottom: the student trend report (per-skill grades over time) and the instructor course analytics (grade distribution by rubric and per-student results). All views report Framing, Judging, and Steering independently.
Acar, O. A. (2023). AI Prompt Engineering Isn't the Future. Harvard Business Review, June 6.
Anderson, L. W., & Krathwohl, D. R. (2001). A Taxonomy for Learning, Teaching, and Assessing: A Revision of Bloom's Taxonomy of Educational Objectives. Longman.
Annapureddy, R., Fornaroli, A., & Gatica-Perez, D. (2024). Generative AI Literacy: Twelve Defining Competencies. Digital Government: Research and Practice.
Bansal, G., Wu, T., Zhou, J., Fok, R., Nushi, B., Kamar, E., Ribeiro, M. T., & Weld, D. S. (2021). Does the Whole Exceed Its Parts? The Effect of AI Explanations on Complementary Team Performance. Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems.
Barzilai, S., & Chinn, C. A. (2018). On the Goals of Epistemic Education: Promoting Apt Epistemic Performance. Journal of the Learning Sciences, 27(3), 353–389.
Bjork, E. L., & Bjork, R. A. (2011). Making Things Hard on Yourself, but in a Good Way: Creating Desirable Difficulties to Enhance Learning. Psychology and the Real World, 56–64.
Buçinca, Z., Malaya, M. B., & Gajos, K. Z. (2021). To Trust or to Think: Cognitive Forcing Functions Can Reduce Overreliance on AI in AI-Assisted Decision-Making. Proceedings of the ACM on Human-Computer Interaction, 5(CSCW1), 1–21.
Chi, M. T. H., & Wylie, R. (2014). The ICAP Framework: Linking Cognitive Engagement to Active Learning Outcomes. Educational Psychologist, 49(4), 219–243.
Clerc, O., Abdelghani, R., Desvaux, C., Poisson, E., Oudeyer, P., & Sauzéon, H. (2026). Teaching Students to Question the Machine: An AI Literacy Intervention Improves Students' Regulation of LLM Use in a Science Task. arXiv preprint arXiv:2604.01955.
Collins, A., Brown, J. S., & Newman, S. E. (1989). Cognitive Apprenticeship: Teaching the Crafts of Reading, Writing, and Mathematics. Knowing, Learning, and Instruction: Essays in Honor of Robert Glaser, 453–494.
Dakan, R., & Feller, J. (2025). Framework for AI Fluency. Anthropic.
Dell'Acqua, F., McFowland III, E., Mollick, E., Lifshitz-Assaf, H., Kellogg, K., Rajendran, S., Krayer, L., Candelon, F., & Lakhani, K. R. (2023). Navigating the Jagged Technological Frontier. Harvard Business School Working Paper 24-013.
Denny, P., Leinonen, J., Prather, J., Luxton-Reilly, A., Amarouche, T., Becker, B. A., & Reeves, B. N. (2024). Prompt Problems: A New Programming Exercise for the Generative AI Era. Proceedings of the 55th ACM Technical Symposium on Computer Science Education (SIGCSE).
Di Santi, E. (2026). Cognitive Amplification vs Cognitive Delegation in Human-AI Systems: A Metric Framework. arXiv preprint arXiv:2603.18677.
Durkin, K., & Rittle-Johnson, B. (2012). The Effectiveness of Using Incorrect Examples to Support Learning about Decimal Magnitude. Learning and Instruction, 22(3), 206–214. https://doi.org/10.1016/j.learninstruc.2011.11.001
Facione, P. A. (1990). Critical Thinking: A Statement of Expert Consensus for Purposes of Educational Assessment and Instruction (The Delphi Report). American Philosophical Association.
Fan, Y., Tang, L., Le, H., Shen, K., Tan, S., Zhao, Y., Shen, Y., Li, X., & Gašević, D. (2025). Beware of Metacognitive Laziness: Effects of Generative Artificial Intelligence on Learning Motivation, Processes, and Performance. British Journal of Educational Technology, 56(2), 489–530. https://doi.org/10.1111/bjet.13544
Feng, Y., Wang, S., Cheng, Z., Wan, Y., & Chen, D. (2025). Are We on the Right Way to Assessing LLM-as-a-Judge?. arXiv preprint arXiv:2512.16041.
Fernandes, D., Villa, S., Nicholls, S., Haavisto, O., Buschek, D., Schmidt, A., Kosch, T., Shen, C., & Welsch, R. (2024). Performance and Metacognition Disconnect when Reasoning in Human-AI Interaction. arXiv preprint arXiv:2409.16708.
Flavell, J. H. (1979). Metacognition and Cognitive Monitoring: A New Area of Cognitive-Developmental Inquiry. American Psychologist, 34(10), 906–911.
Gerlich, M. (2025). AI Tools in Society: Impacts on Cognitive Offloading and the Future of Critical Thinking. Societies, 15(1), 6.
Gilson, L. L., & Goldberg, C. B. (2015). Editors' Comment: So, What Is a Conceptual Paper?. Group & Organization Management, 40(2), 127–130. https://doi.org/10.1177/1059601115576425
Große, C. S., & Renkl, A. (2007). Finding and Fixing Errors in Worked Examples: Can This Foster Learning Outcomes?. Learning and Instruction, 17(6), 612–634. https://doi.org/10.1016/j.learninstruc.2007.09.008
Gu, X., & Ericson, B. J. (2025). AI Literacy in K-12 and Higher Education in the Wake of Generative AI: An Integrative Review. Proceedings of the 2025 ACM Conference on International Computing Education Research (ICER), 125–140. https://doi.org/10.1145/3702652.3744217
Hattie, J., & Timperley, H. (2007). The Power of Feedback. Review of Educational Research, 77(1), 81–112.
Jaakkola, E. (2020). Designing Conceptual Articles: Four Approaches. AMS Review, 10, 18–26.
Jin, Y., Martinez-Maldonado, R., Gašević, D., & Yan, L. (2024). GLAT: The Generative AI Literacy Assessment Test. arXiv preprint arXiv:2411.00283.
Kane, M. T. (2013). Validating the Interpretations and Uses of Test Scores. Journal of Educational Measurement, 50(1), 1–73. https://doi.org/10.1111/jedm.12000
Kapur, M. (2008). Productive Failure. Cognition and Instruction, 26(3), 379–424.
Kim, P., Wang, W., & Bonk, C. J. (2025). Generative AI as a Coach to Help Students Enhance Proficiency in Question Formulation. Journal of Educational Computing Research, 63(3), 565–586. https://doi.org/10.1177/07356331251314222
Kosmyna, N., Hauptmann, E., Yuan, Y. T., Situ, J., Liao, X., Beresnitzky, A. V., Braunstein, I., & Maes, P. (2025). Your Brain on ChatGPT: Accumulation of Cognitive Debt when Using an AI Assistant for Essay Writing Task. arXiv preprint arXiv:2506.08872.
Lee, J. D., & See, K. A. (2004). Trust in Automation: Designing for Appropriate Reliance. Human Factors, 46(1), 50–80.
Lee, S. C., Baby, T., Vongvit, R., Lee, J., Kim, Y., Min, C., & Yoon, S. H. (2025). Development and Validation of a Generative AI Competence Scale. Technology in Society. https://doi.org/10.1016/j.techsoc.2025.103059
Li, C., Cui, H., & Hagedorn, L. S. (2026). The Cognitive Impact of ChatGPT in Higher Education: A Systematic Review of Critical and Creative Thinking Outcomes. Computers and Education: Artificial Intelligence, 10, 100571. https://doi.org/10.1016/j.caeai.2026.100571
Lo, L. S. (2023). The CLEAR Path: A Framework for Enhancing Information Literacy through Prompt Engineering. The Journal of Academic Librarianship, 49(4).
Long, D., & Magerko, B. (2020). What Is AI Literacy? Competencies and Design Considerations. Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems, 1–16.
Messick, S. (1995). Validity of Psychological Assessment: Validation of Inferences from Persons' Responses and Performances as Scientific Inquiry into Score Meaning. American Psychologist, 50(9), 741–749. https://doi.org/10.1037/0003-066X.50.9.741
Nazaretsky, T., Gabbay, H., & Käser, T. (2025). Can Students Judge Like Experts? A Large-Scale Study on AI and Human Personalized Formative Feedback. Computers and Education: Artificial Intelligence. https://doi.org/10.1016/j.caeai.2025.100533
Nelson, T. O., & Narens, L. (1990). Metamemory: A Theoretical Framework and New Findings. The Psychology of Learning and Motivation, 26, 125–173.
Parasuraman, R., & Manzey, D. H. (2010). Complacency and Bias in Human Use of Automation: An Attentional Integration. Human Factors, 52(3), 381–410.
Paul, R., & Elder, L. (2006). The Miniature Guide to Critical Thinking: Concepts and Tools. Foundation for Critical Thinking.
Randazzo, B., Lifshitz-Assaf, H., Kellogg, K., Dell'Acqua, F., Mollick, E., Candelon, F., & Lakhani, K. R. (2025). Cyborgs, Centaurs and Self-Automators: Modes of Human-AI Collaboration in Knowledge Work. Harvard Business School Working Paper 26-036.
Runco, M. A., & Chand, I. (1995). Cognition and Creativity. Educational Psychology Review, 7(3), 243–267.
Sadler, D. R. (1989). Formative Assessment and the Design of Instructional Systems. Instructional Science, 18(2), 119–144.
Salomon, G., Perkins, D. N., & Globerson, T. (1991). Partners in Cognition: Extending Human Intelligence with Intelligent Technologies. Educational Researcher, 20(3), 2–9.
Sidra, S., & Mason, C. (2025). Generative AI in Human-AI Collaboration: Validation of the Collaborative AI Literacy and Collaborative AI Metacognition Scales. International Journal of Human-Computer Interaction. https://doi.org/10.1080/10447318.2025.2543997
Sperber, D., Clément, F., Heintz, C., Mascaro, O., Mercier, H., Origgi, G., & Wilson, D. (2010). Epistemic Vigilance. Mind & Language, 25(4), 359–393.
Srinath, S., Vadaparty, A., Smith IV, D. H., Porter, L., & Zingaro, D. (2025). Assessing Problem Decomposition in CS1 for the GenAI Era. arXiv preprint arXiv:2511.05764.
Tankelevitch, L., Kewenig, V., Simkute, A., Scott, A. E., Sarkar, A., Sellen, A., & Rintel, S. (2024). The Metacognitive Demands and Opportunities of Generative AI. Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems.
Tour, E., & Zadorozhnyy, A. (2025). Conceptualizing and Operationalizing Prompt Literacy for English Language Learners. Journal of Adolescent & Adult Literacy. https://doi.org/10.1002/jaal.70020
Vaccaro, M., Almaatouq, A., & Malone, T. (2024). When Combinations of Humans and AI Are Useful: A Systematic Review and Meta-Analysis. Nature Human Behaviour, 8, 2293–2303.
Vygotsky, L. S. (1978). Mind in Society: The Development of Higher Psychological Processes. Harvard University Press.
Winne, P. H., & Hadwin, A. F. (1998). Studying as Self-Regulated Learning. Metacognition in Educational Theory and Practice, 277–304.
Zimmerman, B. J. (2000). Attaining Self-Regulation: A Social Cognitive Perspective. Handbook of Self-Regulation, 13–39.
UNESCO (2024). AI Competency Framework for Students. UNESCO.