https://doi.org/10.4438/1988-592X-RE-2025-411-730
Tolga Demir
Republic of Türkiye Ministry of National Education
https://orcid.org/0000-0002-1812-3397
Sait Çüm
Universidad Dokuz Eylul
https://orcid.org/0000-0002-0428-5088
This study investigates whether narrative texts can be accurately and stably scored over time and whether effective formative feedback can subsequently be provided for these texts through human-AI collaboration. To this end, two models were employed: the default version of ChatGPT and the Text Assessment Tool (TAT), a GPT model specifically trained through a six-step process for this research purpose. 114 narrative texts were scored three times according to criteria in a rubric by both the specially trained and default models. The agreement levels of the scores given by TAT and default ChatGPT with the actual scores, as well as the stability of these scores over time, were examined. The results indicated that, in contrast to the performance of default ChatGPT, TAT’s scores demonstrated high levels of agreement with the actual scores and maintained stability over time across all rubric categories, consistently surpassing the threshold and frequently indicating high reliability. Additionally, it was observed that the majority of the feedback generated by TAT met the criteria for effective feedback. Additionally, the feedback provided by TAT for the texts exceeded an 83% success rate in meeting effective feedback criteria across all categories. The statistical evidence presented in this study underscores that large language models, when specifically trained, can perform very well in scoring texts using a rubric and providing feedback. This is particularly promising for achieving fairer education, especially in large classes and situations where evaluators are overburdened.
educational assessment, human-AI collaboration, PT training
Este estudio analiza si los textos narrativos pueden ser evaluados con precisión, si es posible mantener calificaciones estables a lo largo del tiempo, y si puede proporcionarse retroalimentación formativa efectiva para estos textos gracias a la colaboración humano-IA. Para ello, se utilizaron dos modelos: la versión estándar de ChatGPT y la Herramienta de Evaluación de Textos (TAT), un modelo GPT específicamente entrenado mediante un proceso de seis pasos diseñado para esta investigación. Se evaluaron 114 textos narrativos en tres ocasiones según los criterios establecidos en una rúbrica, utilizando tanto el modelo estándar como el modelo especialmente entrenado. Se analizaron los niveles de concordancia entre las calificaciones otorgadas por TAT y por ChatGPT estándar con respecto a las calificaciones reales, así como la estabilidad de estas calificaciones a lo largo del tiempo. Los resultados mostraron que, en comparación con el desempeño del ChatGPT estándar, las calificaciones de TAT presentaron altos niveles de concordancia con las calificaciones reales y mantuvieron su estabilidad a lo largo del tiempo en todas las categorías de la rúbrica, superando de forma constante el umbral mínimo e indicando con frecuencia una alta fiabilidad. Además, se observó que la mayor parte de la retroalimentación generada por TAT cumplía con los criterios de retroalimentación efectiva. La evidencia estadística presentada en este estudio demuestra que los modelos de lenguaje a gran escala, cuando son entrenados específicamente, pueden desempeñarse de manera excelente tanto en la evaluación de textos mediante una rúbrica como en la provisión de retroalimentación formativa. Esto es particularmente alentador para lograr una educación más equitativa, en particular en aulas numerosas y en contextos donde los evaluadores se encuentran sobrecargados.
"To shorten our path, we needed a horse. We found a wild one, untamed and strong. We had to tame it, for a wild horse would not serve us. This paper is the story of that taming."
Sait Çüm & Tolga Demir
In educational assessment, question types such as multiple-choice, sentence completion, matching, and true-false are frequently employed in both classroom settings and large-scale examinations, particularly for summative assessment purposes, due to their capacity for objective scoring. However, for formative assessments—which aim to identify and address students´ learning gaps, monitor their development, and enhance instructional processes—It is essential to utilize open-ended questions, as well as oral and product-based or process-oriented performances that provide more detailed data to the educator. Such assessment approaches enable a clearer identification of students´ learning deficiencies and misconceptions, while also facilitating the measurement of higher-order cognitive skills from a taxonomic perspective. Despite these advantages, the time-consuming process of reading, scoring, and providing feedback on these assessments, especially in large classrooms, often leads to their underutilization by teachers.
Recent revolutionary advancements in artificial intelligence (AI) technology have spurred discussions about the future role and significance of AI in our lives. It is now evident that humans are no longer the sole intelligent actors on our planet, making human-AI collaboration inevitable in contemporary organizations (Kolbjørnsrud, 2024). It is not difficult to predict that human-AI collaboration will continue to reduce costs related to time and labor in various fields.
Artificial intelligence (AI) encompasses computerized systems that perform tasks and respond in ways typically associated with human intelligence, such as learning, problem-solving, and goal achievement under uncertain and varying conditions. AI has achieved remarkable progress from early problem-solving in the 1950s to the simulation of human reasoning in the 1960s, from initial mapping projects in the 1970s to the advent of intelligent assistants in the 2000s (Dalton, 2024; Fell Kurban & Şahin, 2024). Within this vast domain, generative AI stands out as a specialized subset focused on creating new content that mimics existing data. Up to the present day, the fields of machine learning and artificial neural networks have significantly advanced, enabling the development of sophisticated generative architectures and deep learning algorithms. Notable examples include generative adversarial networks (GANs), variational autoencoders (VAEs), and transformer-based models (Alto, 2023; Chan & Colloton, 2024; Johannesson, 2024), which are integral to the progress and applications of generative AI.
The release of ChatGPT, a generative AI model, by OpenAI in late 2022, made a significant global impact, garnering widespread attention. This development acted as a driving force, encouraging numerous major technology companies to enter the competitive field of generative AI models (Holmes & Miao, 2023). While strong competitors such as Gemini, DeepSeek and Llama have emerged, ChatGPT continues to maintain a slight lead in terms of popularity and widespread user adoption.
ChatGPT, a pre-trained large language model (LLM), utilizes a transformer-based language architecture, a type of deep neural network highly effective for natural language processing (NLP) tasks. It can understand and generate human-like text based on the input it receives. Trained on a vast amount of data, ChatGPT has learned the patterns, styles, and complexities of human language, making it an exceptional tool for communication. Its capabilities have transformed education by offering dynamic human-like conversations, providing instant information, personalized recommendations, and continuous academic support (Chan & Colloton, 2024; Fell Kurban & Şahin, 2024).
The popularity of LLMs such as ChatGPT among both teachers and students necessitates research into the alignment of its capabilities and outputs with expectations or defined objectives. We find it crucial to explore the potential of LLMs in facilitating and supporting tasks that are exhausting and time-consuming for teachers, rather than merely assisting with superficial or straightforward tasks. This is particularly important in ensuring the continuity of critical educational processes, such as formative assessment, in large classrooms or periods of high teacher workload, thus sustaining the quality of education.
When the literature on the integration of LLMs into educational practices, particularly in writing skills and assessment, was reviewed, two studies were identified that highlighted AI´s effectiveness in generating reading materials (Fitria, 2023; Xiao et al., 2023). Additionally, three studies aligned with our objectives examined AI´s ability to provide feedback or score student essays (Steiss et al., 2024; Wang, 2022; Yavuz et al., 2024). However, a rubric-based training process specifically designed for assessing narrative writing skills was not employed in any of these studies.
On the other hand, the use of Large Language Models (LLMs) in education may produce hallucinatory information, leading to accuracy and reliability issues that can negatively affect student learning and critical thinking skills (Elsayed, 2024). Some studies have shown that LLMs, particularly in feedback processes, may fail to fully comprehend student work and at times provide feedback that is either off-topic or superficial (Venter et al., 2024; Jia et al., 2024). This situation may affect the trust that students and teachers place in AI-generated feedback and could lead to more cautious or even skeptical attitudes toward such outputs (Ziqi, 2024). For precisely these reasons, training a specialized, rubric-based model for a specific purpose within the scope of this study is considered important, as it has the potential to overcome some of the challenges that may arise.
This study aims to determine whether narrative texts can be accurately scored and whether effective formative feedback can be provided through human-AI collaboration. Additionally, the study compares the scoring accuracy and stability over time of a GPT model, the Test Assessment Tool (TAT), which was trained using many-shot iterative prompting approach, with those of the default ChatGPT.
A significant challenge in this study is the inherent subjectivity in scoring narrative texts, even with a rubric. The study suggests that human-AI collaboration can improve objectivity and stability in scoring. For example, determining what constitutes an "engaging title" involves personal judgment, which AI also struggles with. Instead of simplifying the rubric to minimize subjectivity, this research aims to show how human-AI collaboration can develop reliable solutions in contexts requiring subjective evaluation. If successful, this approach could lead to fairer outcomes, even in large-scale assessments or recruitment processes.
The study also investigates the potential of AI collaboration in providing feedback within formative assessment processes to support student development. It posits that such collaboration can reduce teachers´ workload in providing feedback on students´ work. The effective feedback criteria used to measure the effectiveness of the feedback (Brookhart, 2008; Burke & Pieterick, 2010; Irons, 2008; Juwah et al., 2004) are detailed in Appendix I. Although the process of effective feedback can involve dialogue and face-to-face interaction, this study focuses exclusively on written feedback due to the nature of the materials used.
The following hypotheses guide the research process and analyses:
H1: The trained large language model will produce more accurate scores for narrative texts compared to the default ChatGPT.
H2: The trained large language model will demonstrate greater stability and reproducibility in scoring narrative texts over time compared to the default ChatGPT.
H3: The trained large language model will provide more effective written feedback that enhances students´ narrative writing skills compared to the default ChatGPT.
The methodology of this study comprises two main phases. The first phase involves training GPT specifically for the research objective, while the second phase evaluates the effectiveness of the trained model by comparing it to the default ChatGPT.
For both phases, a rubric from the Turkish Ministry of National Education’s 2024 curriculum was used, encompassing eight categories: "page structure," "title," "text structure," "character," "setting," "plot," "language and style," and "spelling and punctuation." Each category is assessed at three levels (details in Appendix II). TAT was trained using the GPT Builder application of OpenAI to scoring texts and provide feedback based on this rubric.
In this study, human evaluators were not used as benchmarks for AI scoring accuracy due to the risk of their evaluation errors introducing additional bias. Instead, the narrative texts in the dataset were created by researchers according to the rubric, with intentional omissions or errors. To ensure the accuracy of the dataset´s intended design, a reliability study was conducted with other experts on a subset of the texts, with the results detailed in this section. Following the reliability study, the pre-determined scores of the texts, referred to as "actual scores," served as the gold standard for comparisons.
TAT was subjected to a thorough six-step training process, summarized below, using GPT Builder; examples of the prompts used in each step are provided in Appendix III.
The GPT model is assigned a specific role. The necessary files for this role (preferably in PDF format) are uploaded to the system, and their comprehensibility is verified. In the context of this study, these files contain the criteria specified in the rubric as well as the criteria for effective feedback.
A question-and-answer session is conducted with GPT about the evaluation criteria. The session aims to determine how well GPT comprehends the criteria and to identify potential issues it might encounter during the evaluation. The prompts given in response to the answers help to clarify how the evaluation criteria are to be applied.
Examples (training data) are provided to GPT for many-shot iterative prompting. For scoring tasks, example sentences are presented for texts that could receive 1, 2, or 3 points. For feedback prompts, examples of sentences illustrating effective feedback are provided.
Unlike earlier steps that focus on parts, this step aims to see the whole. To elaborate, while previous steps focus on specific criteria within a rubric or a particular aspect of effective feedback, this step observes how GPT scores an entire text and provides comprehensive feedback.
During the training process, the phases of Example Analysis, Upload Sample Files, and Structured Practice can be iteratively repeated to ensure more accurate responses. In cases where the desired outcomes are not achieved, the process is repeated with new examples and the structured texts are re-evaluated for problematic areas.
In this step, the researcher verifies the topics agreed upon with GPT up to this point. Final adjustments are made to the instructions if necessary, and the uploaded source files are confirmed.
FIGURE I. GPT Training process for rubric-based assessment using GPT Builder

Figure I illustrates the GPT training process, applicable for similar
tasks.
In this section, we summarize the notable challenges encountered during the AI training process and the solutions devised, aiming to assist other researchers and practitioners in navigating similar issues.
One key challenge was the variety of file types used in the training process. Considering practical applications, evaluators might store texts in different formats, such as photographs of students´ written work. We experimented with different file types during the training process to observe any variations in performance. Using .png files resulted in more errors, likely due to the Optical Character Recognition (OCR) process employed by LLM, which made changes to the content before analysis. With .doc files, the model sometimes altered texts, such as adding and scoring titles that weren’t originally there. However, using .pdf files minimized these issues, significantly reducing the frequency of such interventions.
Another issue involved the mode of text submission—whether collectively or individually. Batch submissions led to significant errors during training and testing, with the model exhibiting unwanted automation in scoring and producing uniform feedback after a few texts. Sequential submission and individual scoring effectively mitigated these issues.
Problems also emerged due to the length of the training data. Long texts or prompts in the training set could cause confusion in the trained model. These issues were resolved by organizing training data into shorter, clearer, and more concise segments.
Occasionally, the trained model struggled to establish the correct connections with previously provided training data, resulting in undesired creativity. This could be due to the model´s difficulty in connecting with prior training data, in addition to the challenge of making incorrect generalizations as it is exposed to more varied data. We observed that these connectivity issues resolved themselves over time without new interventions and were not consistently related to specific training data (indicative of randomness). This problem underscores the importance of human-AI collaboration, suggesting that some processes should not be left entirely to AI. Human oversight can effectively eliminate these issues.
114 narrative texts were created for this study, all written in Turkish. These texts are diverse in terms of evaluation criteria. For instance, some texts consist of a single paragraph but are flawless in terms of grammar and punctuation. Other texts, while ideal in their three-paragraph structure and page layout, lack titles. Some intentionally omit elements of setting. Each story has a unique title, features different characters, and utilizes different elements of setting, resulting in distinct plotlines. In essence, each text is original and unrelated to others. This approach aims to minimize the random factor in scoring or providing effective feedback by TAT.
Narrative texts were uploaded to the Automated Text Analysis Tool (TAT) and scored three times, resulting in scores at three different points in time. The same scoring procedure was applied using the default ChatGPT, yielding three datasets: the actual scores, the scores assigned by TAT across three sessions, and the scores assigned by ChatGPT across three sessions. Additionally, 20 randomly selected narrative texts were re-uploaded to TAT to collect effective feedback based on the previous automated scoring, and outputs were recorded.
The agreement levels among the actual scores of 114 texts, the scores provided by TAT at three different times, and the scores given by default ChatGPT at three different times were calculated using Krippendorff´s α technique. Krippendorff´s α is a reliability coefficient commonly used in fields such as social sciences and content analysis to measure the consistency of categorical or continuous data ratings made by multiple raters or coders (Krippendorff, 2004). The following criteria are used to evaluate the obtained Krippendorff´s α values: a coefficient value below 0.67 indicates low agreement and reliability, a value between 0.67 and 0.80 indicates acceptable, moderate agreement and reliability, and a value above 0.80 indicates high agreement and reliability.
The effectiveness of the feedback provided by TAT, based on effective feedback criteria, was analyzed using descriptive statistics. In this phase, researchers individually examined a total of 160 feedback for 20 randomly selected texts, using the eight categories in the rubric. The feedback was coded by researchers as successful or unsuccessful according to criteria such as category appropriateness, performance orientation, clarity and comprehensibility, developmental quality, constructiveness, and task specificity. Subsequently, the percentage of successful feedback was calculated relative to the total amount of feedback.
The preliminary scores for 114 narrative texts were determined by researchers using a rubric. To assess the reliability of these scores, a random selection of texts from the 114 was sent to three experts, who provided their opinions on the appropriateness of the scores. The similarity between the expert opinions and the researchers´ scores was calculated using the formula A = C ÷ (C + a) × 100, based on the Miles and Huberman (1994) model. In this formula, A represents the reliability coefficient, C denotes the number of items/terms with agreement, and a denotes the number of items/terms without agreement. According to this model, a similarity ratio of at least 80% is required to achieve consistency. Our reliability study showed agreement rates of 95%, 96.25%, and 91.25% between the preliminary scores assigned by the researchers and the expert evaluations. This consistency justifies considering the preliminary scores assigned by the researchers as the gold standard (actual scores).
All texts in the dataset were scored by TAT at three separate times, and Krippendorff’s α was used to determine the level of agreement between each set of scores and the actual scores (t). The results are presented in Table I.
TABLE I. Agreement levels for TAT scores with actual scores
| Category | α t -1 | α t -2 | α t -3 | α mean | Interpretation |
|---|---|---|---|---|---|
| Page structure | 0.899 | 0.897 | 0.857 | 0.884 | High reliability |
| Title | 0.870 | 0.802 | 0.818 | 0.830 | High reliability |
| Text structure | 0.921 | 0.950 | 0.950 | 0.940 | High reliability |
| Character | 0.822 | 0.763 | 0.807 | 0.797 | Medium reliability |
| Setting | 0.728 | 0.787 | 0.759 | 0.758 | Medium reliability |
| Plot | 0.883 | 0.894 | 0.901 | 0.889 | High reliability |
| Language and style | 0.874 | 0.894 | 0.893 | 0.887 | High reliability |
| Spelling and punctuation | 0.780 | 0.786 | 0.814 | 0.793 | Medium reliability |
Examining Table I, it is observed that the category with the lowest agreement between TAT´s scores and the actual scores is the “setting” category, which involves examining the presence of time and place elements in the stories and their impact. On the other hand, the category with the highest agreement is the "text structure" category, which examines the presence and quality of the introduction, body, and conclusion sections of the stories, with a value of 0.940. Upon reviewing the findings on the agreement levels for each of the three comparisons between TAT scores and actual scores, it was found that all the obtained alpha values, as well as their means, exceeded the threshold considered reliable (0.667).
Along with analyzing the agreement between TAT´s scores and the actual scores, the stability of TAT´s scores across the three different times was also examined. The results and their interpretations are presented in Table II.
TABLE II. Stability of TAT Scores
| Category | α 1-2-3 | Interpretation |
|---|---|---|
| Page structure | 0.905 | High reliability |
| Title | 0.896 | High reliability |
| Text structure | 0.957 | High reliability |
| Character | 0.797 | Medium reliability |
| Setting | 0.846 | High reliability |
| Plot | 0.908 | High reliability |
| Language and style | 0.954 | High reliability |
| Spelling and punctuation | 0.828 | High reliability |
Upon examining Table II, it is evident that the scores assigned by TAT exhibit consistency over time across all categories, underscoring the reproducibility of the scoring outcomes. It is observed that the category with the highest agreement between the three different scores assigned by TAT is again the "text structure" category. The "character" category, which assesses the personal and psychological traits of the story characters, shows the lowest agreement. It can be interpreted that the character category, with the lowest alpha value, demonstrates medium reliability, whereas the agreements in the other categories demonstrate high reliability.
Krippendorff’s α values, indicating the agreement between default ChatGPT scores and actual scores, are shown in Table III.
TABLE III. Agreement levels for default ChatGPT scores with actual scores
| Category | α t -1 | α t -2 | α t -3 | α mean | Interpretation |
|---|---|---|---|---|---|
| Page structure | -0.032 | 0.553 | 0.370 | 0.297 | Low reliability |
| Title | 0.159 | 0.422 | 0.473 | 0.351 | Low reliability |
| Text structure | 0.261 | 0.521 | 0.502 | 0.428 | Low reliability |
| Character | 0.266 | 0.604 | 0.477 | 0.449 | Low reliability |
| Setting | 0.384 | 0.469 | 0.516 | 0.456 | Low reliability |
| Plot | 0.214 | 0.381 | 0.477 | 0.357 | Low reliability |
| Language and style | 0.412 | 0.562 | 0.550 | 0.508 | Low reliability |
| Spelling and punctuation | 0.233 | 0.415 | -0.171 | 0.159 | Low reliability |
Table III shows that "Spelling and Punctuation" has the lowest agreement between ChatGPT and actual scores, while "Language & Style" has the highest. Overall, the default ChatGPT scores show low agreement with actual scores across all categories.
The findings related to the stability of the scores given by ChatGPT, a general language processing model not specifically trained for this research, at three different times are presented in Table IV.
TABLE IV. Stability of default ChatGPT Scores
| Category | α 1-2-3 | Interpretation |
|---|---|---|
| Page structure | 0.403 | Low reliability |
| Title | 0.402 | Low reliability |
| Text structure | 0.566 | Low reliability |
| Character | 0.475 | Low reliability |
| Setting | 0.633 | Low reliability |
| Plot | 0.491 | Low reliability |
| Language and style | 0.436 | Low reliability |
| Spelling and punctuation | 0.627 | Low reliability |
Examining Table IV, it is observed that the category with the highest agreement between the scores given at three different times is the "setting" category, while the category with the lowest agreement is the "title" category. Interpreting the values in Table 4, it is observed that the agreements for all categories are low reliability.
FIGURE II. A comparison between TAT and default ChatGPT concerning agreement with actual scores (left) and the stability of scores (right)

Figure II illustrates the mean alpha values from Tables 1 and 3, as well as the alpha values from the three distinct time points (intra-rater reliability) detailed in Tables 2 and 4. The figure highlights significant discrepancies between the scores given by ChatGPT and those assigned by TAT in terms of both their agreement with actual scores (left) and stability (right)
Based on the evidence regarding the agreement of TAT´s rubric-based scoring with the actual scores of the texts and the stability of its scores over different time points, it was concluded that the research hypotheses H1 and H2 were addressed.
To investigate hypothesis H3, the effectiveness of TAT´s feedback, given in line with the rubric used in the study, was analysed against the criteria for effective feedback. The feedback provided by TAT was evaluated by researchers using criteria that included being performance-oriented, clear & comprehensible, constructive, developmental, and task-specific, as outlined in the introduction of the study. During these evaluations, it was found that TAT occasionally provided feedback to a category different from the one it should have addressed. To quantify these instances, an additional criterion named "category appropriateness" was defined alongside the effective feedback criteria. The results, including the total number of feedback instances analysed and their success rates, are presented in Table V.
TABLE V. Feedback performance of TAT
| Category | Total Feedback | Successful Feedback (%) |
|---|---|---|
| Category appropriateness | 160 | 91.88 |
| Performance-oriented | 160 | 100 |
| Clear & comprehensible | 160 | 86.25 |
| Developmental | 160 | 83.75 |
| Constructive | 160 | 100 |
| Task-specific | 160 | 89.38 |
An examination of Table V reveals that all feedback provided by TAT is performance-oriented and constructive. The criterion with the lowest adherence rate is the developmental criterion, met at 83.75%. Despite the inherent difficulty in crafting feedback that supports student development, TAT´s performance in this area is commendably high, as well as high performance being observed across all other criteria.
When feedback is sequentially examined in terms of the criteria that effective feedback should possess, it is observed that 91.88% of the feedback meets the category appropriateness criterion, while 8.12% does not. An example of feedback considered unsuccessful according to this criterion is provided below.
Example Feedback 1 is the feedback provided by TAT for the page structure category of Story 80. However, Example Feedback 1 focuses on suggestions regarding the flow, clarity and elaboration of the story, not the page structure. Although these suggestions are valuable for the development of the text, they do not comply with the rubric criteria for the page structure category. In contrast, Example Feedback 2 is considered suitable for the page structure category.
When examining whether the feedback provided by TAT is performance-oriented, it is observed that all 160 pieces of feedback meet this criterion. Example Feedback 3 is an example of feedback provided by TAT that meets this criterion:
When examining whether the feedback provided by TAT is clear and comprehensible, it is observed that 86.25% of the 160 pieces of feedback are successful according to this criterion. Example Feedback 4 is presented below:
Example Feedback 4 is feedback provided by TAT for the plot category for Story 53. This feedback exemplifies clarity and comprehensibility, which are essential principles of effective feedback. It avoids using suggestions that might confuse the student or words that the student might not understand.
An analysis conducted to evaluate the quality of TAT´s feedback found that 83.75% of the feedback provided supports the student´s development. This is exemplified in Example Feedback 5.
Example Feedback 5 is feedback provided by TAT for the setting category for Story 15. When this feedback is examined, it is observed that the provided suggestions and examples consist of expressions that support the student´s development.
One of the principles of effective feedback is that it should be constructive. According to this criterion, feedback provided to students should encourage them and offer various options instead of rigid commands or instructions. From this perspective, it is observed that all the feedback provided by TAT is delivered in a constructive manner, encouraging the students. This can be seen in the following feedback provided for the setting category for Story 22.
One of the qualities that effective feedback should have been that it should be task-specific rather than general. According to this criterion, effective feedback should not use the same expressions for everyone but should be tailored specifically to the student´s text. When examining the feedback provided by TAT, it is observed that TAT is quite successful in this regard, with 89.38% of the feedback meeting this criterion.
Example Feedback 7 is feedback provided by TAT for the plot category for Story 97. When this feedback is examined, it is observed that it is specific feedback directly related to the text, not general.
This study aims to determine if narrative texts can be accurately and stably scored through human-AI collaboration and if effective formative feedback can be provided. Additionally, the performance of the GPT trained for this purpose was compared to ChatGPT, which was not specifically trained for this research, to highlight performance differences.
Scoring accuracy and reliability
Agreement with actual Scores: TAT scored 114 narrative texts using a rubric and the agreement level between the scores and the actual scores for each rubric category was examined. The Krippendorff´s α values indicated a strong agreement with actual scores across all criteria, with reliability exceeding the threshold (α ≥ 0.667). The highest agreement was observed in the "Text Structure" category (α = 0.940), while the lowest agreement was in the "Setting" category (α = 0.758).
Stability over time: When examining the agreement levels between TAT´s scores at three different times, it was found that Krippendorff´s α values were above the threshold (α ≥ 0.667) across all criteria. The scores for the "Text Structure" category demonstrated the highest stability (α = 0.957), while the scores for the "Character" category showed the lowest stability (α = 0.797).
Both in terms of agreement with actual scores and stability, relatively low alpha values were identified in the categories of character, setting, and spelling and punctuation. For the character category, the rubric reveals a subtle distinction between awarding two points and three points. The rubric stipulates that two points should be given when the physical and psychological traits of the characters are described. When these traits, along with the emotions and perspectives that affect the narrative flow, are identified, three points are warranted. Determining which emotion or perspective influences the narrative or distinguishing them can be challenging. This difficulty would challenge a human evaluator as well as TAT´s evaluations. In the setting category, the challenge is thought to stem from inconsistencies in the combined portrayal of "place" and "time" elements in the narratives. For example, a story may provide detailed information about the place and its impact on the narrative, but neglect the aspect of time, making it difficult to score according to the rubric, which requires their joint assessment. Further disaggregation of these criteria in the rubric into smaller and clearer components could enhance AI scoring performance. Regarding the spelling and punctuation category, we had to use numerous datasets explaining Turkish spelling and punctuation rules to improve TAT´s performance. This necessity is paradoxical because using a large number of datasets can confuse the AI during training. If the stories were in English, fewer datasets would likely have been needed, resulting in better performance. Overall, all performances were above the threshold and satisfactory. Relatively lower performances could be addressed through interventions such as revising the rubric, and these are not viewed as significant issues for text evaluations in AI collaboration
Default ChatGPT´s performance: Tests with default ChatGPT revealed findings of low reliability in both agreement with actual scores and internal stability when scoring narrative texts. This was evident even in simple tasks such as evaluating the title of a text. The default model, untrained for text evaluation and unrestricted by specific tasks, often undertakes unwanted tasks such as corrections. For instance, it might add a title to a text that lacks one and then proceed to score the title it added. When examining the scores across different categories, some categories showed very poor performance. For example, the mean alpha value for spelling and punctuation category was 0.159. The default model was particularly weak in examining spelling and punctuation in Turkish texts. This underscores the substantial improvements achieved in initially lower-performing categories following specialized training.
Criteria compliance: The feedback provided by TAT was evaluated according to the criteria established for effective feedback. The tool demonstrated performance success rates exceeding 83% across all criteria, particularly excelling in delivering performance-oriented, constructive, and task-specific feedback.
Category Appropriateness: Only about 8.12% of the feedback samples were deemed inappropriate for their respective categories, demonstrating TAT’s high performance in delivering feedback within the context of each rubric category and effectively reminding students of the relevant criteria. Furthermore, the feedback considered inappropriate was not due to fabricated issues but rather to the confusion between some subtle distinctions among different rubric categories.
In the study by Yavuz et al. (2024), large language models ChatGPT and Bard were compared for essay evaluation. ChatGPT was used in both its default mode and in a fine-tuned mode with the temperature level reduced to 0.2. The scores given by the AI were compared with those given by human evaluators. The results indicated that both default ChatGPT and fine-tuned ChatGPT, as well as Bard, provided reliable scores. Notably, the fine-tuned ChatGPT showed a very high agreement with human evaluators. In the aforementioned study, the language models were not specifically trained for the task. Fine-tuning was achieved by simply adjusting the temperature setting, which limits the variability of the model´s responses. In our study, however, no temperature adjustment was made, and default ChatGPT was used for comparisons. The results of the two studies diverge concerning the performance of default ChatGPT. We considered that the language of the essays being evaluated might be a significant factor. One study used English texts evaluated with an English rubric, while the other used Turkish texts evaluated with a Turkish rubric. To substantiate this claim, more research comparing performances across different languages is required. Another factor contributing to the differing results could be the number of texts evaluated. In the study by Yavuz et al. (2024), only three texts were evaluated, while in our study, 114 texts were evaluated. We observed that as the number of texts to be scored by ChatGPT increased, it produced undesirable automatic responses and applied similar scoring patterns to qualitatively different texts. Thus, the other study may have achieved better performance by evaluating a small number of texts with appropriate prompts and human-AI collaboration. However, we argue that a model specifically trained for a purpose performs much better when there is a heavy lifting to be done.
Awidi (2024), compared human evaluators and default ChatGPT in the evaluation of 108 texts. The intraclass correlation coefficient (ICC) for single measures was 0.349, indicating low agreement, which is consistent with our study´s results. Awidi (2024), noted that the agreement increased when looking at average measures and advocated for AI collaboration in text evaluation to achieve more consistent results and significantly reduce human workload.
Regarding the quality of feedback provided to texts, Steiss et al. (2024) compared the feedback quality from humans and ChatGPT on student writings. The study compared 200 pieces of feedback from humans and 200 from AI. The results showed that human raters were more successful in providing high-quality feedback in all categories except for criteria-based feedback. Based on this, the authors argued that ChatGPT can be beneficial in the absence of a well-trained educator. In our study, we achieved quite good results regarding the quality of AI-provided feedback. The difference in results between the two studies is largely due to whether the language model was specifically trained for the purpose. We used a model trained for text evaluation and feedback provision, whereas the other study used a default model. Our study showed that a trained language model excels in delivering effective feedback, which is believed to support student development. Regarding this topic, Escalante et al. (2023) conducted a study to determine how AI feedback and human feedback affect students´ writing performance and which type of evaluator the students preferred. The study found no significant difference in performance between the groups receiving AI feedback and those receiving human feedback, and students´ preferences for evaluators were evenly split.
The results of this study underscore the potential of human-AI collaboration in reliably and objectively scoring narrative texts, even in contexts that require subjective evaluations. The high levels of agreement and stability achieved by TAT, a GPT developed for this study, demonstrate that AI tools, when sufficiently trained, can match human performance in scoring texts and providing effective feedback. The strong potential of AI to support formative assessment processes is particularly significant in densely populated regions and large classrooms, as it can contribute to more consistent and scalable evaluation practices for students while also reducing teachers’ workload in monitoring and supporting individual student development. This, in turn, may contribute to a higher quality educational process.
The study emphasizes the need for specialized training to improve AI models´ proficiency in specific tasks. While ChatGPT excels in general language processing, targeted training is crucial for tasks like evaluating narrative texts. Without task-specific constraints, ChatGPT can produce inconsistent results, which is problematic for both scientific research and practical applications. Thus, the authors advise against using default ChatGPT for critical tasks and recommend employing a trained model with demonstrated reliability.
The statistical strength of the results produced by the AI in this study provides significant evidence for its use. However, during the process of both training and utilizing the AI, we discovered that it could make unexpected errors in unforeseen areas.
Beyond the difficulties inherent to the task and the influence of subjective decision-making in narrative text assessment, certain deviations in the agreement rates and temporal performance of both the default ChatGPT and TAT can be explained by the phenomenon of hallucination. Therefore, we argue that a completely AI-driven assessment process, devoid of human oversight, would be highly inappropriate. Beyond preventing errors, human-AI collaboration is essential for developing a system that can continually improve and effectively address varying tasks. Periodically feeding the model with appropriate data can greatly enhance its performance and make it more capable of handling diverse situations.
In the present study, narrative texts were purposefully constructed by the researchers, strictly adhering to a predefined rubric, with intentional incorporation of specific omissions, inaccuracies, and predetermined scoring criteria. This methodological approach enabled a controlled evaluation of the model´s proficiency in interpreting and applying evaluation standards. Nonetheless, this design choice introduces inherent limitations. Primarily, the absence of human evaluators and reliance on artificially generated texts may constrain the authenticity and variability that typically characterize genuine student compositions. Consequently, the results obtained from this method may not fully represent the model´s potential performance in authentic, real-world educational contexts.
Relatedly, the dataset comprised 114 standardized texts, which, although promoting controlled conditions, might inadequately reflect the diverse range of student profiles and varying writing competencies encountered within large-scale educational environments. To address these constraints, subsequent research could benefit from integrating authentic texts produced by actual students and involving human evaluators to comparatively analyze scoring alignment and temporal consistency of customized GPT models, such as TAT. Furthermore, expanding both the sample size and the dataset diversity might enhance the assessment of the model’s generalizability and practical applicability.
Additionally, variations observed between this study and others underscore the importance of investigating how AI language model performance differs across languages. Thus, initiating further practical and experimental research in this area would be beneficial.
1. Feedback is directed towards the performance itself,
not the performer. 2. Feedback focuses on specific aspects of the performance
rather than general comments. 1. Feedback is expressed using words and sentence
structures appropriate for the student´s age group or
developmental level. 2. It clearly specifies what is expected and what constitutes
a good performance. 3. Feedback should be detailed and explanatory enough to
avoid causing confusion for students. 1. Contains technical and complex expressions that make
it difficult for students to understand. 2. Uses vague statements like ´you can do better´ instead of
specifying what is expected. 3. Feedback is superficial and random, making it unclear what
is expected from the student. 1. Feedback should highlight strengths as well as
weaknesses in the performance. Good performances should receive
feedback as well as poor ones. 2. Use language that encourages the student and supports
their self-esteem. 3. Provide options for the student on what they can do,
rather than strict commands or instructions
Category
Effective Feedback Characteristics
Ineffective Feedback Characteristics
Performance-oriented
1. Contains biases towards the student and includes
statements targeting their personality.
2. Uses general comments that are not specific to the
performance.
Clarity & Comprehensibility
Developmental
1. Feedback should include suggestions to help students
address deficiencies and achieve the expected performance.
2. Similar tasks or strategies that can be used by the student
to facilitate self-learning may be recommended.
3. Emphasize what the student should do first to improve
subsequent performances.
1. Emphasizes deficiencies and inadequacies without
suggesting ways to address them.
Constructive
1. Uses patronizing language and statements that passive the
student.
2. Includes judgmental or threatening expressions that
discourage the student.
Task-specific
1. Feedback should not contain general statements; instead,
it should highlight specific points in the student´s work and be
given specifically in relation to its content.
1. The feedback contains generic statements that could be
used for all similar tasks, making the feedback for different
tasks appear repetitive and formulaic.
Category
1 Points
2 Points
3 Points
Page structure
The text is not written in paragraphs and is visually
disorganized on the page.
The text is written in paragraphs, but the indentations
and/or line endings are not properly aligned.
The text is written in paragraphs with proper indentations
and line endings, creating a visually organized page.
Title
The text does not have a title.
The text has a title, but it either does not reflect the
content or is a common cliché.
The title is relevant to the topic, reflects the content,
and is engaging.
Text structure
The text is missing one or more key sections: introduction,
climax, and resolution.
The text has an introduction, climax, and resolution, but
the transitions between sections are disjointed.
The text has an introduction, climax, and resolution, with
logical relationships and smooth transitions between the
sections.
Character
Characters are only mentioned by name without any additional
information.
Characters are named, and their physical and/or
psychological traits are described.
Characters are named, their physical and psychological
traits are described, and their emotions, perspectives, and
attitudes, which influence the story´s flow, are explained or
suggested.
Setting
Either the place or the time element is missing or
unclear.
The place and time are mentioned, but no detailed
information is provided.
The place is well described with auditory and visual
details, and the time is detailed, indicating its impact on
other story elements.
Plot
There is no clear plot.
There is a clear plot, but the transitions between events
are disjointed.
There is a clear plot with strong transitions between
events.
Language and style
Most sentences are unclear, lacking semantic and grammatical
connections, and the story uses very limited vocabulary.
Sentences are clear and understandable, with some semantic
and grammatical connections, but the story uses limited
vocabulary.
Sentences are clear and understandable, with well-made
semantic and grammatical connections, and the story uses rich
vocabulary.
Spelling and punctuation
There are 11 or more spelling and punctuation errors in the
text.
There are 6-10 spelling and punctuation errors in the
text.
There are no more than 5 spelling and punctuation errors in
the text.
Steps
Example Prompts
Goal Setting and Initial Assessment
As a language teacher, you will evaluate your students´
narrative texts and provide them with effective feedback to help
them improve. To do this, you will use the rubric and effective
feedback principles documents that I will upload for you.
Criteria Introduction
We will discuss the category of setting. When you examine
the criteria, do you see any item on which you might have
difficulty deciding? Where do you think you might encounter
problems while scoring?
Example Analysis
If the event in the story takes place in the summer, the
time is clear; however, if there is no information about the
specific details of the time, you should assign a score of 2.
The same principle applies to the place element. If the event
occurs at an inn and this is mentioned, but there are no
detailed descriptions about the inn, you should also assign a
score of 2.
Upload Sample Files
Paragraph indentation is when the first line of a paragraph
starts further in than the other lines. Now, I will upload a
single-paragraph story example without indentation for you. You
can use this file as a basis for your evaluation.
Structured Practice
I am going to upload a text for you. Based on our
discussions, I would like you to evaluate all sections of the
rubric for this text.
Final Evaluation and Confirmation
Now, describe the files I have uploaded to you, summarize
the decisions we have made, and specify the rules you will pay
attention to during the evaluation.
Alto, V. (2023).
Awidi, I. T. (2024). Comparing expert tutor evaluation of
reflective essays with marking by generative artificial
intelligence (AI) tool.
Brookhart, S. M. (2008).
Burke, D., & Pieterick, J. (2010).
Chan, C. K. Y., & Colloton, T. (2024).
Dalton, G. (2024).
Elsayed, H. (2024). The impact of hallucinated information in
large language models on student learning outcomes: A critical
examination of misinformation risks in AI-assisted education.
Escalante, J., Pack, A., & Barrett, A. (2023). AI-generated
feedback on writing: Insights into efficacy and ENL student
preference.
Fell Kurban, C., & Şahin, M. (2024).
Fitria, T. N. (2023). Artificial intelligence (AI) technology
in OpenAI ChatGPT application: A review of ChatGPT in writing
English essay.
Holmes, W., & Miao, F. (2023).
Irons, A. (2008).
Jia, Q., Cui, J., Xi, R., Liu, C., Rashid, P., Li, R., &
Gehringer, E. (2024). On assessing the faithfulness of
LLM-generated feedback on student assignments. In
Johannesson, P. (2024).
Juwah, C., Macfarlane-Dick, D., Matthew, B., Nicol, D., Ross,
D., & Smith, B. (2004).
Kolbjørnsrud, V. (2024). Designing the intelligent
organization: Six principles for human-AI collaboration.
Krippendorff, K. H. (2004).
Miles, M. B., & Huberman, A. M. (1994).
Ministerio Nacional de Educación de Turquía. (2024).
Steiss, J., Tate, T., Graham, S., Cruz, J., Hebert, M., Wang,
J., Moon, Y., Tseng, W., Warschauer, M., & Olson, C. B.
(2024). Comparing the quality of human and ChatGPT feedback of
students’ writing.
Wang, Z. (2022). Computer-assisted EFL writing and evaluations
based on artificial intelligence: A case from a college reading
and writing course.
Xiao, C., Xu, S. X., Zhang, K., Wang, Y., & Xia, L. (2023).
Evaluating reading comprehension exercises generated by LLMs: A
showcase of ChatGPT in education applications. In
Venter, J., Coetzee, S. A., & Schmulian, A. (2024).
Exploring the use of artificial intelligence (AI) in the delivery
of effective feedback.
Yavuz, F., Çelik, Ö., & Yavaş Çelik, G. (2024). Utilizing
large language models for EFL essay grading: An examination of
reliability and validity in rubric-based assessments.
Ziqi, C., Xinhua, Z., Qi, L., & Wei, W. (2024). L2
students’ barriers in engaging with form and content-focused
AI-generated feedback in revising their compositions.