The latest research of Associate Professor Liu Yingying, a member of the Second Language Acquisition Research Team, has been published in Computer Assisted Language Learning, titled “Enhancing GPT-based automated essay scoring: the impact of fine-tuning and linguistic complexity measures” .
The study explored the application of large language models in automated essay scoring with a focus on their reliability, accuracy, and fairness. Using a pretrained generative AI language model, the research evaluated English essays written by second language English learners whose first languages are Chinese, French, German and Spanish. By comparing the performance of three approaches including GPT ratings only, linguistic complexity indices only, and a combination of both, the study uncovered the roles of fine-tuning and linguistic complexity indices in automated essay scoring.
The research findings indicated that the fine-tuned GPT model shows strong agreement with human raters, particularly in evaluating intermediate and advanced-level writings. However, the model exhibited lower accuracy for low-level essays, primarily due to class imbalance in the training data caused by insufficient low-level samples. Additionally, the AI model tends to assign higher scores than human raters overall. The scoring accuracy also varies among learners with different first language backgrounds. Among the three approaches, the composite model combining fine-tuning and linguistic complexity measures performed best, though its advantage over the fine-tuned model alone is marginal. This study demonstrates the potential of generative AI in second language writing assessment, and offers valuable insights for future research.
