automated essay scoring via pairwise contrastive regression

Saved searches

Use saved searches to filter your results more quickly.

To see all available qualifiers, see our documentation .

Notifications You must be signed in to change notification settings

CarryCKW/AES-NPCR

Folders and files.

Name		Name
4 Commits

asap		asap

Repository files navigation

Automated-essay-scoring-via-pairwise-contrastive-regression.

Created by Jiayi Xie*, Kaiwei Cai*, Li Kong, Junsheng Zhou, Weiguang Qu This repository contains the ASAP dataset and Pytorch implementation for Automated Essay Scoring.(Coling 2022, Oral)

We use 5-corss-validation, and convert the dataset asap into 5 folds, as shown in the file path "./dataset/asap"

Code for AES-NPCR

Requirement.

Pytorch 1.7.1 Python 3.7.9

Pretrain Model

BERT, Roberta, XLNet can be used, default BERT

The code will be refactored.

Python 97.7%

Search Menu

Sign in through your institution

Advance Articles
Special Issues
Author Guidelines
Submission Site
Open Access
Reviewer Guidelines
Review and Appeals Process
About The Computer Journal
About the BCS, The Chartered Institute for IT
Editorial Board
Advertising and Corporate Services
Journals Career Network
Self-Archiving Policy
Dispatch Dates
Journals on Oxford Academic
Books on Oxford Academic

< Previous

Automated Essay Scoring by Capturing Relative Writing Quality

Article contents
Figures & tables
Supplementary Data

Hongbo Chen, Jungang Xu, Ben He, Automated Essay Scoring by Capturing Relative Writing Quality , The Computer Journal , Volume 57, Issue 9, September 2014, Pages 1318–1330, https://doi.org/10.1093/comjnl/bxt117

Permissions Icon Permissions

Automated essay-scoring (AES) systems utilize computer techniques and algorithms to automatically rate essays written in an educational setting, by which the workload of human raters is greatly reduced. AES is usually addressed as a classification or regression problem, where classical machine learning algorithms such as K-nearest neighbor and support vector machines are applied. In this paper, we argue that essay rating is based on the comparison of writing quality between essays and treat AES rather as a ranking problem by capturing the difference in writing quality between essays. We propose a rank-based approach that trains an essay-rating model by learning to rank algorithms, which have been widely used in many information retrieval and social Web mining tasks. Various linguistic and statistical features are utilized to facilitate the learning algorithms. Extensive experiments on two public English essay datasets, Automated Student Assessment Prize and Chinese Learners English Corpus, show that our proposed approach based on pairwise learning outperforms previous classification or regression-based methods on all 15 topics. Finally, analysis on the importance of the features extracted reveals that content, organization and structure are the main factors that affect the ratings of essays written by native English speakers, while non-native speakers are prone to losing ratings on improper term usage, syntactic complexity and grammar errors.

Personal account

Sign in with email/username & password
Get email alerts
Save searches
Purchase content
Activate your purchase/trial code
Add your ORCID iD

Institutional access

Sign in with username/password
Recommend to your librarian
Institutional account management
Get help with access

Access to content on Oxford Academic is often provided through institutional subscriptions and purchases. If you are a member of an institution with an active account, you may be able to access content in one of the following ways:

IP based access

Typically, access is provided across an institutional network to a range of IP addresses. This authentication occurs automatically, and it is not possible to sign out of an IP authenticated account.

Choose this option to get remote access when outside your institution. Shibboleth/Open Athens technology is used to provide single sign-on between your institutionâ€™s website and Oxford Academic.

Click Sign in through your institution.
Select your institution from the list provided, which will take you to your institution's website to sign in.
When on the institution site, please use the credentials provided by your institution. Do not use an Oxford Academic personal account.
Following successful sign in, you will be returned to Oxford Academic.

If your institution is not listed or you cannot sign in to your institutionâ€™s website, please contact your librarian or administrator.

Enter your library card number to sign in. If you cannot sign in, please contact your librarian.

Society Members

Society member access to a journal is achieved in one of the following ways:

Sign in through society site

Many societies offer single sign-on between the society website and Oxford Academic. If you see â€˜Sign in through society siteâ€™ in the sign in pane within a journal:

Click Sign in through society site.
When on the society site, please use the credentials provided by that society. Do not use an Oxford Academic personal account.

If you do not have a society account or have forgotten your username or password, please contact your society.

Sign in using a personal account

Some societies use Oxford Academic personal accounts to provide access to their members. See below.

A personal account can be used to get email alerts, save searches, purchase content, and activate subscriptions.

Some societies use Oxford Academic personal accounts to provide access to their members.

Viewing your signed in accounts

Click the account icon in the top right to:

View your signed in personal account and access account management features.
View the institutional accounts that are providing access.

Signed in but can't access content

Oxford Academic is home to a wide variety of products. The institutional subscription may not cover the content that you are trying to access. If you believe you should have access to that content, please contact your librarian.

For librarians and administrators, your personal account also provides access to institutional account management. Here you will find options to view and activate subscriptions, manage institutional settings and access options, access usage statistics, and more.

Short-term Access

To purchase short-term access, please sign in to your personal account above.

Don't already have a personal account? Register

Month:	Total Views:
February 2017	3
March 2017	1
April 2017	18
May 2017	4
June 2017	1
July 2017	5
August 2017	2
September 2017	2
October 2017	16
November 2017	11
December 2017	13
January 2018	9
March 2018	5
April 2018	6
May 2018	2
July 2018	3
September 2018	18
October 2018	1
November 2018	4
December 2018	3
February 2019	3
March 2019	5
April 2019	6
May 2019	3
June 2019	5
July 2019	1
September 2019	2
October 2019	2
November 2019	12
December 2019	4
January 2020	2
February 2020	2
March 2020	2
April 2020	2
May 2020	1
June 2020	1
July 2020	2
August 2020	1
October 2020	4
January 2021	3
February 2021	4
March 2021	4
April 2021	1
May 2021	3
July 2021	1
September 2021	3
October 2021	1
December 2021	5
January 2022	4
February 2022	1
March 2022	3
April 2022	2
July 2022	1
August 2022	3
September 2022	2
November 2022	2
December 2022	7
February 2023	3
March 2023	5
April 2023	1
May 2023	3
June 2023	4
July 2023	6
August 2023	5
September 2023	2
October 2023	2
November 2023	1
December 2023	1
January 2024	2
February 2024	3
March 2024	6
April 2024	4
May 2024	5
June 2024	4
July 2024	6
August 2024	4

Email alerts

Citing articles via.

Recommend to your Library

Affiliations

Online ISSN 1460-2067
Print ISSN 0010-4620
Copyright © 2024 British Computer Society
About Oxford Academic
Publish journals with us
University press partners
What we publish
New features
Open access
Rights and permissions
Accessibility
Advertising
Media enquiries
Oxford University Press
Oxford Languages
University of Oxford

Oxford University Press is a department of the University of Oxford. It furthers the University's objective of excellence in research, scholarship, and education by publishing worldwide

Copyright © 2024 Oxford University Press
Cookie settings
Cookie policy
Privacy policy
Legal notice

This Feature Is Available To Subscribers Only

This PDF is available to Subscribers Only

For full access to this pdf, sign in to an existing account, or purchase an annual subscription.

FABRIC: Automated Scoring and Feedback Generation for Essays

Automated essay scoring (AES) provides a useful tool for students and instructors in writing classes by generating essay scores in real-time. However, previous AES models do not provide more specific rubric-based scores nor feedback on how to improve the essays, which can be even more important than the overall scores for learning. We present FABRIC, a pipeline to help students and instructors in English writing classes by automatically generating 1) the overall scores, 2) specific rubric-based scores, and 3) detailed feedback on how to improve the essays. Under the guidance of English education experts, we chose the rubrics for the specific scores as content , organization , and language . The first component of the FABRIC pipeline is DREsS, a real-world D ataset for R ubric-based Es say S coring (DREsS). The second component is CASE, a C orruption-based A ugmentation S trategy for E ssays, with which we can improve the accuracy of the baseline model by 45.44%. The third component is EssayCoT, the Essay Chain-of-Thought prompting strategy which uses scores predicted from the AES model to generate better feedback. We evaluate the effectiveness of the new dataset DREsS and the augmentation strategy CASE quantitatively and show significant improvements over the models trained with existing datasets. We evaluate the feedback generated by EssayCoT with English education experts to show significant improvements in the helpfulness of the feedback across all rubrics. Lastly, we evaluate the FABRIC pipeline with students in a college English writing class who rated the generated scores and feedback with an average of 6 on the Likert scale from 1 to 7.

1 Introduction

In writing education, automated essay scoring (AES) offers benefits to both students and instructors by providing scores of students’ essays in real-time. Many students fear exposing their errors to instructors, therefore immediate assessment of their essays with AES can reduce their anxiety and help them improve their writing Sun and Fan ( 2022 ) . For instructors, this AES model can ease the burdensome process of evaluation and offer a means to validate their own evaluation, ensuring accuracy and consistency in assessment.

Existing AES models provide valuable overall scores, but they are insufficient for both learners and instructors desiring more details. Several studies have underscored English learners’ preference for specific and direct feedback Sidman-Taveau and Karathanos-Aguilar ( 2015 ); Karathanos and Mena ( 2009 ) . As students rarely seek clarifications on unclear feedback and may even disregard it, scoring and feedback must be clear and specific for easy comprehension Sidman-Taveau and Karathanos-Aguilar ( 2015 ) . However, existing AES models cannot be trained to provide detailed rubric-based scores because the datasets either do not have any rubric-specific scores, or when they do, the rubrics and criteria for scoring vary significantly among different datasets.

We introduce FABRIC , F eedback generation guided with A ES B y R ubric-based dataset I ncorporating C hatGPT, a combination of AES model and LLM. FABRIC comprises three major contributions: DREsS , a real-world D ataset for R ubric-based Es say S coring, CASE , a C orruption-based A ugmentation S trategy for E ssays, and EssayCoT , the Essay Chain-of-Thought prompting strategy for feedback generation.

DREsS includes 1,782 essays collected from EFL learners, each scored by instructors according to three rubrics: content, organization, and language. Furthermore, we rescale existing rubric-based datasets to align with our three primary rubrics. We propose as a standard rubric-based AES dataset this combination of a newly collected real-classroom dataset and an existing dataset rescaled to the same set of rubrics and standards. CASE is a novel data augmentation method to enhance the performance of the AES model. CASE employs three rubric-specific strategies to augment the essay dataset with corruption, and training with CASE results in a model that outperforms the quadratic weighted kappa score of the baseline model by 26.37%. EssayCoT is a prompting strategy to guide essay feedback generation, which is a new task on top of AES. EssayCoT leverages essay scores automatically predicted by the AES model when generating feedback, instead of manually composing few-shot exemplars. Feedback with EssayCoT prompting is significantly more preferred and helpful compared to standard prompting, according to the assessment by 13 English education experts. Lastly, we deploy FABRIC in an essay editing platform for 33 English as a Foreign Language (EFL) students.

In summary, the main contributions of this work are as follows:

We propose a standard rubric-based dataset with the combination of our newly collected real-classroom DREsS dataset (1.7K) and unified samples of the existing datasets (2.9K).

We introduce corruption-based augmentation strategies for essays (CASE). We build 3.9K of content, 15.7K of organization, and 0.9K of language synthetic data for AES model training.

We introduce EssayCoT prompting for essay feedback generation, which significantly improves the helpfulness of feedback.

We propose FABRIC, a pipeline that generates both scores and feedback leveraging DREsS, CASE, and EssayCoT. We deploy FABRIC with the aim of exploring its practical application in English writing education.

2 Related Work

2.1 automated essay scoring.

Automated essay scoring (AES) systems are used in evaluating and scoring student essays based on a given prompt. However, there is only a limited amount of available rubric-based datasets for AES with limited utility because the rubrics are not consistent. Furthermore, AES dataset has to be annotated with the experts in English education, considering scoring task requires not only proficiency in English but also pedagogical knowledge in English writing. To the best of our knowledge, a real-world AES dataset has not yet been established, as existing AES datasets make use of scores annotated by non-experts in English education.

2.1.1 AES Datasets

ASAP dataset 1 1 1 https://www.kaggle.com/c/asap-aes is widely used in AES tasks, including eight different prompts. Six out of eight prompt sets (P1-6) have a single overall score, and only two prompts (P7-8) are rubric-based datasets. These two rubric-based prompts consist of 1,569 and 723 essays for each respective prompt. The two prompt sets even have distinct rubrics and score ranges, which poses a challenge in leveraging both datasets for training rubric-based models. The essays are graded by non-expert annotators, though the essays were written by Grade 7-10 students in the US.

Mathias and Bhattacharyya ( 2018 ) manually annotated different attributes of essays in ASAP Prompt 1 to 6, which only have a single overall score. ASAP++ P1-2 are argumentative essays, while P3-6 are source-dependent essays. However, most samples in ASAP++ were annotated by a single annotator, who is a non-expert, including non-native speakers of English. Moreover, each prompt set of ASAP++ has different attributes to each other, which need to be more generalizable to fully leverage such dataset for AES model.

ICNALE Edited Essays

ICNALE Edited Essays (EE) v3.0 Ishikawa ( 2018 ) presents rubric-based essay evaluation scores and fully edited versions of essays written by EFL learners from 10 countries in Asia. The essays were evaluated according to 5 rubrics: content, organization, vocabulary, language use, and mechanics, according to ESL Composition Profile Jacobs et al. ( 1981 ) . Even though the essays are written by EFL learners, the essay is rated and edited only by five native English speakers, non-experts in the domain of English writing education. In addition, it is not openly accessible and only consists of 639 samples.

TOEFL11 Blanchard et al. ( 2013 ) corpus from ETS introduced 12K TOEFL iBT essays, which are not publicly accessible now. TOEFL11 only provides a general score for essays in 3 levels (low/mid/high), which is insufficient for building a well-performing AES system.

2.1.2 AES Models

Recent AES models can be categorized into two distinct types: holistic scoring model and rubric-based scoring model.

Holistic AES

The majority of the previous studies used the ASAP dataset for training and evaluation, aiming to predict the overall score of the essay only Tay et al. ( 2018 ); Cozma et al. ( 2018 ); Wang et al. ( 2018 ); Yang et al. ( 2020 ) . Enhanced AI Scoring Engine (EASE) 2 2 2 https://github.com/edx/ease is a commonly used, open-sourced AES system based on feature extraction and statistical methods. In addition, Taghipour and Ng ( 2016 ) and Xie et al. ( 2022 ) released models based on recurrent neural networks and neural pairwise contrastive regression (NPCR) model, respectively. However, only a limited number publicly opened their models and code, highlighting the need for additional publicly available data and further validation of existing models.

Rubric-based AES

The scarcity of publicly available rubric-based AES datasets poses significant obstacles to the advancement of AES research. There are industry-driven services such as IntelliMetric® Rudner et al. ( 2006 ) and E-rater® Attali and Burstein ( 2006 ) and datasets from ETS Blanchard et al. ( 2013 ) , but none of them are accessible to the public. In order to facilitate AES research in the academic community, it is crucial to release a publicly available rubric-based AES dataset and baseline model.

2.2 Essay Feedback Generation

Feedback generation.

Though recent studies assume that LLMs can be used to facilitate education innovation by providing real-time and individualized feedback Yan et al. ( 2023 ); Kasneci et al. ( 2023 ) , no study has addressed detailed approaches for feedback generation in education using LLMs to the best of our knowledge. Peng et al. ( 2023 ) demonstrate that LLM performances on task-oriented dialogue and open-domain question answering dramatically improves with access to golden knowledge, suggesting the benefit of incorporating more specific and targeted knowledge to LLMs. It assumes that appropriate golden knowledge, such as rubric explanations and accurate scores on essays, can nudge LLMs to generate better feedback on essay writing.

Feedback Quality Evaluation

Zheng et al. ( 2023 ) evaluate the quality of responses of LLM-based assistants to open-ended questions using a holistic approach, considering four criteria: helpfulness, relevance, accuracy, and level of detail. Wang et al. ( 2023 ) evaluate responses generated by current LLMs in terms of helpfulness and acceptness, indicating which response is better by inputting win, tie, and lose. Jia et al. ( 2021 ) define each feature in peer-review comments into three: suggestion, problem, and positive tone.

3 FABRIC Pipeline

We have examined the specific needs of the stakeholders in EFL education for both scores and feedback on essays through a group interview with six students and a written interview with three instructors. The interview details are in Appendix A.1 . Along with AES for essay scores, we propose an essay feedback generation task to meet the needs of EFL learners for immediate and specific feedback on their essays. Specifically, the feedback generation task involves understanding a student’s essay and generating feedback under three rubrics: content, organization, and language Cumming ( 1990 ); Ozfidan and Mitchell ( 2022 ) . The objective is to provide feedback that is helpful, relevant, accurate, and specific Zheng et al. ( 2023 ) for both students and instructors. In this section, we present FABRIC, a serial combination of rubric-based AES models (§ 3.1 ) and rubric-based feedback generation using EssayCoT (§ 3.2 ).

3.1 Rubric-based AES Models

We fine-tune BERT for each rubric using 1.7K essays from DREsS (§ 3.1.1 ), 2.9K essays from standardized data (§ 3.1.2 ), and 1.3K essays augmented by CASE (§ 3.1.3 ). BERT-based model architectures are the most state-of-the-art method in AES Devlin et al. ( 2019 ) , and there are no significant improvements in AES by using other pre-trained language models (PLM) Xie et al. ( 2022 ) . Experimental results of rubric-based AES with different PLMs are provided in Appendix B.2 .

3.1.1 Dataset Collection

Dataset details.

DREsS includes 1,782 essays on 22 prompts, having 313.36 words and 21.19 sentences on average. Each sample in DREsS includes students’ written essay, essay prompt, rubric-based scores (content, organization, language), total score, class division (intermediate, advanced), and a test type (pre-test, post-test). The essays are scored on a range of 1 to 5, with increments of 0.5, based on the three rubrics: content, organization, and language. We chose such three rubrics as standard criteria for scoring EFL essays, according to previous studies from the language education Cumming ( 1990 ); Ozfidan and Mitchell ( 2022 ) . Detailed explanations of the rubrics are shown in Table 1 . The essays are written by undergraduate students enrolled in EFL writing courses at a college in South Korea from 2020 to 2023. Most students are Korean their ages span from 18 to 22, with an average of 19.7. In this college, there are two divisions of the EFL writing class: intermediate and advanced. The division is based on students’ TOEFL writing scores (15-18 for intermediate and 19-21 for advanced). During the course, students are asked to write an in-class timed essay for 40 minutes both at the start (pre-test) and the end of the semester (post-test) to measure their improvements.

Content	Paragraphs is well-developed and relevant to the argument, supported with strong reasons and examples.
Organization	The argument is very effectively structured and developed, making it easy for the reader to follow the ideas and understand how the writer is building the argument. Paragraphs use coherence devices effectively while focusing on a single main idea.
Language	The writing displays sophisticated control of a wide range of vocabulary and collocations. The essay follows grammar and usage rules throughout the paper. Spelling and punctuation are correct throughout the paper.

Annotator Details

We collect scoring data from 11 instructors, who served as the teachers of the students who wrote the essays. All annotators are experts in English education or Linguistics and are qualified to teach EFL writing courses at a college in South Korea. To ensure consistent and reliable scoring across all instructors, they all participated in training sessions with a scoring guide and norming sessions where they develop a consensus on scores using two sample essays. Additionally, there was no significant difference among the score distribution of all instructors in the whole data tested by one-way ANOVA and Tukey HSD at a p-value of 0.05.

3.1.2 Standardizing the Existing Data

We standardize three existing rubric-based datasets to align with the three rubrics in DREsS: content, organization, and language. We unify ASAP set 7 and 8, which are the only rubric-based datasets in ASAP. ASAP prompt set 7 includes four rubrics – ideas, organization, style, and convention – while prompt set 8 contains six rubrics – ideas and content, organization, voice, word choice, sentence fluency, and convention. Both sets provide scores ranging from 0 to 3. For language rubric, we first create synthetic labels based on a weighted average. This involves assigning a weight of 0.66 to the style and 0.33 to the convention in set 7, and assigning equal weights to voice, word choice, sentence fluency, and convention in set 8. For content and organization rubric, we utilize the existing data rubric (idea for content, organization as same) in the dataset. We then rescale the score of all rubrics into a range of 1 to 5. We repeated the same process with ASAP++ set 1 and 2, which has the same attributes as ASAP set 7 and 8. Similarly, for ICNALE EE dataset, we unify vocabulary, language use, and mechanics as language rubric with a weight of 0.4, 0.5, and 0.1, respectively. In the process of consolidating the writing assessment criteria, we sought professional consultation from EFL education experts and strategically grouped together those components that evaluate similar aspects.

3.1.3 Synthetic Data Construction

To overcome the scarcity of data, we construct synthetic data for rubric-based AES. We introduce a corruption-based augmentation strategy for essays (CASE), which starts with a well-written essay and incorporates a certain portion of sentence-level errors into the synthetic essay. In subsequent experiments, we define well-written essays as an essay that scored 4.5 or 5.0 out of 5.0 on each criterion.

(1)

n ( S c ) n subscript 𝑆 𝑐 \text{n}(S_{c}) is the number of corrupted sentences in the synthetic essay, and n ( S E ) n subscript 𝑆 𝐸 \text{n}(S_{E}) is the number of sentences in the well-written essay, which serves as the basis for the synthetic essay. x i subscript 𝑥 𝑖 x_{i} denotes the score of the synthetic essay.

We substitute randomly-sampled sentences from well-written essays with out-of-domain sentences from different prompts. This is based on an assumption that sentences in well-written essays support the given prompt’s content, meaning that sentences from the essays on different prompts convey different contents. Therefore, more number of substitutions imply higher levels of corruption in the content of the essay.

Organization

We swap two randomly-sampled sentences in well-written essays and repeat this process based on the synthetic score, supposing that sentences in well-written essays are systemically structured in order. The more number of swaps implies higher levels of corruption in the organization of the essay.

We substitute randomly-sampled sentences into ungrammatical sentences and repeat this process based on the synthetic score. We extract 605 ungrammatical sentences from BEA-2019 data for the shared task of grammatical error correction (GEC) Bryant et al. ( 2019 ) . We define ungrammatical sentences with the number of edits of the sentence over 10, which is the 98th percentile. The more substitutions, the more corruption is introduced in the grammar of the essay. We set a high threshold for ungrammatical sentences because of the limitation of the current GEC dataset that inherent noise may be included, such as erroneous or incomplete correction Rothe et al. ( 2021 ) .

3.1.4 Data Statistics

	Content	Organization	Language
DREsS	1,782	1,782	1,782
ASAP P7	1,569	1,569	1,569
ASAP P8	723	723	723
ASAP++ P1	1,785	1,785	1,785
ASAP++ P2	1,800	1,800	1,800
ICNALE EE	639	639	639
CASE	3,924	15,696	981
Total	12,222	23,994	14,845

Table 2 shows the number of samples per rubric. We use the data for training and validating our AES model. It consists of our newly released DREsS dataset, unified samples of existing datasets (ASAP Prompt 7-8, ASAP++ Prompt 1-2, and ICNALE EE), and synthetic data augmented using CASE. In particular, we generate synthetic data with CASE under ablation study for exploring the optimal number of samples.

3.2 EssayCoT

We introduce EssayCoT (Figure 2 ), a simple but efficient prompting method, to enhance the performance of essay feedback generation. Chain-of-Thought (CoT) Wei et al. ( 2022 ) is a few-shot prompting technique that enhances problem-solving by incorporating intermediate reasoning steps, guiding LLMs toward the final answer. However, it requires significant time and effort to provide human-written few-shot examples. Especially, CoT may not be an efficient approach for essay feedback generation, considering the substantial length of the essay and feedback. Instead, EssayCoT can perform CoT in a zero-shot setting without any additional human effort, since it leverages essay scores which are automatically predicted by AES model. It utilizes three rubric-based scores on content, organization, and language as a rationale for essay feedback generation.

4 Experimental Result

In this section, we present the performance of AES model with CASE (§ 4.1 ) and essay feedback generation with EssayCoT (§ 4.2 ).

4.1 Automated Essay Scoring

Model	Data	Content	Organization	Language	Total
gpt-3.5-turbo	N/A	0.239	0.371	0.246	0.307
EASE (SVR)	DREsS	-	-	-	0.360
NPCR )	DREsS	-	-	-	0.507
BERT )	DREsS	0.414	0.311	0.487	0.471
	+ unified ASAP, ASAP++, ICNALE EE	0.599	0.593	0.587	0.551
	+ synthetic data from CASE	0.642	0.750	0.607	0.685

The performance of AES models is mainly evaluated by the consistency between the predicted scores and the gold standard scores, conventionally calculated using the quadratic weighted kappa (QWK) score. Table 3 shows the experimental results with augmentations using CASE on the combination of DREsS dataset and unified datasets (ASAP, ASAP++, and ICNALE EE). Detailed experimental settings are described in Appendix B.1 . Fine-tuned BERT exhibits scalable results with the expansion of training data. The model trained with a combination of our approaches outperforms other baseline models by 45.44%, demonstrating its effectiveness.

The results of existing holistic AES models underscore the need to examine existing AES models using new datasets. The QWK scores of EASE and NPCR drop from 0.699 to 0.360 and from 0.817 to 0.507, respectively, compared to the QWK scores of the models trained on ASAP. It implies that (1) our dataset may be more complex, considering that ASAP has 4-6 score classes while our DREsS contains 9 classes on each rubric with scores ranging from 1 to 5 with increments of 0.5 and 25 classes with a score range 3 to 15 on the total score, and (2) the existing models might be overfitted to ASAP. Another limitation of these models is their inability to compute rubric-based scores.

Asking gpt-3.5-turbo to score an essay achieved the worst performances among all, showing high variances among the essays with the same ground truth score. The detailed results for ChatGPT in different prompt settings are provided in Table 7 in Appendix B.3 .

We perform an ablation study to explore the effects of CASE and find the optimal number of CASE operations per each rubric. In Figure 3 , we investigate how the number of synthetic data by each class per original data among all classes, n a u g subscript 𝑛 𝑎 𝑢 𝑔 n_{aug} , affects the performance over all rubrics for n a u g = { 0.125 , 0.25 , 0.5 , 1 , 2 , 4 , 8 } subscript 𝑛 𝑎 𝑢 𝑔 0.125 0.25 0.5 1 2 4 8 n_{aug}=\{0.125,0.25,0.5,1,2,4,8\} . CASE on content, organization, and language rubrics show their best performances on 0.5, 2, 0.125 of n a u g subscript 𝑛 𝑎 𝑢 𝑔 n_{aug} , generating a pair of synthetic essays and corresponding scores in 4.5, 18, 1.125 times, respectively. We suppose that the detailed augmentation strategies for each rubric and the small size of the original data affect the optimal number of CASE operations. Organization, where corruption was made within the essay and irrelevant to the size of the original data, showed the highest n a u g subscript 𝑛 𝑎 𝑢 𝑔 n_{aug} . Content, where the corrupted sentences were sampled from 874 well-written essays with 21.2 sentences on average, reported higher n a u g subscript 𝑛 𝑎 𝑢 𝑔 n_{aug} than language, where the corrupted sentences were sampled from 605 ungrammatical sentences.

4.2 Essay Feedback Generation

We adapt evaluation criteria for the quality evaluation of LLM response Zheng et al. ( 2023 ) and re-define those criteria to fit our domain of feedback generation. To overcome the limitation of previous research with holistic evaluation, we assess the feedback quality by each criterion. • Level of detail: the feedback is specific, supported with details. • Accuracy: the feedback content provides accurate information according to the essay. • Relevance: the feedback is provided according to the understanding of the essay criteria. • Helpfulness: the feedback is helpful for students to improve the quality of writing.

Standard Prompting	EssayCoT Prompting
The language used in the essay is sophisticated and displays a wide range of vocabulary and collocations. The writer follows grammar and usage rules throughout the paper, and spelling and punctuation are correct. The essay is well-written and easy to read. Overall, the language used in the essay is strong and effectively communicates the writer’s argument.	The essay displays some control of vocabulary and grammar, but there are also several errors and awkward phrasings throughout. For example, the writer uses the phrase (‘nowdays’ instead of ‘nowadays’), and there are several instances of awkward phrasing and word choice. Additionally, there are several errors in punctuation and capitalization. The essay would benefit from more careful proofreading and editing, as well as more attention to precise and effective language use.

Standard Prompting

EssayCoT Prompting

The language used in the essay is sophisticated and displays a wide range of vocabulary and collocations. The writer follows grammar and usage rules throughout the paper, and spelling and punctuation are correct. The essay is well-written and easy to read. Overall, the language used in the essay is strong and effectively communicates the writer’s argument.

The essay displays some control of vocabulary and grammar, but there are also several errors and awkward phrasings throughout. For example, the writer uses the phrase (‘nowdays’ instead of ‘nowadays’), and there are several instances of awkward phrasing and word choice. Additionally, there are several errors in punctuation and capitalization. The essay would benefit from more careful proofreading and editing, as well as more attention to precise and effective language use.

Table 4 shows two different language feedback examples for the same essay with a score of 2.5 out of 5.0. These examples are generated using different prompts: a standard prompt without the score and EssayCoT prompt with the score. The green text indicates detailed support and examples provided by the essay (level of detail), and the blue text describes the overall evaluation of the essay regarding the language criterion. By comparing blue text, EssayCoT suggests the improvements (helpfulness) such as ‘errors and awkward phrasing’ and ‘punctuation and capitalization’ , while standard prompting only praises language use such as ‘vocabulary and collocations’ . Considering that the language score of the essay is 2.5 out of 5.0, the feedback generated by EssayCoT appears to be more accurate. The orange text in the feedback generated by the standard prompt is irrelevant to the language criterion (relevance) and has similar expressions to an organization explanation in Table 1 .

To evaluate the quality of the feedback generated through these two prompting techniques (standard vs. EssayCoT), we recruited 13 English education experts with a Secondary School Teacher’s Certificate (Grade II) for English Language, licensed by the Ministry of Education, Korea. These annotators were asked to evaluate both types of rubric-based feedback for the same essay on a 7-point Likert scale for each rubric. Then, they were asked to vote their general preference between the two feedback types with three options: A is better, B is better, and no difference. We randomly sampled 20 essays balancing the total score of the essay, and allocated 7 annotators to each essay.

Results show that 52.86% of the annotators prefer feedback from EssayCoT prompting, compared to only 28.57% who prefer feedback from standard prompting. The remaining 18.57% reported no difference between the two feedback. It was shown to be statistically significant at p 𝑝 p level of < 0.05 using the Chi-squared test. Figure 4 presents further evaluation results on the two types of feedback. EssayCoT prompting performs better in terms of accuracy, relevance, and especially helpfulness, which achieves statistical significance across all rubrics. Feedback from standard prompting without essay scores tends to generate general compliments rather than criticisms and suggestions. EFL learners prefer constructive corrective feedback rather than positive feedback, according to the qualitative interview described in Appendix A.1 ,

The only area where standard prompting performed better was in the level of detail regarding the content feedback. This suggests that standard prompting allows a higher degree of freedom by prompting without scores, which enables to generate more detailed feedback. Nevertheless, as it scored poorer on all other criteria, we suppose that this freedom was not quite helpful in essay feedback generation. The comparison of content feedback in Appendix B.4 represents that standard prompting only provided a specific summary of the essay instead of suggestions or criticisms. Furthermore, it even provided inaccurate information in language feedback. As shown in Table 4 , the feedback generated with standard prompting incorrectly indicated that the spelling is correct.

5 Prototype Deployment and Evaluation

We adopted our pipeline to the college’s EFL writing courses using the RECIPE Han et al. ( 2023 ) platform to investigate both the usage and perception by students and instructors. The participants of our study were 33 students from EFL writing courses (intermediate: 11, advanced: 22). Student cohort compromise 32 Korean and 1 Italian student, of whom 12 are female and 21 are male.

Students were asked to self-assess their essays, given a short description of each rubric as a guide. Subsequently, they received the scores and feedback generated by our system. Then they evaluated the helpfulness of the scores and feedback and engaged further by having a conversation with ChatGPT to better understand the feedback and to improve their essays. A detailed set of the questions posed to students is described in Appendix C.1.1 .

Figure 5 presents the responses of the EFL writing course students regarding the perceived performances and the learning experiences with the outputs of our pipeline. On average, students evaluated the performance of the AES model as well as the style and the quality of feedback generated as 6 out of 7 (Figure 5(a) ). They reported confidence in their essay quality and understanding of each writing rubric significantly improved due to the engagement on our platform embedded with FABRIC (Figure 5(b) ).

6 Discussion

In this work, we propose a fully-automated pipeline capable of scoring and generating feedback on students’ essays. As we investigated in § A.1 and simulated in § 5 , this pipeline could assist both EFL learners and instructors by generating rubric-based scores and feedback promptly. In this section, we discuss plausible usage scenarios to advance the FABRIC and integrate the pipeline into the general education contexts.

Human-in-the-Loop Pipeline

Though our main finding shows the possibilities of the full automation of essay scoring and feedback generation, we also suggest a direct extension of our work that can be further developed by integrating a human-in-the-loop component into the pipeline to enhance the teaching and learning experience. As instructors can modify the style or contents of the feedback, FABRIC can be enhanced by implementing a personalized feedback generation model that aligns seamlessly with instructors’ pedagogical objectives and teaching styles. Therefore, students can receive feedback that is more trustworthy and reliable which empowers them to engage in the learning process actively. In addition, feedback generation can be developed to provide personalized feedback for students to align with their difficult needs and learning styles.

Check for Students’ Comprehension

Instructors can incorporate our pipeline into their class materials to identify recurring issues in students’ essays, potentially saving significant time compared to manual reviews. Our pipeline can be effectively used to detect similar feedback provided to a diverse set of students, which often indicates common areas of difficulty. By identifying these common issues, instructors can create targeted, customized, individualized educational content that addresses the specific needs of their students, thereby enhancing the overall learning experience.

7 Conclusion

In conclusion, this paper contributes to English writing education by releasing new data, introducing novel augmentation strategies for automated essay scoring, and proposing EssayCoT prompting for essay feedback generation. Recognizing the limitations of previous holistic AES studies, we present DREsS, a dataset specifically designed for rubric-based essay scoring. Additionally, we suggest CASE, corruption-based augmentation strategies for essays, which utilizes DREsS to generate pairs of synthetic essays and the corresponding score by injecting feasible sentence-level errors. Through in-depth focus group interviews with EFL learners, we identify a strong demand for both scores and feedback in EFL writing education, leading us to define a novel task, essay feedback generation. To address this task, we propose FABRIC, a comprehensive pipeline for score and feedback generation on student essays, employing essay scores for feedback generation as Essay Chain-of-Thought (EssayCoT). Our results show that the augmented data with CASE significantly improve the performance of AES, achieving about 0.6 of QWK scores, and feedback generated by EssayCoT prompting with essay score is significantly more preferred compared to standard prompting by English education experts. We finally deployed our FABRIC pipeline in real-world EFL writing education, exploring the students’ practical use of AI-generated scores and feedback. We envision several scenarios for the implementation of our proposed pipeline in real-world classrooms, taking into consideration of human-computer interactions. This work aims to inspire researchers and practitioners to delve deeper into NLP-driven innovation in English writing education, with the ultimate goal of advancing the field.

Limitations

Our augmentation strategy primarily starts from well-written essays and generates erroneous essays and their corresponding scores, therefore it is challenging to synthesize well-written essays with our method. We believe that well-written essays can be reliably produced by LLMs, which have demonstrated strong writing capabilities, especially in English.

We utilize ChatGPT, a black-box language model, for feedback generation. As a result, our pipeline lacks transparency and does not provide explicit justifications or rationales for the feedback generated. We acknowledge the need for further research to develop models that produce more explainable feedback, leaving room for future exploration.

Ethics Statement

We expect that this paper can considerably contribute to the development of NLP for good within the field of NLP-driven assistance in EFL writing education. All studies in this research project were performed under our institutional review board (IRB) approval. We have thoroughly addressed ethical considerations throughout our study, focusing on (1) collecting essays from students, (2) validating our pipeline in EFL courses, and (3) releasing the data.

After the EFL courses ended, we asked the students who had taken them to share their essays written through the course to prevent any potential effects on their scores or grades.

There was no discrimination when recruiting and selecting EFL students and instructors regarding any demographics, including gender and age. We set the wage per session to be above the minimum wage in the Republic of Korea in 2023 (KRW 9,260 ≈ \approx USD 7.25) 3 3 3 https://www.minimumwage.go.kr/ . They were free to participate in or drop out of the experiment, and their decision did not affect the scores or the grade they received.

We deeply considered the potential risk associated with releasing a dataset containing human-written essays in terms of privacy and personal information. We will filter out all sensitive information related to their privacy and personal information by (1) rule-based code and (2) human inspection. To address this concern, we will run a checklist, and only the researchers or practitioners who submit the checklist can access our data.

Attali and Burstein (2006) Yigal Attali and Jill Burstein. 2006. Automated essay scoring with e-rater® v.2 . The Journal of Technology, Learning and Assessment , 4(3).
Beltagy et al. (2020) Iz Beltagy, Matthew E. Peters, and Arman Cohan. 2020. Longformer: The long-document transformer .
Black et al. (2022) Sidney Black, Stella Biderman, Eric Hallahan, Quentin Anthony, Leo Gao, Laurence Golding, Horace He, Connor Leahy, Kyle McDonell, Jason Phang, Michael Pieler, Usvsn Sai Prashanth, Shivanshu Purohit, Laria Reynolds, Jonathan Tow, Ben Wang, and Samuel Weinbach. 2022. GPT-NeoX-20B: An open-source autoregressive language model . In Proceedings of BigScience Episode #5 – Workshop on Challenges & Perspectives in Creating Large Language Models , pages 95–136, virtual+Dublin. Association for Computational Linguistics.
Blanchard et al. (2013) Daniel Blanchard, Joel Tetreault, Derrick Higgins, Aoife Cahill, and Martin Chodorow. 2013. Toefl11: A corpus of non-native english. ETS Research Report Series , 2013(2):i–15.
Bryant et al. (2019) Christopher Bryant, Mariano Felice, Øistein E. Andersen, and Ted Briscoe. 2019. The BEA-2019 shared task on grammatical error correction . In Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications , pages 52–75, Florence, Italy. Association for Computational Linguistics.
Cozma et al. (2018) Mădălina Cozma, Andrei Butnaru, and Radu Tudor Ionescu. 2018. Automated essay scoring with string kernels and word embeddings . In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) , pages 503–509, Melbourne, Australia. Association for Computational Linguistics.
Cumming (1990) Alister Cumming. 1990. Expertise in evaluating second language compositions . Language Testing , 7(1):31–51.
Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding . In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) , pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
Han et al. (2023) Jieun Han, Haneul Yoo, Yoonsu Kim, Junho Myung, Minsun Kim, Hyunseung Lim, Juho Kim, Tak Yeon Lee, Hwajung Hong, So-Yeon Ahn, and Alice Oh. 2023. RECIPE: How to integrate ChatGPT into EFL writing education .
Ishikawa (2018) Shinichiro Ishikawa. 2018. The icnale edited essays; a dataset for analysis of l2 english learner essays based on a new integrative viewpoint. English Corpus Studies , 25:117–130.
Jacobs et al. (1981) Holly Jacobs, Stephen Zinkgraf, Deanna Wormuth, V. Hearfiel, and Jane Hughey. 1981. Testing ESL Composition: a Practical Approach . ERIC.
Jia et al. (2021) Qinjin Jia, Jialin Cui, Yunkai Xiao, Chengyuan Liu, Parvez Rashid, and Edward F. Gehringer. 2021. All-in-one: Multi-task learning bert models for evaluating peer assessments .
Karathanos and Mena (2009) K Karathanos and DD Mena. 2009. Enhancing the academic writing skills of ell future educators: A faculty action research project. English learners in higher education: Strategies for supporting students across academic disciplines , pages 1–13.
Kasneci et al. (2023) Enkelejda Kasneci, Kathrin Sessler, Stefan Küchemann, Maria Bannert, Daryna Dementieva, Frank Fischer, Urs Gasser, Georg Groh, Stephan Günnemann, Eyke Hüllermeier, Stepha Krusche, Gitta Kutyniok, Tilman Michaeli, Claudia Nerdel, Jürgen Pfeffer, Oleksandra Poquet, Michael Sailer, Albrecht Schmidt, Tina Seidel, Matthias Stadler, Jochen Weller, Jochen Kuhn, and Gjergji Kasneci. 2023. Chatgpt for good? on opportunities and challenges of large language models for education . Learning and Individual Differences , 103:102274.
Mathias and Bhattacharyya (2018) Sandeep Mathias and Pushpak Bhattacharyya. 2018. ASAP++: Enriching the ASAP automated essay grading dataset with essay attribute scores . In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018) , Miyazaki, Japan. European Language Resources Association (ELRA).
Ozfidan and Mitchell (2022) Burhan Ozfidan and Connie Mitchell. 2022. Assessment of students’ argumentative writing: A rubric development . Journal of Ethnic and Cultural Studies , 9(2):pp. 121–133.
Peng et al. (2023) Baolin Peng, Michel Galley, Pengcheng He, Hao Cheng, Yujia Xie, Yu Hu, Qiuyuan Huang, Lars Liden, Zhou Yu, Weizhu Chen, and Jianfeng Gao. 2023. Check your facts and try again: Improving large language models with external knowledge and automated feedback .
Rothe et al. (2021) Sascha Rothe, Jonathan Mallinson, Eric Malmi, Sebastian Krause, and Aliaksei Severyn. 2021. A simple recipe for multilingual grammatical error correction . In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers) , pages 702–707, Online. Association for Computational Linguistics.
Rudner et al. (2006) Lawrence M. Rudner, Veronica Garcia, and Catherine Welch. 2006. An evaluation of intellimetric™ essay scoring system . The Journal of Technology, Learning and Assessment , 4(4).
Sidman-Taveau and Karathanos-Aguilar (2015) Rebekah Sidman-Taveau and Katya Karathanos-Aguilar. 2015. Academic writing for graduate-level english as a second language students: Experiences in education. The CATESOL Journal , 27(1):27–52.
Sun and Fan (2022) Bo Sun and Tingting Fan. 2022. The effects of an awe-aided assessment approach on business english writing performance and writing anxiety: A contextual consideration. Studies in Educational Evaluation , 72:101123.
Taghipour and Ng (2016) Kaveh Taghipour and Hwee Tou Ng. 2016. A neural approach to automated essay scoring . In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing , pages 1882–1891, Austin, Texas. Association for Computational Linguistics.
Tay et al. (2018) Yi Tay, Minh Phan, Luu Anh Tuan, and Siu Cheung Hui. 2018. Skipflow: Incorporating neural coherence features for end-to-end automatic text scoring . Proceedings of the AAAI Conference on Artificial Intelligence , 32(1).
Wang et al. (2023) Hongru Wang, Rui Wang, Fei Mi, Zezhong Wang, Ruifeng Xu, and Kam-Fai Wong. 2023. Chain-of-thought prompting for responding to in-depth dialogue questions with llm .
Wang et al. (2018) Yucheng Wang, Zhongyu Wei, Yaqian Zhou, and Xuanjing Huang. 2018. Automatic essay scoring incorporating rating schema via reinforcement learning . In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing , pages 791–797, Brussels, Belgium. Association for Computational Linguistics.
Wei et al. (2022) Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, brian ichter, Fei Xia, Ed Chi, Quoc V Le, and Denny Zhou. 2022. Chain-of-thought prompting elicits reasoning in large language models . In Advances in Neural Information Processing Systems , volume 35, pages 24824–24837. Curran Associates, Inc.
Xie et al. (2022) Jiayi Xie, Kaiwei Cai, Li Kong, Junsheng Zhou, and Weiguang Qu. 2022. Automated essay scoring via pairwise contrastive regression . In Proceedings of the 29th International Conference on Computational Linguistics , pages 2724–2733, Gyeongju, Republic of Korea. International Committee on Computational Linguistics.
Yan et al. (2023) Lixiang Yan, Lele Sha, Linxuan Zhao, Yuheng Li, Roberto Martinez-Maldonado, Guanliang Chen, Xinyu Li, Yueqiao Jin, and Dragan Gašević. 2023. Practical and ethical challenges of large language models in education: A systematic literature review .
Yang et al. (2020) Ruosong Yang, Jiannong Cao, Zhiyuan Wen, Youzheng Wu, and Xiaodong He. 2020. Enhancing automated essay scoring performance via fine-tuning pre-trained language models with combination of regression and ranking . In Findings of the Association for Computational Linguistics: EMNLP 2020 , pages 1560–1569, Online. Association for Computational Linguistics.
Zaheer et al. (2020) Manzil Zaheer, Guru Guruganesh, Kumar Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, and Amr Ahmed. 2020. Big bird: Transformers for longer sequences . In Advances in Neural Information Processing Systems , volume 33, pages 17283–17297. Curran Associates, Inc.
Zheng et al. (2023) Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric. P Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. 2023. Judging llm-as-a-judge with mt-bench and chatbot arena .

Appendix A Qualitative Interview

A.1 content of the interview.

While most existing NLP-driven EFL writing assistance tools focus on automated scoring, we were interested in how much EFL learners and instructors want feedback (i.e., rationales to an essay score) as well. We conducted a group interview with six EFL learners, who had taken at least one undergraduate EFL writing course. While only 2 of them had received feedback from their instructors before, all of them expressed strong needs for both rubric-based scores and feedback. 4 out of 6 students preferred specific and constructive feedback for improving their essays, as opposed to positive remarks without critical analysis. In addition, they are particularly interested in getting scores and feedback instantaneously so that they can learn the weaknesses of their essays in the same context of writing, and refine them through an iterative process.

To examine the needs for feedback generation from the perspective of EFL instructors, we conducted a written interview with three instructors who are currently teaching at a college EFL center. We provided a sample essay accompanied with four different AI-generated scores and feedback, which varied based on whether a score was provided, whether feedback was given, and the type of feedback. Different instructors expressed diverse preferences when it came to giving scores and providing auto-generated feedback. Two of them exhibited concerns that the feedback may give irrelevant to the class topic, or fail to consider the open-ended nature of essay feedback. There was another group of concerns that giving too much feedback would lower the educational effect by overwhelming the students. Despite the aforementioned concerns, all the instructors expressed a strong need for objective scoring. P1 and P3 mentioned that they would like to utilize AI-generated score and feedback as a comparison to double-check if their scores need to be adjusted.

A.2 Interview Questionnaire

For the midterm and final essays, most of the instructors provide only scores without any feedback. Please tell us the pros and cons of this. Please share your various needs for grade provision.

Which of the following six types of feedback do you prefer and why? (Direct feedback, Indirect feedback, Metalinguistic CF, The focus of the feedback, Electronic feedback, Reformulation)

What kind of feedback would you like to receive among the feedback listed below? (Positive feedback, Constructive feedback, Questioning feedback, Reflective feedback, Specific feedback, Comparative feedback, Collaborative feedback)

What was the hardest thing about evaluating students’ essays? What efforts other than rubric, norming sessions were needed to minimize the subjectivity of scores? Please feel free to tell us about ways to improve the shortcomings you mentioned.

The example is the student essay and AI scoring results. Among the four scoring and feedback examples above, please list them in order of help in improving students’ learning effect.

If the AI model provides both a score and feedback for a student’s essay, what advantages do you think it would have over providing only a score? If the AI model provides both a score and feedback for a student’s essay, what advantages do you think it will have over providing feedback alone?

What kinds of feedback can be good for enhancing the learning effect? Are there any concerns about providing score and feedback through AI?

What part of the grade processing process would you like AI to help with?

After looking at the score and feedback generated by the AI model, do you think your score could change after seeing them? How do you think it will be used for scoring?

Appendix B Supplemental Results

B.1 experimental settings.

We split our data into train/dev/test with 6:2:2 ratio with a seed of 22. The AES experiments were conducted under GeForce RTX 2080 Ti (4 GPUs), 128GiB system memory, and Intel(R) Xeon(R) Silver 4114 CPU @ 2.20GHz (20 CPU cores) with hyperparameters denoted in Table 5 .

Hyperparameter	Value
Batch Size	32
Number of epochs	10
Early Stopping Patience	5
Learning Rate	2e-5
Learning Rate Scheduler	Linear
Optimizer	AdamW

B.2 Rubric-based AES with Different LMs

Model	Content	Organization	Language	Total
BERT ( )	0.414	0.311	0.487	0.471
Longformer ( )	0.409	0.312	0.475	0.463
BigBird ( )	0.412	0.317	0.473	0.469
GPT-NeoX ( )	0.410	0.313	0.446	0.475

Experimental results of rubric-based AES with different LMs are provided in Table 6 , showing no significant difference among different LMs. Xie et al. ( 2022 ) also observed that leveraging different LMs has no significant effect on AES performance, and most state-of-the-art AES methods have leveraged BERT Devlin et al. ( 2019 ) .

B.3 Rubric-based AES with ChatGPT

Table 7 shows AES results of ChatGPT with different prompts described in Table 8 . Considering the substantial length of the essay and feedback, we were able to provide a maximum of 2 shots for the prompt to gpt-3.5-turbo . To examine 2-shot prompting performance, we divided the samples into two distinct groups and computed the average total score for each group. Subsequently, we randomly sampled a single essay in each group, ensuring that its total score corresponded to the calculated mean value.

Prompt	Content	Organization	Language	Total
(A)	0.320	0.248	0.359	0.336
(B)	0.330	0.328	0.306	0.346
(C)	0.357	0.278	0.342	0.364
(D)	0.336	0.361	0.272	0.385

(A)

Q. Please score the essay with three rubrics: content, organization, and language.

### Answer format: {content: score[x], organization: score[y], language: score[z]}

score = [1.0, 1.5, 2.0, 2.5, 3.0, 3.5, 4.0, 4.5, 5.0]

Please answer only in the above dictionary format, without feedback.

### prompt: <essay prompt>

### essay: <student’s essay>

Q. Please score the essay with three rubrics: content, organization, and language.

### Answer format: {content: score[x], organization: score[y], language: score[z]}

score = [1.0, 1.5, 2.0, 2.5, 3.0, 3.5, 4.0, 4.5, 5.0]

Please answer only in the above dictionary format, without feedback.

### 1-shot example:

### 2-shot example:

### prompt: <essay prompt>

### essay: <student’s essay>

Q. Please score the essay with three rubrics: content, organization, and language.

### Answer format: {content: score[x], organization: score[y], language: score[z]}

score = [1.0, 1.5, 2.0, 2.5, 3.0, 3.5, 4.0, 4.5, 5.0]

Please answer only in the above dictionary format, without feedback.

### prompt: <essay prompt>

### essay: <student’s essay>

Q. Please score the essay with three rubrics: content, organization, and language.

### Answer format: {content: score[x], organization: score[y], language: score[z], content_fbk: chatgpt_con_fbk, org_fbk: chatgpt_org_fbk, lang_fbk: chatgpt_lang_fbk}

score = [1.0, 1.5, 2.0, 2.5, 3.0, 3.5, 4.0, 4.5, 5.0]

Please answer only in the above dictionary format, with feedback.

### prompt: <essay prompt>

### essay: <student’s essay>

B.4 Sample-level Analysis on Essay Feedback Generation

The green text in Table 9 represents detailed support and examples, while the pink text indicates improvements for the essay, which align with level of detail and helpfulness criteria, respectively. Feedback generated with standard prompting describes specific details of the content of the essay. However, it only summarizes the content of the essay and provides praise, instead of suggestions or criticism. On the other hand, feedback with EssayCoT prompting points out that ‘the essay lacks depth and analysis’ and ‘could benefit from expanding on their ideas and providing more specific examples’ .

Standard Prompting

EssayCoT Prompting

The content of the essay is well-developed and relevant to the argument. The writer has identified three major problems in their country and has provided specific reasons and details to support their opinion. The writer has effectively highlighted the gender problem, inequality problem, and offensive replying culture in some communities. The writer has also provided examples of how these problems are threatening their society and splitting people up. Overall, the content of the essay is strong and persuasive.

The essay has some relevant points, but the content is not well-developed. The writer mentions three problems in Korea, but only briefly touches on each one without providing strong reasons or examples to support their argument. The essay lacks depth and analysis, and the writer could benefit from expanding on their ideas and providing more specific details.

Appendix C Study Design for Prototype Deployment and Evaluation

C.1 students, c.1.1 survey design.

Before participating in the experiments, students were asked to write an argumentative academic essay based on a given prompt and submit it. In each session of the experiment, students (1) respond to the pre-survey asking about their experience in receiving scores and feedback in their EFL writing class, (2) self-assess their essay based on the description of each rubric, (3) evaluate AI-generated rubric-based scores and feedback on their essay, (4) interact with ChatGPT based on the scores and feedback from (3) to improve their essay via RECIPE platform, and finally (5) respond to the post-survey asking about their thoughts on the session. Students can repeat this process and participate in multiple sessions.

The following paragraphs describe survey questions asked to students in each step of the session.

These are questions asking about your experience in receiving scores and feedback on your assignments in the writing class.

I think that the number of times I received scores in this course was sufficient.

After submission, I did not have to wait long until I received scores for my essay.

Did you receive any feedback (excluding scores) on your essay from the instructor?

I think that the number of times I received feedback in this course was sufficient.

After submission, I did not have to wait long until I received feedback for my essay.

I was satisfied with the quality of the feedback I received from the instructor.

Scores and Feedback

Please rate your essay for each rubric, in a scale of 1 to 5: ${content, organization, language rubric explanation} .

Now, you will look at the score and feedback given by the AI model based on your previous essay. Please answer the questions. Also, please be aware that the score is not your final score given by the instructor. It is a score generated by an AI model, and hence can be different from the score given by your instructor. ${rubric}

I agree with this score provided by the AI model

I agree with this feedback provided by the AI model

Pipeline Evaluation

Please answer these questions AFTER finishing the scoring exercise.

I think that the style or tone of the AI-based feedback was appropriate?

Please rate the overall performance of AI-based scoring.

Please rate the overall quality of AI-based feedback.

What made AI-based feedback satisfactory to you? (choose all that apply)

Please freely share your thoughts on AI-based scoring/feedback.

Post-survey

Please answer these questions AFTER finishing the main exercise.

My confidence on the quality of the essay increased after the exercise.

My understanding on the content criteria increased after the exercise.

My understanding on the organization criteria increased after the exercise.

My understanding on the language criteria increased after the exercise.

Please freely share your thoughts on regarding the exercise.

Conferences
New Conferences
search search
You are not signed in

External Links

Google Scholar
References: 0
Cited by: 0
Bibliographies: 0
[Upload PDF for personal use]

Researchr is a web site for finding, collecting, sharing, and reviewing scientific publications, for researchers by researchers.

Sign up for an account to create a profile with publication list, tag and review your related work, and share bibliographies with your co-authors.

Automated Essay Scoring via Pairwise Contrastive Regression

Jiayi Xie , Kaiwei Cai , Li Kong , Junsheng Zhou , Weiguang Qu . Automated Essay Scoring via Pairwise Contrastive Regression . In Nicoletta Calzolari , Chu-Ren Huang , Hansaem Kim , James Pustejovsky , Leo Wanner , Key-Sun Choi , Pum-Mo Ryu , Hsin-Hsi Chen , Lucia Donatelli , Heng Ji , Sadao Kurohashi , Patrizia Paggio , Nianwen Xue , Seokhwan Kim , YoungGyun Hahm , Zhong He , Tony Kyungil Lee , Enrico Santus , Francis Bond , Seung-Hoon Na , editors, Proceedings of the 29th International Conference on Computational Linguistics, COLING 2022, Gyeongju, Republic of Korea, October 12-17, 2022 . pages 2724-2733 , International Committee on Computational Linguistics, 2022. [doi]

Bibliographies

Abstract is missing.

Web Service API

Subscribe to the PwC Newsletter

Join the community, edit social preview.

automated essay scoring via pairwise contrastive regression

Add a new code entry for this paper

Remove a code repository from this paper, mark the official implementation from paper authors, add a new evaluation result row.

TASK	DATASET	MODEL	METRIC NAME	METRIC VALUE	GLOBAL RANK	REMOVE

AUTOMATED ESSAY SCORING
CONTRASTIVE LEARNING

Remove a task

Add a method

Remove a method, edit datasets, automated essay scoring via pairwise contrastive regression.

COLING 2022 · Jiayi Xie , Kaiwei Cai , Li Kong , Junsheng Zhou , Weiguang Qu · Edit social preview

Automated essay scoring (AES) involves the prediction of a score relating to the writing quality of an essay. Most existing works in AES utilize regression objectives or ranking objectives respectively. However, the two types of methods are highly complementary. To this end, in this paper we take inspiration from contrastive learning and propose a novel unified Neural Pairwise Contrastive Regression (NPCR) model in which both objectives are optimized simultaneously as a single loss. Specifically, we first design a neural pairwise ranking model to guarantee the global ranking order in a large list of essays, and then we further extend this pairwise ranking model to predict the relative scores between an input essay and several reference essays. Additionally, a multi-sample voting strategy is employed for inference. We use Quadratic Weighted Kappa to evaluate our model on the public Automated Student Assessment Prize (ASAP) dataset, and the experimental results demonstrate that NPCR outperforms previous methods by a large margin, achieving the state-of-the-art average performance for the AES task.

Code Edit Add Remove Mark official

Tasks edit add remove, datasets edit, results from the paper edit add remove, methods edit add remove.

Enhanced cross-prompt trait scoring via syntactic feature fusion and contrastive learning

Published: 27 September 2023
Volume 80 , pages 5390–5407, ( 2024 )

Cite this article

Jingbo Sun 1 ,
Weiming Peng 2 , 3 ,
Tianbao Song 4 ,
Haitao Liu 1 ,
Shuqin Zhu 5 &
Jihua Song 1

237 Accesses

Explore all metrics

Automated essay scoring aims to evaluate the quality of an essay automatically. It is one of the main educational applications in the field of natural language processing. Recently, the research scope has been extended from prompt-special scoring to cross-prompt scoring and further concentrating on scoring different traits. However, cross-prompt trait scoring requires identifying inner-relations, domain knowledge, and trait representation as well as dealing with insufficient training data for the specific traits. To address these problems, we propose a RDCTS model that employs contrastive learning and utilizes Kullback–Leibler divergence to measure the similarity of positive and negative samples, and we design a feature fusion algorithm that combines POS and syntactic features instead of using single text attribute features as input for the neural AES system. We incorporate implicit data augmentation by adding the dropout layer to the word level and sentence level of the hierarchical model to mitigate the effects of limited data. Experimental results show that our RDCTS achieves state-of-the-art performance and greater consistency.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save.

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Price includes VAT (Russian Federation)

Instant access to the full article PDF.

Rent this article via DeepDyve

Institutional subscriptions

Automated Pipeline for Multi-lingual Automated Essay Scoring with ReaderBench

Coherence Based Automatic Essay Scoring Using Sentence Embedding and Recurrent Neural Networks

Feature Enhanced Capsule Networks for Robust Automatic Essay Scoring

Explore related subjects.

Artificial Intelligence

https://www.kaggle.com/c/asap-aes/data

Taghipour K, Ng HT (2016) A neural approach to automated essay scoring. In: EMNLP, pp 1882–1891

Dong F, Zhang Y (2016) Automatic features for essay scoring: an empirical study. In: EMNLP, pp 1072–1077

Dong F, Zhang Y, Yang J (2017) Attention-based recurrent convolutional neural network for automatic essay scoring. In: CoNLL, pp 153–162

Nandini V, Uma Maheswari P (2020) Automatic assessment of descriptive answers in online examination system using semantic relational features. J Supercomput 76(6):4430–4448

Article Google Scholar

Mathias S, Bhattacharyya P (2020) Can neural networks automatically score essay traits? In: BEA, pp 85–91

Hussein MA, Hassan HA, Nassef M (2020) A trait-based deep learning automated essay scoring system with adaptive feedback. Int J Adv Comput Sci Appl 11(5):287–293

Google Scholar

Jin C, He B, Hui K, Sun L (2018) Tdnn: a two-stage deep neural network for prompt-independent automated essay scoring. In: ACL, pp 1088–1097

Cao Y, Jin H, Wan X, Yu Z (2020) Domain-adaptive neural automated essay scoring. In: SIGIR, pp 1011–1020

Ridley R, He L, Dai X, Huang S, Chen J (2020) Prompt agnostic essay scorer: a domain generalization approach to cross-prompt automated essay scoring. arXiv preprint arXiv:2008.01441

Song W, Zhang K, Fu R, Liu L, Liu T, Cheng M (2020) Multi-stage pre-training for automated chinese essay scoring. In: EMNLP, pp 6723–6733

Mim FS, Inoue N, Reisert P, Ouchi H, Inui K (2019) Unsupervised learning of discourse-aware text representation for essay scoring. In: ACL, pp 378–385

Ridley R, He L, Dai X-y, Huang S, Chen J (2021) Automated cross-prompt scoring of essay traits. In: AAAI, vol. 35, pp 13745–13753

Mathias S, Bhattacharyya P (2018) Asap++: enriching the asap automated essay grading dataset with essay attribute scores. In: LREC, pp 1169–1173

Rudner LM, Liang T (2002) Automated essay scoring using Bayes’ theorem. J Technol Learn Assess 1(2)

Chen H, He B (2013) Automated essay scoring by maximizing human-machine agreement. In: EMNLP, pp 1741–1752

Li X, Chen M, Nie J-Y (2020) Sednn: shared and enhanced deep neural network model for cross-prompt automated essay scoring. Knowl-Based Syst 210:106491

Chen T, Kornblith S, Norouzi M, Hinton G (2020) A simple framework for contrastive learning of visual representations. In: ICML, pp 1597–1607

He K, Fan H, Wu Y, Xie S, Girshick R (2020) Momentum contrast for unsupervised visual representation learning. In: CVPR, pp 9729–9738

Fang H, Wang S, Zhou M, Ding J, Xie P (2020) Cert: contrastive self-supervised learning for language understanding. arXiv preprint arXiv:2005.12766

Gao T, Yao X, Chen D (2021) Simcse: Simple contrastive learning of sentence embeddings. In: EMNLP, pp 6894–6910

Wang X, Cao Q, Wang Q, Cao Z, Zhang X, Wang P (2022) Robust log anomaly detection based on contrastive learning and multi-scale mass. J Supercomput 78(16):17491–17512

Logeswaran L, Lee H (2018) An efficient framework for learning sentence representations. In: ICLR

Wu L, Li J, Wang Y, Meng Q, Qin T, Chen W, Zhang M, Liu T-Y, et al. (2021) R-drop: regularized dropout for neural networks, vol 34, pp 10890–10905

Shi J, Liang C, Hou L, Li J, Liu Z, Zhang H (2019) Deepchannel: salience estimation by contrastive learning for extractive document summarization. In: AAAI, vol 33, pp 6999–7006

Meng Y, Xiong C, Bajaj P, Bennett P, Han J, Song X, et al. (2021) Coco-lm: correcting and contrasting text sequences for language model pretraining. In: NeurIPS, vol 34, pp 23102–23114

Nivre J (2004) Incrementality in deterministic dependency parsing. In: Proceedings of the Workshop on Incremental Parsing: Bringing Engineering and Cognition Together, pp 50–57

Hadsell R, Chopra S, LeCun Y (2006) Dimensionality reduction by learning an invariant mapping. In: CVPR, vol 2, pp 1735–1742

Pennington J, Socher R, Manning CD (2014) Glove: global vectors for word representation. In: EMNLP, pp 1532–1543

Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R (2014) Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res 15(1):1929–1958

MathSciNet Google Scholar

Yannakoudakis H, Cummins R (2015) Evaluating the performance of automated text scoring systems. In: BEA, pp 213–223

Ke Z, Ng V (2019) Automated essay scoring: a survey of the state of the art. In: IJCAI, vol. 19, pp 6300–6308

Manning CD, Surdeanu M, Bauer J, Finkel JR, Bethard S, McClosky D (2014) The stanford corenlp natural language processing toolkit. In: ACL, pp 55–60

Download references

Acknowledgements

The authors would like to thank the anonymous reviewers for providing helpful comments to improve the quality of the article

This work was supported by the Beijing Natural Science Foundation (Grant No. 4234081), the National Natural Science Foundation of China (Grants No. 61877004 and No. 62007004), the Major Program of National Social Science Foundation of China (Grant No. 18ZDA295) and the 2021 International Chinese education research project of Center for Language Education and Cooperation of the Ministry of Education of China (Grant No. 21YH53C).

Author information

Authors and affiliations.

School of Artificial Intelligence, Beijing Normal University, No. 19, Xinjiekouwai St, Haidian District, Beijing, 100875, Beijing, China

Jingbo Sun, Haitao Liu & Jihua Song

Chinese Character Research and Application Laboratory, Beijing Normal University, Beijing, 100875, Beijing, China

Weiming Peng

Linguistic Data Consortium, University of Pennsylvania, Philadelphia, PA, 19104, USA

School of Computer Science and Engineering, Beijing Technology and Business University, No. 11, No. 33, Fucheng Road, Haidian District, Beijing, 100048, Beijing, China

Tianbao Song

Teachers’ College, Beijing Union University, No. 5, Waiguanxie St, Chaoyang District, Beijing, 100011, Beijing, China

You can also search for this author in PubMed Google Scholar

Contributions

JS (Jingbo Sun) contributed to conceptualization and methodology; JS (Jingbo Sun) and TS performed writing (original draft preparation); HL, SZ, JS (Jihua Song) and WP performed writing (review and editing). All authors read and approved the final manuscript.

Corresponding authors

Correspondence to Jingbo Sun or Jihua Song .

Ethics declarations

Ethical approval.

Not applicable.

Conflict of interest

The authors declare that they have no conflicts of interest.

Availability of data and material

The ASAP dataset that supports the findings of this study is available in https://www.kaggle.com/c/asap-aes/data .

Additional information

Publisher's note.

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Sun, J., Peng, W., Song, T. et al. Enhanced cross-prompt trait scoring via syntactic feature fusion and contrastive learning. J Supercomput 80 , 5390–5407 (2024). https://doi.org/10.1007/s11227-023-05640-2

Download citation

Accepted : 30 August 2023

Published : 27 September 2023

Issue Date : March 2024

DOI : https://doi.org/10.1007/s11227-023-05640-2

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

Automated essay scoring
Natural language processing
Contrastive learning
Data augmentation
Information fusion
Find a journal
Publish with us
Track your research

solidarity - (ua) - (ru)
news - (ua) - (ru)
donate - donate - donate

for scientists:

ERA4Ukraine
Assistance in Germany
Ukrainian Global University
#ScienceForUkraine

default search action

combined dblp search
author search
venue search
publication search

"Automated Essay Scoring via Pairwise Contrastive Regression."

Details and statistics.

DOI: —

type: Conference or Workshop Paper

metadata version: 2022-10-13

Please note: Providing information about references and citations is only possible thanks to to the open metadata APIs provided by crossref.org and opencitations.net . If citation data of your publications is not openly available yet, then please consider asking your publisher to release your citation data to the public. For more information please see the Initiative for Open Citations (I4OC) . Please also note that there is no way of submitting missing references or citation data directly to dblp.

Please also note that this feature is work in progress and that it is still far from being perfect. That is, in particular,

the lists below may be incomplete due to unavailable citation data,
reference strings may not have been successfully mapped to the items listed in dblp, and
we do not have complete and curated metadata for all items given in these lists.

JavaScript is requires in order to retrieve and display any references and citations for this record.

manage site settings

To protect your privacy, all features that rely on external API calls from your browser are turned off by default . You need to opt-in for them to become active. All settings here will be stored as cookies with your web browser. For more information see our F.A.Q.

Unpaywalled article links

load links from unpaywall.org

Privacy notice: By enabling the option above, your browser will contact the API of unpaywall.org to load hyperlinks to open access articles. Although we do not have any reason to believe that your call will be tracked, we do not have any control over how the remote server uses your data. So please proceed with care and consider checking the Unpaywall privacy policy .

Archived links via Wayback Machine

load content from archive.org

Privacy notice: By enabling the option above, your browser will contact the API of archive.org to check for archived content of web pages that are no longer available. Although we do not have any reason to believe that your call will be tracked, we do not have any control over how the remote server uses your data. So please proceed with care and consider checking the Internet Archive privacy policy .

Reference lists

load references from crossref.org and opencitations.net

Privacy notice: By enabling the option above, your browser will contact the APIs of crossref.org , opencitations.net , and semanticscholar.org to load article reference information. Although we do not have any reason to believe that your call will be tracked, we do not have any control over how the remote server uses your data. So please proceed with care and consider checking the Crossref privacy policy and the OpenCitations privacy policy , as well as the AI2 Privacy Policy covering Semantic Scholar.

Citation data

load citations from opencitations.net

Privacy notice: By enabling the option above, your browser will contact the API of opencitations.net and semanticscholar.org to load citation information. Although we do not have any reason to believe that your call will be tracked, we do not have any control over how the remote server uses your data. So please proceed with care and consider checking the OpenCitations privacy policy as well as the AI2 Privacy Policy covering Semantic Scholar.

OpenAlex data

load data from openalex.org

Privacy notice: By enabling the option above, your browser will contact the API of openalex.org to load additional information. Although we do not have any reason to believe that your call will be tracked, we do not have any control over how the remote server uses your data. So please proceed with care and consider checking the information given by OpenAlex .

see also: Terms of Use | Privacy Policy | Imprint

dblp was originally created in 1993 at:

since 2018, dblp has been operated and maintained by:

the dblp computer science bibliography is funded and supported by:

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

Publications
Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

Advanced Search
Journal List
Entropy (Basel)

Improving Automated Essay Scoring by Prompt Prediction and Matching

1 School of Artificial Intelligence, Beijing Normal University, Beijing 100875, China

Tianbao Song

2 School of Computer Science and Engineering, Beijing Technology and Business University, Beijing 100048, China

Weiming Peng

Associated data.

Publicly available datasets were used in this study. These data can be found here: http://hsk.blcu.edu.cn/ (accessed on 6 March 2022).

Automated essay scoring aims to evaluate the quality of an essay automatically. It is one of the main educational application in the field of natural language processing. Recently, Pre-training techniques have been used to improve performance on downstream tasks, and many studies have attempted to use pre-training and then fine-tuning mechanisms in an essay scoring system. However, obtaining better features such as prompts by the pre-trained encoder is critical but not fully studied. In this paper, we create a prompt feature fusion method that is better suited for fine-tuning. Besides, we use multi-task learning by designing two auxiliary tasks, prompt prediction and prompt matching, to obtain better features. The experimental results show that both auxiliary tasks can improve model performance, and the combination of the two auxiliary tasks with the NEZHA pre-trained encoder produces the best results, with Quadratic Weighted Kappa improving 2.5% and Pearson’s Correlation Coefficient improving 2% on average across all results on the HSK dataset.

1. Introduction

Automated essay scoring (AES), which aims to automatically evaluate and score essays, is one typical application of natural language processing (NLP) technique in the field of education [ 1 ]. In earlier studies, a combination of handcrafted design features and statistical machine learning is used [ 2 , 3 ], and with the development of deep learning, neural network-based approaches gradually become mainstream [ 4 , 5 , 6 , 7 , 8 ]. Recently, pre-trained language models have gradually become the foundation module of NLP, and the paradigm of pre-training, then fine-tuning, is also widely adopted. Pre-training is the most common method for transfer learning, in which a model is trained on a surrogate task and then adapted to the desired downstream task by fine-tuning [ 9 ]. Some research has attempted to use pre-training modules in AES tasks [ 10 , 11 , 12 ]. Howard et al. [ 10 ] utilize the pre-trained encoder as a feature extraction module to obtain a representation of the input text and update the pre-trained model parameters based on the downstream text classification task by adding a linear layer. Rodriguez et al. [ 11 ] employ a pre-trained encoder as the essay representation extraction module for the AES task, with inputs at various granularities of the sentence, paragraph, overall, etc., and then use regression as the training target for the downstream task to further optimize the representation. In this paper, we fine-tune the pre-trained encoder as a feature extraction module and convert the essay scoring task into regression as in previous studies [ 4 , 5 , 6 , 7 ].

The existing neural methods obtain a generic representation of the text through a hierarchical model using convolutional neural networks (CNN) for word-level representation and long short-term memory (LSTM) for sentence-level representation [ 4 ], which is not specific to different features. To enhance the representation of the essay, some studies have attempted to incorporate features such as prompt [ 3 , 13 ], organization [ 14 ], coherence [ 2 ], and discourse structure [ 15 , 16 , 17 ] into the neural model. These features are critical for the AES task because they help the model understand the essay while also making the essay scoring more interpretable. In actual scenarios, prompt adherence is an important feature in essay scoring tasks [ 3 ]. The hierarchical model is insensitive to changes in the corresponding prompt for the essay and always assigns the same score for the same essay, regardless of the essay prompt. Persing and Ng [ 3 ] propose a feature-rich approach that integrates the prompt adherence dimension. Ref. [ 18 ] improves document modeling with a topic word. Li et al. [ 7 ] utilizes a hierarchical structure with an attention mechanism to construct prompt information. However, the above feature fusion methods are unsuitable for fine-tuning.

The two challenges in effectively incorporating pre-trained models into AES feature representation are the data dimension and the methodological dimension. For the data dimension, the use of fine-tuning approaches to transfer the pre-trained encoder to downstream tasks frequently necessitates sufficient data, and there has been more research on both training and testing data from the same target prompt [ 4 , 5 ], but the data size is relatively small, varying between a few hundred and a few thousand, and pre-trained encoders cannot be fine-tuned well. In order to solve this challenge, we use the whole training set, which includes various prompts. In terms of methodology, we employ the pre-training and multi-task learning (MTL) paradigms, which can learn features that cannot be learned in a single task through joint learning, learning to learn, and learning with auxiliary tasks [ 19 ], etc. MTL methods have been applied to several NLP tasks, such as text classification [ 20 , 21 ], semantic analysis [ 22 ] et al. Our method creates two auxiliary tasks that need to be learned alongside the main task. The main task and auxiliary tasks can increase each other’s performance by sharing information and complementing each other.

In this paper, we propose an essay scoring model based on fine-tuning that utilizes multi-task learning to fuse prompt features by designing two auxiliary tasks, prompt prediction, and prompt matching, which is more suitable for fine-tuning. Our approach can effectively incorporate the prompt feature in essays and improve the representation and understanding of the essay. The paper is organized as follows. In Section 2 , we first review related studies. We describe our method and experiment in Section 3 and Section 4 . Section 5 presents the findings and discussions. Finally, in Section 6 , we provide a conclusion, future work, and the limitations of the paper.

2. Related Work

Pre-trained language models, such as BERT [ 23 ], BERT-WWM [ 24 ], RoBERTa [ 25 ], and NEZHA [ 26 ], have gradually become a fundamental technique for NLP, with great success on both English and Chinese tasks [ 27 ]. In our approach, we use the BERT and NEZHA feature extraction layers. BERT is the abbreviation of Bidirectional Encoder Representations from Transformers, and it is based on transformer blocks that are built using the attention mechanism [ 28 ] to extract semantic information. It is trained on two unsupervised tasks using large-scale datasets: masked language model (MLM) and next sentence prediction (NSP). NEZHA is a Chinese pre-training model that employs functional relative positional encoding and whole word masking (WWM) rather than BERT. The pre-training then the fine-tuning mechanism is widely used in downstream NLP tasks, including AES [ 11 , 12 , 15 ]. Mim et al. [ 15 ] propose a pre-training approach for evaluating the organization and argument strength of essays based on modeling coherence. Song et al. [ 12 ] present a multi-stage pre-training method for automated Chinese essay scoring that consists of three components: weakly supervised pre-training, supervised cross-prompt fine-tuning, and supervised target-prompt fine-tuning. Rodriguez et al. [ 11 ] use BERT and XLNET [ 29 ] for representation and fine-tuning of English corpus.

The essay prompt introduces the topic, offers concepts, and restricts both content and perspective. Some studies have attempted to enhance the AES system by incorporating prompt features in many ways, such as by integrating prompt information to determine if an essay is off-topic [ 13 , 18 ] or by considering prompt adherence as a crucial indicator [ 3 ]. Louis and Higgins [ 13 ] improve model performance by expanding prompt information with a list of related words and reducing spelling errors. Persing and Ng [ 3 ] propose a feature-rich method for incorporating the prompt adherence dimension via manual annotation. Klebanov et al. [ 18 ] also improve essay modeling with topic words to quantify the overall relevance of the essay to the prompt, and the relationship between prompt adherence scores and total essay quality is also discussed. The methods described above mostly employ statistical machine learning, prompt information is enriched by annotation and the construction of datasets, as well as the construction of word lists and topic word mining. While all of them are making great progress, the approaches they are employing are more difficult to directly transfer to fine-tuning. Li et al. [ 7 ] propose a shared model and an enhanced model (EModel), and utilize a neural network hierarchical structure with an attention mechanism to construct features of the essay such as discourse, coherence, relevancy, and prompt. For the representation, the paper employs GloVe [ 30 ] rather than a pre-trained model. In the experiment section, we compared our method to the sub-module of EModel (Pro.) which incorporates the prompt feature.

3.1. Motivation

Although previous studies on automated essay scoring models for specific prompts have shown promising results, most research focuses on generic features of essays. Only a few studies have focused on prompt feature extraction, and no one has attempted to use a multi-task approach to make the model capture prompt features and be sensitive to prompts automatically. Our approach is motivated by capturing prompt features to make the model aware of the prompt and using pre-training and then the fine-tuning mechanism for AES. Based on this motivation, we use a multi-task learning approach to obtain features that are more applicable to Essay Scoring (ES) by adding essay prompts to the model input and proposing two auxiliary tasks: Prompt Prediction ( PP ) and Prompt Matching ( PM ). The overall architecture of our model is illustrated in Figure 1 .

An external file that holds a picture, illustration, etc.
Object name is entropy-24-01206-g001.jpg

The proposed framework. “一封求职信” is the prompt of the essay, the English translation is “A cover letter”. “主管您好” means “Hello Manager”. The prompt and essay are separated by [SEP].

3.2. Input and Feature Extraction Layer

The input representation for a given essay is built by adding the corresponding token embeddings E t o k e n , segment embeddings E s e g m e n t , and position embeddings E p o s i t i o n . To fully exploit the prompt information, we concatenate the prompt in front of the essay. The first token of each input is a special classification token [CLS], and the prompt and essay are separated by [SEP]. The token embedding of the j -th essay in the i -th prompt can be expressed as Equation ( 1 ), E s e g m e n t and E p o s i t i o n are obtained from the tokenizer of the pre-train encoder.

We utilize the BERT and NEZHA as feature extraction layers. The final hidden state corresponding to the [CLS] token is the essay representation r e for essay scoring and subtasks.

3.3. Essay Scoring Layer

We view essay scoring as a regression task. To enable data mapping regression problems, the real scores are scaled to the range [ 0 , 1 ] for training and rescaled during evaluation, according to the existing studies:

where s i j is the scaled score for i -th prompt j -th essay, and s c o r e i j is the actual score for i -th prompt j -th essay, m a x s c o r e i and m i n s c o r e i are the maximum and minimum of the real scores for the i -th prompt. The input is essay representation r e from the pre-trained encoder, which is fed into a linear layer with a sigmoid activation function:

where s ^ is the predicted score by AES system, σ is the sigmoid function, W e s is a trainable weights, and b e s is a bias. The essay scoring (es) training objective is described as:

3.4. Subtask 1: Prompt Prediction

The definition of prompt prediction is giving an essay to determine which prompt it belongs to. We view prompt prediction as a classification task. The input is essay representation r e , which is fed into a linear layer with a softmax function. The formula is given by Equation ( 5 ):

where u ^ is the probability distribution of classification results, W p p is a parameter matrix, and b p p is a bias. The loss function is formalized as follows:

where u k is the real prompt label for the k -th sample, p p p k c is the probability that the k -th sample belongs to the c -th category, C denotes the number of prompts, which in this study is ten.

3.5. Subtask 2: Prompt Matching

The definition of prompt matching is giving a pair of a prompt and an essay, and to decide if the essay and the prompt are compatible. We consider prompt matching to be a classification task. The following is the formula:

where v ^ is the probability distribution of matching results, W p m is a parameter matrix, and b p m is a bias. The objective function is shown in Equation ( 9 )

where v k indicates whether the input prompt and essay match. p p m k m is the likelihood that the matching degree of k -th sample falls into category m. m denotes the matching degree, 0 for a match, 1 for a dismatch. The distinction between prompt prediction and prompt matching is that as the number of prompts increases, the difference in classification targets leads to increasingly obvious differences in task difficulty, sample distribution and diversity, and scalability.

3.6. Multi-Task Loss Function

The final loss function for each input is a weighted sum of the loss functions for essay scoring and two subtasks: prompt prediction and prompt matching, with the loss formalized as follows:

where α , β , and γ are non-negative weights assigned in advance to balance the importance of the three tasks. Because the objective of this research is to improve the AES system, the main task should be given more weight than the two auxiliary tasks. The optimal parameters in this paper are α : β = α : γ = 100:1, and in Section 5.3 , we design experiments to figure out the optimal value interval for α , β , and γ .

4. Experiment

4.1. dataset.

We use HSK (HSK is the acronym of Hanyu Shuiping Kaoshi, which is Chinese Pinyin for the Chinese Proficiency Test). Dynamic Composition Corpus ( http://hsk.blcu.edu.cn/ (accessed on 6 March 2022)) as our dataset as in existing studies [ 31 ]. HSK is also called “TOEFL in Chinese”, which is a national standardized test designed to test the proficiency of non-native speakers of Chinese. The HSK corpus includes 11,569 essays composed by foreigners from more than thirty different nations or regions in response to more than fifty distinct prompts. We eliminate any prompts with fewer than 500 student writings from the HSK dataset to constitute the experimental data. The statistical results of the final filtered dataset are provided in Table 1 , which comprises 8878 essays across 10 prompts taken from the actual HSK test. Each essay score ranges from 40 to 95 points. We divide the entire dataset at random into the training set, validation set, and test set in the ratio of 6:2:2. To alleviate the problem of insufficient data under a single prompt, we apply the entire training set that consists of different prompts for fine-tuning. We test every prompt individually as well as the entire test set during the testing phase and utilize the same 5-fold cross-validation procedure as [ 4 , 5 ]. Finally, we report the average performance.

HSK dataset statistic.

Set	#Essay	Avg #len	Chinese Prompt (English Translation)
1	522	336	一封求职信
			(A cover letter)
2	703	395	记对我影响最大的一个人
			(Remember the person who influenced me the most)
3	707	340	如何看待“安乐死”
			(How to view “euthanasia”)
4	957	338	由“三个和尚没水喝”想到的
			(Thought on “Three monks without water”)
5	829	356	如何解决“代沟”问题
			(How to solve the “generation gap”)
6	694	387	一封写给父母的信
			(A letter to parents)
7	1529	350	绿色食品与饥饿
			(Green food and hunger)
8	1333	330	吸烟对个人健康和公众利益的影响
			(Effects of smoking on personal health and public interest)
9	865	347	父母是孩子的第一任老师
			(Parents are children’s first teachers)
10	739	337	我看流行歌曲
			(My opinion on popular songs)

4.2. Evaluation Metrics

For the main task, we use the Quadratic Weighted Kappa (QWK)approach, which is widely used in AES [ 32 ], to analyze the agreement between prediction scores and the ground truth. QWK can be calculated by Equations ( 11 ) and ( 12 )

where i and j are the golden score of the human rater and the AES system score, and each essay has N possible ratings. Second, calculate the QWK score using Equation ( 12 ).

where O i , j denotes the number of essays that receive a rating i by the human rater and a rating j by the AES system. The expected rating matrix Z is histogram vectors of the golden rating and AES system rating and normalized so that the sum of its elements equals the sum of its elements in O . We also utilize Pearson’s Correlation Coefficient (PCC) to measure the association as in previous studies [ 3 , 32 , 33 ], which quantifies the degree of linear dependency between two variables and describes the level of covariation. In contrast to the QWK metric, which evaluates the agreement between the model output and the gold standard, we use PCC to assess whether the AES system ranks essays similarly to the gold standard, indicating the capacity of the AES system to appropriately rank texts, i.e., high scores ahead of low scores. For auxiliary tasks, we consider prompt prediction and prompt matching as classification problems and use macro-F1 score (F1), and accuracy (Acc.) as evaluation metrics.

4.3. Comparisons

Our model is compared to the baseline models listed below. The former three are existing neural AES methods, and we experiment with both character and word input when training for comparison. The fourth method is to fine-tune the pre-trained model, and the rest are variations of our proposed method.

CNN-LSTM [ 4 ]: This method builds a document using CNN for word-level representation and LSTM for sentence-level representation, as well as the addition of a pooling layer to obtain the text representation. Finally, the score is obtained by applying the linear layer of the sigmoid function.

CNN-LSTM-att [ 5 ]: This method incorporates an attention mechanism into both the word-level and sentence-level representations of CNN-LSTM.

EModel (Pro.): This method concatenates the prompt information in the input layer of CNN-LSTM-att, which is a sub-module of [ 7 ].

BERT/NEZHA-FT: This method is used to fine-tune the pre-trained model. To obtain the essay representation, we directly feed an essay into the pre-trained encoder as the input. We choose the [CLS] embedding as essay representations and feed them into a linear layer of the sigmoid function for scoring.

BERT/NEZHA-concat: The difference between this method and fine-tune is that the input representation concatenates the prompt to the front of the essay in token embedding, as in Figure 1 .

BERT/NEZHA-PP: This model incorporates prompt prediction as an auxiliary task, with the same input as the concat model and the output using [CLS] as the essay representation. A linear layer with the sigmoid function is used for essay scoring, and a linear layer with the softmax function is used for prompt prediction.

BERT/NEZHA-PM: This model includes prompt matching as an auxiliary task. In the input stage of constructing the training data, there is a 50% probability that the prompt and the essay are mismatched. [CLS] embedding is used to represent the essay. A linear layer with the sigmoid function is used for essay scoring, and a linear layer with the softmax function is used for prompt matching.

BERT/NEZHA-PP&PM: This model utilizes two auxiliary tasks, prompt prediction, and prompt matching, with the same inputs and outputs as the PM model. The output layer of the auxiliary tasks is the same as above.

4.4. Parameter Settings

We use BERT ( https://github.com/google-research/bert (accessed on 11 March 2022)) and NEZHA ( https://github.com/huawei-noah/Pretrained-Language-Model/tree/master/NEZHA-TensorFlow (accessed on 11 March 2022)) as pre-trained encoder. To obtain tokens and token embeddings, we employ the tokenizer and vocabulary of the pre-trained encoder. The parameters of the pre-trained encoder are learnable during both the fine-tuning and training phases. The maximum length of the input is set to 512 and Table 2 includes additional parameters. The baseline models, CNN-LSTM and CNN-LSTM-att, are trained from scratch, and their parameters are shown in Table 2 . Our experiments are carried out on NVIDIA TESLA V100 32 G GPUs.

Parameter settings.

Parameters	Baselines Settings	Our Methods Settings
Embedding size	100	768
Vocab size	500	21,128
Epoch	50	10
Batch size	64	16
Optimizer	RMSprop	Adam
Learning rate	1 × 10	5 × 10
LSTM hidden state	100	-
CNN filters (kernel size)	100 (5)	-
Word embedding	Tencent (small) (accessed on 17 March 2022)	-

5. Results and Discussions

5.1. main results and analysis.

We report our experimental results in Table 3 and Table A1 (Due to space limitations, this table is included in Appendix A ). Table A1 illustrates the average QWK and PCC for each prompt. Table 3 shows QWK and PCC across the entire test set and the average results of each prompt test set. As shown in Table 3 , we can find that the proposed auxiliary tasks (PP, PM, and PP&PM) (line 8–10 & 13–15) outperform other contrast models on both QWK and PCC, PP&PM models with the pre-trained encoder, BERT, and NEZHA, outperform PP and PM on QWK. In terms of the PCC metric, PM models exceeded the other two models except for the average result with the NEZHA encoder. The findings above indicate that our proposed two auxiliary tasks are both effective.

QWK and PCC for the total test set and Average QWK and PCC for each prompt test set; † denotes input as a character; ‡ denotes input as word. The best results are in bold.

Models	Total		Average
Models	QWK	PCC	QWK	PCC
CNN-LSTM †	0.632	0.672	0.612	0.642
CNN-LSTM-att †	0.642	0.672	0.615	0.648
CNN-LSTM ‡	0.617	0.653	0.596	0.633
CNN-LSTM-att ‡	0.623	0.658	0.603	0.629
EModel (Pro.) ‡	0.642	0.669	0.620	0.649
BERT-FT	0.683	0.722	0.667	0.713
BERT-concat	0.685	0.719	0.671	0.712
BERT-PP	0.688	0.714	0.668	0.709
BERT-PM	0.700		0.684
BERT-PP&PM		0.711		0.715
NEZHA-FT	0.676	0.714	0.662	0.708
NEZHA-concat	0.681	0.717	0.667	0.714
NEZHA-PP	0.695	0.727	0.680
NEZHA-PM	0.698		0.682	0.724
NEZHA-PP&PM		0.714		0.722

On Total test set, our best results, a pre-trained encoder with PM and PP, are higher compared to fine-tuning method and EModel(Pro.), exceed the strong baseline concat model by 1.8% with BERT and 2.3% with NEZHA on QWK, and get a generally consistent correlation. It is shown from Table 3 that our proposed models also yield similar results to the Average test set, 1.6% of BERT and 2% of NEZHA on QWK of PP&PM models compared to concat model, 2% of BERT and 2.5% of NEZHA on QWK of PP&PM models compared to fine-tuning model, and competitive results on PCC metric. Using the multi-task learning approach and fine-tuning comparison, our proposed approach outperforms the baseline system on both QWK and PCC, indicating that better essay representation can be obtained through multi-tasking learning. Furthermore, when compared to the concat model with fused prompt representation, our proposed approach outperform the baseline in QWK scores, but line 10 and line 15 in Table 3 Total track PCC values are lower within 1% of the baseline. It demonstrates that our proposed auxiliary task is effective in representing the essay prompt.

We train the hierarchical model (line 1–4) using character and word as input, respectively, and the results show that using the character for training is generally better, with the best results in Total and Average being more than 4% lower than those with the pre-training method. The results indicate that using pre-trained encoders both BERT and NEZHA for feature extraction works well on the HSK dataset. The pre-training model comparison reveals that BERT and NEZHA are competitive, with NEZHA delivering the best results.

Results of each prompt with BERT and NEZHA are displayed in Figure 2 . The results of our proposed models (PP, PM, and PP&PM) have made positive progress on several prompts. Among them, the results of PP&PM, in addition, to prompt 1 and prompt 5, extend beyond the two baselines of fine-tuning and concat . The results indicate that our proposed auxiliary tasks to incorporate prompt is generic and can be employed with a range of genres and prompts. The primary cause of the results of individual prompts being suboptimal is that the hyperparameters of loss function α , β , and γ are not adjusted specifically for each prompt and we will further analyze the reasons for this in Section 5.3 .

An external file that holds a picture, illustration, etc.
Object name is entropy-24-01206-g002.jpg

( a ) Results of each prompt with BERT pre-trained encoder on QWK; ( b ) Results of each prompt with NEZHA pre-trained encoder on QWK.

5.2. Result and Effect of Auxiliary Tasks

Table 4 depicts the results of the auxiliary tasks (PP and PM) on validation set, the accuracy and F1 are both greater than 85% for BERT and 90% for NEZHA, and the model is well trained in the auxiliary task, when compared to both pre-trained models BERT and NEZHA, the latter produces better. The results of auxiliary tasks with NEZHA perform better as feature extraction modules.

Accuracy and F1 for PP and PM on validation set.

Models	Prompt Prediction		Prompt Matching
Models	Acc. (%)	F1 (%)	Acc. (%)	F1 (%)
BERT-PP&PM	86.6	85.6	85.5	85.6
NEZHA-PP&PM	91.7	98.1	90.7	91.4

Comparing the contribution of PP and PM, as shown in Table A1 and Table 3 and Figure 3 , the contribution of PM is higher and more effective. Figure 3 a,b illustrate radar graphs of various pre-trained encoders of PP and PM across 10 prompts utilizing QWK metrics. Figure 3 a shows that the QWK value of PM is higher than PP in all but prompt 9 with BERT encoder, and Figure 3 b demonstrates that the results of PM are 60% better compared to those of PP, implying that PM is also superior to PP for a specific prompt. The PM and PP comparison results for the Total and Average datasets are provided in Figure 3 c,d. Except for the PM model with the NEZHA pre-trained encoder, which has a slightly lower QWK than the PP model, all models that use PM as a single auxiliary task perform better, further demonstrating the superiority of prompt matching in prompt representing and incorporating.

An external file that holds a picture, illustration, etc.
Object name is entropy-24-01206-g003.jpg

( a ) Radar graph of BERT-PP&BERT-PM; ( b ) Radar graph of NEZHA-PP&NEZHA-PM; ( c ) Results of PP and PM on QWK; ( d ) Results of PP and PM on PCC.

5.3. Effect of Loss Weight

We examine how the ratio of loss weight parameters β and γ affects the model. Figure 4 a shows that the model works best when the ratio is 1:1 on both QWK and PCC metrics. Figure A1 depicts the QWK results for various β and γ ratios, as well as revealing that the model produces the greatest results at around 1:1 for different prompts, except for prompts 1, 5, and 6, and the same is true for the average results. Concerning the issue of our model being suboptimal for individual prompts, Figure A1 illustrates that the best results for prompts 1, 5, and 6 are not achieved at 1:1, suggesting that it is inappropriate for such parameters in these prompts. Because we disorder the entire training set and fix the β and γ ratio before testing it independently, the parameters of the different prompts cannot be dynamically adjusted within a single training procedure. The reasons are to address the lack of data and also to focus more on the average performance of the model, which also prevents the model from overfitting for specific prompts. Compared to the results in Table A1 , NEZHA-PP and NEZHA-PM both outperform the baselines and the PP&PM model for prompt 1, indicating that both PP and PM can enhance the results when employed separately. For prompt 5, NEZHA-PP performs better than NEZHA-PM, showing that PP plays a greater role. The PP&PM model is already the best result for prompt 6, even though the 1:1 parameter is not optimal in Figure A1 , demonstrating that there is still potential for improvement. The information above reveals that different prompts have varying degrees of difficulty for joint training and parameter optimization of the main and auxiliary tasks, along with different conditions of applicability for the two auxiliary tasks we presented.

An external file that holds a picture, illustration, etc.
Object name is entropy-24-01206-g004.jpg

( a ) The effect of PP&PM in different β / γ ratios of QWK and PCC on Total dataset, we fix the value of α in this section of the experiment.; ( b ) The smoothing results for training losses across all tasks; ( c ) The results of different α : β (PP), α : γ (PM), and α : β : γ (PP&PM) ratios on QWK.

We also measure the effect of α on the model, where we fix the β / γ ratio constant at 1:1. Figure 4 c demonstrates that the PP, PM, and PP&PM models are all optimal at α : β = α : γ = 100:1, with the best QWK values for PP&PM, indicating that our suggested method of combining two auxiliary tasks for joint training is effective. The observation of [ 1 , 100 ] shows that when the ratio is small, the main task cannot be trained well, the two auxiliary tasks have a negative impact on the main task, but the single auxiliary task has less impact, indicating that multiple auxiliary tasks are more difficult to train concurrently than a single auxiliary task. In addition, future research should consider how to dynamically optimize the parameters of multiple tasks.

The training losses for ES, PP, and PM are included in Figure 4 b, and it can be seen that the loss of the main task decreases rapidly in the early stage, and the model converges around 6000 steps. The reason for faster model convergence in PM is that the task is a dichotomous classification compared to PP, which is a ten classification, and additionally, among the ten prompts, prompt 6 “A letter to parent” and prompt 9 “Parents are children’s first teachers” are more similar, making PP more difficult. As a result, further research into how to select the appropriate weight ratio and design more matching auxiliary tasks is required.

6. Conclusions and Future Work

This paper presents a pre-training and then fine-tuning model for automated essay scoring. The model incorporates the essay prompts to the model input and obtains better features more applicable to essay scoring by multi-task learning with two auxiliary tasks, prompt prediction, and prompt matching. Experiments demonstrate that the model outperforms baselines in results measured by the QWK and PCC on average across all results on the HSK dataset, indicating that our model is substantially better in terms of agreement and association. The experimental results also show that both auxiliary tasks can effectively improve the model performance, and the combination of the two auxiliary tasks with the NEZHA pre-trained encoder yields the best results, with QWK enhancing 2.5% and PCC improving 2% compared to the strong baseline, the concatenate model, on average across all results on the HSK dataset. When compared to existing neural essay scoring methods, the experimental results show that QWK improves by 7.2% and PCC improves by 8% on average across all results.

Although our work has enhanced the effectiveness of the AES system, there are still limitations. Regarding the data dimension, this research primarily investigates fusing prompt features in Chinese; other languages are not examined extensively. Nevertheless, our method is more convenient for migration than the manual annotation approach, and other languages can be directly migrated. Furthermore, other features in different languages can use our method to create similar auxiliary tasks for information fusion. Moreover, as the number of prompts grows, the difficulty of training for prompt prediction increases, and we will consider combining prompts with genre and other information to design auxiliary tasks suitable for more prompts, as well as attempting to find a balance between the number of essays and the number of prompts to make prompt prediction more efficient. The parameters of the loss function are now defined empirically at the methodological level, which is not conducive to additional auxiliary activities. In future work, we will optimize the parameter selection scheme and build dynamic parameter optimization techniques to accommodate variable numbers of auxiliary tasks. In terms of application, our approaches focus on fusing textual information in prompts, while they do not cover all prompt forms. Our system now requires additional modules for the chart and picture prompt. In future research, we will experiment with multimodal prompt data to improve the application scenarios of the AES system.

Abbreviations

The following abbreviations are used in this manuscript:

AES	Automated Essay Scoring
NLP	Natural Language Processing
QWK	Quadratic Weighted Kappa
PCC	Pearson’s Correlation Coefficient

QWK and PCC for each prompt on HSK dataset, † denotes input as character; ‡ denotes input as word. The best results are in bold.

Metrics	QWK	PCC	QWK	PCC	QWK	PCC	QWK	PCC	QWK	PCC

CNN-LSTM †	0.721	0.742	0.634	0.644	0.646	0.669	0.644	0.661	0.666	0.702
CNN-LSTM-att †	0.759	0.767	0.639	0.650	0.662	0.683	0.649	0.671	0.654	0.695
CNN-LSTM ‡	0.730	0.749	0.638	0.657	0.613	0.663	0.673	0.696	0.671	0.709
CNN-LSTM-att ‡		0.773	0.622	0.634	0.679	0.701	0.680	0.694	0.668	0.705
EModel (Pro.) ‡	0.752	0.769	0.664	0.681	0.672	0.687	0.693	0.710	0.676	0.704
BERT-FT	0.725	0.765	0.701	0.748	0.678	0.720	0.726	0.763	0.667	0.699
BERT-concat	0.746	0.772	0.718	0.756	0.681	0.726	0.713	0.751	0.686
BERT-PP	0.735	0.773	0.718	0.758	0.680	0.724	0.715	0.743	0.658	0.681
BERT-PM		0.774					0.729			0.704
BERT-PP&PM	0.716		0.728	0.766		0.734				0.707
NEZHA-FT	0.719	0.769	0.706	0.763	0.671	0.715	0.706	0.744	0.661	0.689
NEZHA-concat	0.703	0.751	0.696	0.761	0.665	0.715	0.715	0.754
NEZHA-PP	0.750		0.700	0.764	0.692		0.731		0.692	0.728
NEZHA-PM		0.787	0.735		0.697	0.741	0.714	0.760	0.684	0.717
NEZHA-PP&PM	0.687	0.781		0.765		0.745		0.761	0.697	0.710

CNN-LSTM †	0.539	0.564	0.553	0.580	0.456	0.496	0.612	0.669	0.646	0.688
CNN-LSTM-att †	0.552	0.581	0.552	0.604	0.454	0.507	0.598	0.660	0.630	0.661
CNN-LSTM ‡	0.479	0.519	0.542	0.565	0.396	0.446	0.596	0.652	0.627	0.674
CNN-LSTM-att ‡	0.486	0.516	0.553	0.590	0.356	0.399	0.575	0.616	0.649	0.665
EModel (Pro.) ‡	0.503	0.528	0.560	0.602	0.413	0.457	0.597	0.661	0.667	0.693
BERT-FT	0.582	0.625	0.673	0.705	0.558	0.625	0.683	0.746	0.677	0.733
BERT-concat	0.580		0.651	0.698	0.571		0.672	0.720	0.690	0.738
BERT-PP	0.562	0.615	0.664	0.700	0.553	0.611	0.694		0.696	0.740
BERT-PM	0.579	0.620	0.682				0.688	0.736
BERT-PP&PM		0.627		0.705	0.568	0.601			0.695	0.741
NEZHA-FT	0.594	0.631	0.674	0.707	0.553	0.599	0.655	0.722	0.677	0.738
NEZHA-concat	0.595	0.642	0.689	0.718	0.554	0.610	0.658	0.716	0.684	0.738
NEZHA-PP	0.588	0.639	0.688		0.579		0.672		0.706	0.751
NEZHA-PM	0.576	0.630	0.672	0.719	0.583	0.624	0.692	0.740
NEZHA-PP&PM				0.715		0.618		0.729	0.684	0.750

An external file that holds a picture, illustration, etc.
Object name is entropy-24-01206-g0A1.jpg

The effect of PP&PM in different β / γ ratios of QWK across all dataset, we fix the value of α in this section of the experiment.

Funding Statement

This research was funded by the National Natural Science Foundation of China (Grant No.62007004), the Major Program of the National Social Science Foundation of China (Grant No.18ZDA295), and the Doctoral Interdisciplinary Foundation Project of Beijing Normal University (Grant No.BNUXKJC2020).

Author Contributions

Conceptualization and methodology, J.S. (Jingbo Sun); writing—original draft preparation, J.S. (Jingbo Sun) and T.S.; writing—review and editing, T.S., J.S. (Jihua Song) and W.P. All authors have read and agreed to the published version of the manuscript.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Data availability statement, conflicts of interest.

The authors declare no conflict of interest.

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Corpus ID: 263829723

LLM-as-a-tutor in EFL Writing Education: Focusing on Evaluation of Student-LLM Interaction

Jieun Han , Haneul Yoo , +8 authors Alice Oh
Published 8 October 2023
Computer Science, Education

Ask This Paper

By using this feature, you agree to AI2's terms and conditions and that you will not submit any sensitive or confidential info.

AI2 may include your prompts and inputs in a public dataset for future AI research and development. Please check the box to opt-out.

Ask a question about " "

Supporting statements, figures and tables from this paper.

9 Citations

Recipe4u: student-chatgpt interaction dataset in efl writing education, designing prompt analytics dashboards to analyze student-chatgpt interactions in efl writing, large language models meet user interfaces: the case of provisioning feedback, human-ai collaborative essay scoring: a dual-process framework with llms, from automation to augmentation: large language models elevating essay scoring landscape, unveiling scoring processes: dissecting the differences between llms and human graders in automatic scoring, is gpt-4 alone sufficient for automated essay scoring: a comparative judgment approach based on rater cognition.

Highly Influenced

The Responsible Development of Automated Student Feedback with Generative AI

Leaf: language learners’ english essays and feedback corpus, 26 references, recipe: how to integrate chatgpt into efl writing education, teachers helping efl students improve their writing through written feedback: the case of native and non-native english-speaking teachers' beliefs, the 2 sigma problem: the search for methods of group instruction as effective as one-to-one tutoring, assessment of students’ argumentative writing: a rubric development, chatgpt for good on opportunities and challenges of large language models for education, assessing online writing feedback resources: generative ai vs. good samaritans, expertise in evaluating second language compositions, written corrective feedback: students’ perception and preferences, is chatgpt a good teacher coach measuring zero-shot performance for scoring and providing actionable insights on classroom instruction, evidence on the effectiveness of comprehensive error correction in second language writing, related papers.

Showing 1 through 3 of 0 Related Papers

Published in 2023

Jieun Han Haneul Yoo +8 authors Alice Oh

pdf bib abs Automated Essay Scoring via Pairwise Contrastive Regression Jiayi Xie | Kaiwei Cai | Li Kong | Junsheng Zhou | Weiguang Qu Proceedings of the 29th International Conference on Computational Linguistics

Kaiwei Cai 1
Junsheng Zhou 1
Weiguang Qu 1

IMAGES

Underline
Automated Essay Scoring via Pairwise Contrastive Regression
Automated Essay Scoring via Pairwise Contrastive Regression
Figure 1 from Enhancing Automated Essay Scoring Performance via
Table 1 from Automated Essay Scoring by Maximizing Human-Machine
What is Automated Essay Scoring, Marking, Grading?

VIDEO

Webinar: What’s new with the upcoming WIAT-4 CDN?
Automated Essay Scoring Discourse External Knowledge
Difficult Digital SAT Quadratic Regression with Desmos
Graph Contrastive Learning via Interventional View Generation
[rfp2264] Graph Contrastive Learning via Interventional View Generation
How to "Holt Online Essay Scoring"

COMMENTS

Automated Essay Scoring via Pairwise Contrastive Regression
Abstract. Automated essay scoring (AES) involves the prediction of a score relating to the writing quality of an essay. Most existing works in AES utilize regression objectives or ranking objectives respectively. However, the two types of methods are highly complementary. To this end, in this paper we take inspiration from contrastive learning ...
PDF Automated Essay Scoring via Pairwise Contrastive Regression
1 Introduction. Automated Essay Scoring (AES) is to evaluate the quality of essays and score automatically by using computer technologies. Notably, reasonable grad-ing can solve problems that consume much time and require a lot of human effort. What's more, providing feedback to learners can promote self improvement.
Automated Essay Scoring via Pairwise Contrastive Regression
A novel unified Neural Pairwise Contrastive Regression model in which both objectives are optimized simultaneously as a single loss is proposed, achieving the state-of-the-art average performance for the AES task. Automated essay scoring (AES) involves the prediction of a score relating to the writing quality of an essay. Most existing works in AES utilize regression objectives or ranking ...
Automated-Essay-Scoring-via-Pairwise-Contrastive-Regression
Automated-Essay-Scoring-via-Pairwise-Contrastive-Regression. Created by Jiayi Xie*, Kaiwei Cai*, Li Kong, Junsheng Zhou, Weiguang Qu. This repository contains the ASAP dataset and Pytorch implementation for Automated Essay Scoring. (Coling 2022, Oral)
Enhancing Automated Essay Scoring Performance via Fine-tuning Pre
Automated Essay Scoring (AES) is a critical text regression task that automatically assigns scores to essays based on their writing quality. Recently, the performance of sentence prediction tasks has been largely improved by using Pre-trained Language Models via fusing representations from different layers, constructing an auxiliary sentence ...
REFN : A ESSAY SCORING BY PAIRWISE C
a is greatly reduced.1 INTRODUCTIONAutomatic Essay Scoring (AES) is the technique to automatically score an es. ay over some specific marking scale. AES has been an eye-catching problem in machine learning due to it. promising application in education. It can free tremendous amount of repetitive labour,
Automated Essay Scoring by Maximizing Human-Machine Agreement
Experiments show that the proposed rankbased approach outperforms the state-of-the-art algorithms, and achieves performance comparable to professional human raters, which suggests the effectiveness of the proposed method for automated essay scoring. Previous approaches for automated essay scoring (AES) learn a rating model by minimizing either the classification, regression, or pairwise ...
Automated Essay Scoring by Capturing Relative Writing Quality
Extensive experiments on two public English essay datasets, Automated Student Assessment Prize and Chinese Learners English Corpus, show that our proposed approach based on pairwise learning outperforms previous classification or regression-based methods on all 15 topics.
FABRIC: Automated Scoring and Feedback Generation for Essays
Automated essay scoring via pairwise contrastive regression. In Proceedings of the 29th International Conference on Computational Linguistics, pages 2724-2733, Gyeongju, Republic of Korea. International Committee on Computational Linguistics.
Automatic Essay Multi-dimensional Scoring with Fine-tuning and Multiple
Automated essay scoring (AES) involves predicting a score that reflects the writing quality of an essay. Most existing AES systems produce only a single overall score. However, users and L2 learners expect scores across different dimensions (e.g., vocabulary, grammar, coherence) for English essays in real-world applications. To address this need, we have developed two models that automatically ...
Automated Essay Scoring via Pairwise Contrastive Regression
Automated Essay Scoring via Pairwise Contrastive Regression. Jiayi Xie, Kaiwei Cai, Li Kong, Junsheng Zhou, Weiguang Qu. Automated Essay Scoring via Pairwise Contrastive Regression. In Nicoletta Calzolari, Chu-Ren Huang, Hansaem Kim, James Pustejovsky, Leo Wanner, Key-Sun Choi, Pum-Mo Ryu, Hsin-Hsi Chen, Lucia Donatelli, Heng Ji, Sadao ...
Automated Essay Scoring via Pairwise Contrastive Regression
Automated essay scoring (AES) involves the prediction of a score relating to the writing quality of an essay. Most existing works in AES utilize regression objectives or ranking objectives respectively. However, the two types of methods are highly complementary. To this end, in this paper we take inspiration from contrastive learning and ...
@inproceedings{xie-etal-2022-automated, title = "Automated Essay
@inproceedings{xie-etal-2022-automated, title = "Automated Essay Scoring via Pairwise Contrastive Regression", author = "Xie, Jiayi and Cai, Kaiwei and Kong, Li and Zhou, Junsheng and Qu, Weiguang", editor = "Calzolari, Nicoletta and Huang, Chu-Ren and Kim, Hansaem and Pustejovsky, James and Wanner, Leo and Choi, Key-Sun and Ryu, Pum-Mo and Chen, Hsin-Hsi and Donatelli, Lucia and Ji, Heng and ...
Can Large Language Models Automatically Score Proficiency of Written
Automated essay scoring via pairwise contrastive regression. In Proceedings of the 29th International Conference on Computational Linguistics, pages 2724-2733. Yang et al. (2020) Ruosong Yang, Jiannong Cao, Zhiyuan Wen, Youzheng Wu, and Xiaodong He. 2020. Enhancing automated essay scoring performance via fine-tuning pre-trained language ...
PDF Automated Essay Scoring Using Machine Learning
The automated essay scoring model is a topic of in-terest in both linguistics and Machine Learning. The model systematically classi es our varying degrees of. CS224N Final Project, Shihui Song, Jason Zhao. [email protected] [email protected]. speech and can be applied in both academia and large industrial organizations to improve ...
Enhanced cross-prompt trait scoring via syntactic feature fusion and
Automated essay scoring aims to evaluate the quality of an essay automatically. It is one of the main educational applications in the field of natural language processing. Recently, the research scope has been extended from prompt-special scoring to cross-prompt scoring and further concentrating on scoring different traits. However, cross-prompt trait scoring requires identifying inner ...
"Automated Essay Scoring via Pairwise Contrastive Regression."
Details and statistics. —. open. Conference or Workshop Paper. 2022-10-13. Jiayi Xie, Kaiwei Cai, Li Kong, Junsheng Zhou, Weiguang Qu: Automated Essay Scoring via Pairwise Contrastive Regression. COLING 2022: 2724-2733. last updated on 2022-10-13 17:29 CEST by the.
Improving Automated Essay Scoring by Prompt Prediction and Matching
1. Introduction. Automated essay scoring (AES), which aims to automatically evaluate and score essays, is one typical application of natural language processing (NLP) technique in the field of education [].In earlier studies, a combination of handcrafted design features and statistical machine learning is used [2,3], and with the development of deep learning, neural network-based approaches ...
Improving Performance of Automated Essay Scoring by using back
Automated essay scoring plays an important role in judging students' language abilities in education. Traditional approaches use handcrafted features to score and are time-consuming and complicated. Recently, neural network approaches have improved performance without any feature engineering. Unlike other natural language processing tasks, only a small number of datasets are publicly available ...
Kaiwei Cai
Automated Essay Scoring via Pairwise Contrastive Regression Jiayi Xie | Kaiwei Cai ... Automated essay scoring (AES) involves the prediction of a score relating to the writing quality of an essay. Most existing works in AES utilize regression objectives or ranking objectives respectively. However, the two types of methods are highly complementary.
Enhancing Automated Essay Scoring Performance via Cohesion Measurement
A new way to fine-tune pre-trained language models with multiple losses of the same task is found to improve AES's performance, and the model outperforms not only state-of-the-art neural models near 3 percent but also the latest statistic model. Automated Essay Scoring (AES) is a critical text regression task that automatically assigns scores to essays based on their writing quality ...
FABRIC: Automated Scoring and Feedback Generation for Essays
Automated Essay Scoring via Pairwise Contrastive Regression. Jiayi Xie Kaiwei Cai Li Kong Junsheng Zhou Weiguang Qu. ... 2022; TLDR. A novel unified Neural Pairwise Contrastive Regression model in which both objectives are optimized simultaneously as a single loss is proposed, achieving the state-of-the-art average performance for the AES task. ...
Jiayi Xie
Automated Essay Scoring via Pairwise Contrastive Regression Jiayi Xie | Kaiwei Cai ... Automated essay scoring (AES) involves the prediction of a score relating to the writing quality of an essay. Most existing works in AES utilize regression objectives or ranking objectives respectively. However, the two types of methods are highly complementary.

Navigation Menu

Saved searches

CarryCKW/AES-NPCR

Repository files navigation

Code for AES-NPCR

Pretrain Model

Sign in through your institution

Automated Essay Scoring by Capturing Relative Writing Quality

Personal account

Institutional access

IP based access

Society Members

Sign in through society site

Sign in using a personal account

Viewing your signed in accounts

Signed in but can't access content

Short-term Access

Email alerts

Affiliations

This Feature Is Available To Subscribers Only

FABRIC: Automated Scoring and Feedback Generation for Essays

1 Introduction

2 Related Work

2.1.1 AES Datasets

ICNALE Edited Essays

2.1.2 AES Models

Holistic AES

Rubric-based AES

2.2 Essay Feedback Generation

Feedback Quality Evaluation

3 FABRIC Pipeline

3.1 Rubric-based AES Models

3.1.1 Dataset Collection

Annotator Details

3.1.2 Standardizing the Existing Data

3.1.3 Synthetic Data Construction

Organization

3.1.4 Data Statistics

3.2 EssayCoT

4 Experimental Result

4.1 Automated Essay Scoring

4.2 Essay Feedback Generation

5 Prototype Deployment and Evaluation

6 Discussion

Human-in-the-Loop Pipeline

Check for Students’ Comprehension

7 Conclusion

Limitations

Ethics Statement

Appendix A Qualitative Interview

A.2 Interview Questionnaire

Appendix B Supplemental Results

B.2 Rubric-based AES with Different LMs

B.3 Rubric-based AES with ChatGPT

B.4 Sample-level Analysis on Essay Feedback Generation

Appendix C Study Design for Prototype Deployment and Evaluation

Scores and Feedback

Pipeline Evaluation

Post-survey

External Links

Automated Essay Scoring via Pairwise Contrastive Regression

Subscribe to the PwC Newsletter

Add a new code entry for this paper

Remove a task

Add a method

Code Edit Add Remove Mark official

Enhanced cross-prompt trait scoring via syntactic feature fusion and contrastive learning

Cite this article

Access this article

Similar content being viewed by others

Automated Pipeline for Multi-lingual Automated Essay Scoring with ReaderBench

Coherence Based Automatic Essay Scoring Using Sentence Embedding and Recurrent Neural Networks

Feature Enhanced Capsule Networks for Robust Automatic Essay Scoring

Acknowledgements

Author information

Contributions

Corresponding authors

Ethics declarations

Conflict of interest

Availability of data and material