Grading Student Essays using LLMs
It's the dream of every teacher to grade students' homework automatically using AI. Imagine just clicking a button and, voila, you get the results, similar to how computerized multiple-choice exams are scored.
Today, tools like ChatGPT, Claude, Gemini, and some open-source LLMs allow us to upload files and pose questions. However, it's impractical to upload each student's assignment individually and paste rubrics for every score. Instead, a simple Python script utilizing APIs can automatically process all assignments in a folder and generate a table of scores in a single file.
I've uploaded a test script on GitHub at [https://github.com/franktfye/GradeBuddy], where you can download it and run it in Python.
The examples in the data folder were generated by GPT-4. I asked GPT to generate five sample student essays about multiculturalism. The examples 4 used Chicago style instead of APA7, while the example 5 contained a lot of grammatical errors.
In my experiments with GPT-4 models, utilizing a zero-shot approach with prompts that incorporate assignment rubrics yielded excellent results, eliminating the need for model fine-tuning. Modern LLMs excel at qualitative assessments, except for complex coding tasks. For those, the text classification models available in Azure Language Studio might be a more suitable choice. It's evident that LLMs are not only expert in assessing content relevance but also in analyzing grammar. This represents a significant shift in the educational landscape. It's curious why this innovation hasn't gained widespread popularity across campuses yet. However, I'm confident that some forward-thinking universities are already in the process of creating products around this technology.
The main limitation is that the model cannot detect errors in APA-7 formatting. This is understandable since it evaluates extracted text from .docx files, which don't include much formatting details like page margins and line spacing. Moreover, it seems the model struggles with differentiating citation styles. To circumvent this, we could provide the model with basic context about APA citation styles to detect fundamental formatting errors. Nevertheless, it might be best to manually check for formatting issues rather than relying on LLMs.
Incorporating Vision models could make it possible to grade formatting as well, but this introduces a cost-effectiveness dilemma. Is it worth the substantial effort to grade dozens of assignments? Likely not, unless a user-friendly interface is developed, reducing the time investment for teachers managing multiple courses. However, for a system designed for long-term standardized testing at the university level, it could be worth exploring.
To simplify a teacher's life:
1. First, obtain an API key, for example, from Azure's GPT-4 service. For local processing, platforms like Ollama or GPT4All can be used.
The sample script is configured to automatically retrieve the API key from your system's environment variables. Before running the script, ensure you've set these up accordingly. Alternatively, if you're not planning to share your script and API keys with others, you can directly embed the API keys within the script itself.
2. Develop a grading scheme based on the assignment rubrics and save it in a .txt file for easy future modifications. You can take an example from the one I uploaded.
3. Install Python on your computer. After installation, place all students' assignments in the data folder, ensuring their names or IDs are included in the filenames for easy identification in the output table. Then, open a terminal app and run the script with:
```
python evaluate.oai.py
```
Depending on your network status and the chosen LLMs' response time, the grading results should be available within minutes.
Comments
Post a Comment