ScholaWrite

A Dataset of End-to-End Scholarly Writing Process

Abstract

Writing is a cognitively active task involving continuous decision-making, heavy use of working memory, and frequent switching between multiple activities. Scholarly writing is particularly complex as it requires authors to coordinate many pieces of multiform knowledge while meeting high academic standards. To understand writers' cognitive thinking process, one should fully decode the end-to-end writing data (from scratch to final manuscript) and understand their complex cognitive mechanisms in scientific writing. We introduce ScholaWrite, the first-of-its-kind keystroke logs of an end-to-end scholarly writing process, with thorough annotations of cognitive writing intentions behind each keystroke. Our dataset includes LaTeX-based keystroke data from five preprints with nearly 62K total text changes and annotations across 4 months of paper writing. Our dataset shows promising usability and applications for the future development of AI writing assistants for the research environment, which necessitate complex methods beyond LLM prompting, and supports the cognitive thinking process of scientists.

Repository Structure

Directory	Description
`scholawrite_system/`	Data collection backend, admin page, annotation page, and Chrome extension
`scholawrite_finetune/`	Fine-tuning scripts for BERT, RoBERTa, and Llama-8B-Instruct
`gpt4o/`	GPT-4o inference for iterative writing and intention prediction
`meta_inference/`	Llama-8B-Instruct baseline inference
`eval_tool/`	Web interface for human evaluation of model outputs
`analysis/`	Evaluation metrics, cosine similarity, lexical diversity, and figure generation
`augmented/`	AI-generated content detection benchmarks (ScholaWrite-Augmented)
`seeds/`	Seed documents for iterative writing experiments
`outputs/`	Pre-computed inference outputs from all models

Getting Started

Prerequisites

Docker & Docker Compose
GPU with >= 16GB VRAM (for fine-tuning/inference)
Python 3.8+

Environment Setup

Create an .env file in the project root:

HUGGINGFACE_TOKEN="<Your Hugging Face access token>"
OPEN_AI_API="<Your OpenAI API key>"

Create a Docker container for fine-tuning and inference:

docker run --name scholawrite --gpus all -dt -v ./:/workspace --ipc=host pytorch/pytorch:2.4.1-cuda12.1-cudnn9-devel bash
docker exec -it scholawrite bash
pip install accelerate python-dotenv huggingface-hub datasets transformers trl unsloth diff_match_patch

ScholaWrite System

The data collection system uses Flask + MongoDB with a Chrome extension for Overleaf.

1. MongoDB

Install MongoDB Community Edition and MongoDB Compass.
Install Database Tools.
Run MongoDB on the default port (27017). The flask_db database and activity collection are created automatically.

2. Google OAuth

Create a Google Cloud project and an OAuth client for a Desktop app.
Download the credentials file, rename it to sheet_credential.json.

Update the volume paths in scholawrite_system/docker-compose.yml (lines 12-13):

volumes:
  - <path>/sheet_credential.json:/usr/local/src/scholawrite/flaskapp/sheet_credential.json
  - <path>/token.json:/usr/local/src/scholawrite/flaskapp/token.json

3. Google Sheet

Add Overleaf project IDs to consecutive rows in a Google Sheet column.
In scholawrite_system/flaskapp/App.py, update:
- Line 22: SAMPLE_SPREADSHEET_ID with your Sheet ID
- Line 23: SAMPLE_RANGE_NAME with your cell range

4. Ngrok

Get 3 static domains and auth tokens from Ngrok.
Create ngrok_admin.yml, ngrok_annotation.yml, and ngrok_schola.yml in scholawrite_system/:
```
version: 2
authtoken: <Your AuthToken>
```
Paste your domains into docker-compose.yml on lines 36, 63, and 90.

5. Launch

cd scholawrite_system
docker-compose up

Chrome Extension

Update the server URL in scholawrite_system/extension/background.js (line 3) and popup.js (line 5).
In Chrome, go to chrome://extensions → enable Developer Mode → Load unpacked → select the extension/ folder.

Note: Due to Overleaf UI updates, the Chrome extension can no longer record writer actions or perform AI paraphrase.

Fine-Tuning

Inside the Docker container, navigate to scholawrite_finetune/:

Llama-8B (llama8b_scholawrite_finetune/):

# Iterative writing — set PURPOSE = "WRITING" in args.py
python3 train_writing.py

# Classification — set PURPOSE = "CLASS" in args.py
python3 train_classifier.py

BERT / RoBERTa (bert_finetune/):

python3 small_model_classifier.py

Fine-tuned models are saved to results/ in the project root.

Inference

Fine-tuned Llama-8B (`scholawrite_finetune/llama8b_scholawrite_finetune/`)

# Iterative writing — update model path in iterative_writing.py (lines 65, 80)
python3 iterative_writing.py

# Classification — update model path in classification.py (line 40)
python3 classification.py

Baseline Llama-8B (`meta_inference/llama8b_meta_instruction/`)

python3 iterative_writing.py
python3 classification.py

GPT-4o (`gpt4o/`)

python3 iterative_writing.py
python3 classification.py

Output structure: <output_dir>/<seed>/generation/ and <output_dir>/<seed>/intention/ (100 iterations each).

Eval Tool

Set up Ngrok with 1 static domain. Create eval_tool/ngrok.yml with your auth token.
Update the domain in eval_tool/run_eval_app.sh.

Run:

cd eval_tool
docker-compose up -d
docker exec -it scholawrite_eval bash
./run_eval_app.sh

Citation

@misc{le2025scholawritedatasetendtoendscholarly,
      title={ScholaWrite: A Dataset of End-to-End Scholarly Writing Process},
      author={Khanh Chi Le and Linghe Wang and Minhwa Lee and Ross Volkov and Luan Tuyen Chau and Dongyeop Kang},
      year={2025},
      eprint={2502.02904},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2502.02904},
}

Code Contributors

Linghe Wang, Ross Volkov, Minhwa Lee

License

This project is licensed under the MIT License.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ScholaWrite

Abstract

Repository Structure

Getting Started

Prerequisites

Environment Setup

ScholaWrite System

1. MongoDB

2. Google OAuth

3. Google Sheet

4. Ngrok

5. Launch

Chrome Extension

Fine-Tuning

Inference

Fine-tuned Llama-8B (`scholawrite_finetune/llama8b_scholawrite_finetune/`)

Baseline Llama-8B (`meta_inference/llama8b_meta_instruction/`)

GPT-4o (`gpt4o/`)

Eval Tool

Citation

Code Contributors

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
.github/workflows		.github/workflows
analysis		analysis
augmented		augmented
eval_tool		eval_tool
gpt4o		gpt4o
meta_inference/llama8b_meta_instruction		meta_inference/llama8b_meta_instruction
outputs		outputs
scholawrite_finetune		scholawrite_finetune
scholawrite_system		scholawrite_system
seeds		seeds
webpage		webpage
LICENSE		LICENSE
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

ScholaWrite

Abstract

Repository Structure

Getting Started

Prerequisites

Environment Setup

ScholaWrite System

1. MongoDB

2. Google OAuth

3. Google Sheet

4. Ngrok

5. Launch

Chrome Extension

Fine-Tuning

Inference

Fine-tuned Llama-8B (scholawrite_finetune/llama8b_scholawrite_finetune/)

Baseline Llama-8B (meta_inference/llama8b_meta_instruction/)

GPT-4o (gpt4o/)

Eval Tool

Citation

Code Contributors

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Fine-tuned Llama-8B (`scholawrite_finetune/llama8b_scholawrite_finetune/`)

Baseline Llama-8B (`meta_inference/llama8b_meta_instruction/`)

GPT-4o (`gpt4o/`)

Packages