A Dataset of End-to-End Scholarly Writing Process
Paper • Project Page • Dataset
Writing is a cognitively active task involving continuous decision-making, heavy use of working memory, and frequent switching between multiple activities. Scholarly writing is particularly complex as it requires authors to coordinate many pieces of multiform knowledge while meeting high academic standards. To understand writers' cognitive thinking process, one should fully decode the end-to-end writing data (from scratch to final manuscript) and understand their complex cognitive mechanisms in scientific writing. We introduce ScholaWrite, the first-of-its-kind keystroke logs of an end-to-end scholarly writing process, with thorough annotations of cognitive writing intentions behind each keystroke. Our dataset includes LaTeX-based keystroke data from five preprints with nearly 62K total text changes and annotations across 4 months of paper writing. Our dataset shows promising usability and applications for the future development of AI writing assistants for the research environment, which necessitate complex methods beyond LLM prompting, and supports the cognitive thinking process of scientists.
| Directory | Description |
|---|---|
scholawrite_system/ |
Data collection backend, admin page, annotation page, and Chrome extension |
scholawrite_finetune/ |
Fine-tuning scripts for BERT, RoBERTa, and Llama-8B-Instruct |
gpt4o/ |
GPT-4o inference for iterative writing and intention prediction |
meta_inference/ |
Llama-8B-Instruct baseline inference |
eval_tool/ |
Web interface for human evaluation of model outputs |
analysis/ |
Evaluation metrics, cosine similarity, lexical diversity, and figure generation |
augmented/ |
AI-generated content detection benchmarks (ScholaWrite-Augmented) |
seeds/ |
Seed documents for iterative writing experiments |
outputs/ |
Pre-computed inference outputs from all models |
- Docker & Docker Compose
- GPU with >= 16GB VRAM (for fine-tuning/inference)
- Python 3.8+
Create an .env file in the project root:
HUGGINGFACE_TOKEN="<Your Hugging Face access token>"
OPEN_AI_API="<Your OpenAI API key>"Create a Docker container for fine-tuning and inference:
docker run --name scholawrite --gpus all -dt -v ./:/workspace --ipc=host pytorch/pytorch:2.4.1-cuda12.1-cudnn9-devel bash
docker exec -it scholawrite bash
pip install accelerate python-dotenv huggingface-hub datasets transformers trl unsloth diff_match_patchThe data collection system uses Flask + MongoDB with a Chrome extension for Overleaf.
- Install MongoDB Community Edition and MongoDB Compass.
- Install Database Tools.
- Run MongoDB on the default port (27017). The
flask_dbdatabase andactivitycollection are created automatically.
- Create a Google Cloud project and an OAuth client for a Desktop app.
- Download the credentials file, rename it to
sheet_credential.json. - Update the volume paths in
scholawrite_system/docker-compose.yml(lines 12-13):volumes: - <path>/sheet_credential.json:/usr/local/src/scholawrite/flaskapp/sheet_credential.json - <path>/token.json:/usr/local/src/scholawrite/flaskapp/token.json
- Add Overleaf project IDs to consecutive rows in a Google Sheet column.
- In
scholawrite_system/flaskapp/App.py, update:- Line 22:
SAMPLE_SPREADSHEET_IDwith your Sheet ID - Line 23:
SAMPLE_RANGE_NAMEwith your cell range
- Line 22:
- Get 3 static domains and auth tokens from Ngrok.
- Create
ngrok_admin.yml,ngrok_annotation.yml, andngrok_schola.ymlinscholawrite_system/:version: 2 authtoken: <Your AuthToken>
- Paste your domains into
docker-compose.ymlon lines 36, 63, and 90.
cd scholawrite_system
docker-compose up- Update the server URL in
scholawrite_system/extension/background.js(line 3) andpopup.js(line 5). - In Chrome, go to
chrome://extensions→ enable Developer Mode → Load unpacked → select theextension/folder.
Note: Due to Overleaf UI updates, the Chrome extension can no longer record writer actions or perform AI paraphrase.
Inside the Docker container, navigate to scholawrite_finetune/:
Llama-8B (llama8b_scholawrite_finetune/):
# Iterative writing — set PURPOSE = "WRITING" in args.py
python3 train_writing.py
# Classification — set PURPOSE = "CLASS" in args.py
python3 train_classifier.pyBERT / RoBERTa (bert_finetune/):
python3 small_model_classifier.pyFine-tuned models are saved to results/ in the project root.
# Iterative writing — update model path in iterative_writing.py (lines 65, 80)
python3 iterative_writing.py
# Classification — update model path in classification.py (line 40)
python3 classification.pypython3 iterative_writing.py
python3 classification.pypython3 iterative_writing.py
python3 classification.pyOutput structure: <output_dir>/<seed>/generation/ and <output_dir>/<seed>/intention/ (100 iterations each).
- Set up Ngrok with 1 static domain. Create
eval_tool/ngrok.ymlwith your auth token. - Update the domain in
eval_tool/run_eval_app.sh. - Run:
cd eval_tool docker-compose up -d docker exec -it scholawrite_eval bash ./run_eval_app.sh
@misc{le2025scholawritedatasetendtoendscholarly,
title={ScholaWrite: A Dataset of End-to-End Scholarly Writing Process},
author={Khanh Chi Le and Linghe Wang and Minhwa Lee and Ross Volkov and Luan Tuyen Chau and Dongyeop Kang},
year={2025},
eprint={2502.02904},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2502.02904},
}Linghe Wang, Ross Volkov, Minhwa Lee
This project is licensed under the MIT License.