This repository contains the implementation of our EMNLP 2025 accepted SNaRe [paper] (Scout-Narrator-Refiner), a novel data inverse generation approach for creating domain-aware annotated data for low-resource event detection
SNaRe comprises three components:
- Scout: This component extracts the domain-specific cues from unlabeled target domain data.
- Narrator: This component samples target structure (i.e. event types and triggers) from the Scout's extractions and generates new sentences for this target structure.
- Refiner: This component finds missing events in the Narrator-generated sentences and refines the final annotations.
Post data creation, we train downstream models on our generated data and show an improvement in downstream model performance as the primary evaluation metric.
- Python with CUDA GPUs; models are served via vLLM.
- Recommended: create env from
env.yml.
conda env create -f env.yml
conda activate datagen- Data Directory (input):
sample_event_dir- Ensure you set it up for your dataset before proceeding furtherevent_definitions.txt: mapping of event type → definition (2-line pairs).task_definition.txtandtask_definition_no_marker.txt: task instructions used in prompts.examples_all.json: source examples for few-shot.
- Code:
scripts: Main folder with pre-made end-to-end bash scriptsllm_refinment: Python files for LLM-based extraction used in Scoutllm_inference: Python files for LLM-based inference used in the Refinerdata-generation: Python files for pre-processing/post-processing and main code for the Narratorevaluation/downstream/TextEE: Main code for downstream evaluation. We majorly utilize/modify code from TextEE and you can use their setup to get data for this folder.utils: Utility code files
- Outputs:
- Trigger Extraction Output:
extract_output/... - Generated data:
data_output/<dataset>/<run_name>/ - Downstream training/eval:
evaluation/downstream/TextEE/outputs/...
- Trigger Extraction Output:
Runs trigger extraction → data generation → verification → post-processing → downstream eval.
bash scripts/run.sh <dataset_name> <model_name> <n_gpu> [suffix]
# Example
bash scripts/run.sh ace llama3_8b 2 _testingSupported model_name values (mapped internally): llama3_8b, llama3_70b, qwen3_8b, qwen3_32b, plus additional options in data-generation/run_data_generation.sh. You can modify to add more models of your own.
Outputs land under data_output/<dataset>/<model_name>_0shot_tg-extracted-multi-preranked[suffix]/ with:
gen.txt(event spec + generations)gen_text.txt(even lines only, text for verifier)direct/gen.txt(refiner raw)direct/gen_refined.txt(refined)formatted_TextEE.jsonanddirect/formatted_TextEE.json(post-processed)
-
run.sh(zero-shot, monolingual):- Args:
<dataset_name> <model_name> <n_gpu> [suffix] - Steps:
- Trigger extraction via
llm_refinement(if not already inextract_output/...) - Data generation via
data-generation/run_data_generation.sh - Post-process to
formatted_TextEE.json - Verify/refine with
llm_inference/llm_inference.py - Weak-supervision merge and final post-process
- Downstream eval via
scripts/eval_dg.sh
- Trigger extraction via
- Args:
-
run_fs.sh(few-shot + optional adding gold examples):- Args:
<dataset_name> <model_name> <n_gpu> [few_shot=0] [few_shot_stage1=0] [suffix] [add_trigs=1] - Adds n-shot examples into the training file and can add triggers.
- Args:
-
run_multilingual.sh(zero-shot with language selection):- Args:
<dataset_name> <model_name> <n_gpu> [suffix] [lang=en|ar|zh] - Chinese sets special post-processing flags.
- Args:
-
run_external.sh(use external file as the unlabeled data):- Args:
<dataset_name> <model_name> <n_gpu> <external_file> [suffix]
- Args:
-
eval_dg.sh(downstream training/eval only):- Args:
<data_dir> <dataset_name> <model_name> [suffix] [few_shot=0] [add_trigs=0] - Trains/evaluates TextEE for seeds 0/10/20 and runs evaluation.
- Args:
- Ensure the correct dataset folder exists (e.g.,
event_data_ace/foracedataset) with the required files. Or you can modify the existing scripts to match your requirements. - vLLM will shard across
--n_gpu. SetCUDA_VISIBLE_DEVICESas needed. - Use
suffixto separate experiment runs underdata_output/<dataset>/.
If you use this code or find it helpful, please cite our paper:
@inproceedings{parekh2025snare,
title={SNaRe: Domain-aware Data Generation for Low-Resource Event Detection},
author={Tanmay Parekh and Yuxuan Dong and Lucas Bandarkar and Artin Kim and I-Hung Hsu and Kai-Wei Chang and Nanyun Peng},
booktitle={Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP)},
year={2025}
}For questions or issues, please contact the lead author Tanmay at tparekh@g.ucla.edu
We thank the creators of the datasets used in this work and the open-source community for the tools and frameworks that made this research possible.
