SNaRe: Domain-aware Data Generation for Low-Resource Event Detection

This repository contains the implementation of our EMNLP 2025 accepted SNaRe [paper] (Scout-Narrator-Refiner), a novel data inverse generation approach for creating domain-aware annotated data for low-resource event detection

Overview

SNaRe comprises three components:

Scout: This component extracts the domain-specific cues from unlabeled target domain data.
Narrator: This component samples target structure (i.e. event types and triggers) from the Scout's extractions and generates new sentences for this target structure.
Refiner: This component finds missing events in the Narrator-generated sentences and refines the final annotations.

Post data creation, we train downstream models on our generated data and show an improvement in downstream model performance as the primary evaluation metric.

Getting Started

Environment

Python with CUDA GPUs; models are served via vLLM.
Recommended: create env from env.yml.

conda env create -f env.yml
conda activate datagen

Folder Structure

Data Directory (input): sample_event_dir - Ensure you set it up for your dataset before proceeding further
- event_definitions.txt: mapping of event type → definition (2-line pairs).
- task_definition.txt and task_definition_no_marker.txt: task instructions used in prompts.
- examples_all.json: source examples for few-shot.
Code:
- scripts: Main folder with pre-made end-to-end bash scripts
- llm_refinment: Python files for LLM-based extraction used in Scout
- llm_inference: Python files for LLM-based inference used in the Refiner
- data-generation: Python files for pre-processing/post-processing and main code for the Narrator
- evaluation/downstream/TextEE: Main code for downstream evaluation. We majorly utilize/modify code from TextEE and you can use their setup to get data for this folder.
- utils: Utility code files
Outputs:
- Trigger Extraction Output: extract_output/...
- Generated data: data_output/<dataset>/<run_name>/
- Downstream training/eval: evaluation/downstream/TextEE/outputs/...

Running end-to-end

Runs trigger extraction → data generation → verification → post-processing → downstream eval.

bash scripts/run.sh <dataset_name> <model_name> <n_gpu> [suffix]
# Example
bash scripts/run.sh ace llama3_8b 2 _testing

Supported model_name values (mapped internally): llama3_8b, llama3_70b, qwen3_8b, qwen3_32b, plus additional options in data-generation/run_data_generation.sh. You can modify to add more models of your own.

Outputs land under data_output/<dataset>/<model_name>_0shot_tg-extracted-multi-preranked[suffix]/ with:

gen.txt (event spec + generations)
gen_text.txt (even lines only, text for verifier)
direct/gen.txt (refiner raw)
direct/gen_refined.txt (refined)
formatted_TextEE.json and direct/formatted_TextEE.json (post-processed)

Scripts (scripts/)

run.sh (zero-shot, monolingual):
- Args: <dataset_name> <model_name> <n_gpu> [suffix]
- Steps:
  1. Trigger extraction via llm_refinement (if not already in extract_output/...)
  2. Data generation via data-generation/run_data_generation.sh
  3. Post-process to formatted_TextEE.json
  4. Verify/refine with llm_inference/llm_inference.py
  5. Weak-supervision merge and final post-process
  6. Downstream eval via scripts/eval_dg.sh
run_fs.sh (few-shot + optional adding gold examples):
- Args: <dataset_name> <model_name> <n_gpu> [few_shot=0] [few_shot_stage1=0] [suffix] [add_trigs=1]
- Adds n-shot examples into the training file and can add triggers.
run_multilingual.sh (zero-shot with language selection):
- Args: <dataset_name> <model_name> <n_gpu> [suffix] [lang=en|ar|zh]
- Chinese sets special post-processing flags.
run_external.sh (use external file as the unlabeled data):
- Args: <dataset_name> <model_name> <n_gpu> <external_file> [suffix]
eval_dg.sh (downstream training/eval only):
- Args: <data_dir> <dataset_name> <model_name> [suffix] [few_shot=0] [add_trigs=0]
- Trains/evaluates TextEE for seeds 0/10/20 and runs evaluation.

Repo Tips

Ensure the correct dataset folder exists (e.g., event_data_ace/ for ace dataset) with the required files. Or you can modify the existing scripts to match your requirements.
vLLM will shard across --n_gpu. Set CUDA_VISIBLE_DEVICES as needed.
Use suffix to separate experiment runs under data_output/<dataset>/.

Citation

If you use this code or find it helpful, please cite our paper:

@inproceedings{parekh2025snare,
    title={SNaRe: Domain-aware Data Generation for Low-Resource Event Detection},
    author={Tanmay Parekh and Yuxuan Dong and Lucas Bandarkar and Artin Kim and I-Hung Hsu and Kai-Wei Chang and Nanyun Peng},
    booktitle={Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP)},
    year={2025}
}

Contact

For questions or issues, please contact the lead author Tanmay at tparekh@g.ucla.edu

Acknowledgments

We thank the creators of the datasets used in this work and the open-source community for the tools and frameworks that made this research possible.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SNaRe: Domain-aware Data Generation for Low-Resource Event Detection

Overview

Getting Started

Environment

Folder Structure

Running end-to-end

Scripts (scripts/)

Repo Tips

Citation

Contact

Acknowledgments

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
data-generation		data-generation
evaluation/downstream/TextEE		evaluation/downstream/TextEE
figures		figures
llm_inference		llm_inference
llm_refinement		llm_refinement
sample_event_dir		sample_event_dir
scripts		scripts
utils		utils
README.md		README.md
env.yml		env.yml

PlusLabNLP/SNaRe

Folders and files

Latest commit

History

Repository files navigation

SNaRe: Domain-aware Data Generation for Low-Resource Event Detection

Overview

Getting Started

Environment

Folder Structure

Running end-to-end

Scripts (scripts/)

Repo Tips

Citation

Contact

Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages