Skip to content

PlusLabNLP/SNaRe

Repository files navigation

SNaRe: Domain-aware Data Generation for Low-Resource Event Detection

This repository contains the implementation of our EMNLP 2025 accepted SNaRe [paper] (Scout-Narrator-Refiner), a novel data inverse generation approach for creating domain-aware annotated data for low-resource event detection

Overview

SNaRe comprises three components:

  1. Scout: This component extracts the domain-specific cues from unlabeled target domain data.
  2. Narrator: This component samples target structure (i.e. event types and triggers) from the Scout's extractions and generates new sentences for this target structure.
  3. Refiner: This component finds missing events in the Narrator-generated sentences and refines the final annotations.

Post data creation, we train downstream models on our generated data and show an improvement in downstream model performance as the primary evaluation metric.

SNaRe Methodology

Getting Started

Environment

  • Python with CUDA GPUs; models are served via vLLM.
  • Recommended: create env from env.yml.
conda env create -f env.yml
conda activate datagen

Folder Structure

  • Data Directory (input): sample_event_dir - Ensure you set it up for your dataset before proceeding further
    • event_definitions.txt: mapping of event type → definition (2-line pairs).
    • task_definition.txt and task_definition_no_marker.txt: task instructions used in prompts.
    • examples_all.json: source examples for few-shot.
  • Code:
    • scripts: Main folder with pre-made end-to-end bash scripts
    • llm_refinment: Python files for LLM-based extraction used in Scout
    • llm_inference: Python files for LLM-based inference used in the Refiner
    • data-generation: Python files for pre-processing/post-processing and main code for the Narrator
    • evaluation/downstream/TextEE: Main code for downstream evaluation. We majorly utilize/modify code from TextEE and you can use their setup to get data for this folder.
    • utils: Utility code files
  • Outputs:
    • Trigger Extraction Output: extract_output/...
    • Generated data: data_output/<dataset>/<run_name>/
    • Downstream training/eval: evaluation/downstream/TextEE/outputs/...

Running end-to-end

Runs trigger extraction → data generation → verification → post-processing → downstream eval.

bash scripts/run.sh <dataset_name> <model_name> <n_gpu> [suffix]
# Example
bash scripts/run.sh ace llama3_8b 2 _testing

Supported model_name values (mapped internally): llama3_8b, llama3_70b, qwen3_8b, qwen3_32b, plus additional options in data-generation/run_data_generation.sh. You can modify to add more models of your own.

Outputs land under data_output/<dataset>/<model_name>_0shot_tg-extracted-multi-preranked[suffix]/ with:

  • gen.txt (event spec + generations)
  • gen_text.txt (even lines only, text for verifier)
  • direct/gen.txt (refiner raw)
  • direct/gen_refined.txt (refined)
  • formatted_TextEE.json and direct/formatted_TextEE.json (post-processed)

Scripts (scripts/)

  • run.sh (zero-shot, monolingual):

    • Args: <dataset_name> <model_name> <n_gpu> [suffix]
    • Steps:
      1. Trigger extraction via llm_refinement (if not already in extract_output/...)
      2. Data generation via data-generation/run_data_generation.sh
      3. Post-process to formatted_TextEE.json
      4. Verify/refine with llm_inference/llm_inference.py
      5. Weak-supervision merge and final post-process
      6. Downstream eval via scripts/eval_dg.sh
  • run_fs.sh (few-shot + optional adding gold examples):

    • Args: <dataset_name> <model_name> <n_gpu> [few_shot=0] [few_shot_stage1=0] [suffix] [add_trigs=1]
    • Adds n-shot examples into the training file and can add triggers.
  • run_multilingual.sh (zero-shot with language selection):

    • Args: <dataset_name> <model_name> <n_gpu> [suffix] [lang=en|ar|zh]
    • Chinese sets special post-processing flags.
  • run_external.sh (use external file as the unlabeled data):

    • Args: <dataset_name> <model_name> <n_gpu> <external_file> [suffix]
  • eval_dg.sh (downstream training/eval only):

    • Args: <data_dir> <dataset_name> <model_name> [suffix] [few_shot=0] [add_trigs=0]
    • Trains/evaluates TextEE for seeds 0/10/20 and runs evaluation.

Repo Tips

  • Ensure the correct dataset folder exists (e.g., event_data_ace/ for ace dataset) with the required files. Or you can modify the existing scripts to match your requirements.
  • vLLM will shard across --n_gpu. Set CUDA_VISIBLE_DEVICES as needed.
  • Use suffix to separate experiment runs under data_output/<dataset>/.

Citation

If you use this code or find it helpful, please cite our paper:

@inproceedings{parekh2025snare,
    title={SNaRe: Domain-aware Data Generation for Low-Resource Event Detection},
    author={Tanmay Parekh and Yuxuan Dong and Lucas Bandarkar and Artin Kim and I-Hung Hsu and Kai-Wei Chang and Nanyun Peng},
    booktitle={Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP)},
    year={2025}
}

Contact

For questions or issues, please contact the lead author Tanmay at tparekh@g.ucla.edu

Acknowledgments

We thank the creators of the datasets used in this work and the open-source community for the tools and frameworks that made this research possible.

About

Forward-Inverse Generation for Low-Resouce Domain-specific Event Detection

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published