ENVISION: Localizing Active Objects from Egocentric Vision with Symbolic World Knowledge

Localizing Active Objects from Egocentric Vision with Symbolic World Knowledge
Te-Lin Wu, Yu Zhou (equal contribution), Nanyun Peng
EMNLP 2023 (Oral)

Abstract

The ability to actively ground task instructions from an egocentric view is crucial for AI agents to accomplish tasks or assist humans. One important step towards this goal is to localize and track key active objects that undergo major state change as a consequence of human actions/interactions in the environment (e.g., localizing and tracking the ‘sponge‘ in video from the instruction "Dip the sponge into the bucket.") without being told exactly what/where to ground. While existing works approach this problem from a pure vision perspective, we investigate to which extent the language modality (i.e., task instructions) and their interaction with visual modality can be beneficial. Specifically, we propose to improve phrase grounding models’ ability in localizing the active objects by: (1) learning the role of objects undergoing change and accurately extracting them from the instructions, (2) leveraging pre- and post-conditions of the objects during actions, and (3) recognizing the objects more robustly with descriptional knowledge. We leverage large language models (LLMs) to extract the aforementioned action-object knowledge, and design a per-object aggregation masking technique to effectively perform joint inference on object phrases with symbolic knowledge. We evaluate our framework on Ego4D and Epic-Kitchens datasets. Extensive experiments demonstrate the effectiveness of our proposed framework, which leads to > 54% improvements in all standard metrics on the TREK-150-OPE-Det localization + tracking task, > 7% improvements in all standard metrics on the TREK-150-OPE tracking task, and > 3% improvements in average precision (AP) on the Ego4D SCOD task.

Reproduce

Requirements

The models in this paper are runnable on a single Nvidia V-100 GPU and CUDA Version: 12.0.

Please see environment.yml for specific package requirements.

ENVISION

For data processing and LLM knowledge extraction, please refer to data_processing. For model training and inference using ENVISION, please refer to modeling. For visualizing data and ENVISION inference results, please refer to visualization. For model training and inference using ENVISION, please refer to baselines.

BibTeX

If you find the code in this repo useful, please consider citing our paper:

@article{wu2023localizing,
  title={Localizing active objects from egocentric vision with symbolic world knowledge},
  author={Wu, Te-Lin and Zhou, Yu and Peng, Nanyun},
  journal={arXiv preprint arXiv:2310.15066},
  year={2023}
}

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
Media		Media
baselines		baselines
cocoapi		cocoapi
data_processing		data_processing
modeling		modeling
visualization		visualization
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
replacements.txt		replacements.txt
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

ENVISION: Localizing Active Objects from Egocentric Vision with Symbolic World Knowledge

Abstract

Reproduce

Requirements

ENVISION

BibTeX

About

Uh oh!

Releases

Packages

Languages

License

PlusLabNLP/ENVISION

Folders and files

Latest commit

History

Repository files navigation

ENVISION: Localizing Active Objects from Egocentric Vision with Symbolic World Knowledge

Abstract

Reproduce

Requirements

ENVISION

BibTeX

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages