Skip to content

Commit 8af4ad0

Browse files
Improve artifact documentation
1 parent 7cacf3d commit 8af4ad0

File tree

10 files changed

+127
-43
lines changed

10 files changed

+127
-43
lines changed

README.md

Lines changed: 18 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -48,16 +48,17 @@ conda activate sherpa
4848
```
4949

5050
### Install Sherpa
51-
This artifact uses the [Sherpa](https://github.com/Aggregate-Intellect/sherpa) to for the use cases. Specifically, it uses a slightly customized version of Sherpa v0.4.0, which is included in the `sherpa` folder in this repository. You can install Sherpa from the source code in this repository.
52-
To install Sherpa from the source code, first, install with [poetry](https://python-poetry.org/).
53-
```bash
54-
pip install poetry
55-
```
51+
This artifact uses the [Sherpa](https://github.com/Aggregate-Intellect/sherpa) to for the use cases. Specifically, it uses a slightly customized version of Sherpa v0.4.0, which is included in the `sherpa` folder in this repository. You can install Sherpa from the source code with `pip` edit mode in this repository.
52+
53+
>[!NOTE]
54+
>
55+
> The following step is optional, as it has already been configured in the `requirements.txt` files in each use case folder. However, if you experience any issues with the installation from the `requirements.txt` files, remove the first line of the `requirements.txt` file in the use case folder and run the following commands to install Sherpa.
5656
57-
Then, you can run the following commands:
57+
To install Sherpa from the source code, run the following commands in the top-level of the directory:
5858
```bash
5959
cd sherpa/src
60-
poetry install --with optional
60+
pip install -e .
61+
cd ../..
6162
```
6263

6364
### Install Dependencies
@@ -70,17 +71,23 @@ This repository uses several APIs for accessing the Large Language Models. You n
7071

7172
## Use cases
7273
> [!NOTE]
73-
> Excepting installing Sherpa, all the instructions for the use cases must be executed in the corresponding use case folder.
74+
>
75+
> All the instructions for the use cases must be executed in the corresponding use case folder.
7476
7577
the following folder contains material for each use case used in the paper:
76-
* `human_eval` contains the material for the HumanEval benchmark for the code generation use case
77-
* `clevr-human` contains the material for the Clevr-Human dataset for the question answering use case
78-
* `state_based_modeling` contains the material for the class name generation use case
78+
* `human_eval` contains the material for the HumanEval benchmark for the **code generation** use case
79+
* `clevr-human` contains the material for the Clevr-Human dataset for the **question answering** use case
80+
* `state_based_modeling` contains the material for the **class name generation** use case
7981

8082
Please refer the `README.md` in each folder for the details of the use case and how to run the experiments.
8183

8284
Each use case contains a `evaluation.ipynb` notebook that contains the steps to use generated results to create tables and figures in the paper.
8385

86+
## A Note on Other LLMs
87+
The use cases in this repository is tested for the following LLMs: GPT-4o, GPT-4o-mini, Qwen/Qwen2.5-7B-Instruct-Turbo, and Meta-Llama-3.1-70B-Instruct-Turbo. However, you can use other LLMs by using the wrappers from LangChain.
88+
89+
Some new LLMs may require upgrade the LangChain version. For example, the latest `gpt-4.1-nano` requires to update `langchain_openai` with `pip install -U langchain_openai`. While newer version of the dependency may work, this repository is only tested with the specific versions used in the requirements of each use case.
90+
8491
## Citation
8592
If you found this repository useful, please consider citing the following paper:
8693
```bibtex

clevr-human/README.md

Lines changed: 23 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,14 @@
11
# Sherpa for Clevr Human (The Question Answering Use Case)
22

3+
> [!NOTE]
4+
>
5+
> The following command assumes you are in the `clevr-human` folder.
6+
37
Designing a state machine for solving the question answering ask in the Clevr-Human dataset.
48

59
## Organization
610
The use case is organized as follows:
7-
* `clevr_qa` contains the code implementation for the use case
11+
* `clevr_qa` contains the code implementation for the use case. Specifically, it includes the implementation of the following approaches in the paper:
812
* The folder `react` contains the implementation of the ReACT approach
913
* The folder `routing` contains the implementation of the routing state machine approach
1014
* The folder `state_machine` contains the implementation of the planning state machine approach
@@ -36,30 +40,27 @@ The use case is organized as follows:
3640
# For conda
3741
conda activate clevr
3842
```
39-
40-
2. Install `sherpa` following the top-level Read
41-
3. Install the requirements:
43+
2. Install the requirements:
4244
```bash
4345
pip install -r requirements.txt
4446
```
4547

4648
## Create Dataset
47-
48-
1. download the Download CLEVR v1.0 (no images) from [the Clevr website](https://cs.stanford.edu/people/jcjohns/clevr/)
49-
2. put the `CLEVR_val_scenes.json` file to the `data` folder
50-
3. Download human created questions from [here](https://cs.stanford.edu/people/jcjohns/iep/)
51-
4. Put the `CLEVR-Humans-val` file to the `data` folder
52-
5. Run `scripts/create_dataset.py` to create the dataset
53-
6. This script will push the dataset to HuggingFace. Update the `--dataset_name` argument when running the experiments to your dataset name
54-
55-
7. The processed dataset is also available on [huggingface](https://huggingface.co/datasets/Dogdays/clevr_subset)
49+
1. Download the dataset using the `download_datasets.sh` script,or follow the following manual instructions:
50+
1. download the Download CLEVR v1.0 (no images) from [the Clevr website](https://cs.stanford.edu/people/jcjohns/clevr/)
51+
2. put the `CLEVR_val_scenes.json` file to the `data` folder
52+
3. Download human created questions from [here](https://cs.stanford.edu/people/jcjohns/iep/)
53+
4. Put the `CLEVR-Humans-val.json` file to the `data` folder
54+
2. Run `python -m scripts.create_dataset` to create the dataset
55+
3. This script will push the dataset to HuggingFace. Update the `--hg_dataset_name` argument when running the experiments to your dataset name. To update the dataset, you may need to login HuggingFace from the command line using `huggingface-cli login` command. Please refer this [link](https://huggingface.co/docs/huggingface_hub/en/quick-start#authentication) for more details.
56+
4. The processed dataset is also available on [huggingface](https://huggingface.co/datasets/Dogdays/clevr_subset)
5657

5758
## Setup the Environment Variables
5859
Create a `.env` file and copy the content of `.env_template` to it. Then, set the `OPENAI_API_KEY` and `TOGETHER_API_KEY` variables to your OpenAI and TogetherAI API keys, respectively.
5960

6061
## Run Question Answering
6162

62-
Run the `python -m scripts.run_qa ` command to run the question answering task. The command has several arguments to control the behavior of the script. Use the `--help` argument to see the available options:
63+
Run the `python -m scripts.run_qa ` command to run the question answering task. The command has several arguments to control the behavior of the script. Use the `--help` argument to see the available arguments:
6364

6465
* **-h, --help**: show this help message and exit
6566
* **--dataset_name**: Name of the processed dataset on HuggingFace. Default is `Dogdays/clevr_subset`. You normally don't need to change this unless you have created your own dataset.
@@ -104,4 +105,11 @@ The evaluation steps are included in the `evaluation.ipynb` notebook to create t
104105
jupyter notebook evaluation.ipynb
105106
```
106107

107-
Execute all the cells in the notebook to generate the tables and figures.
108+
Execute all the cells in the notebook to generate the tables and figures.
109+
110+
111+
## Troubleshooting
112+
* If you encountered `Unauthorized` error while access the dataset uploaded, make sure you have logged in to HuggingFace using the `huggingface-cli login` command.
113+
* If you encounter a `BadRequestError` error about data type of the dataset while running the question answering task. Please compare the dataset you are using with the pre-processed dataset in this [link](https://huggingface.co/datasets/Dogdays/clevr_subset) and make sure the dataset is in the same format.
114+
* If you encounter any `ModuleNotFoundError` error, make sure you have installed the requirements in the `requirements.txt` file and currently in the `clevr-human` folder.
115+
* Also make sure that you have set the environment variables in the `.env` file correctly, especially the `OPENAI_API_KEY` and `TOGETHER_API_KEY` variables.

clevr-human/download_datasets.sh

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
# Download the scenes for the CLEVR dataset.
2+
curl https://dl.fbaipublicfiles.com/clevr/CLEVR_v1.0_no_images.zip --output clevr.zip
3+
unzip clevr.zip
4+
mv CLEVR_v1.0/scenes/CLEVR_val_scenes.json data/CLEVR_val_scenes.json
5+
rm -rf CLEVR_v1.0
6+
rm clevr.zip
7+
8+
# Download the questions for the CLEVR dataset.
9+
curl https://cs.stanford.edu/people/jcjohns/iep/CLEVR-Humans.zip --output clevr_humans.zip
10+
unzip clevr_humans.zip
11+
mv CLEVR-Humans/CLEVR-Humans-val.json data/CLEVR-Humans-val.json
12+
rm -rf CLEVR-Humans
13+
rm clevr_humans.zip

clevr-human/requirements.txt

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,11 @@
1+
../sherpa/src
12
pandas==2.2.3
23
datasets==3.2.0
34
langchain-together==0.2.0
45
langchain_core==0.3.56
56
pydantic==2.10.3
67
langchain_openai==0.2.12
8+
python-dotenv==1.1.1
79

810
## Requirements for running the evaluation notebook
911
jupyter==1.1.1

human_eval/Dockerfile

Lines changed: 1 addition & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -5,11 +5,7 @@ WORKDIR /app
55

66

77
# Install poetry and sherpa
8-
RUN pip install --no-cache-dir poetry
9-
COPY sherpa /opt/sherpa
10-
RUN cd /opt/sherpa/src && \
11-
poetry config virtualenvs.create false && \
12-
poetry install
8+
COPY sherpa /sherpa
139

1410
# Install requirements
1511
COPY human_eval/requirements.txt /app/requirements.txt

human_eval/README.md

Lines changed: 57 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,8 @@
11
# Sherpa for HumanEval (The Code Generation Use Case)
2+
> [!NOTE]
3+
>
4+
> The following command assumes you are in the `human_eval` folder.
5+
26
> [!Warning]
37
>
48
> While this use case can also be run using a virtual environment, it is highly recommended to run it in a Docker container or similar isolated environment because the experiments may **execute arbitrary code generated by LLMs**. Below we describe the steps using Jupyter lab in a docker container. Use the virtual environment only if you are aware of the risks and have taken necessary precautions.
@@ -7,7 +11,7 @@
711
The use case is organized as follows:
812
* `run_programs.py`: The main script to run the code generation experiments
913
* `evaluation.ipynb`: The evaluation notebook to create tables and figures in the paper for the HumanEval use case
10-
`llm_coder` contains the code implementation for the use case
14+
* `llm_coder` contains the code implementation for the use case. Specifically, it includes the implementation of the following approaches in the paper:
1115
* `agent_coder_improved` folder contains the implementation of the agent coder approach
1216
* `test_based_sm_with_feedback` folder contains the implementation of the test-based state machine approach
1317
* `coders/direct_prompt_coder.py` contains the implementation of the direct prompt approach
@@ -17,13 +21,18 @@ The use case is organized as follows:
1721
* `requirements.txt` contains the requirements to run the use case
1822
* `Dockerfile` and `docker-compose.yml` contains the Dockerfile to run the use case in a Docker container
1923

20-
## Installation
24+
## Installation Preparation
2125
1. First, download the code for the human_eval benchmark from: https://github.com/openai/human-eval
2226
```bash
2327
git clone https://github.com/openai/human-eval
2428
```
25-
2. Install [Docker](https://docs.docker.com/get-started/overview/) if you haven't already. You can follow the instructions on the Docker website for your operating system.
26-
3. Build the Docker image using the provided Dockerfile and docker compose (Note that the build context is the root directory of this repository, as specified by `..`).
29+
* **NOTE:** Make sure you place the `human-eval` project under the same folder as this README file, i.e., the `human_eval` folder of the repository.
30+
2. Then, copy the `human_eval` folder from this repository to the `human_eval` folder you just cloned.
31+
32+
33+
## Installation with Docker (Recommended)
34+
1. Install [Docker](https://docs.docker.com/get-started/overview/) if you haven't already. You can follow the instructions on the Docker website for your operating system.
35+
2. Build the Docker image using the provided Dockerfile and docker compose (Note that the build context is the root directory of this repository, as specified by `..`).
2736

2837
```bash
2938
docker compose build
@@ -41,12 +50,46 @@ human_eval-jupyterlab-1 | http://127.0.0.1:8888/lab?token=<token>
4150

4251
3. Open the Jupyter lab in your browser with `http://localhost:8888`, and provide the value of the `token` from the terminal output. then you will be able to run the experiments in the subsequent steps.
4352

53+
## Installation with Virtual Environment
54+
> [!Warning]
55+
> This use case may run arbitrary code generated by LLMs. It is highly recommended to run it in a Docker container or similar isolated environment. Use the virtual environment only if you are aware of the risks and have taken necessary precautions.
56+
57+
1. Create a new virtual environment for this use case (not required, but recommended)
58+
59+
```bash
60+
# For venv
61+
python -m venv humaneval
62+
63+
# For conda
64+
conda create -n humaneval python=3.12
65+
```
66+
67+
Activate the virtual environment:
68+
69+
```bash
70+
# For venv
71+
source humaneval/bin/activate
72+
# For conda
73+
conda activate humaneval
74+
```
75+
76+
2. Install the requirements:
77+
```bash
78+
pip install -r requirements.txt
79+
```
80+
81+
4. Install the `human_eval` benchmark:
82+
```bash
83+
pip install -e human_eval
84+
```
85+
86+
4487
## Setup the Environment Variables
4588
Create a `.env` file and copy the content of `.env_template` to it. Then, set the `OPENAI_API_KEY` and `TOGETHER_API_KEY` variables to your OpenAI and TogetherAI API keys, respectively.
4689

4790

4891
## Run the Code Generation
49-
First, open a new terminal in the Jupyter lab interface. Then, you can run the `run_programs.py` script to generate code using LLMs. This script contains the following commands:
92+
First, open a new terminal in the Jupyter lab interface. Then, you can run the `run_programs.py` script to generate code using LLMs. This script contains the following arguments:
5093
* **-h,--help**: show this help message and exit
5194
* **--llm_family**: Provider of the model to use. One of {openai, togetherai}
5295
* **--llm_model**: Name of the LLM to use. The paper uses the following: gpt-4o (openai), gpt-4o-mini (openai), Qwen/Qwen2.5-7B-Instruct-Turbo (together), Qwen/Qwen2.5-7B-Instruct-Turbo (together) and Qwen/Qwen2.5-Coder-32B-Instruct (together). You can also use other LLMs from the two providers.
@@ -81,4 +124,12 @@ The `results` folder contains the cached 3-run results of the approaches used in
81124
The result will generate a `jsonl` file with the generated code and the number of LLM calls made to generate the code. Each line in the file will contain a JSON object.
82125

83126
### Evaluation
84-
The evaluation steps are included in the `evaluation.ipynb` notebook to create tables and figures in the paper for the HumanEval use case. You can open the notebook in Jupyter lab and execute all the cells to generate the tables and figures.
127+
#### With Docker
128+
The evaluation steps are included in the `evaluation.ipynb` notebook to create tables and figures in the paper for the HumanEval use case. You can open the notebook in Jupyter lab and execute all the cells to generate the tables and figures.
129+
130+
#### With virtural environment
131+
```bash
132+
jupyter notebook evaluation.ipynb
133+
```
134+
135+
Execute all the cells in the notebook to generate the tables and figures.

human_eval/requirements.txt

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,11 @@
1+
../sherpa/src
12
pandas==2.2.3
23
langchain-together==0.2.0
34
langchain_core==0.3.56
45
pydantic==2.10.3
56
langchain_openai==0.2.12
67
sympy==1.14.0
8+
python-dotenv==1.1.1
79

810
## Requirements for running the evaluation notebook
911
jupyter==1.1.1

sherpa/README.md

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,6 @@ poetry install --with optional,test,lint
2020

2121
### From source (with pip editable mode):
2222
```bash
23-
git clone
2423
cd sherpa/src
2524
pip install -e .
2625
```

state_based_modeling/README.md

Lines changed: 8 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,15 @@
11
# Sherpa for Modeling (The Class Name Generation Use Case)
2+
3+
> [!NOTE]
4+
>
5+
> The following command assumes you are in the `state_based_modeling` folder.
6+
27
Designing state machines for solving the class name generation task in the Modeling dataset
38

49
## Organization
510
The use case is organized as follows:
611
* `evaluation` contains the code implementation for the evaluation of the use case
7-
* `modeling` contains the code implementation for the use case
12+
* `modeling` contains the code implementation for the use case. Specifically, it includes the implementation of the following approaches in the paper:
813
* The file `model_class.py` contains the implementation of the Inspect state machine
914
* The file `model_class_mig.py` contains the implementation of the MIG state machine
1015
* `ground_truth` contains the ground truth model for the dataset
@@ -37,8 +42,7 @@ The use case is organized as follows:
3742
conda activate modeling
3843
```
3944

40-
2. Install `sherpa` following the top-level Read
41-
3. Install the requirements:
45+
2. Install the requirements:
4246
```bash
4347
pip install -r requirements.txt
4448
```
@@ -68,7 +72,7 @@ To repeat the experiments in the paper, run each LLM three times with the same c
6872
The command will output a `txt` file for each class name problem, containing the class name generated by the LLM.
6973

7074
### Run the State Machine Approaches
71-
Run `scripts.sm_main.py` to generate class names using the state machine approaches. It contains the following commands:
75+
Run `scripts.sm_main.py` to generate class names using the state machine approaches. It contains the following arguments:
7276
* **-h, --help**: show this help message and exit
7377
* **--model_type**: Provider of the model to use. One of {openai, together}
7478
* **--llm**: Name of the LLM to use. The paper uses the following: gpt-4o (openai), gpt-4o-mini (openai), Qwen/Qwen2.5-7B-Instruct-Turbo (together), Qwen/Qwen2.5-7B-Instruct-Turbo (together) and Meta-Llama-3.
Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,7 @@
1+
../sherpa/src
12
pandas==2.2.3
23
langchain-together==0.2.0
34
langchain_core==0.3.56
45
pydantic==2.10.3
5-
langchain_openai==0.2.12
6+
langchain_openai==0.2.12
7+
python-dotenv==1.1.1

0 commit comments

Comments
 (0)