Improve artifact documentation

20001LastOrder · 20001LastOrder · commit 8af4ad087352 · 2025-07-22T18:11:24.000-04:00
diff --git a/README.md b/README.md
@@ -48,16 +48,17 @@ conda activate sherpa
 ```
 
 ### Install Sherpa
-This artifact uses the [Sherpa](https://github.com/Aggregate-Intellect/sherpa) to for the use cases. Specifically, it uses a slightly customized version of Sherpa v0.4.0, which is included in the `sherpa` folder in this repository. You can install Sherpa from the source code in this repository.
-To install Sherpa from the source code, first, install with [poetry](https://python-poetry.org/).
-```bash
-pip install poetry
-```
+This artifact uses the [Sherpa](https://github.com/Aggregate-Intellect/sherpa) to for the use cases. Specifically, it uses a slightly customized version of Sherpa v0.4.0, which is included in the `sherpa` folder in this repository. You can install Sherpa from the source code with `pip` edit mode in this repository.
+
+>[!NOTE]
+>
+> The following step is optional, as it has already been configured in the `requirements.txt` files in each use case folder. However, if you experience any issues with the installation from the `requirements.txt` files, remove the first line of the `requirements.txt` file in the use case folder and run the following commands to install Sherpa.
 
-Then, you can run the following commands:
+To install Sherpa from the source code, run the following commands in the top-level of the directory:
 ```bash
 cd sherpa/src
-poetry install --with optional
+pip install -e .
+cd ../..
 ```
 
 ### Install Dependencies
@@ -70,17 +71,23 @@ This repository uses several APIs for accessing the Large Language Models. You n
 
 ## Use cases
 > [!NOTE]
-> Excepting installing Sherpa, all the instructions for the use cases must be executed in the corresponding use case folder.
+>
+> All the instructions for the use cases must be executed in the corresponding use case folder.
 
 the following folder contains material for each use case used in the paper:
-* `human_eval` contains the material for the HumanEval benchmark for the code generation use case
-* `clevr-human` contains the material for the Clevr-Human dataset for the question answering use case
-* `state_based_modeling` contains the material for the class name generation use case
+* `human_eval` contains the material for the HumanEval benchmark for the **code generation** use case
+* `clevr-human` contains the material for the Clevr-Human dataset for the **question answering** use case
+* `state_based_modeling` contains the material for the **class name generation** use case
 
 Please refer the `README.md` in each folder for the details of the use case and how to run the experiments.
 
 Each use case contains a `evaluation.ipynb` notebook that contains the steps to use generated results to create tables and figures in the paper.
 
+## A Note on Other LLMs
+The use cases in this repository is tested for the following LLMs: GPT-4o, GPT-4o-mini, Qwen/Qwen2.5-7B-Instruct-Turbo, and Meta-Llama-3.1-70B-Instruct-Turbo. However, you can use other LLMs by using the wrappers from LangChain.  
+
+Some new LLMs may require upgrade the LangChain version. For example, the latest `gpt-4.1-nano` requires to update `langchain_openai` with `pip install -U langchain_openai`. While newer version of the dependency may work, this repository is only tested with the specific versions used in the requirements of each use case.
+
 ## Citation
 If you found this repository useful, please consider citing the following paper:
 ```bibtex
diff --git a/clevr-human/README.md b/clevr-human/README.md
@@ -1,10 +1,14 @@
 # Sherpa for Clevr Human (The Question Answering Use Case)
 
+> [!NOTE]
+>
+> The following command assumes you are in the `clevr-human` folder.
+
 Designing a state machine for solving the question answering ask in the Clevr-Human dataset.
 
 ## Organization
 The use case is organized as follows:
-* `clevr_qa` contains the code implementation for the use case
+* `clevr_qa` contains the code implementation for the use case. Specifically, it includes the implementation of the following approaches in the paper:
    * The folder `react` contains the implementation of the ReACT approach
    * The folder `routing` contains the implementation of the routing state machine approach
    * The folder `state_machine` contains the implementation of the planning state machine approach
@@ -36,30 +40,27 @@ The use case is organized as follows:
    # For conda
    conda activate clevr
    ```
-
-2. Install `sherpa` following the top-level Read
-3. Install the requirements:
+2. Install the requirements:
    ```bash
    pip install -r requirements.txt
    ```
 
 ## Create Dataset
-
-1. download the Download CLEVR v1.0 (no images) from [the Clevr website](https://cs.stanford.edu/people/jcjohns/clevr/)
-2. put the `CLEVR_val_scenes.json` file to the `data` folder
-3. Download human created questions from [here](https://cs.stanford.edu/people/jcjohns/iep/)
-4. Put the `CLEVR-Humans-val` file to the `data` folder
-5. Run `scripts/create_dataset.py` to create the dataset
-6. This script will push the dataset to HuggingFace. Update the `--dataset_name` argument when running the experiments to your dataset name
-
-7. The processed dataset is also available on [huggingface](https://huggingface.co/datasets/Dogdays/clevr_subset)
+1. Download the dataset using the `download_datasets.sh` script,or follow the following manual instructions:
+   1. download the Download CLEVR v1.0 (no images) from [the Clevr website](https://cs.stanford.edu/people/jcjohns/clevr/)
+   2. put the `CLEVR_val_scenes.json` file to the `data` folder
+   3. Download human created questions from [here](https://cs.stanford.edu/people/jcjohns/iep/)
+   4. Put the `CLEVR-Humans-val.json` file to the `data` folder
+2. Run `python -m scripts.create_dataset` to create the dataset
+3. This script will push the dataset to HuggingFace. Update the `--hg_dataset_name` argument when running the experiments to your dataset name. To update the dataset, you may need to login HuggingFace from the command line using `huggingface-cli login` command. Please refer this [link](https://huggingface.co/docs/huggingface_hub/en/quick-start#authentication) for more details.
+4. The processed dataset is also available on [huggingface](https://huggingface.co/datasets/Dogdays/clevr_subset)
 
 ## Setup the Environment Variables
 Create a `.env` file and copy the content of `.env_template` to it. Then, set the `OPENAI_API_KEY` and `TOGETHER_API_KEY` variables to your OpenAI and TogetherAI API keys, respectively.
 
 ## Run Question Answering
 
-Run the `python -m scripts.run_qa ` command to run the question answering task. The command has several arguments to control the behavior of the script. Use the `--help` argument to see the available options:
+Run the `python -m scripts.run_qa ` command to run the question answering task. The command has several arguments to control the behavior of the script. Use the `--help` argument to see the available arguments:
 
   * **-h, --help**: show this help message and exit
   * **--dataset_name**: Name of the processed dataset on HuggingFace. Default is `Dogdays/clevr_subset`. You normally don't need to change this unless you have created your own dataset.
@@ -104,4 +105,11 @@ The evaluation steps are included in the `evaluation.ipynb` notebook to create t
 jupyter notebook evaluation.ipynb
 ```
 
-Execute all the cells in the notebook to generate the tables and figures. 
+Execute all the cells in the notebook to generate the tables and figures. 
+
+
+## Troubleshooting
+* If you encountered `Unauthorized` error while access the dataset uploaded, make sure you have logged in to HuggingFace using the `huggingface-cli login` command.
+* If you encounter a `BadRequestError` error about data type of the dataset while running the question answering task. Please compare the dataset you are using with the pre-processed dataset in this [link](https://huggingface.co/datasets/Dogdays/clevr_subset) and make sure the dataset is in the same format. 
+* If you encounter any `ModuleNotFoundError` error, make sure you have installed the requirements in the `requirements.txt` file and currently in the `clevr-human` folder.
+* Also make sure that you have set the environment variables in the `.env` file correctly, especially the `OPENAI_API_KEY` and `TOGETHER_API_KEY` variables.
diff --git a/clevr-human/download_datasets.sh b/clevr-human/download_datasets.sh
@@ -0,0 +1,13 @@
+# Download the scenes for the CLEVR dataset.
+curl https://dl.fbaipublicfiles.com/clevr/CLEVR_v1.0_no_images.zip --output clevr.zip
+unzip clevr.zip
+mv CLEVR_v1.0/scenes/CLEVR_val_scenes.json data/CLEVR_val_scenes.json
+rm -rf CLEVR_v1.0
+rm clevr.zip
+
+# Download the questions for the CLEVR dataset.
+curl https://cs.stanford.edu/people/jcjohns/iep/CLEVR-Humans.zip --output clevr_humans.zip
+unzip clevr_humans.zip
+mv CLEVR-Humans/CLEVR-Humans-val.json data/CLEVR-Humans-val.json
+rm -rf CLEVR-Humans
+rm clevr_humans.zip
diff --git a/clevr-human/requirements.txt b/clevr-human/requirements.txt
@@ -1,9 +1,11 @@
+../sherpa/src
 pandas==2.2.3
 datasets==3.2.0
 langchain-together==0.2.0
 langchain_core==0.3.56
 pydantic==2.10.3
 langchain_openai==0.2.12
+python-dotenv==1.1.1
 
 ## Requirements for running the evaluation notebook
 jupyter==1.1.1
diff --git a/human_eval/Dockerfile b/human_eval/Dockerfile
@@ -5,11 +5,7 @@ WORKDIR /app
 
 
 # Install poetry and sherpa
-RUN pip install --no-cache-dir poetry
-COPY sherpa /opt/sherpa
-RUN cd /opt/sherpa/src && \
-    poetry config virtualenvs.create false && \
-    poetry install
+COPY sherpa /sherpa
 
 # Install requirements
 COPY human_eval/requirements.txt /app/requirements.txt
diff --git a/human_eval/README.md b/human_eval/README.md
@@ -1,4 +1,8 @@
 # Sherpa for HumanEval (The Code Generation Use Case)
+> [!NOTE]
+>
+> The following command assumes you are in the `human_eval` folder.
+
 > [!Warning]
 >
 > While this use case can also be run using a virtual environment, it is highly recommended to run it in a Docker container or similar isolated environment because the experiments may **execute arbitrary code generated by LLMs**. Below we describe the steps using Jupyter lab in a docker container. Use the virtual environment only if you are aware of the risks and have taken necessary precautions.
@@ -7,7 +11,7 @@
 The use case is organized as follows:
 * `run_programs.py`: The main script to run the code generation experiments
 * `evaluation.ipynb`: The evaluation notebook to create tables and figures in the paper for the HumanEval use case
-`llm_coder` contains the code implementation for the use case
+* `llm_coder` contains the code implementation for the use case. Specifically, it includes the implementation of the following approaches in the paper:
   * `agent_coder_improved` folder contains the implementation of the agent coder approach
   * `test_based_sm_with_feedback` folder contains the implementation of the test-based state machine approach
   * `coders/direct_prompt_coder.py` contains the implementation of the direct prompt approach
@@ -17,13 +21,18 @@ The use case is organized as follows:
 * `requirements.txt` contains the requirements to run the use case
 * `Dockerfile` and `docker-compose.yml` contains the Dockerfile to run the use case in a Docker container
 
-## Installation
+## Installation Preparation
 1. First, download the code for the human_eval benchmark from: https://github.com/openai/human-eval
 ```bash
 git clone https://github.com/openai/human-eval
 ```
-2. Install [Docker](https://docs.docker.com/get-started/overview/) if you haven't already. You can follow the instructions on the Docker website for your operating system.
-3. Build the Docker image using the provided Dockerfile and docker compose (Note that the build context is the root directory of this repository, as specified by `..`).
+  * **NOTE:** Make sure you place the `human-eval` project under the same folder as this README file, i.e., the `human_eval` folder of the repository.
+2. Then, copy the `human_eval` folder from this repository to the `human_eval` folder you just cloned.
+
+
+## Installation with Docker (Recommended)
+1. Install [Docker](https://docs.docker.com/get-started/overview/) if you haven't already. You can follow the instructions on the Docker website for your operating system.
+2. Build the Docker image using the provided Dockerfile and docker compose (Note that the build context is the root directory of this repository, as specified by `..`).
 
 ```bash
 docker compose build
@@ -41,12 +50,46 @@ human_eval-jupyterlab-1  |         http://127.0.0.1:8888/lab?token=<token>
 
 3. Open the Jupyter lab in your browser with `http://localhost:8888`, and provide the value of the `token` from the terminal output. then you will be able to run the experiments in the subsequent steps.
 
+## Installation with Virtual Environment
+> [!Warning]
+> This use case may run arbitrary code generated by LLMs. It is highly recommended to run it in a Docker container or similar isolated environment. Use the virtual environment only if you are aware of the risks and have taken necessary precautions.
+
+1. Create a new virtual environment for this use case (not required, but recommended)
+
+   ```bash
+   # For venv
+   python -m venv humaneval
+
+   # For conda
+   conda create -n humaneval python=3.12
+   ```
+
+   Activate the virtual environment:
+
+   ```bash
+   # For venv
+   source humaneval/bin/activate
+   # For conda
+   conda activate humaneval
+   ```
+
+2. Install the requirements:
+   ```bash
+   pip install -r requirements.txt
+   ```
+
+4. Install the `human_eval` benchmark:
+   ```bash
+    pip install -e human_eval
+   ```
+
+
 ## Setup the Environment Variables
 Create a `.env` file and copy the content of `.env_template` to it. Then, set the `OPENAI_API_KEY` and `TOGETHER_API_KEY` variables to your OpenAI and TogetherAI API keys, respectively.
 
 
 ## Run the Code Generation
-First, open a new terminal in the Jupyter lab interface. Then, you can run the `run_programs.py` script to generate code using LLMs. This script contains the following commands:
+First, open a new terminal in the Jupyter lab interface. Then, you can run the `run_programs.py` script to generate code using LLMs. This script contains the following arguments:
   * **-h,--help**: show this help message and exit
   * **--llm_family**: Provider of the model to use. One of {openai, togetherai}
   * **--llm_model**: Name of the LLM to use. The paper uses the following: gpt-4o (openai), gpt-4o-mini (openai), Qwen/Qwen2.5-7B-Instruct-Turbo (together), Qwen/Qwen2.5-7B-Instruct-Turbo (together) and Qwen/Qwen2.5-Coder-32B-Instruct (together). You can also use other LLMs from the two providers.
@@ -81,4 +124,12 @@ The `results` folder contains the cached 3-run results of the approaches used in
 The result will generate a `jsonl` file with the generated code and the number of LLM calls made to generate the code. Each line in the file will contain a JSON object.
 
 ### Evaluation
-The evaluation steps are included in the `evaluation.ipynb` notebook to create tables and figures in the paper for the HumanEval use case. You can open the notebook in Jupyter lab and execute all the cells to generate the tables and figures.
+#### With Docker
+The evaluation steps are included in the `evaluation.ipynb` notebook to create tables and figures in the paper for the HumanEval use case. You can open the notebook in Jupyter lab and execute all the cells to generate the tables and figures.
+
+#### With virtural environment
+```bash
+jupyter notebook evaluation.ipynb
+```
+
+Execute all the cells in the notebook to generate the tables and figures. 
diff --git a/human_eval/requirements.txt b/human_eval/requirements.txt
@@ -1,9 +1,11 @@
+../sherpa/src
 pandas==2.2.3
 langchain-together==0.2.0
 langchain_core==0.3.56
 pydantic==2.10.3
 langchain_openai==0.2.12
 sympy==1.14.0
+python-dotenv==1.1.1
 
 ## Requirements for running the evaluation notebook
 jupyter==1.1.1
diff --git a/sherpa/README.md b/sherpa/README.md
@@ -20,7 +20,6 @@ poetry install --with optional,test,lint
 
 ### From source (with pip editable mode):
 ```bash
-git clone
 cd sherpa/src
 pip install -e .
 ```
diff --git a/state_based_modeling/README.md b/state_based_modeling/README.md
@@ -1,10 +1,15 @@
 # Sherpa for Modeling (The Class Name Generation Use Case)
+
+> [!NOTE]
+>
+> The following command assumes you are in the `state_based_modeling` folder.
+
 Designing state machines for solving the class name generation task in the Modeling dataset
 
 ## Organization
 The use case is organized as follows:
 * `evaluation` contains the code implementation for the evaluation of the use case
-* `modeling` contains the code implementation for the use case
+* `modeling` contains the code implementation for the use case. Specifically, it includes the implementation of the following approaches in the paper:
   * The file `model_class.py` contains the implementation of the Inspect state machine
   * The file `model_class_mig.py` contains the implementation of the MIG state machine
 * `ground_truth` contains the ground truth model for the dataset
@@ -37,8 +42,7 @@ The use case is organized as follows:
    conda activate modeling
    ```
 
-2. Install `sherpa` following the top-level Read
-3. Install the requirements:
+2. Install the requirements:
    ```bash
    pip install -r requirements.txt
    ```
@@ -68,7 +72,7 @@ To repeat the experiments in the paper, run each LLM three times with the same c
 The command will output a `txt` file for each class name problem, containing the class name generated by the LLM. 
 
 ### Run the State Machine Approaches
-Run `scripts.sm_main.py` to generate class names using the state machine approaches. It contains the following commands:
+Run `scripts.sm_main.py` to generate class names using the state machine approaches. It contains the following arguments:
 * **-h, --help**: show this help message and exit
 * **--model_type**: Provider of the model to use. One of {openai, together}
 * **--llm**: Name of the LLM to use. The paper uses the following: gpt-4o (openai), gpt-4o-mini (openai), Qwen/Qwen2.5-7B-Instruct-Turbo (together), Qwen/Qwen2.5-7B-Instruct-Turbo (together) and Meta-Llama-3.
diff --git a/state_based_modeling/requirements.txt b/state_based_modeling/requirements.txt
@@ -1,5 +1,7 @@
+../sherpa/src
 pandas==2.2.3
 langchain-together==0.2.0
 langchain_core==0.3.56
 pydantic==2.10.3
-langchain_openai==0.2.12
+langchain_openai==0.2.12
+python-dotenv==1.1.1