Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
7 changes: 7 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
__pycache__/
.DS_Store
*.egg-info/
.vscode/
.env
.venv
*~
27 changes: 27 additions & 0 deletions LICENSE
Original file line number Diff line number Diff line change
Expand Up @@ -173,3 +173,30 @@
defend, and hold each Contributor harmless for any liability
incurred by, or claims asserted against, such Contributor by reason
of your accepting any such warranty or additional liability.

END OF TERMS AND CONDITIONS

APPENDIX: How to apply the Apache License to your work.

To apply the Apache License to your work, attach the following
boilerplate notice, with the fields enclosed by brackets "[]"
replaced with your own identifying information. (Don't include
the brackets!) The text should be enclosed in the appropriate
comment syntax for the file format. We also recommend that a
file or class name and description of purpose be included on the
same "printed page" as the copyright notice for easier
identification within third-party archives.

Copyright 2022 ReCode Benchmark

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
107 changes: 99 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,17 +1,108 @@
## My Project
# ReCode: Robustness Evaluation of Code Generation Models

TODO: Fill this README out!
This is the repo for ReCode, providing a comprehensive evaluation for the practical robustness of code generation models like CodeGEN, Incoder, GPT-J, CodeT5. In specific, this benchmark provides over 30 different general perturbations on docstrings, function names, and codes. The perturbations are carefully selected and implemented such that the perturbed datasets are naturally and semantically close to the original non-perturbed datasets. All the perturbations are well implemented with automatic generation, providing easy usage and customization. With these perturbations available in our benchmark, the user can get to know a comprehensive analysis of model robustness performance.

Be sure to:
Our benchmark is general with regard to datasets and models. Given the perturbed datasets, the users can evaluate any of public/customized code generation models with the default inference provided by our benchmark. We also allow users to provide their own inference scripts to evaluate robustness in our benchmark by replacing `evaluate-public-models/run_eval_models.sh`.

* Change the title in this README
* Edit your repository description on GitHub
After the model evaluation is done on perturbed datasets, we provide overall robustness analysis for the evaluated models such that the users can easily compare across different models and get aware of the possible practical robustness problems.

## Security
Lastly, we release a standard version of the perturbed datasets for HumanEval and MBPP in this benchmark for general robustness evaluation and compare across different models proposed in future works.

## Installation
We are using python 3.8, cuda 11.6. Anaconda would be recommended. Please run the following commands for installation.
```
conda deactivate; conda env remove --name ReCode
conda create --name ReCode python=3.8
conda activate ReCode
```

Installing huggingface for model inference
```
pip install transformers==4.21.1
pip install -U torch==1.11.0+cu113 -f https://download.pytorch.org/whl/torch_stable.html
```

Installing humaneval. Need to enable humaneval by uncommenting out execution line in `execution.py`.
```
cd evaluate-public-models
# git clone https://github.com/openai/human-eval # we already provide the humaneval files in repo
pip install -e human-eval
cd ..
```

Installing nlaugmenter for perturbations
```
cd nlaugmenter
pip install -r requirements.txt
pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.0.0/en_core_web_sm-3.0.0.tar.gz
cd ..
```

## Running our benchmarks
We have four main types of perturbations. Multiple variances are defined and implemented for each type of perturbation. One can find detailed config in `config.py`. To run our models, please config the data and model path correctly in `config.py` and then run the following command to create partial code and perturb datasets with each type of perturbations.
```
python run_robust.py create_partial natgen # preparing partial code for code perturbations
python run_robust.py perturb nlaugmenter # perturb with nlaugmenter
python run_robust.py perturb func_name # perturb with function rename
python run_robust.py perturb natgen # perturb with code structure transformation
python run_robust.py perturb code # perturb with code format transformation
```

One can specify augmentation method for each type of perturbations with --aug_method, index can be found in config.py. --datasets allow to specify perturbed datasets.
```
python run_robust.py perturb func_name --aug_method 0 --datasets humaneval mbpp # perturb with function rename CamelCase (index=0 defined in config.py) on humaneval and mbpp
```

To evaluate the models, one can run:
```
python run_robust.py nominal normal # nominal evaluation with non-perturbed datasets
python run_robust.py nominal natgen # nominal evaluation with non-perturbed partial code datasets
python run_robust.py exec nlaugmenter # nlaugmenter perturbed datasets evaluation
python run_robust.py exec func_name # function rename perturbed datasets evaluation
python run_robust.py exec natgen # code structure perturbed datasets evaluation
python run_robust.py exec format # code format transformation perturbed datasets evaluation
```

If one wants to perturb or evaluate specific augmentation method, one can easily run
```
python run_robust.py perturb func_name --aug_method 0 # perturb dataset with function rename CamelCase (index=0 defined in config.py)
python run_robust.py exec func_name --aug_method 0 # evaluate model on dataset with function rename CamelCase (index=0 defined in config.py)
```

For targeted models please use augments --models and --datasets. Note that one has to correctly config the model and dataset path and names correctly.
```
python run_robust.py perturb func_name --datasets humaneval mbpp --models codegen-350M-multi codegen-350M-mono # perturb dataset humaneval mbpp on codegen-350M-multi and codegen-350M-mono
python run_robust.py exec func_name --datasets humaneval mbpp --models codegen-350M-multi codegen-350M-mono # evaluate model on dataset humaneval mbpp on codegen-350M-multi and codegen-350M-mono
```

To analyze the evaluated results, one can run the following commands. Report option summarizes the results while analysis option provide prints for the perturbed data and completion by the model.
```
python run_robust.py report func_name --models codegen-350M-multi --datasets mbpp # get results for dataset perturbed with function rename by codegen-350M-multi
python run_robust.py analysis func_name --models codegen-350M-multi --datasets mbpp # analyze completion samples for dataset perturbed with function rename by codegen-350M-multi
```

To debug and customize perturbations, one can use low-level APIs. Turn on --print_sample to debug and check customized perturbations on each sample.
```
python perturb.py --method format --aug_method 0
```

See [CONTRIBUTING](CONTRIBUTING.md#security-issue-notifications) for more information.

## License
The ReCode benchmark is under Apache-2.0 license.


This project is licensed under the Apache-2.0 License.
## Authors
This code generation robustness benchmark (ReCode) is developed by a team in Amazon AWS

- Shiqi Wang, wshiqi@amazon.com, (main developer)
- Zheng Li, zl634@cornell.edu
- Haifeng Qian, qianhf@amazon.com
- Mingyue Shang, myshang@amazon.com
- Chenghao Yang, ychengha@amazon.com
- Zijian Wang, zijwan@amazon.com
- Varun Kumar, kuvrun@amazon.com
- Samson Tan, samson@amazon.com
- Baishakhi Ray, rabaisha@amazon.com
- Parminder Bhatia, parmib@amazon.com
- Murali Krishna Ramanathan, mkraman@amazon.com
- Bing Xiang, bxiang@amazon.com
Loading