This repository hosts the code for the paper Preference Learning with Lie Detectors can Induce Honesty or Evasion.
An example of a setup and a basic experimental run is given in run.sh. Different run configurations can be adjusted by setting the flags such as DO_DPO to true or false. The codebase has been tested on the pytorch/pytorch:2.5.1-cuda12.1-cudnn9-devel Docker image.