Open-source codebase for Terminator Policy Gradient (TermPG), from "Reinforcement Learning with a Terminator (Neurips 2022)".
To use Terminator, make sure python3 is installed and pip is up to date. This project was tested using python version 3.8.
git clone https://github.com/guytenn/Terminator.gitIt is recommended to install requirements using a virtual environment. To set up a virtual environment follow these steps
cd Terminator/
python3 -m venv terminator_env
source terminator_env/bin/activate
# Upgrade Pip
pip install --upgrade pipWhile in Terminator directory install requirements using the following command
pip install .You can find the latest version of Backseat Driver here (currently only supports linux).
Download and unzip the files to src/envs/backseat_driver/build/
Your file system should be organized as follows.
src/envs/backseat_driver/build/BackseatDriverTerm_BurstDebugInformation_DoNotShip
src/envs/backseat_driver/build/BackseatDriverTerm_Data
src/envs/backseat_driver/build/BackseatDriverTerm.x86_64
src/envs/backseat_driver/build/UnityPlayer.soThe following will run TermPG using its default parameters
python3 run.py --learn_costs --termination_gammaBelow you can find a list of arguments you can change
| TermPG Arguments | Description |
|---|---|
--learn_costs |
Learn costs according to TermPG |
--termination_gamma |
Use dynamic discount factor according to TermPG |
--cost_coef 1 |
Cost coefficient for termination in environment |
--bonus_coef 1 |
Bonus coefficient for termination cost confidence in TermPG |
--bonus_type 'maxmin' |
Type of bonus to use for costs (one of: 'none', 'std', 'maxmin') |
--reward_penalty_coef 0 |
Penalty coefficient for costs (penalize reward by estimated costs) |
--termination_penalty 0 |
A Penalty for termination (reward shaping variant) |
--reward_bonus_coef 0 |
Bonus coefficient for optimism in costs |
--window 30 |
Window size for termination |
--env_window -1 |
The real window the env will use for termination. If -1 will use default window. |
--n_ensemble 3 |
Number of networks to use in cost model ensemble |
--term_train_steps 30 |
Number of train steps to train terminator |
--term_batch_size 64 |
Batch size for terminator |
--term_replay_size 1000 |
Replay size for terminator |
--cost_in_state |
Will add true cost to state (TerMDP with known costs) |
--no_termination |
Will disable termination in environment |
--cost_history_in_state |
Will add history of costs to state in addition to accumulated cost |
| General Arguments | Description |
|---|---|
--train_timesteps 1000000 |
Number of simulation timesteps to train a policy |
--train_batch_size 1024 |
Number of timesteps collected for each SGD round. This defines the size of each SGD epoch. |
--batch_size 32 |
Total SGD batch size across all devices for SGD. This defines the minibatch size within each epoch. |
--num_epochs 3 |
Number of SGD iterations in each outer loop (i.e., number of epochs to execute per train batch). |
--graphics |
When enabled will render environment |
--wandb |
Log to wandb |
--project_name |
Project name for wandb logging |
--run_name |
Run name for wandb logging |
--num_processes 8 |
Number of workers during training (value of -1 will use all cpus) |
--num_gpus 1 |
Number of gpus to use for training |
--entropy_coeff 0 |
Entropy loss coefficient |
--use_lstm |
Use a recurrent policy |
--clean_data |
Will remove all model files in src/data |
To cite our paper please use
@Article{tennenholtz2022reinforcement,
title={Reinforcement Learning with a Terminator},
author={Tennenholtz, Guy and Merlis, Nadav and Shani, Lior and Mannor, Shie and Shalit, Uri and Chechik, Gal and Hallak, Assaf and Dalal, Gal},
journal={arXiv preprint arXiv:2205.15376},
year={2022}
}