Variational Autoencoder with Arbitrary Conditioning (VAEAC) is a neural probabilistic model based on variational autoencoder that can be conditioned on an arbitrary subset of observed features and then sample the remaining features.
For more detail, see the following paper:
Oleg Ivanov, Michael Figurnov, Dmitry Vetrov.
Variational Autoencoder with Arbitrary Conditioning, ICLR 2019,
link.
This PyTorch code implements the model and reproduces the results from the paper.
Install prerequisites from requirements.txt.
This code was tested on Linux (but it should work on Windows as well),
Python 3.6.4 and PyTorch 1.0.
To impute missing features with VAEAC one can use impute.py.
impute.py works with real-valued and categorical features.
It takes tab-separated values (tsv) file as an input.
NaNs in the input file indicate the missing features.
The output file is also a tsv file, where for each object
there is num_imputations copies of it with NaNs replaced
with different imputations.
These copies with imputations are consecutive in the output file.
For example, if num_imputations is 2,
then the output file is structured as follows
object1_imputation1
object1_imputation2
object2_imputation1
object2_imputation2
object3_imputation1
...
By default num_imputations is 5.
One-hot max size is the number of different values of a categorical feature. The values are assumed to be integers from 0 to K - 1, where K is one-hot max size. For the real-valued feature one-hot max size is assumed to be 0 or 1.
For example, for a dataset with a binary feature, three real-valued features
and a categorical feature with 10 classes the correct --one_hot_max_sizes
arguments are 2 1 1 1 10.
Validation ratio is the ratio of objects which will be used for validation and the best model selection.
python auto_imputation_script.py --field {} (To make the whole field to be missing data, range: bg, 0, 1, ... 17)
Example:
python auto_imputation_script.py --field bg 0 1 2
The results can be found in imputations_vis/
python auto_evaluate_script.py
python vis_train.py --input_file data/train_test_split/forModel_new_groundtruth.tsv --epochs 150 --validation_ratio 0.1 --one_hot_max_sizes 1 1 1 11 1 13 0 96 7 1 1 1 1 1 2 77 7 1 1 1 1 1 4 39 7 1 1 1 1 1 6 25 7 1 1 1 1 1 8 41 7 1 1 1 1 1 10 40 7 1 1 1 1 1 12 33 7 1 1 1 1 1 14 21 7 1 1 1 1 1 16 31 7 1 1 1 1 1 18 29 7 1 1 1 1 1 20 29 7 1 1 1 1 1 22 33 7 1 1 1 1 1 24 43 7 1 1 1 1 1 26 43 7 1 1 1 1 1 28 53 7 1 1 1 1 1 30 37 7 1 1 1 1 1 32 43 7 1 1 1 1 1 34 47 7 1 1 1 1 1
python vis_prepare_data.py --input_name forModel_new --seed 100 --prob 0.5
python vis_impute.py --input_file data/train_test_split/forModel_new_train.tsv --output_file data/imputations/for_evaluate.tsv --one_hot_max_sizes 1 1 1 11 1 13 0 96 7 1 1 1 1 1 2 77 7 1 1 1 1 1 4 39 7 1 1 1 1 1 6 25 7 1 1 1 1 1 8 41 7 1 1 1 1 1 10 40 7 1 1 1 1 1 12 33 7 1 1 1 1 1 14 21 7 1 1 1 1 1 16 31 7 1 1 1 1 1 18 29 7 1 1 1 1 1 20 29 7 1 1 1 1 1 22 33 7 1 1 1 1 1 24 43 7 1 1 1 1 1 26 43 7 1 1 1 1 1 28 53 7 1 1 1 1 1 30 37 7 1 1 1 1 1 32 43 7 1 1 1 1 1 34 47 7 1 1 1 1 1
python vis_evaluate_results.py --groundtruth data/train_test_split/forModel_new_groundtruth.tsv --input_file data/train_test_split/forModel_new_train.tsv --imputed_file data/imputations/for_evaluate.tsv --one_hot_max_sizes 1 1 1 11 1 13 0 96 7 1 1 1 1 1 2 77 7 1 1 1 1 1 4 39 7 1 1 1 1 1 6 25 7 1 1 1 1 1 8 41 7 1 1 1 1 1 10 40 7 1 1 1 1 1 12 33 7 1 1 1 1 1 14 21 7 1 1 1 1 1 16 31 7 1 1 1 1 1 18 29 7 1 1 1 1 1 20 29 7 1 1 1 1 1 22 33 7 1 1 1 1 1 24 43 7 1 1 1 1 1 26 43 7 1 1 1 1 1 28 53 7 1 1 1 1 1 30 37 7 1 1 1 1 1 32 43 7 1 1 1 1 1 34 47 7 1 1 1 1 1
If you find this code useful in your research, please consider citing the paper:
@inproceedings{
ivanov2018variational,
title={Variational Autoencoder with Arbitrary Conditioning},
author={Oleg Ivanov and Michael Figurnov and Dmitry Vetrov},
booktitle={International Conference on Learning Representations},
year={2019},
url={https://openreview.net/forum?id=SyxtJh0qYm},
}