This is a time series forecasting project based on the Wikipedia Web Traffic Time Series Forecasting dataset from Kaggle. Two RNN architectures are implemented:
- A "Vanilla" RNN regressor.
- A Seq2seq regressor.
Both are implemented in TensorFlow 2, with custom training functions optimized with Autograph.
Main files:
config.yaml: config file for hyperparameters.dataprep.py: data preprocessing pipeline.train.py: training pipeline.tools.py: contains useful processing functions to be iterated in main pipelines.model.py: builds model.
I also added a visualize_performance.ipynb Jupyter Notebook to visually inspect models' performance on Test data.
Folders:
/data_raw/: requires unzippedtrain_2.csvfile from Kaggle. Available is animputed.csvdataset, containing imputed time series, coming from my other repository on a GAN for imputation of missing data in time series./data_processed/: divided in/Train/and/Test/directories./saved_models/: contains all saved TensorFlow models, both regressors./utils/: for pics and other secondary files.
After you clone the repository locally, download the raw dataset from Kaggle, and place unzipped train_2.csv file in /data_raw/ folder.
Then, time series forecast is executed in two steps. First, run data preprocessing pipeline:
python -m dataprep
This will generate Training+Validation and Test files, stored in /data_processed/ subdirectories. Second, launch training pipeline with:
python -m train
This will either create, train and save a new model, or load and train an already existing one, stored in /saved_models/ folder.
Finally, Test set performance will be evaluated from test.ipynb notebook.
numpy==1.18.3
pandas==1.0.3
scikit-learn==0.22.2.post1
scipy==1.4.1
tensorflow==2.1.0
tqdm==4.45.0
I used a pretty powerful laptop, with 64GB or RAM and NVidia RTX 2070 GPU. I highly recommend GPU training to avoid excessive computational times.
