Based on the Berkeley Big Data Benchmark, this repo has scripts to make it easy to:
- deploy an up-to-date HDP cluster on EC2
- copy data for the Intel Hadoop Benchmark and TPC-H from S3
- convert the data sets to Parquet, ORC and RCFile
- run and time Intel Hadoop Benchmark queries and a subset of TPC-H queries
The framework aims to support:
- Hive-on-Tez
- Shark
- Presto
- Impala
Engine support is currently in development.