-
Release version: 1.0
-
Release date: 2014/7/14
-
Contact: Rui Sun, Lan Yi, Grace Huang, Jiangang, Duan
-
Homepage: https://github.com/intel-hadoop/Mallet
-
Contents:
- Overview
- Getting Started
- Project Directory Layout
- Limitations and Known Issues
Mallet is an open source decision support benchmark for HiveQL-compatible SQL engines, which is derived from the TPC-DS benchmark. It basically follows the TPC-DS's modeling of several generally applicable aspects of a decision support system, including the database schema, data population, queries, data maintenance. Note that the result of Mallet is not comparable to any published TPC-DS Benchmark results.
Mallet implements the Driver and follows the benchmark procedure defined in the TPC-DS 1.1.0 specification. Mallet performs the Load Test, the Power Test and the Throughput Test 1 & 2. It reports part of the metric indicators defined in the specification. A text based report with the primary performance metric (QphM@SF) will be generated upon the completion of the benchmark.
Mallet utilizes the TPC-DS toolkit for data preparation, i.e., generation of the table and maintenance data. To make the data generation fast, Mallet takes advantage of the Hadoop streaming and the data chunk generation support by the DSDGEN tool for distributed data generation. The generated data is stored in HDFS for accesses. For HIVE and SHARK, the data will be loaded in place as external tables.
Mallet's workload consists of 65 queries in HiveQL which are converted from the corresponding TPC-DS queries in SQL. Mallet's workload also implements in HiveQL all of the data maintenance functions defined in the TPC-DS 1.1.0 specification.
Mallet uses JDBC to present workload to the target database, so it works with HiveQL-compatible database applications with JDBC support, such as HIVE,Shark.
-
Java 1.6
-
The TPC-DS Toolkit
You need to download the TPC-DS software package from the TPC webiste. Follow the guide in the package to build the DSQGEN and DSDGEN tools from the source code contained in the package. Make sure the binaries built from the source can execute on each node of your Hadoop cluster without problems caused by differences of operating environments, for example, GLIBC version.
-
Hadoop 1.x or 2.x
Make sure you set the
HADOOP_HOMEenvironment variable or put the<HADOOP Home>/bindirectory in the PATH environment variable.
For Hadoop 1.x, only Hadoop 1.0.4 was tested.
For Hadoop 2.x, only CDH 5.0 beta YARN mode was tested.
If you want to run Mallet with HIVE as the target database, HIVE 0.13.1 or 0.12.0 is required. HIVE 0.13.1 is the default target.
- HIVE 0.13.1
You need to download the source package of HIVE 0.13.1 from http://hive.apache.org/downloads.html and apply the patch file (src/main/resources/hive/hive-0.13.1.patch), then build the HIVE binary.
- HIVE 0.12.0
You need to download the source package of HIVE 0.12.0 from http://hive.apache.org/downloads.html and apply the patch file (src/main/resources/hive/hive-0.12.0.patch), then build the HIVE binary.
Then you need to modify pom.xml to changet the version for the dependency hive-jdbc to 0.12.0. Then build Mallet.
If you want to run Mallet with SHARK as the target database, SHARK 0.9.0 is required.
-
Checkout the source code of Mallet in the open source repository to your local directory.
-
In your local Mallet directory, type
mvn clean packageto build Mallet.Note: If you intend to run Mallet with SHARK, change the version of the
hive-jdbcdependency in pom.xml from 0.12.0 to 0.11.0. -
After build, the Mallet binary can be found at
target/mallet-1.0-bin/mallet-1.0/.
-
Copy the TPC-DS Tool Binaries
It seems the DSQGEN tool has problem with long path, so the workaround is to copy 3 TPC-DS tool binaries you built before (
dsdgen, dsqgen, tpcds.idx) to thetoolssub-directory of<Mallet binary directory>. -
Configure Benchmark Parameters
You can modify benchmark parameters in
<Mallet binary directory>/conf/conf.properties:hiveServerHost < The host name of the target databases's JDBC service > hiveServerPort < The port number of the target databases's JDBC service > numberOfStreams < The number of query streams in the Throughput Tests > scaleFactor < The scale factor. Valid options are 1, 100, 300, 1000, 3000, 10000, 30000, 100000 > user < The username used to connect to JDBC > malletDbDir < The HDFS root directory for the Mallet data > -
Configure Data Preparation
You need to set some global environment variables in
<Mallet binary directory>/bin/config.sh:PARALLEL < The number of data chunks generated in parallel > HADOOP_EXECUTABLE < The Hadoop executable location. Optional, set if it can't be automatically discovered > HADOOP_CONF_DIR < The hadoop configuration directory. Optional, set if it can't be automatically discovered > STREAMING < The path to Hadoop streaming jar. Optional, set if it can't be automatically discovered > COMPRESS_GLOBAL < Switch on/off the compression for the generated data, 0 is disable, 1 is enable. Optional > COMPRESS_CODEC_GLOBAL < The default codec used for data compression. Optional >Note:
- Mallet will guess the value of these variables if they are not explicitly set. If so, Mallet guarantees neither the correctness of guess nor the success running of benchmarks.
- Do not change the default values of other global environment variables unless necessary.
-
cd
<Mallet binary directory> -
Make sure Hadoop is running. type
bin/prepare.shto generate table and maintenance data.You may try to increase the PARALLEL variable (the recommended value is (Map task capacity of the cluster*2)/(21+numberOfStreams)) to reduce the duration of data generation.
-
Make sure the target database is running. type
bin/run.shto start the benchmark.
bin/run.sh without any command line options performs a complete benchmark. In some cases, you can provide one of the following command line options to alter the behavior:
--quickrun Performs the benchmark with empty query and data maintenance operations. Used to facilitate development and verify installation and environment settings.
--powertest Performs only power test.
--query <query id> Performs only a single query.
- Upon the completion of the benchmark, a report with the primary performance metric (QphM@SF) will be generated in the stdout.
The following is an abbreviated sample report:
------------------- Mallet Benchmark Report --------------------
Number of Query Streams: 4
Number of queries in Query Stream: 65
Database Load Elapsed Time: 0h:0m:7s
Power Test Elapsed Time: 4h:24m:51s
Throughput Run 1 Elapsed Time: 6h:24m:24s
Query Run 1 Elapsed Time: 5h:14m:10s
Refresh Run 1 Elapsed Time: 1h:10m:13s
Throughput Run 2 Elapsed Time: 6h:17m:7s
Query Run 2 Elapsed Time: 5h:6m:58s
Refresh Run 2 Elapsed Time: 1h:10m:8s
----------------------------------------------------------------
Performance Metric = 25.700481928957096 QphM@1GB
----------------------------------------------------------------
---------- Query Run 1 Timing Intervals (in seconds) -----------
Query Minimum Median Maximum
2 225.712 235.046 247.064
...
98 157.288 203.679 217.967
---------- Query Run 2 Timing Intervals (in seconds) -----------
Query Minimum Median Maximum
2 201.364 215.648 227.006
...
98 158.62 210.739 220.181
src/main
|- java -- The Mallet benchmark driver source code implemented in JAVA.
|- config -- The configurations for the benchmark, such as number of streams, Scale Factor, database JDBC server and port, …
|- scripts -- The Shell scripts for data preparation and benchmark running.
|- resources
|- dm_function -- DM functions in HiveQL for Data Maintenance.
|- hive -- HiveQL scripts to create TPC-DS tables and refresh tables.
|- query_templates -- query templates in HiveQL.
- Mallet contains only 65 queries, which is a subset of all 99 queries defined in the TPC-DS 1.1.0 specification.
- Mallet relaxes the ACID requirements on the target databases in the TPC-DS 1.1.0 specification.
- Mallet does not report price related metrics defined in the TPC-DS 1.1.0 specification.