Ensure that docker is installed and running on your system. If it's not installed, please download Docker Desktop from https://www.docker.com/products/docker-desktop/.
These instructions have been verified with:
- MacOS: ProductVersion: 11.2.3, BuildVersion: 20D91
- Docker: 20.10.22, build 3a2c30b
Additionally, you'll need:
- Python3
- Ports 4040 and 8080 available for DeSQL UI and Vanilla Spark UI, respectively.
Please be aware that the results reproduced here are not indicative of the actual overhead comparisons between DeSQL and Vanilla Spark as reported in the research paper. The reported findings in the paper were derived from a 12-node cluster, comprising 1 name node and 11 data nodes, where each node was equipped with a minimum of 8 cores at 3.10 GHz CPU, 48GB Memory, and 4TB disk space. Collectively, the cluster utilized 104 cores, 53TB of storage, and 576GB of memory.
In contrast, the reproducibility exercises in this documentation have been conducted in a local mode, using a small dataset. These exercises intend to demonstrate DeSQL's functionality and not to benchmark performance. The fluctuations observed in the reproducibility graphs reflect the initialization and execution times inherent to Spark jobs, rather than the processing time related to data size. Since the demonstration uses a sample dataset with significantly fewer rows, the majority of the time measured may correspond to Spark's startup and operation rather than the processing time differential with and without DeSQL.
For a more accurate comparison that reflects the operational overhead of DeSQL versus Vanilla Spark, one would need to replicate the computational environment and dataset size as described in the original research.
git clone https://github.com/sabaat/desql-artifacts.gitRemove any system-generated files to prevent interference with Docker builds:
find . -name '.DS_Store' -type f -deleteNavigate to the spark-sql-debug directory and build the Docker image:
cd spark-sql-debug
docker build -t my-custom-spark:latest -f dist/kubernetes/dockerfiles/spark/Dockerfile .NOTE: If you encounter permission issues, use
sudowith Docker commands.
Use Docker Compose to start the containers:
docker compose up -dTROUBLESHOOTING: If Docker Compose fails, restart Docker:
sudo systemctl restart dockerSubmit a Spark SQL job to the DeSQL container. Replace query5.sql with other queries as necessary e.g query6.sql or query9.sql. It contains all DeSQL queries from 1 to 10.
docker exec -it spark-local-container /opt/spark/bin/spark-submit \
--class DeSqlPackage.SQLTest.SQLTest \
--master "local[*]" \
--conf "spark.some.config.option=some-value" \
/opt/spark/app/desqlpackage_2.12-0.1.0-SNAPSHOT.jar /opt/spark/queries/query5.sqlExpected Observation: DeSQL will start, and you can observe the logs in the console. Once DeSQL starts, access the DeSQL UI at
http://localhost:4040/debugger/. The UI displays sub-queries of the original query along with their data as processed within Spark computations. Additionally, it presents the query execution plan, with clickable green nodes for nodes with available sub-queries, to view the node's respective subquery and data.
Execute the script inside the DeSQL container to gather results:
docker exec -it spark-local-container /bin/bash -c "/opt/spark/desql_results.sh"Copy the results from the Docker container to your local machine:
docker cp spark-local-container:/opt/spark/examples/graphsFolder/data.txt ./data1.txtShut down the Docker containers:
docker compose downChange the directory to Vanilla Spark and build the Docker image:
cd ..
cd spark-3.0.0-bin-hadoop2.7
docker build -t my-vanilla-spark:latest -f kubernetes/dockerfiles/spark/Dockerfile .Start the Vanilla Spark containers:
docker compose up -dExecute the script to gather results from Vanilla Spark:
docker exec -it vanilla-spark-local-container /bin/bash -c "/opt/spark/spark_results.sh"Copy the Vanilla Spark results to your local machine:
docker cp vanilla-spark-local-container:/opt/spark/examples/graphsFolder/data.txt ./data2.txtNavigate to the parent directory:
cd ..Install necessary Python packages for analysis:
python3 -m pip install matplotlib pandasExecute the analysis script:
python3 script.py ./spark-sql-debug/data1.txt ./spark-3.0.0-bin-hadoop2.7/data2.txt