DeSQL: Interactive Debugging of SQL in Data-Intensive Scalable Computing

1. Pre-requisites

Ensure that docker is installed and running on your system. If it's not installed, please download Docker Desktop from https://www.docker.com/products/docker-desktop/.

These instructions have been verified with:

MacOS: ProductVersion: 11.2.3, BuildVersion: 20D91
Docker: 20.10.22, build 3a2c30b

Additionally, you'll need:

Python3
Ports 4040 and 8080 available for DeSQL UI and Vanilla Spark UI, respectively.

Warning: Results Reproducibility Disclaimer

Please be aware that the results reproduced here are not indicative of the actual overhead comparisons between DeSQL and Vanilla Spark as reported in the research paper. The reported findings in the paper were derived from a 12-node cluster, comprising 1 name node and 11 data nodes, where each node was equipped with a minimum of 8 cores at 3.10 GHz CPU, 48GB Memory, and 4TB disk space. Collectively, the cluster utilized 104 cores, 53TB of storage, and 576GB of memory.

In contrast, the reproducibility exercises in this documentation have been conducted in a local mode, using a small dataset. These exercises intend to demonstrate DeSQL's functionality and not to benchmark performance. The fluctuations observed in the reproducibility graphs reflect the initialization and execution times inherent to Spark jobs, rather than the processing time related to data size. Since the demonstration uses a sample dataset with significantly fewer rows, the majority of the time measured may correspond to Spark's startup and operation rather than the processing time differential with and without DeSQL.

For a more accurate comparison that reflects the operational overhead of DeSQL versus Vanilla Spark, one would need to replicate the computational environment and dataset size as described in the original research.

2. Creating the Docker Image

Clone the Repository

git clone https://github.com/sabaat/desql-artifacts.git

Clean Up

Remove any system-generated files to prevent interference with Docker builds:

find . -name '.DS_Store' -type f -delete

Build the Docker Image

Navigate to the spark-sql-debug directory and build the Docker image:

cd spark-sql-debug
docker build -t my-custom-spark:latest -f dist/kubernetes/dockerfiles/spark/Dockerfile .

NOTE: If you encounter permission issues, use sudo with Docker commands.

Start Containers

Use Docker Compose to start the containers:

docker compose up -d

TROUBLESHOOTING: If Docker Compose fails, restart Docker:

sudo systemctl restart docker

3. Running DeSQL

Submit a Spark SQL Query

Submit a Spark SQL job to the DeSQL container. Replace query5.sql with other queries as necessary e.g query6.sql or query9.sql. It contains all DeSQL queries from 1 to 10.

docker exec -it spark-local-container /opt/spark/bin/spark-submit \
--class DeSqlPackage.SQLTest.SQLTest \
--master "local[*]" \
--conf "spark.some.config.option=some-value" \
/opt/spark/app/desqlpackage_2.12-0.1.0-SNAPSHOT.jar /opt/spark/queries/query5.sql

Expected Observation: DeSQL will start, and you can observe the logs in the console. Once DeSQL starts, access the DeSQL UI at http://localhost:4040/debugger/. The UI displays sub-queries of the original query along with their data as processed within Spark computations. Additionally, it presents the query execution plan, with clickable green nodes for nodes with available sub-queries, to view the node's respective subquery and data.

Reproducibility

Execute the script inside the DeSQL container to gather results:

docker exec -it spark-local-container /bin/bash -c "/opt/spark/desql_results.sh"

Copy the results from the Docker container to your local machine:

docker cp spark-local-container:/opt/spark/examples/graphsFolder/data.txt ./data1.txt

Shut down the Docker containers:

docker compose down

Change the directory to Vanilla Spark and build the Docker image:

cd ..
cd spark-3.0.0-bin-hadoop2.7
docker build -t my-vanilla-spark:latest -f kubernetes/dockerfiles/spark/Dockerfile .

Start the Vanilla Spark containers:

docker compose up -d

Execute the script to gather results from Vanilla Spark:

docker exec -it vanilla-spark-local-container /bin/bash -c "/opt/spark/spark_results.sh"

Copy the Vanilla Spark results to your local machine:

docker cp vanilla-spark-local-container:/opt/spark/examples/graphsFolder/data.txt ./data2.txt

Navigate to the parent directory:

cd ..

Install necessary Python packages for analysis:

python3 -m pip install matplotlib pandas

Execute the analysis script:

python3 script.py ./spark-sql-debug/data1.txt ./spark-3.0.0-bin-hadoop2.7/data2.txt

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.github		.github
.idea		.idea
Queries		Queries
R		R
assembly		assembly
bin		bin
build		build
common		common
conf		conf
core		core
data		data
dev		dev
dist		dist
docs		docs
examples		examples
external		external
graphx		graphx
hadoop-cloud		hadoop-cloud
launcher		launcher
licenses-binary		licenses-binary
licenses		licenses
mllib-local		mllib-local
mllib		mllib
project		project
python		python
repl		repl
resource-managers		resource-managers
sbin		sbin
spark-kafka-consumer		spark-kafka-consumer
sql		sql
streaming		streaming
target		target
tools		tools
.gitattributes		.gitattributes
.ignoredgitignore		.ignoredgitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
LICENSE-binary		LICENSE-binary
NOTICE		NOTICE
NOTICE-binary		NOTICE-binary
README.md		README.md
appveyor.yml		appveyor.yml
desqlpackage_2.12-0.1.0-SNAPSHOT.jar		desqlpackage_2.12-0.1.0-SNAPSHOT.jar
docker-compose.yml		docker-compose.yml
pom.xml		pom.xml
scalastyle-config.xml		scalastyle-config.xml
scalastyle-on-compile.generated.xml		scalastyle-on-compile.generated.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Licenses found

Uh oh!

Repository files navigation

DeSQL: Interactive Debugging of SQL in Data-Intensive Scalable Computing

1. Pre-requisites

Warning: Results Reproducibility Disclaimer

2. Creating the Docker Image

Clone the Repository

Clean Up

Build the Docker Image

Start Containers

Quick Links

3. Running DeSQL

Submit a Spark SQL Query

Reproducibility

About

Licenses found

Uh oh!

Releases

Packages

Uh oh!

Languages

License

Licenses found

sabaat/desql-artifacts

Folders and files

Latest commit

History

Repository files navigation

DeSQL: Interactive Debugging of SQL in Data-Intensive Scalable Computing

1. Pre-requisites

Warning: Results Reproducibility Disclaimer

2. Creating the Docker Image

Clone the Repository

Clean Up

Build the Docker Image

Start Containers

Quick Links

3. Running DeSQL

Submit a Spark SQL Query

Reproducibility

About

Resources

License

Licenses found

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages