This is a fully self contained docker - poetry project for setting up databricks dev environments
All the software required to build this app will be installed when running the steps listed below
-
Install WSL if not installed (
wsl --install) -
Start WSL (
wsl --list --online) -
Add your preferred linux distro (
wsl --install -d Ubuntu-20.04) -
Install Chocolatey
Set-ExecutionPolicy Bypass -Scope Process -Force; [System.Net.ServicePointManager]::SecurityProtocol = [System.Net.ServicePointManager]::SecurityProtocol -bor 3072; iex ((New-Object System.Net.WebClient).DownloadString(‘https://chocolatey.org/install.ps1’)) -
install docker using choco (
choco install -y docker-desktop) -
Install git ( if not installed ) (
choco install -y git)
sudo apt update
sudo apt upgrade
sudo apt install docker.io
sudo usermod -a -G docker $USER
sudo groupadd docker && sudo gpasswd -a ${USER} docker && sudo systemctl restart docker
newgrp docker
sudo systemctl start docker
sudo systemctl enable docker
sudo apt-get install git
- Clone repo (
git clone repo) - go into project directory (
cd DATABRICKS_CICD) - Create your feature branch (try to use a relevant branch name) (
git checkout -b feature/fooBar) - Create a databricks.env file ( see sample_databricks_env.txt for format) Add your databricks token to databricks.env file (
DATABRICKS_TOKEN=Enter Your Token Here) - build your local environment (
cd ./build; docker-compose up --build -d) - To load your python scripts into the environment add them to the src folder
- to run code interactively from terminal run (
docker exec -it (docker inspect --format="{{.Id}}" databricks_cicd) bash)
Databricks uses a series of command tags to identify/ differentiate a file from a databricks notebook. Here are the commands that need to be added to your python file for it to be converted to a Notebook. More details on importing workbooks
Python :
-
Create a notebook (add to top of the file or cell)
# Databricks notebook source -
Create a code cell
# COMMAND ----------
R :
-
Create a notebook (add to top of the file or cell)
# Databricks notebook source -
Create a code cell
# COMMAND ----------
SQL :
-
Create a notebook (add to top of the file or cell)
-- Databricks notebook source -
Create a code cell
-- COMMAND ----------
Scala :
-
Create a notebook
// Databricks notebook source -
Create a code cell
// COMMAND ----------
Typically your main class or Python file will have other dependency JARs and files. You can add such dependency JARs and files by calling sparkContext.addJar("path-to-the-jar") or sparkContext.addPyFile("path-to-the-file")
You can use dbutils.fs and dbutils.secrets utilities of the Databricks Utilities module. Supported commands are :
dbutils.fs.cp,
dbutils.fs.head,
dbutils.fs.ls,
dbutils.fs.mkdirs,
dbutils.fs.mv,
dbutils.fs.put,
dbutils.fs.rm,
dbutils.secrets.get,
dbutils.secrets.getBytes,
dbutils.secrets.list,
dbutils.secrets.listScopes
from pyspark.dbutils import DBUtils
dbutils = DBUtils(spark)
dbutils.fs.cp('file:/home/user/data.csv', 'dbfs:/uploads')
dbutils.fs.cp('dbfs:/output/results.csv', 'file:/home/user/downloads/')
Please note that databricks connect does not work with existing spark environments to avoid any env contamination this project uses poetry venv to create an isolated work environment.
Must add this option to the clusters advanced config spark.databricks.service.server.enabled true
Databricks Connect allows you to connect your favorite IDE (Eclipse, IntelliJ, PyCharm, RStudio, Visual Studio Code), notebook server (Jupyter Notebook, Zeppelin), and other custom applications to Databricks clusters. In the case of this project its meant to enable users to follow CICD processes for deploying code from their local dev machine using Dev Containers (Docker) to elevated environments. More information can be found about this tool on the following Microsoft documentation link
A medallion architecture is a data design pattern used to logically organize data in a lakehouse, with the goal of incrementally and progressively improving the structure and quality of data as it flows through each layer of the architecture (from Bronze ⇒ Silver ⇒ Gold layer tables). Medallion architectures are sometimes also referred to as "multi-hop" architectures. Databricks documentation
Example Script built with this project to test execution
from pyspark.sql import SparkSession
from pyspark.sql.types import *
from datetime import date
spark = SparkSession.builder.appName('temps-demo').getOrCreate()
# Create a Spark DataFrame consisting of high and low temperatures
# by airport code and date.
schema = StructType([
StructField('AirportCode', StringType(), False),
StructField('Date', DateType(), False),
StructField('TempHighF', IntegerType(), False),
StructField('TempLowF', IntegerType(), False)
])
data = [
[ 'BLI', date(2021, 4, 3), 52, 43],
[ 'BLI', date(2021, 4, 2), 50, 38],
[ 'BLI', date(2021, 4, 1), 52, 41],
[ 'PDX', date(2021, 4, 3), 64, 45],
[ 'PDX', date(2021, 4, 2), 61, 41],
[ 'PDX', date(2021, 4, 1), 66, 39],
[ 'SEA', date(2021, 4, 3), 57, 43],
[ 'SEA', date(2021, 4, 2), 54, 39],
[ 'SEA', date(2021, 4, 1), 56, 41]
]
temps = spark.createDataFrame(data, schema)
# Create a table on the Databricks cluster and then fill
# the table with the DataFrame's contents.
# If the table already exists from a previous run,
# delete it first.
spark.sql('USE default')
spark.sql('DROP TABLE IF EXISTS demo_temps_table')
temps.write.saveAsTable('demo_temps_table')
# Query the table on the Databricks cluster, returning rows
# where the airport code is not BLI and the date is later
# than 2021-04-01. Group the results and order by high
# temperature in descending order.
df_temps = spark.sql("SELECT * FROM demo_temps_table " \
"WHERE AirportCode != 'BLI' AND Date > '2021-04-01' " \
"GROUP BY AirportCode, Date, TempHighF, TempLowF " \
"ORDER BY TempHighF DESC")
df_temps.show()
# Results:
#
# +-----------+----------+---------+--------+
# |AirportCode| Date|TempHighF|TempLowF|
# +-----------+----------+---------+--------+
# | PDX|2021-04-03| 64| 45|
# | PDX|2021-04-02| 61| 41|
# | SEA|2021-04-03| 57| 43|
# | SEA|2021-04-02| 54| 39|
# +-----------+----------+---------+--------+
# Clean up by deleting the table from the Databricks cluster.
spark.sql('DROP TABLE demo_temps_table')
- Fork it
- Create your feature branch (
git checkout -b feature/fooBar) - Commit your changes (
git commit -am 'Add some fooBar') - Push to the branch (
git push origin feature/fooBar) - Create a new Pull Request
More Details on Forking a repo
cleanup docker env
docker kill $(docker container ls -q)
docker system prune -a
