These are the scripts from the Apache Spark on Amazon EMR workshop led by JT Halbert, Chief Data Scientist at Tetra Concepts, and Jason Morris at Amazon.
The dataset used for the workshop is the ENRON email dataset from https://www.cs.cmu.edu/~./enron/
Topics covered include:
- Installing Spark locally
- Deploying a Spark instance with Amazon's Elastic MapReduce
- Basic theory of Resilient Distributed Dataset
- Data exploration with Spark at the Spark Shell
- Using Spark's core APIs in Scala
- Using Spark's PairRDD functions
- Deploying a job on a Spark cluster
- How to access logs and diagnose a running job
The rough steps to run through the workshop are:
- Install
awscliandsshif you don't already have it - Run
demo.shthat sets up the EC2 3 node cluster - Enter your aws keys and connect to the machine
- Run
setup.shto get all the data from S3 buckets - That script also sets up Spark and your bash startup environment
- Run
start-spark.shthat sets up Spark using yarn - You're in the Scala REPL and play around with
followalong.scala
JT's github: https://github.com/notjasonmorris