Apache Spark and Amazon EMR Workshop

These are the scripts from the Apache Spark on Amazon EMR workshop led by JT Halbert, Chief Data Scientist at Tetra Concepts, and Jason Morris at Amazon.

The dataset used for the workshop is the ENRON email dataset from https://www.cs.cmu.edu/~./enron/

Course Outline

Topics covered include:

Installing Spark locally
Deploying a Spark instance with Amazon's Elastic MapReduce
Basic theory of Resilient Distributed Dataset
Data exploration with Spark at the Spark Shell
Using Spark's core APIs in Scala
Using Spark's PairRDD functions
Deploying a job on a Spark cluster
How to access logs and diagnose a running job

Usage

The rough steps to run through the workshop are:

Install awscli and ssh if you don't already have it
Run demo.sh that sets up the EC2 3 node cluster
Enter your aws keys and connect to the machine
Run setup.sh to get all the data from S3 buckets
That script also sets up Spark and your bash startup environment
Run start-spark.sh that sets up Spark using yarn
You're in the Scala REPL and play around with followalong.scala

Acknowledgements

JT's github: https://github.com/notjasonmorris

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
README.md		README.md
bashrc		bashrc
demo.sh		demo.sh
followalong.scala		followalong.scala
gitignore		gitignore
setup.sh		setup.sh
start-spark.sh		start-spark.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Apache Spark and Amazon EMR Workshop

Course Outline

Usage

Acknowledgements

About

Uh oh!

Releases

Packages

Languages

jasimpson/workshop-spark-on-aws

Folders and files

Latest commit

History

Repository files navigation

Apache Spark and Amazon EMR Workshop

Course Outline

Usage

Acknowledgements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages