Skip to content

stdatalabs/SparkTwitterStreamAnalysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

24 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SparkTwitterPopularHashTags

A project on Spark Streaming to analyze Popular hashtags from live twitter data streams. Data is ingested from different input sources like Twitter source, Flume and Kafka and processed downstream using Spark Streaming.

Requirements

  • IDE
  • Apache Maven 3.x
  • JVM 6 or 7

General Info

The source folder is organized into 2 packages i.e. Kafka and Streaming. Each class in the Streaming package explores different approach to consume data from Twitter source. Below is the list of classes:

  • com/stdatalabs/Kafka
    • KafkaTwitterProducer.java -- A Kafka Producer that publishes twitter data to a kafka broker
  • com/stdatalabs/Streaming
    • SparkPopularHashTags.scala -- Receives data from Twitter datasource
    • FlumeSparkPopularHashTags.scala -- Receives data from Flume Twitter producer
    • KafkaSparkPopularHashTags.scala -- Receives data from Kafka Producer
    • RecoverableKafkaPopularHashTags.scala -- Spark-Kafka receiver based approach. Ensures at-least once semantics
    • KafkaDirectPopularHashTags.scala -- Spark-Kafka Direct approach. Ensures exactly once semantics
  • TwitterAvroSource.conf -- Flume conf for running Twitter avro source

Description

More articles on hadoop technology stack at stdatalabs

About

A Spark Streaming App to analyze the popular hashtags based on keywords

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published