Skip to content

Data engineering: Apache Kafka, Apache Spark, MongoDB, ELK Stack

Notifications You must be signed in to change notification settings

nboro/data_engineering_case

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Stream and batch processing implementation.
This was created as part of an assignment for the Data Engineering course of the MSc Data Science and Entrepreneurship program.
Tools used: Apache Spark, Apache Kafka, MongoDB and ELK Stack.

Content FILES:

  • backend.py: simulator for backend as a data source
  • config.yaml: contains all the configuration accessed by batch and real-time service
  • consumer.py: anomaly detection implementation
  • email_sender.py: SMTP function for daily report
  • report.py: batch implementation to generate daily report
  • util.py: contains read_config utility function

FOLDERS:

  • data: contains 2 files as input for the backend.py
  • logstash: all necessary bash and config file to run logstash
  • mongo: contains data and bash file to populate designated mongo collection for daily report

Required library: 1. spark-mongo connector (.jar)

Required services: 1. Run mongod 2. Run zookeeper 3. Run kafka 4. Run elasticsearch 5. Run kibana 6. Run logstash for search_events 7. Run logstash for search_anomalies

About

Data engineering: Apache Kafka, Apache Spark, MongoDB, ELK Stack

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published