Distributed Data Processing Module - Dan Zaratsian, March 2020
- Introduction and Module Agenda
- Distributed Computing
- Walk-through of Tools and Services for Big Data
- Distributed Architectures and Use Cases
- Google Colab Notebook Environment
- Google BigQuery Sandbox
- Hadoop 101
- Intro to Apache Hive
- Apache Hive Syntax and Schema Design
- Intro to Apache HBase and Apache Phoenix (NoSQL)
- Apache HBase Schema Design & Best Practices
- Apache Phoenix Syntax
- Intro to Apache SparkSQL
- Apache SparkSQL
- BigQuery (Serverless SQL)
- Google Cloud Firestore (NoSQL)
Assignment - Due on Friday, March 27, 2020
- Apache Kafka
- Google PubSub
- Spark Streaming
- Apache Beam (Google Dataflow)
- Apache Spark Overview
- Spark Machine Learning (MLlib)
- ML Pipelines
- Building and deploying Spark machine learning models
- Considerations for ML in distributed environments
- Spark Best Practices and Tuning
Assignment (Coming)
Slides
- Intro to Google Cloud Platform
- Overview of Serverless
- Google Cloud Functions
- Cloud Run
- Industry trends & Applications
- Walk-through of Tools and Services
Slides This session will be used as an overflow from previous sessions. If extra time is needed or a deeper dive is required for specific content, then this session will be used for that.
- Machine Learning APIs
- GCP AutoML
- GCP AI Platform