Data Engineering Zoomcamp

Free Data Engineering course!
Alternatives To Data Engineering Zoomcamp
Project NameStarsDownloadsRepos Using ThisPackages Using ThisMost Recent CommitTotal ReleasesLatest ReleaseOpen IssuesLicenseLanguage
Data Engineering Zoomcamp13,734
8 days ago48Jupyter Notebook
Free Data Engineering course!
Prefect12,0961703 hours ago162July 05, 2022518apache-2.0Python
The easiest way to orchestrate and observe your data pipelines
Lakefs3,43014 hours ago62June 15, 2022566apache-2.0Go
lakeFS - Data version control for your data lake | Git for data
Everything Tech372
a year agoapache-2.0Go
A collection of online resources to help you on your Tech journey.
Dataplane129
2 months ago33otherJavaScript
Dataplane is an Airflow inspired data platform with additional data mesh capability to automate, schedule and design data pipelines and workflows. Dataplane is written in Golang with a React front end.
Movalytics Data Warehouse74
3 years agoPython
Data pipeline performing ETL to AWS Redshift using Spark, orchestrated with Apache Airflow
Cowait54
3 months ago53April 01, 202239apache-2.0Python
Containerized distributed programming framework for Python
Towardsdataengineering52
4 months ago7Python
This repo contains commands that data engineers use in day to day work.
Camelboilerplate42
2 years agoJava
A Spring Boot Camel boilerplate that aims to consume events from Apache Kafka, process it and send to a PostgreSQL database.
Rtdl39
8 months agomitGo
rtdl makes it easy to build and maintain a real-time data lake
Alternatives To Data Engineering Zoomcamp
Select To Compare


Alternative Project Comparisons
Readme

Data Engineering Zoomcamp

Syllabus

Taking the course

2023 Cohort

Self-paced mode

All the materials of the course are freely available, so that you can take the course at your own pace

  • Follow the suggested syllabus (see below) week by week
  • You don't need to fill in the registration form. Just start watching the videos and join Slack
  • Check FAQ if you have problems
  • If you can't find a solution to your problem in FAQ, ask for help in Slack

Asking for help in Slack

The best way to get support is to use DataTalks.Club's Slack. Join the #course-data-engineering channel.

To make discussions in Slack more organized:

Syllabus

Note: NYC TLC changed the format of the data we use to parquet. But you can still access the csv files here.

Week 1: Introduction & Prerequisites

  • Course overview
  • Introduction to GCP
  • Docker and docker-compose
  • Running Postgres locally with Docker
  • Setting up infrastructure on GCP with Terraform
  • Preparing the environment for the course
  • Homework

More details

Week 2: Workflow Orchestration

  • Data Lake
  • Workflow orchestration
  • Introduction to Prefect
  • ETL with GCP & Prefect
  • Parametrizing workflows
  • Prefect Cloud and additional resources
  • Homework

More details

Week 3: Data Warehouse

  • Data Warehouse
  • BigQuery
  • Partitioning and clustering
  • BigQuery best practices
  • Internals of BigQuery
  • Integrating BigQuery with Airflow
  • BigQuery Machine Learning

More details

Week 4: Analytics engineering

  • Basics of analytics engineering
  • dbt (data build tool)
  • BigQuery and dbt
  • Postgres and dbt
  • dbt models
  • Testing and documenting
  • Deployment to the cloud and locally
  • Visualizing the data with google data studio and metabase

More details

Week 5: Batch processing

  • Batch processing
  • What is Spark
  • Spark Dataframes
  • Spark SQL
  • Internals: GroupBy and joins

More details

Week 6: Streaming

  • Introduction to Kafka
  • Schemas (avro)
  • Kafka Streams
  • Kafka Connect and KSQL

More details

Week 7, 8 & 9: Project

Putting everything we learned to practice

  • Week 7 and 8: working on your project
  • Week 9: reviewing your peers

More details

Workshop: Maximizing Confidence in Your Data Model Changes with dbt and PipeRider

More details

Overview

Architecture diagram

Technologies

  • Google Cloud Platform (GCP): Cloud-based auto-scaling platform by Google
    • Google Cloud Storage (GCS): Data Lake
    • BigQuery: Data Warehouse
  • Terraform: Infrastructure-as-Code (IaC)
  • Docker: Containerization
  • SQL: Data Analysis & Exploration
  • Prefect: Workflow Orchestration
  • dbt: Data Transformation
  • Spark: Distributed Processing
  • Kafka: Streaming

Prerequisites

To get the most out of this course, you should feel comfortable with coding and command line and know the basics of SQL. Prior experience with Python will be helpful, but you can pick Python relatively fast if you have experience with other programming languages.

Prior experience with data engineering is not required.

Instructors

Tools

For this course, you'll need to have the following software installed on your computer:

  • Docker and Docker-Compose
  • Python 3 (e.g. via Anaconda)
  • Google Cloud SDK
  • Terraform

See Week 1 for more details about installing these tools

Supporters and partners

Thanks to the course sponsors for making it possible to create this course

Do you want to support our course and our community? Please reach out to [email protected]

Popular Data Engineering Projects
Popular Docker Projects
Popular Data Processing Categories
Related Searches

Get A Weekly Email With Trending Projects For These Categories
No Spam. Unsubscribe easily at any time.
Jupyter Notebook
Docker
Spark
Kafka
Data Engineering