Awesome Open Source

Programming Languages

Search results for spark data engineering

data-engineering x

70 search results found

Data Engineering Zoomcamp ⭐ 19,461

Free Data Engineering course!

Cookbook ⭐ 12,557

The Data Engineering Cookbook

Dagster ⭐ 9,467

An orchestration platform for the development, production, and observation of data assets.

Mage Ai ⭐ 6,324

🧙 The modern replacement for Airflow. Build, run, and manage data pipelines for integrating and transforming data.

Risingwave ⭐ 5,799

The distributed streaming database. Engineered to offer the simplest and most cost-efficient way for stream processing and management.

Awesome Opensource Data Engineering ⭐ 1,331

An Awesome List of Open-Source Data Engineering Projects

Pyspark Example Project ⭐ 1,034

Example project implementing best practices for PySpark ETL jobs and applications.

Around Dataengineering ⭐ 926

A Data Engineering & Machine Learning Knowledge Hub

Scalable identity resolution, entity resolution, data mastering and deduplication using ML

Blazing-fast query execution engine speaks Apache Spark language and has Arrow-DataFusion at its core.

Goodreads_etl_pipeline ⭐ 593

An end-to-end GoodReads Data Pipeline for Building Data Lake, Data Warehouse and Analytics Platform.

Data Engineering Interview Questions ⭐ 554

More than 2000+ Data engineer interview questions.

Data Engineering Projects ⭐ 322

Personal Data Engineering Projects

Every Single Day I Tldr ⭐ 311

A daily digest of the articles or videos I've found interesting, that I want to share with you.

Butterfree ⭐ 269

A tool for building feature stores.

A Clojure dataframe library that runs on Spark

A simple Spark-powered ETL framework that just works 🍺

Spark Alchemy ⭐ 169

Collection of open-source Spark tools & frameworks that have made the data engineering and data science teams at Swoop highly productive

Lakehouse Engine ⭐ 154

The Lakehouse Engine is a configuration driven Spark framework, written in Python, serving as a scalable and distributed engine for several lakehouse algorithms, data flows and utilities for Data Products.

Scalable Data Science Platform ⭐ 153

Content for architecting a data science platform for products using Luigi, Spark & Flask.

Big Data Mapreduce Course ⭐ 135

Big Data Modeling, MapReduce, Spark, PySpark @ Santa Clara University

Movalytics Data Warehouse ⭐ 116

Data pipeline performing ETL to AWS Redshift using Spark, orchestrated with Apache Airflow

De Zoomcamp Ui ⭐ 107

🎨 UI for the Free Data Engineering Zoomcamp 2023 Course provided by DataTalksClub

Streamify ⭐ 97

A data engineering project with Kafka, Spark Streaming, dbt, Docker, Airflow, Terraform, GCP and much more!

Flowman is an ETL framework powered by Apache Spark. With its declarative approach, Flowman simplifies the development of complex data pipelines.

Gallia Core ⭐ 79

A schema-aware Scala library for data transformation

Data Engineering Nanodegree ⭐ 76

Projects done in the Data Engineering Nanodegree by Udacity.com

Waimak is an open-source framework that makes it easier to create complex data flows in Apache Spark.

Apachespark ⭐ 59

This repository will help you to learn about databricks concept with the help of examples. It will include all the important topics which we need in our real life experience as a data engineer. We will be using pyspark & sparksql for the development. At the end of the course we also cover few case studies.

Containerized distributed programming framework for Python

Soda Spark ⭐ 49

Soda Spark is a PySpark library that helps you with testing your data in Spark Dataframes

Learn Data Munging ⭐ 37

Notes on Data Engineering with Pandas, PySpark, Dask, Ray, Arrow DataFusion, Polars etc.

Sageworks ⭐ 36

SageWorks: An easy to use Python API for creating and deploying SageMaker Models

PyJaws: A Pythonic Way to Define Databricks Jobs and Workflows

Us Stock Prediction Using Ml And Spark ⭐ 35

Predict stock price based on financial news feeds

Distributedwekaspark ⭐ 32

Write data & AI pipelines in (SQL, Spark, Pandas) and deploy to the cloud, simplified

Spark Ai ⭐ 31

Toolbox for building Generative AI applications on top of Apache Spark.

Spark Studyclub ⭐ 31

Grupo de Estudios de Apache Spark organizado por la comunidad Data Engineering Latam

Debussy_concert ⭐ 29

Debussy is an opinionated Data Architecture and Engineering framework, enabling data analysts and engineers to build better platforms and pipelines.

Sparkdataset ⭐ 28

Instant search for and access to many datasets in Pyspark.

Data Engineering Nanodegree ⭐ 27

Solution to all projects of Udacity's Data Engineering Nanodegree: Data Modeling with Postgres & Cassandra, Data Warehouse with Redshift, Data Lake with Spark and Data Pipeline with Airflow.

Aws Glue Docker ⭐ 22

🐋 Docker image for AWS Glue Spark/Python

Jobanalytics_and_search ⭐ 22

JobAnalytics system consumes data from multiple sources and provides valuable information to both job hunters and recruiters.

De 100 Days ⭐ 22

data engineering 100 days 🤖 🧲 🦾 | #DE

Spark Movies Etl ⭐ 21

Spark data pipeline that ingests and transforms movie ratings data.

Spark Distcp ⭐ 18

A re-implementation of Hadoop DistCP in Apache Spark

Big Data Engineering ⭐ 15

Data Pipeline from the Global Historical Climatology Network DataSet

Pyspark On Aws Emr ⭐ 13

The goal of this project is to offer an AWS EMR template using Spot Fleet and On-Demand Instances that you can use quickly. Just focus on writing pyspark code.

Bootcamp_data Engineering ⭐ 13

Bootcamp to learn basics in Data Engineering

Akka Lift Ml ⭐ 12

akka http service for serving spark machine learning models

Marshmallow Pyspark ⭐ 12

Marshmallow serializer integration with pyspark

Data Paths ⭐ 11

Airflowjob ⭐ 11

Airflow POC demo : 1) env set up 2) airflow DAG 3) Spark/ML pipeline | #DE

Huemul Bigdatagovernance ⭐ 10

Huemul BigDataGovernance, es una framework que trabaja sobre Spark, Hive y HDFS. Permite la implementación de una estrategia corporativa de dato único, basada en buenas prácticas de Gobierno de Datos. Permite implementar tablas con control de Primary Key y Foreing Key al insertar y actualizar datos utilizando la librería, Validación de nulos, largos de textos, máximos/mínimos de números y fechas, valores únicos y valores por default. También permite clasificar los campos en aplicabilidad de der

Fake Data Pipeline ⭐ 10

Data Generators -> Kafka -> Spark Streaming -> PostgreSQL -> Grafana

Sparkitecture ⭐ 9

A collection of “cookbook-style” scripts for simplifying data engineering and machine learning in Apache Spark.

Pyspark Template ⭐ 8

A Python PySpark Projet with Poetry

Data Engineering Onboarding Starter ⭐ 8

This repository contains a 10 step program to enter the world of Data Engineering

Repo for practical data science problems approaches, including notebook demo and working scripts | #DS | #analysis

Itversity Boxes ⭐ 8

Repository for all ITVersity Vagrant Boxes.

This set of code and instructions has the porpouse to instanciate a compiled environment with set of docker images like airflow webserver, airflow scheduler, postgresql, pyspark, Data Pipeline consuming data from weather api , processing with pyspark and storing in postgresql

Data Engineering Interviews ⭐ 7

Data engineering interviews Q&A for data community by data community

Dataengineering Youtube Project ⭐ 6

Data Engineering Youtube Project

Sparklyclean ⭐ 6

Optimal distributed data deduplication and supervised learning pipeline using Apache Spark

Data Engineer Portfolio ⭐ 6

This is a repository to demonstrate my details, skills, projects and to keep track of my progression in Data Analytics and Data Science topics.

Data.engineers.lunch ⭐ 6

Resources from weekly Zoom lunches revolving around Data Engineering. Hosted by Anant Corporation.

Awesome Data Pipeline ⭐ 6

Awesome list for datapipeline

Dataengineering ⭐ 6

The Data Engineering subteam of Cornell Data Science

Spark Databricks ⭐ 6

🔥 Master Apache Spark & Databricks! Dive into a world of big data with exclusive insights from Udemy courses, personal notes, and practical guides. Whether you're starting out or scaling new heights in data engineering, this is your ultimate resource hub! 🌟🚀

Spark Structured Streaming Kafka ⭐ 5

Spark Structured Streaming + Kafka + Delta pipeline.

Data Readings ⭐ 5

Reading List in Data Systems

Udacity Data Engineering Nanodegree ⭐ 5

This is a repository to hold the files and notebooks produced throughout my Udacity's Nanodegree Data Engineering program.

Docker_spark_history_ui ⭐ 5

A dockerised version of the spark history server which enables us to access metrics in the spark ui from a log generated by AWS glue

Related Searches

Scala Spark (3,279)

Python Spark (2,053)

Java Spark (1,587)

Apache Spark (1,207)

Spark Hadoop (1,188)

Jupyter Notebook Spark (1,151)

Spark Kafka (985)

Spark Streaming (817)

Spark Pyspark (812)

Docker Spark (693)

1-70 of 70 search results

Privacy | About | Terms | Follow Us On Twitter

Copyright 2018-2024 Awesome Open Source. All rights reserved.