Awesome Open Source

Programming Languages

Search results for spark etl

102 search results found

Doris ⭐ 11,243

Apache Doris is an easy-to-use, high performance and unified analytics database.

Dagster ⭐ 9,467

An orchestration platform for the development, production, and observation of data assets.

Mage Ai ⭐ 6,324

🧙 The modern replacement for Airflow. Build, run, and manage data pipelines for integrating and transforming data.

Aws Glue Samples ⭐ 1,334

AWS Glue code samples

Pyspark Example Project ⭐ 1,034

Example project implementing best practices for PySpark ETL jobs and applications.

Scalable identity resolution, entity resolution, data mastering and deduplication using ML

Goodreads_etl_pipeline ⭐ 593

An end-to-end GoodReads Data Pipeline for Building Data Lake, Data Warehouse and Analytics Platform.

Aws Glue Libs ⭐ 568

AWS Glue Libraries are additions and enhancements to Spark for ETL operations.

Metorikku ⭐ 536

A simplified, lightweight ETL Framework based on Apache Spark

Spark Excel ⭐ 421

A Spark plugin for reading and writing Excel files

Zdh_web ⭐ 379

大数据采集,抽取平台,zdh_web是zdh系列服务的可视化管理平台，包含数据采集,调度,权限,审批

Big_data_architect_skills ⭐ 353

一个大数据架构师应该掌握的技能

Data Engineering Projects ⭐ 322

Personal Data Engineering Projects

Beginner_de_project ⭐ 276

Beginner data engineering project - batch edition

Butterfree ⭐ 269

A tool for building feature stores.

A simple Spark-powered ETL framework that just works 🍺

A COBOL parser and Mainframe/EBCDIC data source for Apache Spark

Easy_sql ⭐ 126

A library developed to ease the data ETL development process.

Big Data ETL and Utilities for Hadoop Map Reduce, Spark and Storm

Flowman is an ETL framework powered by Apache Spark. With its declarative approach, Flowman simplifies the development of complex data pipelines.

Gallia Core ⭐ 79

A schema-aware Scala library for data transformation

Data Engineering Nanodegree ⭐ 76

Projects done in the Data Engineering Nanodegree by Udacity.com

Luigi Warehouse ⭐ 73

A luigi powered analytics / warehouse stack

Udacity Data Engineer Nanodegree ⭐ 64

Classwork projects and home works done through Udacity data engineering nano degree

Spark Etl ⭐ 62

Apache Spark based ETL Engine

Apachespark ⭐ 59

This repository will help you to learn about databricks concept with the help of examples. It will include all the important topics which we need in our real life experience as a data engineer. We will be using pyspark & sparksql for the development. At the end of the course we also cover few case studies.

Zdh_server ⭐ 56

数据采集平台zdh,etl 处理服务

One ETL tool to rule them all

Data Engineering ⭐ 55

How to build an awesome data engineering team

Datapipelines Essentials Python ⭐ 45

Simplified ETL process in Hadoop using Apache Spark. Has complete ETL pipeline for datalake. SparkSession extensions, DataFrame validation, Column extensions, SQL functions, and DataFrame transformations

EtlFlow is an ecosystem of functional libraries in Scala based on ZIO for running complex Auditable workflows which can interact with Google Cloud Platform, AWS, Kubernetes, Databases, SFTP servers, On-Prem Systems and more.

Architect_big_data_solutions_with_spark ⭐ 42

code, labs and lectures for the course

Udacity Data Engineering ⭐ 42

Udacity Data Engineering Nano Degree (DEND)

Spark Ref Architecture ⭐ 38

Reference Architectures for Apache Spark

Etl Light ⭐ 38

A light Kafka to HDFS/S3 ETL library based on Apache Spark

Apache Spark ETL Utilities

Sharpetl ⭐ 36

Write ETL using your favorite SQL dialects

Amazon Eks Apache Spark Etl Sample ⭐ 35

Spark ETL example processing New York taxi rides public dataset on EKS

智能数据探索服务(Intelligent Data Exploration Service)，一站式Data + AI数据解决方案！

Write data & AI pipelines in (SQL, Spark, Pandas) and deploy to the cloud, simplified

Basin is a visual programming editor for building Spark and PySpark pipelines. Easily build, debug, and deploy complex ETL pipelines from your browser

Starlake ⭐ 29

Starlake is an On Premise and Cloud ELT/ETL Framework for Batch & Stream Processing

Data Engineer Nanodegree Projects Udacity ⭐ 27

Projects done in the Data Engineer Nanodegree Program by Udacity.com

Nebula Exchange ⭐ 26

NebulaGraph Exchange is an Apache Spark application to parse data from different sources to NebulaGraph in a distributed environment. It supports both batch and streaming data in various formats and sources including other Graph Databases, RDBMS, Data warehouses, NoSQL, Message Bus, File systems, etc.

Spark Gotchas ⭐ 25

Few things we've met during our etl project based on spark

WASP is a framework to build complex real time big data applications. It relies on a kind of Kappa/Lambda architecture mainly leveraging Kafka and Spark. If you need to ingest huge amount of heterogeneous data and analyze them through complex pipelines, this is the framework for you.

Apache-Spark based Data Flow(ETL) Framework which supports multiple read, write destinations of different types and also support multiple categories of transformation rules.

Sql Based Etl With Apache Spark On Amazon Eks ⭐ 23

A solution that provides declarative data processing capability, and workflow orchestration automation to help your business users (such as analysts and data scientists) access their data and create meaningful insights without the need for manual IT processes.

Whakapai ⭐ 22

Various Python Data Science Projects available in PyPi

Aws Glue Docker ⭐ 22

🐋 Docker image for AWS Glue Spark/Python

Forklift ⭐ 22

🚚 ETL for Spark and Airflow

De 100 Days ⭐ 22

data engineering 100 days 🤖 🧲 🦾 | #DE

Spark Movies Etl ⭐ 21

Spark data pipeline that ingests and transforms movie ratings data.

Zephyr is a big data, platform agnostic ETL API, with Hadoop MapReduce, Storm, and other big data bindings.

Resilient data pipeline framework running on Apache Spark

Cda Client ⭐ 19

Cloud Data Access client

Jun_bigdata ⭐ 18

jun_bigdata大数据平台服务框架。实现了Kafka实时数据过滤、清洗、转换、消费，实现了Sp SQL对Redis、MongoDB等非关系型数据库的数据的读写；集成了规则引擎，可基于规则引擎实现客

Sparklanes ⭐ 16

A lightweight data processing framework for Apache Spark

Telemetry Streaming ⭐ 15

Spark Streaming ETL jobs for Mozilla Telemetry

Spark Etl ⭐ 15

Set of ETL utils for Spark

Data Pipeline from the Global Historical Climatology Network DataSet

This project is a unified ETL platform that support various data processing technologies, including Spark, Hive, Hadoop, Python, Linux Shell script, etc.

Datalink ⭐ 13

简单易用的ETL工具

Camus Compressor ⭐ 12

Camus Compressor merges files created by Camus and saves them in a compressed format.

Airflowjob ⭐ 11

Airflow POC demo : 1) env set up 2) airflow DAG 3) Spark/ML pipeline | #DE

Bigdata Etl Pipeline ⭐ 10

The Data Pipeline and Analytics Stack is a comprehensive solution designed for processing, storing, and visualizing data. Explore a complete data pipeline with all components seamlessly set up and ready to use

Spark Etl Atlas ⭐ 10

A small project to show how to add lineage to Atlas when using Spark as ETL tool

Dcc Release ⭐ 10

Second generation of the ICGC DCC release ETL built on Spark

DIEM Data Integration Engine Multipurpose

Yet Another SPark Framework

Restaurantinspectionssparkmlnet ⭐ 9

ETL & Data Enrichment with Spark.NET and ML.NET Automated (Auto) ML

Pyspark Template ⭐ 8

A Python PySpark Projet with Poetry

Data Engineering Onboarding Starter ⭐ 8

This repository contains a 10 step program to enter the world of Data Engineering

Apache Spark Etl Pipeline Example ⭐ 8

Demonstration of using Apache Spark to build robust ETL pipelines while taking advantage of open source, general purpose cluster computing.

An elegant way to ETL'ing

Spark HbaseETL Tools. Support bulk

Dlt With Debug ⭐ 8

A lightweight helper utility which allows developers to do interactive pipeline development by having a unified source code for both DLT run and Non-DLT interactive notebook run.

Spark Etl Demo ⭐ 7

Demo of an ETL Spark Job

Meetup Spark Airflow Demo ⭐ 7

Spark & Airflow demo for meetup

Greenplum Streamsets ⭐ 7

Greenplum with Streamsets

This is an ETL application on AWS with general open sales and customer data that you can find here: https://github.com/camposvinicius/data/blob/main/A it's a zipped file with some .csvs inside that we will apply transformations.

Mongodb Elasticsearch Spark Etl ⭐ 7

Generic template to read MongoDB and migrate to ElasticSearch

Spark Etl Framework ⭐ 7

A generic ETL framework with Spark_SQL for transforming data by constructing pipelines with Yaml/Json/Xml

Spark Kafka Simple Consumer Receiver ⭐ 7

Pyspark Boilerplate Mehdio ⭐ 7

Pyspark boilerplate for running prod ready data pipeline

Etl Processes Using Sqoop Hadoop Hive Spark And Scala ⭐ 7

I implemented various ETL processes like loading the data using sqoop from mysql to hdfs, transform the data using Spark and Scala, perform analytics using Spark and Scala and loading the data back to HDFS.

Openmrs Etl ⭐ 7

openmrs - mysql - debezium - kafka - spark - scala

Data Engineer Portfolio ⭐ 6

This is a repository to demonstrate my details, skills, projects and to keep track of my progression in Data Analytics and Data Science topics.

Data.engineers.lunch ⭐ 6

Resources from weekly Zoom lunches revolving around Data Engineering. Hosted by Anant Corporation.

Spark Databricks ⭐ 6

🔥 Master Apache Spark & Databricks! Dive into a world of big data with exclusive insights from Udemy courses, personal notes, and practical guides. Whether you're starting out or scaling new heights in data engineering, this is your ultimate resource hub! 🌟🚀

Setl Examples ⭐ 6

Learn SETL with examples, lessons and exercises

Yl Spark Sql ⭐ 6

一个Spark SQL方言，增强了批处理、机器学习、模型服务等语义；基于统一的SQL语法，提供了一个ETL、机器学习

Spark Sql Etl Framework ⭐ 6

Multi-stage, config driven, SQL based ETL framework using PySpark

Kf Portal Etl ⭐ 5

🏭 Extract-Transform-Load Pipeline for producing data for the Kids First Data Resource Portal

Udacity Data Engineering Nanodegree ⭐ 5

This is a repository to hold the files and notebooks produced throughout my Udacity's Nanodegree Data Engineering program.

Doris Sdk ⭐ 5

SDK for Apache Doris

Datafastlane ⭐ 5

Data in the Fast Lane is a powerful and extensible ETL that leverages Apache Spark.

Spark Structured Streaming Kafka ⭐ 5

Spark Structured Streaming + Kafka + Delta pipeline.

Related Searches

Scala Spark (3,279)

Python Spark (2,053)

Java Spark (1,587)

Apache Spark (1,207)

Spark Hadoop (1,188)

Jupyter Notebook Spark (1,151)

Spark Kafka (985)

Spark Streaming (817)

Spark Pyspark (812)

Python Etl (807)

1-100 of 102 search results

Privacy | About | Terms | Follow Us On Twitter

Copyright 2018-2024 Awesome Open Source. All rights reserved.