Awesome Open Source
Search
Programming Languages
Languages
All Categories
Categories
About
Search results for data lake
data-lake
x
100 search results found
Trino
⭐
9,118
Official repository of Trino, the distributed SQL query engine for big data, formerly known as PrestoSQL (https://trino.io)
Deeplake
⭐
7,689
Database for AI. Store Vectors, Images, Texts, Videos, etc. Use with LLMs/LangChain. Store, query, version, & visualize any AI data. Stream data in real-time to PyTorch/TensorFlow. https://activeloop.ai
Starrocks
⭐
7,191
StarRocks, a Linux Foundation project, is a next-generation sub-second MPP OLAP database for full analytics scenarios, including multi-dimensional analytics, real-time analytics, and ad-hoc queries. InfoWorld’s 2023 BOSSIE Award for best open source software.
Hudi
⭐
5,064
Upserts, Deletes And Incremental Processing on Big Data.
Lakefs
⭐
3,900
lakeFS - Data version control for your data lake | Git for data
Dinky
⭐
2,657
Dinky is a data development platform based on Apache Flink, enabling agile data development and deployment.
Lakesoul
⭐
2,248
LakeSoul is an end-to-end, realtime and cloud native Lakehouse framework with fast data ingestion, concurrent update and incremental data analytics on cloud storages for both BI and AI applications.
Kyuubi
⭐
1,849
Apache Kyuubi is a distributed and multi-tenant gateway to provide serverless SQL on data warehouses and lakehouses.
Bitsail
⭐
1,514
BitSail is a distributed high-performance data integration engine which supports batch, streaming and incremental scenarios. BitSail is widely used to synchronize hundreds of trillions of data every day.
Leofs
⭐
1,338
The LeoFS Storage System
Udacity Data Engineering Projects
⭐
1,335
Few projects related to Data Engineering including Data Modeling, Infrastructure setup on cloud, Data Warehousing and Data Lake development.
Dlt
⭐
1,069
data load tool (dlt) is an open source Python library that makes data loading easy 🛠️
Kylo
⭐
1,035
Kylo is a data lake management software platform and framework for enabling scalable enterprise-class data lakes on big data technologies such as Teradata, Apache Spark and/or Hadoop. Kylo is licensed under Apache 2.0. Contributed by Teradata Inc.
Zingg
⭐
828
Scalable identity resolution, entity resolution, data mastering and deduplication using ML
Amoro
⭐
617
Amoro is a Lakehouse management system built on open data lake formats.
Goodreads_etl_pipeline
⭐
593
An end-to-end GoodReads Data Pipeline for Building Data Lake, Data Warehouse and Analytics Platform.
Vulcan Sql
⭐
570
Open-source Analytical Data API Framework for data apps. It turns SQL queries into RESTful APIs in no time!
Hudi Resources
⭐
509
汇总Apache Hudi相关资料
Automate Dv
⭐
456
A free to use dbt package for creating and loading Data Vault 2.0 compliant Data Warehouses (powered by dbt, an open source data engineering tool, registered trademark of dbt Labs)
Marmaray
⭐
444
Generic Data Ingestion & Dispersal Library for Hadoop
Aws Serverless Data Lake Framework
⭐
379
Enterprise-grade, production-hardened, serverless data lake on AWS
Data Engineering Projects
⭐
322
Personal Data Engineering Projects
Cuelake
⭐
266
Use SQL to build ELT pipelines on a data lakehouse.
Usql
⭐
233
U-SQL Examples and Issue Tracking
Amazon S3 Find And Forget
⭐
223
Amazon S3 Find and Forget is a solution to handle data erasure requests from data lakes stored on Amazon S3, for example, pursuant to the European General Data Protection Regulation (GDPR)
Hivemq Mqtt Tensorflow Kafka Realtime Iot Machine Learning Training Inference
⭐
159
Real Time Big Data / IoT Machine Learning (Model Training and Inference) with HiveMQ (MQTT), TensorFlow IO and Apache Kafka - no additional data store like S3, HDFS or Spark required
Btrblocks
⭐
156
BtrBlocks: Efficient Columnar Compression for Data Lakes (SIGMOD 2023 Paper)
Gravitino
⭐
153
World's most powerful data catalog service with providing a high-performance, geo-distributed and federated metadata lake.
Aws Orbit Workbench
⭐
127
A Data Platform built for AWS, powered by Kubernetes.
Streamis
⭐
96
Streaming application development and management system, based on Linkis and DSS, planning to provide the workflow-like graphical drag-and-drop development capability.
Smart Data Lake
⭐
87
Smart Automation Tool for building modern Data Lakes and Data Pipelines
Roota
⭐
86
RootA is a public-domain language of threat detection and response that combines native queries from a SIEM, EDR, XDR, or Data Lake with standardized metadata and threat intelligence to enable automated translation into other languages
Uncoder_io
⭐
81
An IDE and translation engine for detection engineers and threat hunters. Be faster, write smarter, keep 100% privacy.
Apachespark
⭐
59
This repository will help you to learn about databricks concept with the help of examples. It will include all the important topics which we need in our real life experience as a data engineer. We will be using pyspark & sparksql for the development. At the end of the course we also cover few case studies.
Zeeqs
⭐
57
GraphQL API for Zeebe data
Lighthouse
⭐
54
Lighthouse is a library for data lakes built on top of Apache Spark. It provides high-level APIs in Scala to streamline data pipelines and apply best practices.
Doris Website
⭐
51
Apache Doris Website
Anyscale
⭐
49
anyscale roadmap
Aws Dbs Refarch Datalake
⭐
47
Reference Architectures for Datalakes on AWS
Dataligo
⭐
47
A library to accelerate ML and ETL pipeline by connecting all data sources
Datapipelines Essentials Python
⭐
45
Simplified ETL process in Hadoop using Apache Spark. Has complete ETL pipeline for datalake. SparkSession extensions, DataFrame validation, Column extensions, SQL functions, and DataFrame transformations
Accio
⭐
43
Accio - Query Your Data Warehouse Like Exploring One Big View.
Rtdl
⭐
39
rtdl makes it easy to build and maintain a real-time data lake
Cnfuzz
⭐
36
Breaking Cloud Native Web APIs in their natural habitat.
Querypal
⭐
36
Web UI for Amazon Athena
Terraform Azure Data
⭐
35
Terraform script to deploy almost all Azure Data Services
Pan Cortex Data Lake Python
⭐
32
Python idiomatic SDK for Cortex™ Data Lake.
Threat Detection And Visualization
⭐
30
Threat Detection and Visualization
Real Time Data Warehouse
⭐
29
Real-time Data Warehouse with Apache Flink & Apache Kafka & Apache Hudi
Awesome Data Engineering
⭐
29
📒(GitBook) A curated list of awesome Data Engineering resources
Terraform Module Azure Datalake
⭐
28
Terraform module for an Azure Data Lake
Enceladus
⭐
28
Dynamic Conformance Engine
Data Engineer Nanodegree Projects Udacity
⭐
27
Projects done in the Data Engineer Nanodegree Program by Udacity.com
Apiary
⭐
27
Apiary provides modules which can be combined to create a federated cloud data lake
Aws Auto Terminate Idle Emr
⭐
26
AWS Auto Terminate Idle AWS EMR Clusters Framework is an AWS based solution using AWS CloudWatch and AWS Lambda using a Python script that is using Boto3 to terminate AWS EMR clusters that have been idle for a specified period of time.
Doris Thirdparty
⭐
26
Self-managed thirdparty dependencies for Apache Doris
Local Data Lakehouse
⭐
24
Sample Data Lakehouse deployed in Docker containers using Apache Iceberg, Minio, Trino and a Hive Metastore. Can be used for local testing.
Nodestream
⭐
23
A Fast, Declarative, and Extensible ETL Framework for Graph Databases.
Tickit Data Lake Demo
⭐
23
Resources for video demonstrations and blog posts related to DataOps on AWS
Jobanalytics_and_search
⭐
22
JobAnalytics system consumes data from multiple sources and provides valuable information to both job hunters and recruiters.
Docker_datalake
⭐
21
Datalake
R2 Bucket Uploader
⭐
20
Cloudflare R2 bucket File Uploader
Sparkprogramminginscala
⭐
18
Apache Spark Course Material
Apiary Data Lake
⭐
18
Terraform scripts for deploying Apiary Data Lake
Serverless Datalake Example
⭐
17
A serverless datalake project and framework based on AWS S3,Glue,Athena,MWAA and QuickSight. With a series of best practices, it guides you how to build a serverless datalake.
Hiveberg
⭐
16
Demonstration of a Hive Input Format for Iceberg
Analyzing Reddit Sentiment With Aws
⭐
16
Learn how to use Kinesis Firehose, AWS Glue, S3, and Amazon Athena by streaming and analyzing reddit comments in realtime. 100-200 level tutorial.
Azure Certification Dp 200
⭐
16
Road to Azure Data Engineer Part-I: DP-200 - Implementing an Azure Data Solution
Data Mill
⭐
16
A K8s-based infrastructure for analytics
Azure Security Data Lake
⭐
16
A platform for extracting and shipping security value from your data lake to Sentinel.
Ghcn D
⭐
14
Data Pipeline from the Global Historical Climatology Network DataSet
Data Engineering Mta Turnstile
⭐
14
Data Engineering - Metropolitan Transportation Authority (MTA) Subway Data Analysis
Parquet Usql
⭐
13
A custom extractor designed to read parquet for Azure Data Lake Analytics
Coinmetrics Formula Builder Models
⭐
13
A collection of json files used to automatically create models at https://charts.coinmetrics.io/formulas/
Prestorials
⭐
13
Tutorials and examples of how to deploy Presto and connect it to different data sources
Datalake Graphql Wrapper
⭐
12
The DataLake GraphQL Wrapper provides a GraphQL API for presto/trino.
Herd Mdl
⭐
11
Herd-MDL, a turnkey managed data lake in the cloud. See https://finraos.github.io/herd-mdl/ for more information.
Lakefs Hooks
⭐
10
a simple lakeFS webhook for pre-commit and pre-merge validation of data objects
Nayco
⭐
10
Nayco(内湖) is all in one micro DataLake for IoT
Columnar
⭐
9
An idiomatic kotlin dataframe toolkit for data engineering tasks of any size dataset
Vulcan Sql Examples
⭐
9
Curated VulcanSQL show cases
Kyuubi Docker
⭐
9
Apache Kyuubi is a distributed and multi-tenant gateway to provide serverless SQL on data warehouses and lakehouses.
Awesome Olap
⭐
8
A curated list of awesome Online Analytical Processing databases, frameworks, ressources and other awesomeness.
Logstash Output Adls
⭐
7
Logstash output plugin for Azure Data Lake Store (ADLS)
Kassette Server
⭐
7
Secured pipelines for your reporting and auditing data
Aws Insurancelake Etl
⭐
7
This solution helps you deploy ETL processes and data storage resources to create an Insurance Lake using Amazon S3 buckets for storage, AWS Glue for data transformation, and AWS CDK Pipelines. It is originally based on the AWS blog Deploy data lake ETL jobs using CDK Pipelines, and complements the InsuranceLake Infrastructure project
Awesome Data Pipeline
⭐
6
Awesome list for datapipeline
Formacao Engenheiro De Dados Cloud E Big Data Azure Databricks
⭐
6
Formação Engenheiro de Dados Cloud e Big Data (Azure & DataBricks)
2020 Healthcarelake
⭐
6
A reasonably secure data lake for healthcare analytics
Bigdata Platform
⭐
6
End to end big data project, that aims to show how to implement different big data layers, from the infrastructure layer to the end user one. [HADOOP][Spark][Kafka][Cassandra][Ansible][Jupyter
Serverless Architecture
⭐
5
Companion to my Linked In Learning 'Serverless Architecture' course
Genomic Bigdata Spark
⭐
5
Genomic BigData Warehousing with Apache Spark and LakeHouse Architecture
Projecty
⭐
5
Project Y is a straightforward Landing Zones automated deployment tool dedicated to data processing.
Nodejs Data Lake Dashboard
⭐
5
Sample and tutorial that creates interactive dashboards using: Dynamic Dashboard Embedded, Cloud Object Storage, SQL Query, DB2 Warehouse and AppID.
Microsoft Big Data Scientist And Ai
⭐
5
Microsoft Big Data, Data Scientist, and AI
Spark Streaming In Python
⭐
5
Apache Spark 3 - Structured Streaming Course Material
Udacity Data Engineering Nanodegree
⭐
5
This is a repository to hold the files and notebooks produced throughout my Udacity's Nanodegree Data Engineering program.
Vre
⭐
5
VRE infrastructure running at CERN
Aws Insurancelake Infrastructure
⭐
5
This solution helps you deploy ETL processes and data storage resources to create an Insurance Lake using Amazon S3 buckets for storage, AWS Glue for data transformation, and AWS CDK Pipelines. It is originally based on the AWS blog Deploy data lake ETL jobs using CDK Pipelines, and complements the InsuranceLake ETL with CDK Pipelines project.
Lakeapi
⭐
5
API for distributing Data Lake Data
1-100 of 100 search results
Privacy
|
About
|
Terms
|
Follow Us On Twitter
Copyright 2018-2024 Awesome Open Source. All rights reserved.