Awesome Open Source

Programming Languages

Search results for spark parquet

67 search results found

Iceberg ⭐ 5,179

Gaffer ⭐ 1,724

A large-scale entity and relation database supporting aggregation of properties

Petastorm ⭐ 1,693

Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.

ADAM is a genomics analysis platform with specialized file formats built using Apache Avro, Apache Spark, and Apache Parquet. Apache 2 licensed.

Devops Python Tools ⭐ 709

80+ DevOps & Data CLI Tools - AWS, GCP, GCF Python Cloud Functions, Log Anonymizer, Spark, Hadoop, HBase, Hive, Impala, Linux, Docker, Spark Data Converters & Validators (Avro/Parquet/JSON/CSV/INI/XML/YAML), Travis CI, AWS CloudFormation, Elasticsearch, Solr etc.

Iceberg ⭐ 409

Iceberg is a table format for large, slow-moving tabular data

Spindle ⭐ 333

Next-generation web analytics processing with Scala, Spark, and Parquet.

⛈️ RumbleDB 1.21.0 "Hawthorn blossom" 🌳 for Apache Spark | Run queries on your large-scale, messy JSON-like data (JSON, text, CSV, Parquet, ROOT, AVRO, SVM...) | No install required (just a jar to download) | Declarative Machine Learning and more

Spark Programming Guide Zh Cn ⭐ 188

Spark 编程指南简体中文版

Parquet Index ⭐ 113

Spark SQL index for Parquet tables

Schema registry for CSV, TSV, JSON, AVRO and Parquet schema. Supports schema inference and GraphQL API.

Avro Parquet Spark Example ⭐ 61

An example of using Avro and Parquet in Spark SQL

A temporary home for LinkedIn's changes to Apache Iceberg (incubating)

Spark Compaction ⭐ 52

File compaction tool that runs on top of the Spark framework.

Spark Mail ⭐ 45

Tutorial on parsing Enron email to Avro and then explore the email set using Spark.

Spark Parquet Thrift Example ⭐ 44

Example Spark project using Parquet as a columnar store with Thrift objects.

Etl Light ⭐ 38

A light Kafka to HDFS/S3 ETL library based on Apache Spark

scalable knowledge graph construction from unstructured text

Simplesparkavroapp ⭐ 32

Simple Spark app that reads and writes Avro data

Topnotch ⭐ 29

A framework for systematically quality controlling big data.

Bucketing and partitioning system for Parquet

Enceladus ⭐ 28

Dynamic Conformance Engine

WASP is a framework to build complex real time big data applications. It relies on a kind of Kappa/Lambda architecture mainly leveraging Kafka and Spark. If you need to ingest huge amount of heterogeneous data and analyze them through complex pipelines, this is the framework for you.

Apache-Spark based Data Flow(ETL) Framework which supports multiple read, write destinations of different types and also support multiple categories of transformation rules.

Sql Based Etl With Apache Spark On Amazon Eks ⭐ 23

A solution that provides declarative data processing capability, and workflow orchestration automation to help your business users (such as analysts and data scientists) access their data and create meaningful insights without the need for manual IT processes.

Forklift ⭐ 22

🚚 ETL for Spark and Airflow

Spark_log_data ⭐ 21

Flume-to-Spark-Streaming Log Parser

Cda Client ⭐ 19

Cloud Data Access client

Albis: High-Performance File Format for Big Data Systems

Spark Sql Gdelt ⭐ 16

Scripts and code to import the GDELT dataset into Spark SQL for analysis

Spark Lucenerdd Examples ⭐ 15

Examples of spark-lucenerdd

Parquet Generator ⭐ 15

Parquet file generator

Spark Bigquery ⭐ 15

Google BigQuery data source for Apache Spark

Experiments ⭐ 15

Code examples for my blog posts

Spark Vector ⭐ 15

Repository for the Spark-Vector connector

Spark To Tableau ⭐ 14

Spark to Tableau Extractor library

Pyspark S3 Parquet Example ⭐ 13

This repo demonstrates how to load a sample Parquet formatted file from an AWS S3 Bucket. A python job will then be submitted to a Apache Spark instance running on AWS EMR, which will run a SQLContext to create a temporary table using a DataFrame. SQL queries will then be possible against the temporary table.

Infoflow ⭐ 12

An Apache Spark implementation of the InfoMap community detection algorithm

Scalpel Flattening ⭐ 11

This repository host code related SNDS database flattening

Intelqatcodec ⭐ 11

Spark S3 ⭐ 11

Spark Plugin for Amazon S3

Huemul Bigdatagovernance ⭐ 10

Huemul BigDataGovernance, es una framework que trabaja sobre Spark, Hive y HDFS. Permite la implementación de una estrategia corporativa de dato único, basada en buenas prácticas de Gobierno de Datos. Permite implementar tablas con control de Primary Key y Foreing Key al insertar y actualizar datos utilizando la librería, Validación de nulos, largos de textos, máximos/mínimos de números y fechas, valores únicos y valores por default. También permite clasificar los campos en aplicabilidad de der

Pyspark Dataframe Made Easy ⭐ 10

pyspark dataframe made easy

Chicago Taxi Trips Analysis ⭐ 10

Analysis of City Of Chicago Taxi Trip Dataset Using AWS EMR, Spark, PySpark, Zeppelin and Airbnb's Superset

Imooc Sparksql ⭐ 10

SparkSQL慕课网日志分析及可视化展示

Telecom Streaming ⭐ 9

Telecom scenarios implemented with streaming techniques

Elastic Tools ⭐ 9

Apache Spark based command line tools for ElasticSearch

Redditr Insight Data Engineering Project ⭐ 8

RedditR for Content Engagement and Recommendation

Example Applications ⭐ 8

Example applications for use with PNDA

Sempala is a SPARQL-over-SQL approach to provide interactive-time SPARQL query processing on Hadoop. It stores RDF data in a columnar layout (Parquet) on HDFS and uses either Impala or Spark as the execution layer on top of it. SPARQL queries are translated into Impala/Spark SQL for execution.

Spark For Noobs By A Noob ⭐ 7

Jupyter notebooks for learning PySpark

Spark Streaming Twitter ⭐ 7

Building pipeline to process the real-time data using Spark and Mongodb.

big data query console command and script for scala

Avrotoparquet ⭐ 6

Command line converter for Apache Avro to Apache Parquet file formats

Strava Spark ⭐ 6

Analyzing my Strava history with Spark

Ob Spark Shell ⭐ 6

Scala spark-shell backend for Org-mode's Babel

Bigdata Platform ⭐ 6

End to end big data project, that aims to show how to implement different big data layers, from the infrastructure layer to the end user one. [HADOOP][Spark][Kafka][Cassandra][Ansible][Jupyter

Spark Sessions ⭐ 6

Examples for how to split sets of time based events into sessions using Spark

Stackexchange Parquet ⭐ 6

Spark job for converting the StackExchange Network data into parquet format.

Schema_evolution_exploration ⭐ 5

Explore schema evolution using parquet and Spark or Presto

Avroparquet ⭐ 5

AVRO / Parquet Demo Code

Genomic Bigdata Spark ⭐ 5

Genomic BigData Warehousing with Apache Spark and LakeHouse Architecture

A simple in-memory, configuration driven, data processing pipeline for Apache Spark.

Arrow Data Source ⭐ 5

Spark DataSouce plugin for reading files from various formats like Parquet into Arrow compatible columnar vectors.

Related Searches

Scala Spark (3,279)

Python Spark (2,053)

Java Spark (1,587)

Jupyter Notebook Spark (1,268)

Apache Spark (1,207)

Spark Hadoop (1,188)

Spark Kafka (985)

Spark Streaming (817)

Spark Pyspark (812)

Docker Spark (683)

1-67 of 67 search results

Privacy | About | Terms | Follow Us On Twitter

Copyright 2018-2024 Awesome Open Source. All rights reserved.