Spark Bigquery

Google BigQuery support for Spark, SQL, and DataFrames
Alternatives To Spark Bigquery
Project NameStarsDownloadsRepos Using ThisPackages Using ThisMost Recent CommitTotal ReleasesLatest ReleaseOpen IssuesLicenseLanguage
Spark36,8502,3949038 hours ago46May 09, 2021244apache-2.0Scala
Apache Spark - A unified analytics engine for large-scale data processing
Redash23,904311 hours ago2May 05, 2020591bsd-2-clausePython
Make Your Company Data Driven. Connect to any data source, easily visualize, dashboard and share your data.
Doris9,630
9 hours ago5July 20, 20232,067apache-2.0Java
Apache Doris is an easy-to-use, high performance and unified analytics database.
Mage Ai5,590
8 hours ago278August 08, 2023138apache-2.0Python
🧙 The modern replacement for Airflow. Build, run, and manage data pipelines for integrating and transforming data.
Sqlglot3,806458 hours ago401August 14, 20233mitPython
Python SQL Parser and Transpiler
Ibis3,16424249 hours ago48August 13, 2023102apache-2.0Python
The flexibility of Python with the scale and performance of modern SQL.
Linkis3,136382 days ago3July 29, 2023228apache-2.0Java
Apache Linkis builds a computation middleware layer to facilitate connection, governance and orchestration between the upper applications and the underlying data engines.
Quicksql1,939
a year ago84mitJava
A Flexible, Fast, Federated(3F) SQL Analysis Middleware for Multiple Data Sources
Sql Generator1,923
a year ago1May 18, 20221apache-2.0Vue
🔨 用 JSON 来生成结构化的 SQL 语句,基于 Vue3 + TypeScript + Vite + Ant Design + MonacoEditor 实现,项目简单(重逻辑轻页面)、适合练手~
Fugue1,7321815 hours ago120August 20, 202344apache-2.0Python
A unified interface for distributed computing. Fugue executes SQL, Python, Pandas, and Polars code on Spark, Dask and Ray without any rewrites.
Alternatives To Spark Bigquery
Select To Compare


Alternative Project Comparisons
Readme

MAINTENANCE MODE

THIS PROJECT IS IN MAINTENANCE MODE DUE TO THE FACT THAT IT'S NOT WIDELY USED WITHIN SPOTIFY. WE'LL PROVIDE BEST EFFORT SUPPORT FOR ISSUES AND PULL REQUESTS BUT DO EXPECT DELAY IN RESPONSES.

spark-bigquery

Build Status GitHub license Maven Central

Google BigQuery support for Spark, SQL, and DataFrames.

spark-bigquery version Spark version Comment
0.2.x 2.x.y Active development
0.1.x 1.x.y Development halted

To use the package in a Google Cloud Dataproc cluster:

install org.apache.avro_avro-ipc-1.7.7.jar to ~/.ivy2/jars

spark-shell --packages com.spotify:spark-bigquery_2.10:0.2.2

To use it in a local SBT console:

import com.spotify.spark.bigquery._

// Set up GCP credentials
sqlContext.setGcpJsonKeyFile("<JSON_KEY_FILE>")

// Set up BigQuery project and bucket
sqlContext.setBigQueryProjectId("<BILLING_PROJECT>")
sqlContext.setBigQueryGcsBucket("<GCS_BUCKET>")

// Set up BigQuery dataset location, default is US
sqlContext.setBigQueryDatasetLocation("<DATASET_LOCATION>")

Usage:

// Load everything from a table
val table = sqlContext.bigQueryTable("bigquery-public-data:samples.shakespeare")

// Load results from a SQL query
// Only legacy SQL dialect is supported for now
val df = sqlContext.bigQuerySelect(
  "SELECT word, word_count FROM [bigquery-public-data:samples.shakespeare]")

// Save data to a table
df.saveAsBigQueryTable("my-project:my_dataset.my_table")

If you'd like to write nested records to BigQuery, be sure to specify an Avro Namespace. BigQuery is unable to load Avro Namespaces with a leading dot (.nestedColumn) on nested records.

// BigQuery is able to load fields with namespace 'myNamespace.nestedColumn'
df.saveAsBigQueryTable("my-project:my_dataset.my_table", tmpWriteOptions = Map("recordNamespace" -> "myNamespace"))

See also Loading Avro Data from Google Cloud Storage for data type mappings and limitations. For example loading arrays of arrays is not supported.

License

Copyright 2016 Spotify AB.

Licensed under the Apache License, Version 2.0: http://www.apache.org/licenses/LICENSE-2.0

Popular Sql Projects
Popular Spark Projects
Popular Data Processing Categories

Get A Weekly Email With Trending Projects For These Categories
No Spam. Unsubscribe easily at any time.
Google
Scala
Sql
Spark
Bigquery
Avro