Sparkbq

Sparklyr extension package to connect to Google BigQuery
Alternatives To Sparkbq
Project NameStarsDownloadsRepos Using ThisPackages Using ThisMost Recent CommitTotal ReleasesLatest ReleaseOpen IssuesLicenseLanguage
Cloudquery5,136715 hours ago345May 22, 2023330mpl-2.0Go
The open source high performance data integration platform built for developers.
Google Cloud Python4,34112318 hours ago38August 03, 2023143apache-2.0Python
Google Cloud Client Library for Python
Apprtc4,001
a month ago5September 15, 2020124bsd-3-clauseJavaScript
appr.tc has been shutdown. Please use the Dockerfile to run your own test/dev instance.
Awesome Gcp Certifications3,420
3 months agomit
Google Cloud Platform Certification resources.
Google Cloud Node2,6553211722 days ago73August 10, 202394apache-2.0TypeScript
Google Cloud Client Library for Node.js
Professional Services2,572
4 days ago50apache-2.0Python
Common solutions and tools developed by Google Cloud's Professional Services team. This repository and its contents are not an officially supported Google product.
Google Cloud Java1,7721391915 hours ago201August 08, 202372apache-2.0Java
Google Cloud Client Library for Java
Google Cloud Ruby1,293714 hours ago12September 12, 2023250apache-2.0Ruby
Google Cloud Client Library for Ruby
Google Cloud Php1,001191585 hours ago220June 23, 2022102apache-2.0PHP
Google Cloud Client Library for PHP
Google Cloud Dotnet8724411418 hours ago24June 07, 202210apache-2.0C#
Google Cloud Client Libraries for .NET
Alternatives To Sparkbq
Select To Compare


Alternative Project Comparisons
Readme

sparkbq: Google BigQuery Support for sparklyr

CRAN_Status_Badge Rdoc

sparkbq is a sparklyr extension package providing an integration with Google BigQuery. It builds on top of spark-bigquery, which provides a Google BigQuery data source to Apache Spark.

Version Information

You can install the released version of sparkbq from CRAN via

install.packages("sparkbq")

or the latest development version through

devtools::install_github("miraisolutions/sparkbq", ref = "develop")

The following table provides an overview over supported versions of Apache Spark, Scala, and Google Dataproc:

sparkbq spark-bigquery Apache Spark Scala Google Dataproc
0.1.x 0.1.0 2.2.x and 2.3.x 2.11 1.2.x and 1.3.x

sparkbq is based on the Spark package spark-bigquery which is available in a separate GitHub repository.

Example Usage

library(sparklyr)
library(sparkbq)
library(dplyr)

config <- spark_config()

sc <- spark_connect(master = "local[*]", config = config)

# Set Google BigQuery default settings
bigquery_defaults(
  billingProjectId = "<your_billing_project_id>",
  gcsBucket = "<your_gcs_bucket>",
  datasetLocation = "US",
  serviceAccountKeyFile = "<your_service_account_key_file>",
  type = "direct"
)

# Reading the public shakespeare data table
# https://cloud.google.com/bigquery/public-data/
# https://cloud.google.com/bigquery/sample-tables
hamlet <- 
  spark_read_bigquery(
    sc,
    name = "hamlet",
    projectId = "bigquery-public-data",
    datasetId = "samples",
    tableId = "shakespeare") %>%
  filter(corpus == "hamlet") # NOTE: predicate pushdown to BigQuery!
  
# Retrieve results into a local tibble
hamlet %>% collect()

# Write result into "mysamples" dataset in our BigQuery (billing) project
spark_write_bigquery(
  hamlet,
  datasetId = "mysamples",
  tableId = "hamlet",
  mode = "overwrite")

Authentication

When running outside of Google Cloud it is necessary to specify a service account JSON key file. Information on how to generate service account credentials can be found at https://cloud.google.com/storage/docs/authentication#service_accounts. The service account key file can either be passed as parameter serviceAccountKeyFile to bigquery_defaults or directly to spark_read_bigquery and spark_write_bigquery. Alternatively, an environment variable export GOOGLE_APPLICATION_CREDENTIALS=/path/to/your/service_account_keyfile.json can be set (see https://cloud.google.com/docs/authentication/getting-started for more information). When running on Google Cloud, e.g. Google Cloud Dataproc, application default credentials (ADC) may be used in which case it is not necessary to specify a service account key file.

Further Information

Popular Bigquery Projects
Popular Google Projects
Popular Data Processing Categories
Related Searches

Get A Weekly Email With Trending Projects For These Categories
No Spam. Unsubscribe easily at any time.
R
Google
Cloud Computing
Authentication
Spark
Bigquery