Project Name | Stars | Downloads | Repos Using This | Packages Using This | Most Recent Commit | Total Releases | Latest Release | Open Issues | License | Language |
---|---|---|---|---|---|---|---|---|---|---|
Scio | 2,456 | 1 | 34 | 18 hours ago | 82 | March 30, 2020 | 149 | apache-2.0 | Scala | |
A Scala API for Apache Beam and Google Cloud Dataflow. | ||||||||||
Hadoop Connectors | 267 | 22 | 46 | 10 days ago | 578 | December 12, 2022 | 50 | apache-2.0 | Java | |
Libraries and tools for interoperability between Hadoop-related open-source software and Google Cloud Platform. | ||||||||||
Bigquery To Datastore | 47 | 3 years ago | 1 | Java | ||||||
Export a whole BigQuery table to Google Datastore with Apache Beam/Google Dataflow | ||||||||||
Kettle Beam | 30 | 3 years ago | 14 | apache-2.0 | Java | |||||
Kettle plugins for Apache Beam | ||||||||||
Almaren Framework | 29 | 2 | 2 months ago | 48 | July 21, 2022 | 3 | apache-2.0 | Scala | ||
The Almaren Framework provides a simplified consistent minimalistic layer over Apache Spark. While still allowing you to take advantage of native Apache Spark features. You can still combine it with standard Spark code. | ||||||||||
Data Pipeline | 23 | 5 years ago | Python | |||||||
Hive Bigquery Storage Handler | 16 | a year ago | 8 | apache-2.0 | Java | |||||
Hive Storage Handler for interoperability between BigQuery and Apache Hive | ||||||||||
Kuromoji For Bigquery | 14 | 5 days ago | 5 | Java | ||||||
Tokenize Japanese text on BigQuery with Kuromoji in Apache Beam/Google Dataflow at scale | ||||||||||
Nifi Bigquery Bundle | 11 | 4 years ago | 3 | Java | ||||||
Bigquery bundle for Apache NiFi | ||||||||||
Data Rivers | 8 | 5 days ago | 35 | Python | ||||||
Apache Airflow and Beam ETL scripts for the City of Pittsburgh's data analysis pipelines |
This is a Hive StorageHandler plugin that enables Hive to interact with BigQuery. It allows you keep your existing pipelines but move to BigQuery. It utilizes the high throughput BigQuery Storage API to read data and uses the BigQuery API to write data.
The following steps are performed under Dataproc cluster in Google Cloud Platform. If you need to run in your cluster, you will need setup Google Cloud SDK and Google Cloud Storage connector for Hadoop.
git clone https://github.com/GoogleCloudPlatform/hive-bigquery-storage-handler
cd hive-bigquery-storage-handler
mvn clean install
Enable the BigQuery Storage API. Follow these instructions and check pricing details
Copy the compiled Jar to a Google Cloud Storage bucket that can be accessed by your hive cluster
Open Hive CLI and load the jar as shown below:
hive> add jar gs://<Jar location>/hive-bigquery-storage-handler-1.0-shaded.jar;
hive> list jars;
At this point you can operate Hive just like you used to do.
If you have BigQuery table already, here is how you can define Hive table that refer to it:
CREATE TABLE bq_test (word_count bigint, word string)
STORED BY
'com.google.cloud.hadoop.io.bigquery.hive.HiveBigQueryStorageHandler'
TBLPROPERTIES (
'bq.dataset'='<BigQuery dataset name>',
'bq.table'='<BigQuery table name>',
'mapred.bq.project.id'='<Your Project ID>',
'mapred.bq.temp.gcs.path'='gs://<Bucket name>/<Temporary path>',
'mapred.bq.gcs.bucket'='<Cloud Storage Bucket name>'
);
You will need to provide the following table properties:
Property | Value |
---|---|
bq.dataset | BigQuery dataset id (Optional if hive database name matches BQ dataset name) |
bq.table | BigQuery table name (Optional if hive table name matches BQ table name) |
mapred.bq.project.id | Your project id |
mapred.temp.gcs.path | Temporary file location in GCS bucket |
mapred.bq.gcs.bucket | Temporary GCS bucket name |
BigQuery | Hive | DESCRIPTION |
---|---|---|
INTEGER | BIGINT | Signed 8-byte Integer |
FLOAT | DOUBLE | 8-byte double precision floating point number |
DATE | DATE | FORMAT IS YYYY-[M]M-[D]D. The range of values supported for the Date type is 0001-01-01 to 9999-12-31 |
TIMESTAMP | TIMESTAMP | Represents an absolute point in time since Unix epoch with millisecond precision (on Hive) compared to Microsecond precision on Bigquery. |
BOOLEAN | BOOLEAN | Boolean values are represented by the keywords TRUE and FALSE |
STRING | STRING | Variable-length character data |
BYTES | BINARY | Variable-length binary data |
REPEATED | ARRAY | Represents repeated values |
RECORD | STRUCT | Represents nested structures |
The new API allows column pruning and predicate filtering to only read the data you are interested in.
Since BigQuery is backed by a columnar datastore, it can efficiently stream data without reading all columns.
The Storage API supports arbitrary pushdown of predicate filters. To enable predicate pushdown ensure hive.optimize.ppd is set to true.
Filters on all primitive type columns will be pushed to storage layer improving the performance of reads. Predicate pushdown is not supported on complex types such as arrays and structs. For example - filters like address.city = "Sunnyvale"
will not get pushdown to Bigquery.
from_unixtime(cast(cast(<timestampcolumn> as bigint)/1000 as bigint), 'yyyy-MM-dd hh:mm:ss')
set hive.execution.engine=mr
to use MapReduce as the execution engine CREATE TABLE dbname.alltypeswithSchema(currenttimestamp TIMESTAMP,currentdate DATE, userid BIGINT, sessionid STRING, skills Array<String>,
eventduration DOUBLE, eventcount BIGINT, is_latest BOOLEAN,keyset BINARY,addresses ARRAY<STRUCT<status: STRING, street: STRING,city: STRING, state: STRING,zip: BIGINT>> )
STORED BY 'com.google.cloud.hadoop.io.bigquery.hive.HiveBigQueryStorageHandler'
TBLPROPERTIES (
'bq.dataset'='bqdataset',
'bq.table'='bqtable',
'mapred.bq.project.id'='bqproject',
'mapred.bq.temp.gcs.path'='gs://bucketname/prefix',
'mapred.bq.gcs.bucket'='bucketname',
'avro.schema.literal'='{"type":"record","name":"alltypesnonnull",
"fields":[{"name":"currenttimestamp","type":["null",{"type":"long","logicalType":"timestamp-micros"}], "default" : null}
,{"name":"currentdate","type":{"type":"int","logicalType":"date"}, "default" : -1},{"name":"userid","type":"long","doc":"User identifier.", "default" : -1}
,{"name":"sessionid","type":["null","string"], "default" : null},{"name":"skills","type":["null", {"type":"array","items":"string"}], "default" : null}
,{"name":"eventduration","type":["null","double"], "default" : null},{"name":"eventcount","type":["null","long"], "default" : null}
,{"name":"is_latest","type":["null","boolean"], "default" : null},{"name":"keyset","type":["null","bytes"], "default" : null}
,{"name":"addresses","type":["null", {"type":"array",
"items":{"type":"record","name":"__s_0",
"fields":[{"name":"status","type":"string"},{"name":"street","type":"string"},{"name":"city","type":"string"},{"name":"state","type":"string"},{"name":"zip","type":"long"}]
}}], "default" : null
}
]
}'
);