Project Name | Stars | Downloads | Repos Using This | Packages Using This | Most Recent Commit | Total Releases | Latest Release | Open Issues | License | Language |
---|---|---|---|---|---|---|---|---|---|---|
Spark | 35,291 | 2,394 | 882 | 21 hours ago | 46 | May 09, 2021 | 211 | apache-2.0 | Scala | |
Apache Spark - A unified analytics engine for large-scale data processing | ||||||||||
Cookbook | 11,362 | 2 months ago | 108 | apache-2.0 | ||||||
The Data Engineering Cookbook | ||||||||||
God Of Bigdata | 7,901 | 20 days ago | 2 | |||||||
专注大数据学习面试,大数据成神之路开启。Flink/Spark/Hadoop/Hbase/Hive... | ||||||||||
Zeppelin | 5,978 | 32 | 23 | 2 days ago | 2 | June 21, 2017 | 135 | apache-2.0 | Java | |
Web-based notebook that enables data-driven, interactive data analytics and collaborative documents with SQL, Scala and more. | ||||||||||
Sparkinternals | 4,665 | a year ago | 27 | |||||||
Notes talking about the design and implementation of Apache Spark | ||||||||||
Bigdl | 4,177 | 10 | a day ago | 16 | April 19, 2021 | 720 | apache-2.0 | Jupyter Notebook | ||
Fast, distributed, secure AI for Big Data | ||||||||||
Iceberg | 4,039 | 21 hours ago | 4 | May 23, 2022 | 1,298 | apache-2.0 | Java | |||
Apache Iceberg | ||||||||||
Tensorflowonspark | 3,849 | 5 | 14 days ago | 32 | April 21, 2022 | 11 | apache-2.0 | Python | ||
TensorFlowOnSpark brings TensorFlow programs to Apache Spark clusters. | ||||||||||
Koalas | 3,228 | 1 | 12 | 3 months ago | 47 | October 19, 2021 | 109 | apache-2.0 | Python | |
Koalas: pandas API on Apache Spark | ||||||||||
Spark Nlp | 3,154 | 2 | 2 | 21 hours ago | 90 | March 05, 2021 | 41 | apache-2.0 | Scala | |
State of the Art Natural Language Processing |
This library provides support for reading an Amazon Athena table with Apache Spark via Athena JDBC Driver.
I developed this library for the following reasons:
Apache Spark is implemented to use PreparedStatement when reading data through JDBC. However, because Athena JDBC Driver provided by AWS only implements Statement of JDBC Driver Spec and PreparedStatement is not implemented, Apache Spark can not read Athena data through JDBC.
So I refer to the JDBC data source implementation code in spark-sql and change it to call Statement of Athena JDBC Driver so that Apache Spark can read Athena data.
Table of Contents
You can register a Athena table and run SQL queries against it, or query with the Apache Spark SQL DSL.
import io.github.tmheo.spark.athena._
// Read a table from current region with default s3 staging directory.
val users = spark.read.athena("(select * from users)")
// Read a table from current region with s3 staging directory.
val users2 = spark.read.athena("users", "s3://staging_dir")
// Read a table from another region with s3 staging directory.
val users3 = spark.read.athena("users", "us-east-1", "s3://staging_dir")
Option | Description |
---|---|
dbtable |
Athena Table or SQL Query |
region |
AWS Region. Default value is current region |
s3_staging_dir |
The Amazon S3 location to which your query output is written. Default value is s3://aws-athena-query-results-${accountNumber}-${region}/ |
user |
AWS Access Key Id. If you do not specify user, password, the library will try to use InstanceProfileCredentialsProvider. |
password |
AWS Secret Access Key. If you do not specify user, password, the library will try to use InstanceProfileCredentialsProvider. |