Project Name | Stars | Downloads | Repos Using This | Packages Using This | Most Recent Commit | Total Releases | Latest Release | Open Issues | License | Language |
---|---|---|---|---|---|---|---|---|---|---|
Bigdata Notes | 13,291 | 4 months ago | 33 | Java | ||||||
大数据入门指南 :star: | ||||||||||
Flink Learning | 13,198 | 3 months ago | apache-2.0 | Java | ||||||
flink learning blog. http://www.54tianzhisheng.cn/ 含 Flink 入门、概念、原理、实战、性能调优、源码解析等内容。涉及 Flink Connector、Metrics、Library、DataStream API、Table API & SQL 等内容的学习案例,还有 Flink 落地应用的大型项目案例(PVUV、日志存储、百亿数据实时去重、监控告警)分享。欢迎大家支持我的专栏《大数据实时计算引擎 Flink 实战与性能优化》 | ||||||||||
Technology Talk | 13,004 | 3 months ago | 10 | |||||||
汇总java生态圈常用技术框架、开源中间件,系统架构、数据库、大公司架构案例、常用三方类库、项目管理、线上问题排查、个人成长、思考等知识 | ||||||||||
God Of Bigdata | 7,992 | 2 months ago | 2 | |||||||
专注大数据学习面试,大数据成神之路开启。Flink/Spark/Hadoop/Hbase/Hive... | ||||||||||
Spring Boot Quick | 2,152 | 2 months ago | 12 | Java | ||||||
:herb: 基于springboot的快速学习示例,整合自己遇到的开源框架,如:rabbitmq(延迟队列)、Kafka、jpa、redies、oauth2、swagger、jsp、docker、k3s、k3d、k8s、mybatis加解密插件、异常处理、日志输出、多模块开发、多环境打包、缓存cache、爬虫、jwt、GraphQL、dubbo、zookeeper和Async等等:pushpin: | ||||||||||
Bigdataguide | 1,994 | 2 days ago | Java | |||||||
大数据学习,从零开始学习大数据,包含大数据学习各阶段学习视频、面试资料 | ||||||||||
Szt Bigdata | 1,702 | 6 months ago | 15 | other | Scala | |||||
深圳地铁大数据客流分析系统🚇🚄🌟 | ||||||||||
Gaffer | 1,701 | 4 | 21 | a day ago | 94 | July 11, 2022 | 115 | apache-2.0 | Java | |
A large-scale entity and relation database supporting aggregation of properties | ||||||||||
Bigdata Interview | 1,397 | 2 years ago | n,ull | |||||||
:dart: :star2:[大数据面试题]分享自己在网络上收集的大数据相关的面试题以及自己的答案总结.目前包含Hadoop/Hive/Spark/Flink/Hbase/Kafka/Zookeeper框架的面试题知识总结 | ||||||||||
Dockerfiles | 1,132 | 16 days ago | 14 | mit | Shell | |||||
50+ DockerHub public images for Docker & Kubernetes - DevOps, CI/CD, GitHub Actions, CircleCI, Jenkins, TeamCity, Alpine, CentOS, Debian, Fedora, Ubuntu, Hadoop, Kafka, ZooKeeper, HBase, Cassandra, Solr, SolrCloud, Presto, Apache Drill, Nifi, Spark, Consul, Riak |
Apache HBase is a distributed Key-Value store of data on HDFS. It is modeled after Google’s Big Table, and provides APIs to query the data. The data is organized, partitioned and distributed by its “row keys”. Per partition, the data is further physically partitioned by “column families” that specify collections of “columns” of data. The data model is for wide and sparse tables where columns are dynamic and may well be sparse.
Although HBase is a very useful big data store, its access mechanism is very primitive and only through client-side APIs, Map/Reduce interfaces and interactive shells. SQL accesses to HBase data are available through Map/Reduce or interfaces mechanisms such as Apache Hive and Impala, or some “native” SQL technologies like Apache Phoenix. While the former is usually cheaper to implement and use, their latencies and efficiencies often cannot compare favorably with the latter and are often suitable only for offline analysis. The latter category, in contrast, often performs better and qualifies more as online engines; they are often on top of purpose-built execution engines.
Currently Spark supports queries against HBase data through HBase’s Map/Reduce interface (i.e., TableInputFormat). Spark SQL supports use of Hive data, which theoretically should be able to support HBase data access, out-of-box, through HBase’s Map/Reduce interface and therefore falls into the first category of the “SQL on HBase” technologies.
We believe, as an unified big data processing engine, Spark is in good position to provide better HBase support.
Online documentation https://github.com/Huawei-Spark/Spark-SQL-on-HBase/blob/master/doc/SparkSQLOnHBase_v2.2.docx
This version of 1.0.0 requires Spark 1.4.0.
Spark HBase is built using Apache Maven.
I. Clone and build Huawei-Spark/Spark-SQL-on-HBase
$ git clone https://github.com/Huawei-Spark/Spark-SQL-on-HBase spark-hbase
II. Go to the root of the source tree
$ cd spark-hbase
III. Build the project Build without testing
$ mvn -DskipTests clean install
Or, build with testing. It will run test suites against a HBase minicluster.
$ mvn clean install
First, add the path of spark-hbase jar to the hbase-env.sh in $HBASE_HOME/conf directory, as follows:
HBASE_CLASSPATH=$HBASE_CLASSPATH:/spark-hbase-root-dir/target/spark-sql-on-hbase-1.0.0.jar
Then, register the coprocessor service 'CheckDirEndPoint' to hbase-site.xml in the same directory, as follows:
<property>
<name>hbase.coprocessor.region.classes</name>
<value>org.apache.spark.sql.hbase.CheckDirEndPointImpl</value>
</property>
(Warning: Don't register another coprocessor service 'SparkSqlRegionObserver' here !)
The easiest way to start using Spark HBase is through the Scala shell:
./bin/hbase-sql
First, add the spark-hbase jar to the SPARK_CLASSPATH in the $SPARK_HOME/conf directory, as follows:
SPARK_CLASSPATH=$SPARK_CLASSPATH:/spark-hbase-root-dir/target/spark-sql-on-hbase-1.0.0.jar
Then go to the spark-hbase installation directory and issue
./bin/pyspark-hbase
A successfull message is as follows:
You are using Spark SQL on HBase!!! HBaseSQLContext available as hsqlContext.
To run a python script, the PYTHONPATH environment should be set to the "python" directory of the Spark-HBase installation. For example,
export PYTHONPATH=/root-of-Spark-HBase/python
Note that the shell commands are not included in the Zip file of the Spark release. They are for developers' use only for this version of 1.0.0. Instead, users can use "$SPARK_HOME/bin/spark-shell --packages Huawei-Spark/Spark-SQL-on-HBase:1.0.0" for SQL shell or "$SPARK_HOME/bin/pyspark --packages Huawei-Spark/Spark-SQL-on-HBase:1.0.0" for Pythin shell.
Testing first requires building Spark HBase. Once Spark HBase is built ...
Run all test suites from Maven:
mvn -Phbase,hadoop-2.4 test
Run a single test suite from Maven, for example:
mvn -Phbase,hadoop-2.4 test -DwildcardSuites=org.apache.spark.sql.hbase.BasicQueriesSuite
We use IntelliJ IDEA for Spark HBase development. You can get the community edition for free and install the JetBrains Scala plugin from Preferences > Plugins.
To import the current Spark HBase project for IntelliJ:
-XX:MaxPermSize=512m -Xmx3072m
You can also make those setting to be the default by setting to the "Defaults -> ScalaTest".
Please refer to the Configuration guide in the online documentation for an overview on how to configure Spark.