Project Name | Stars | Downloads | Repos Using This | Packages Using This | Most Recent Commit | Total Releases | Latest Release | Open Issues | License | Language |
---|---|---|---|---|---|---|---|---|---|---|
Bigdata Notes | 13,291 | 4 months ago | 33 | Java | ||||||
大数据入门指南 :star: | ||||||||||
God Of Bigdata | 7,992 | 2 months ago | 2 | |||||||
专注大数据学习面试,大数据成神之路开启。Flink/Spark/Hadoop/Hbase/Hive... | ||||||||||
Tensorflowonspark | 3,851 | 5 | 15 days ago | 32 | April 21, 2022 | 13 | apache-2.0 | Python | ||
TensorFlowOnSpark brings TensorFlow programs to Apache Spark clusters. | ||||||||||
Ibis | 2,756 | 24 | 16 | 2 hours ago | 32 | April 28, 2022 | 80 | apache-2.0 | Python | |
The flexibility of Python with the scale and performance of modern SQL. | ||||||||||
Bigdata Interview | 1,397 | 2 years ago | n,ull | |||||||
:dart: :star2:[大数据面试题]分享自己在网络上收集的大数据相关的面试题以及自己的答案总结.目前包含Hadoop/Hive/Spark/Flink/Hbase/Kafka/Zookeeper框架的面试题知识总结 | ||||||||||
Bigdata Growth | 898 | 4 hours ago | 1 | mit | Shell | |||||
大数据知识仓库涉及到数据仓库建模、实时计算、大数据、数据中台、系统设计、Java、算法等。 | ||||||||||
Devops Python Tools | 658 | 3 days ago | 32 | mit | Python | |||||
80+ DevOps & Data CLI Tools - AWS, GCP, GCF Python Cloud Functions, Log Anonymizer, Spark, Hadoop, HBase, Hive, Impala, Linux, Docker, Spark Data Converters & Validators (Avro/Parquet/JSON/CSV/INI/XML/YAML), Travis CI, AWS CloudFormation, Elasticsearch, Solr etc. | ||||||||||
Sparta | 524 | 4 years ago | 9 | apache-2.0 | Scala | |||||
Real Time Analytics and Data Pipelines based on Spark Streaming | ||||||||||
Docker Hadoop Spark Workbench | 503 | 3 years ago | 18 | Makefile | ||||||
[EXPERIMENTAL] This repo includes deployment instructions for running HDFS/Spark inside docker containers. Also includes spark-notebook and HDFS FileBrowser. | ||||||||||
Spindle | 333 | 8 years ago | 2 | apache-2.0 | JavaScript | |||||
Next-generation web analytics processing with Scala, Spark, and Parquet. |
List of collisions that have occurred in Montreal since 2012.
This set includes collisions involving at least one motor vehicle circulating on the network and which have been the subject of a police report. It includes descriptive, contextual and event location information, including seriousness in terms of death, serious injury, minor injury and property damage only.
Collisions Routieres Road Collisions data
Number of Instances: 171,271
Number of Attributes: 68
Publishers: Service de l'urbanisme et de la mobilité - Direction de la mobilité
Frequency of update: Annual
Language: French
Geographic coverage: Territory of the city of Montreal
Temporal coverage: 2012-01-01 / 2018-12-31
Last edit: 2019-09-17 09:40
Created on: 2018-11-11 21:39
License: Creative Commons Attribution 4.0 International
using Dockerfile data_dockerfile to build an image downloads the data from the source to tmp_data directory then move this data to data directory inside a volume when the container will be created by running the following commands:
docker build -t database_image -f data_dockerfile .
docker run -it database_image
docker volume create project-scripts-volume
docker run --rm -v project-scripts-volume:/volume database_image
docker run --rm -v "$(pwd)"/data:/data \
-v project-scripts-volume:/volume busybox \
cp -r /data/ /volume
docker run -it --rm -v project-scripts-volume:/volume busybox ls -l /volume
docker run -it --rm -v project-scripts-volume:/volume busybox ls -l /volume/data
Using the following command a spark network will be created as "spark-network"
docker network create spark-network
using docker compose to create spark cluster by running spark-compose.yml file using the below command:
env user_mongo=root pass_mongo=password docker-compose --file spark-compose.yml up --scale spark-worker=2
docker exec -it containerID sh
Spark shell will be used to unzip the data inside the volume, upload the data on HDFS and MongoDB. to start spark shell the below command will be run:
docker run -it --rm \
-v project-scripts-volume:/volume \
--network=spark-network \
mjhea0/spark:2.4.1 \
./bin/pyspark \
--master spark://master:7077 \
--packages org.mongodb.spark:mongo-spark-connector_2.11:2.4.0
using the following command in spark shell:
from zipfile import *
with ZipFile("/volume/data/accidents_2012_2018.zip", 'r') as zipObj:
... zipObj.extractall('/volume/data')
first read the unziped data from the volume then push it as parquet into HDFS. to acheive that the following command to be run in spark shell
acc_data = spark.read.csv("/volume/data")
acc_data.write.parquet("hdfs://hadoop/acc_data_parquet")
to check the data file on HDFS open http://localhost:50070 then navigate to "Utilities" in main bar, select "Browse the file system" then the below page will open.
first in spark shell run the folowing comand to be able pushing the data to MongoDB
spark = SparkSession \
.builder \
.appName("mongodb") \
.master("spark://master:7077") \
.config("spark.mongodb.input.uri", "mongodb://root:[email protected]/test.coll?authSource=admin") \
.config("spark.mongodb.output.uri", "mongodb://root:[email protected]/test.coll?authSource=admin") \
.config('spark.jars.packages', 'org.mongodb.spark:mongo-spark-connector_2.11:2.4.0')\
.getOrCreate()
acc_mongo = spark.read.csv("/volume/data")
acc_mongo.write.format("com.mongodb.spark.sql.DefaultSource").mode("append").save()
acc_mongo = spark.read.csv("hdfs://hadoop/acc_data_parquet")
acc_mongo.write.format("com.mongodb.spark.sql.DefaultSource").mode("append").save()
Open http://localhost:8181 for Mongo Express.
Then click on "test" for database
In the volume a directory named "script" will be created and copied all the required scripts into that directory by executing the following command:
docker run --rm -v "$(pwd)"/scripts:/script \
-v project-scripts-volume:/volume busybox \
cp -r /script/ /volume
docker run -it --rm -v project-scripts-volume:/volume busybox ls -l /volume/script
By execute hdfs_store.py script as following:
docker run -t --rm \
-v project-scripts-volume:/volume \
--network=spark-network \
mjhea0/spark:2.4.1 \
bin/spark-submit \
--master spark://master:7077 \
--class endpoint \
/volume/script/hdfs_store.py
By execute mongodb_store.py script as following:
docker run -t --rm \
-v project-scripts-volume:/volume \
--network=spark-network \
mjhea0/spark:2.4.1 \
bin/spark-submit \
--master spark://master:7077 \
--class endpoint \
--packages org.mongodb.spark:mongo-spark-connector_2.11:2.4.0 \
/volume/script/mongodb_store.py
Creating a volume to store the notbooks Which will be created
docker volume create notebooks
Keeping the cluster up now to deploy jupyter in cluster using docker compose by running jupyter-compose.yml file using the below command::
env TOKEN=project1261 docker-compose --file jupyter-compose.yml up
Open http://localhost:8889/?token=project1261
jupyter have access to the volume where the data and scripts are stored.
Openning new notebook and prepare the environment.
Read the data from the volume as csv
Push the data to HDFS
Push the data from hdfs into MongoDB
Enabling the Cloud Dataproc and Google Compute Engine APIs.
Creating bucket where notebook and the data will be stored.
In advanced options section Anaconda and jupyter components should be selected to run jupyter in the cluster.
Openning Jupyter on GCP cluster
Running the Project App
Exploratory Data Analysis on the Collisions Routieres dataset
In our project, we chose to work with the dataset about traffic accidents that happened from 2012 to 2018 in the city of Montreal. We built our infrastructure using a docker container to create an image gathering that dataset from the Montreal Open Data Website. We decompressed, read and wrote the dataset in HDFS (Parquet files) and MongoDB using a spark console running on a Spark Cluster and making use of Docker Volumes. Also, we wrote scripts to make those operations automatically. Plus, we ran a Jupyter Notebook using the same volume where the data was saved and we implemented some exploratory data analysis. The sensitive data as passwords and tokens were handled safely. After doing everything locally as a Docker Stack, we connected our GitHub repository on the Google Cloud Platform and deployed our solution on the Cloud!
After doing the proposed tutorials during the course classes we could imagine the complexity of Big Data infrastructure. However, after doing the group project, we started to understand such complexity due to the great challenge that was to deploy a simple application both locally and on the Cloud. The use of a lot of structures that characterize Big Data infrastructure solutions can easily become hard work even when we have good tools to help us out.