Collection of Big Data Infrastructure using Docker, Spark Cluster, HDFS, MangoDB, GCP
Alternatives To Big_data_infrastructure
Project NameStarsDownloadsRepos Using ThisPackages Using ThisMost Recent CommitTotal ReleasesLatest ReleaseOpen IssuesLicenseLanguage
Bigdata Notes13,291
4 months ago33Java
大数据入门指南 :star:
God Of Bigdata7,992
2 months ago2
515 days ago32April 21, 202213apache-2.0Python
TensorFlowOnSpark brings TensorFlow programs to Apache Spark clusters.
Ibis2,75624162 hours ago32April 28, 202280apache-2.0Python
The flexibility of Python with the scale and performance of modern SQL.
Bigdata Interview1,397
2 years agon,ull
:dart: :star2:[大数据面试题]分享自己在网络上收集的大数据相关的面试题以及自己的答案总结.目前包含Hadoop/Hive/Spark/Flink/Hbase/Kafka/Zookeeper框架的面试题知识总结
Bigdata Growth898
4 hours ago1mitShell
Devops Python Tools658
3 days ago32mitPython
80+ DevOps & Data CLI Tools - AWS, GCP, GCF Python Cloud Functions, Log Anonymizer, Spark, Hadoop, HBase, Hive, Impala, Linux, Docker, Spark Data Converters & Validators (Avro/Parquet/JSON/CSV/INI/XML/YAML), Travis CI, AWS CloudFormation, Elasticsearch, Solr etc.
4 years ago9apache-2.0Scala
Real Time Analytics and Data Pipelines based on Spark Streaming
Docker Hadoop Spark Workbench503
3 years ago18Makefile
[EXPERIMENTAL] This repo includes deployment instructions for running HDFS/Spark inside docker containers. Also includes spark-notebook and HDFS FileBrowser.
8 years ago2apache-2.0JavaScript
Next-generation web analytics processing with Scala, Spark, and Parquet.
Alternatives To Big_data_infrastructure
Select To Compare

Alternative Project Comparisons

Big Data Infrastructure


Collisions Routieres Dataset



List of collisions that have occurred in Montreal since 2012.

This set includes collisions involving at least one motor vehicle circulating on the network and which have been the subject of a police report. It includes descriptive, contextual and event location information, including seriousness in terms of death, serious injury, minor injury and property damage only.

Dataset Source

Collisions Routieres Road Collisions data

Dataset Characteristics

Number of Instances: 171,271
Number of Attributes: 68
Publishers: Service de l'urbanisme et de la mobilité - Direction de la mobilité
Frequency of update: Annual
Language: French
Geographic coverage: Territory of the city of Montreal
Temporal coverage: 2012-01-01 / 2018-12-31
Last edit: 2019-09-17 09:40
Created on: 2018-11-11 21:39
License: Creative Commons Attribution 4.0 International

Fetching the data and download it into volume using an image

using Dockerfile data_dockerfile to build an image downloads the data from the source to tmp_data directory then move this data to data directory inside a volume when the container will be created by running the following commands:

docker build -t database_image -f data_dockerfile .
docker run -it database_image


Creating Volume

Create Data volume

docker volume create project-scripts-volume

Copy Database to Volume

docker run --rm -v project-scripts-volume:/volume database_image

Copy Data Folder to Volume

docker run --rm -v "$(pwd)"/data:/data \
-v project-scripts-volume:/volume busybox \
cp -r /data/ /volume

Volume Contents

docker run -it --rm -v project-scripts-volume:/volume busybox ls -l /volume
docker run -it --rm -v project-scripts-volume:/volume busybox ls -l /volume/data


Services on Spark Cluster

Create Spark Network

Using the following command a spark network will be created as "spark-network"

docker network create spark-network

Spark Cluster with HDFS and MongoDB

using docker compose to create spark cluster by running spark-compose.yml file using the below command:

env user_mongo=root pass_mongo=password docker-compose --file spark-compose.yml up --scale spark-worker=2

Preparing Data and Work Environment

Ckecking if the volume accessable by the cluster

  • check the containers of the cluster
  • execute the worker container to check the volume using the command:
docker exec -it containerID sh


Starting Spark Shell

Spark shell will be used to unzip the data inside the volume, upload the data on HDFS and MongoDB. to start spark shell the below command will be run:

docker run -it --rm \
  -v project-scripts-volume:/volume \
  --network=spark-network \
  mjhea0/spark:2.4.1 \
  ./bin/pyspark \
  --master spark://master:7077 \
  --packages org.mongodb.spark:mongo-spark-connector_2.11:2.4.0


Unzip the Data File:

using the following command in spark shell:

from zipfile import *
with ZipFile("/volume/data/", 'r') as zipObj:
...     zipObj.extractall('/volume/data')


Store the data in HDFS and MongoDB

Store the data in HDFS as parquet

first read the unziped data from the volume then push it as parquet into HDFS. to acheive that the following command to be run in spark shell

acc_data ="/volume/data")


to check the data file on HDFS open http://localhost:50070 then navigate to "Utilities" in main bar, select "Browse the file system" then the below page will open.


Store the data in Mongodb

first in spark shell run the folowing comand to be able pushing the data to MongoDB

spark = SparkSession \
        .builder \
        .appName("mongodb") \
        .master("spark://master:7077") \
        .config("spark.mongodb.input.uri", "mongodb://root:[email protected]/test.coll?authSource=admin") \
        .config("spark.mongodb.output.uri", "mongodb://root:[email protected]/test.coll?authSource=admin") \
        .config('spark.jars.packages', 'org.mongodb.spark:mongo-spark-connector_2.11:2.4.0')\

Read the data from the volume and store it in MongoDB

acc_mongo ="/volume/data")

Read the data from the HDFS and store it in MongoDB

acc_mongo ="hdfs://hadoop/acc_data_parquet")

Open http://localhost:8181 for Mongo Express.


Then click on "test" for database


Automated Storage

Copy the scripts into the volume

In the volume a directory named "script" will be created and copied all the required scripts into that directory by executing the following command:

docker run --rm -v "$(pwd)"/scripts:/script \
-v project-scripts-volume:/volume busybox \
cp -r /script/ /volume

Checking the scripts in the volume:

docker run -it --rm -v project-scripts-volume:/volume busybox ls -l /volume/script


Store the data in HDFS as parquet

By execute script as following:

docker run -t --rm \
  -v project-scripts-volume:/volume \
  --network=spark-network \
  mjhea0/spark:2.4.1 \
  bin/spark-submit \
    --master spark://master:7077 \
    --class endpoint \

Store the data in MongoDB

By execute script as following:

docker run -t --rm \
  -v project-scripts-volume:/volume \
  --network=spark-network \
  mjhea0/spark:2.4.1 \
  bin/spark-submit \
    --master spark://master:7077 \
    --class endpoint \
    --packages org.mongodb.spark:mongo-spark-connector_2.11:2.4.0 \

Using Jupyter Notebook

Create Volume

Creating a volume to store the notbooks Which will be created

docker volume create notebooks

Jupyter in the Cluster

Keeping the cluster up now to deploy jupyter in cluster using docker compose by running jupyter-compose.yml file using the below command::

env TOKEN=project1261 docker-compose --file jupyter-compose.yml up

Open http://localhost:8889/?token=project1261

jupyter have access to the volume where the data and scripts are stored.


Openning new notebook and prepare the environment.

Reading the Data from volume

Read the data from the volume as csv


Store the Data on HDFS

Push the data to HDFS


Store the Data in MongoDB

Push the data from hdfs into MongoDB


The Full Notebook

Google Cloud Platform

Environment Setup

  • First lets create a project.


  • Enabling the Cloud Dataproc and Google Compute Engine APIs.

  • Creating bucket where notebook and the data will be stored.



  • Creating cluster and selecting the the bucket to which was created to have access to the data and where the notebooks will be stored.



In advanced options section Anaconda and jupyter components should be selected to run jupyter in the cluster.


  • From the web interfaces in created cluster we can open jupyter notebook which is runing on the cluster and have access to the bucket where our data is.


Analytics on GCP

Openning Jupyter on GCP cluster


Running the Project App
Exploratory Data Analysis on the Collisions Routieres dataset

Github Mirror and Deploy via GCP

Using the exact same process as our local docker stack, we will deploy our application on the GCP. The only difference is that we skip the manual hdfs/mongo-express builds and go right to automated builds.

  • Pulling data from github hosted docker image through google shell


  • Running the docker image and making sure the data is in the correct directory


  • Copying the scripts


  • Deploying the spark compose file with created network


  • Creating jupyter volume


  • Jupyter compose up to be able to reach our notebook


  • Proof of the notebook in action on our GCP project cluster


The Results Obtained

In our project, we chose to work with the dataset about traffic accidents that happened from 2012 to 2018 in the city of Montreal. We built our infrastructure using a docker container to create an image gathering that dataset from the Montreal Open Data Website. We decompressed, read and wrote the dataset in HDFS (Parquet files) and MongoDB using a spark console running on a Spark Cluster and making use of Docker Volumes. Also, we wrote scripts to make those operations automatically. Plus, we ran a Jupyter Notebook using the same volume where the data was saved and we implemented some exploratory data analysis. The sensitive data as passwords and tokens were handled safely. After doing everything locally as a Docker Stack, we connected our GitHub repository on the Google Cloud Platform and deployed our solution on the Cloud!


After doing the proposed tutorials during the course classes we could imagine the complexity of Big Data infrastructure. However, after doing the group project, we started to understand such complexity due to the great challenge that was to deploy a simple application both locally and on the Cloud. The use of a lot of structures that characterize Big Data infrastructure solutions can easily become hard work even when we have good tools to help us out.

Popular Hdfs Projects
Popular Spark Projects
Popular Data Storage Categories

Get A Weekly Email With Trending Projects For These Categories
No Spam. Unsubscribe easily at any time.
Jupyter Notebook