A framework for the real-time IP flow data analysis built on Apache Spark Streaming, a modern distributed stream processing system.
⚠ Project Stream4Flow is no longer maintained as the used frameworks are constantly evolving, and it is not in our capacity to continually update the installation scripts. If you're interested in other network data processing tools and our current research, check out CSIRT-MU repositories.
The basis of the Stream4Flow framework is formed by the IPFIXCol collector, Kafka messaging system, Apache Spark, and Elastic Stack. IPFIXCol is able to receive IP flows from a majority of network Netflow/IPFIX probes (e.g., Flowmon Probe, softflowd, etc.). IPFIXCol enables incoming IP flow records to be transformed into the JSON format provided to the Kafka messaging system. The selection of Kafka was based on its scalability and partitioning possibilities, which provide sufficient data throughput. Apache Spark was selected as the data stream processing framework for its quick IP flow data throughput, available programming languages (Scala, Java, or Python) and MapReduce programming model. The analysis results are stored in Elastic Stack containing Logstash, Elasticsearch, and Kibana, which enable storage, querying, and visualizing the results. The Stream4Flow framework also contains the additional web interface to make administration easier and visualize complex results of the analysis.
More on stream-based IP flow analysis is described in our paper titled Toward Stream-Based IP Flow Analysis.
We have it all prepared for you. Everything is preconfigured. You have to only choose the deployment variant.
Note: The minimum hardware requirement is 12GB of RAM
vagrant up
or start guests separately vagrant up <guest-name>
vagrant ssh <guest-name>
)See provision/README.md for additional information about provisioning and Vagrant usage.
Note: machines in the cluster must run Debian OS with systemd
ansible-playbook -i <your inventory file> site.yml --user <username> --ask-pass
(consult ansible docs for further information)Usage | Description | Usage information |
---|---|---|
Input data | Input point for network monitoring data in IPFIX/Netflow format |
|
Stream4Flow Web Interface | Web interface for application for viewing data |
|
Spark Web Interface | Apache Spark streaming interface for application control |
|
Kibana Web Interface | Elastic Kibana web interface for Elasticsearch data |
|
ssh [email protected]
cd /home/spark/applications/
./run-application.sh ./statistics/protocols_statistics/spark/protocols_statistics.py -iz producer:2181 -it ipfix.entry -oz producer:9092 -ot results.output
Stream4Flow is compatible with any Netflow v5/9 or IPFIX network probe. To measure your first data for Stream4Flow, you can use either commercial solution such as Flowmon Probe or an open-source alternative softflowd
Install softflowd
sudo apt-get install softflowd
Start data export
softflowd -i <your interface> -D -n 192.168.0.2:4739
softflowd -i <your interface> -D -n <IP address of producer>:4739
Bibtex
@ARTICLE{jirsik-2017-toward,
author={Jirsik, Tomas and Cermak, Milan and Tovarnak, Daniel and Celeda, Pavel},
journal={IEEE Communications Magazine},
title={Toward Stream-Based IP Flow Analysis},
year={2017},
volume={55},
number={7},
pages={70-76},
doi={10.1109/MCOM.2017.1600972},
ISSN={0163-6804},
}
Plain text
T. Jirsik, M. Cermak, D. Tovarnak and P. Celeda, "Toward Stream-Based IP Flow Analysis," in IEEE Communications Magazine, vol. 55, no. 7, pp. 70-76, 2017.
doi: 10.1109/MCOM.2017.1600972
Related Publications
The SecurityCloud project is supported by the Technology Agency of the Czech Republic under No. TA04010062 Technology for processing and analysis of network data in big data concept.