Data Engineering Project is an implementation of the data pipeline which consumes the latest news from RSS Feeds and makes them available for users via handy API. The pipeline infrastructure is built using popular, open-source projects.
Access the latest news and headlines in one place. 💪
Airflow DAG is responsible for the execution of Python scraping modules. It runs periodically every X minutes producing micro-batches.
First task updates proxypool. Using proxies in combination with rotating user agents can help get scrapers past most of the anti-scraping measures and prevent being detected as a scraper.
Second task extracts news from RSS feeds provided in the configuration file, validates the quality and sends data into Kafka topic A. The extraction process is using validated proxies from proxypool.
Software required to run the project. Install:
manage.sh - wrapper for
docker-compose works as a managing tool.
run_tests.sh executes unit tests against Airflow scraping modules and Django Rest Framework applications.
Read detailed documentation on how to interact with data collected by pipeline using search endpoints.
search_fieldstitle and description, see all of the news containing the
Lewandowskiphrase in their titles
Inspired by following codes, articles and videos:
Contributions are what makes the open-source community such an amazing place to learn, inspire, and create. Any contributions you make are greatly appreciated.
git checkout -b feature/AmazingFeature)
git commit -m 'Add some AmazingFeature')
git push origin feature/AmazingFeature)
Distributed under the MIT License. See LICENSE for more information.