Awesome Open Source
Awesome Open Source
Combined Topics
data-processing
x
Advertising
📦 10
All Projects
Application Programming Interfaces
📦 124
Applications
📦 192
Artificial Intelligence
📦 78
Blockchain
📦 73
Build Tools
📦 113
Cloud Computing
📦 80
Code Quality
📦 28
Collaboration
📦 32
Command Line Interface
📦 49
Community
📦 83
Companies
📦 60
Compilers
📦 63
Computer Science
📦 80
Configuration Management
📦 42
Content Management
📦 175
Control Flow
📦 213
Data Formats
📦 78
Data Processing
📦 276
Data Storage
📦 135
Economics
📦 64
Frameworks
📦 215
Games
📦 129
Graphics
📦 110
Hardware
📦 152
Integrated Development Environments
📦 49
Learning Resources
📦 166
Legal
📦 29
Libraries
📦 129
Lists Of Projects
📦 22
Machine Learning
📦 347
Mapping
📦 64
Marketing
📦 15
Mathematics
📦 55
Media
📦 239
Messaging
📦 98
Networking
📦 315
Operating Systems
📦 89
Operations
📦 121
Package Managers
📦 55
Programming Languages
📦 245
Runtime Environments
📦 100
Science
📦 42
Security
📦 396
Social Media
📦 27
Software Architecture
📦 72
Software Development
📦 72
Software Performance
📦 58
Software Quality
📦 133
Text Editors
📦 49
Text Processing
📦 136
User Interface
📦 330
User Interface Components
📦 514
Version Control
📦 30
Virtualization
📦 71
Web Browsers
📦 42
Web Servers
📦 26
Web User Interface
📦 210
The Top 32 Data Processing Open Source Projects
Categories
>
Data Processing
>
Data Processing
Awesome Web Scraping
⭐
3,920
List of libraries, tools and APIs for web scraping and data processing.
Dali
⭐
3,039
A library containing both highly optimized building blocks and an execution engine for data pre-processing in deep learning applications
Miller
⭐
2,627
Miller is like awk, sed, cut, join, and sort for name-indexed data such as CSV, TSV, and tabular JSON
Texar
⭐
2,091
Toolkit for Machine Learning, Natural Language Processing, and Text Generation, in TensorFlow. This is part of the CASL project: http://casl-project.ai/
Bonobo
⭐
1,371
Extract Transform Load for Python 3.5+
Bash Oneliner
⭐
1,298
A collection of handy Bash One-Liners and terminal tricks for data processing and Linux system maintenance.
Broadway
⭐
1,217
Concurrent and multi-stage data ingestion and data processing with Elixir
Dialogpt
⭐
1,104
Large-scale pretraining for dialogue
Dataflowjavasdk
⭐
856
Google Cloud Dataflow provides a simple, powerful model for building both batch and streaming parallel data processing pipelines.
Data Science On Gcp
⭐
829
Source code accompanying book: Data Science on the Google Cloud Platform, Valliappa Lakshmanan, O'Reilly 2017
Texar Pytorch
⭐
628
Integrating the Best of TF into PyTorch, for Machine Learning, Natural Language Processing, and Text Generation. This is part of the CASL project: http://casl-project.ai/
Pandera
⭐
452
A light-weight, flexible, and expressive pandas data validation library
Awesome Kafka
⭐
383
A list about Apache Kafka
Xidel
⭐
325
Command line tool to download and extract data from HTML/XML pages or JSON-APIs, using CSS, XPath 3.0, XQuery 3.0, JSONiq or pattern matching. It can also create new or transformed XML/HTML/JSON documents.
Eternal
⭐
315
👾~ music, eternal ~ 👾
Nonechucks
⭐
298
Deal with bad samples in your dataset dynamically, use Transforms as Filters, and more!
Rapidtables
⭐
290
Super fast list of dicts to pre-formatted tables conversion library for Python 2/3
Pxi
⭐
239
🧚 pxi (pixie) is a small, fast, and magical command-line data processor similar to jq, mlr, and awk.
Pysparkling
⭐
231
A pure Python implementation of Apache Spark's RDD and DStream interfaces.
Machine Learning Notebooks
⭐
217
Machine Learning notebooks for refreshing concepts.
Amadeus
⭐
216
Harmonious distributed data analysis in Rust.
Vaspy
⭐
182
Manipulating VASP files with Python.
Collapse
⭐
156
Advanced and Fast Data Transformation in R
Pulsar Flink
⭐
110
Elastic data processing with Apache Pulsar and Apache Flink
Data Processing Agreements
⭐
109
Collection of Data Processing Agreement (DPA) and GDPR compliance resources
Distributed Dataset
⭐
108
A distributed data processing framework in Haskell.
Forte
⭐
70
Forte is a flexible and powerful NLP builder FOR TExt. This is part of the CASL project: http://casl-project.ai/
Cbrain
⭐
51
CBRAIN is a flexible Ruby on Rails framework for accessing and processing of large data on high-performance computing infrastructures.
Pulsar Spark
⭐
51
When Apache Pulsar meets Apache Spark
2019 Electronic Design Competition
⭐
51
【电赛】2019 全国大学生电子设计竞赛 (F题)纸张数量检测装置 (基于STM32F407 & FDC2214 & USART HMI)
Mdsplus
⭐
43
The MDSplus data management system
Tdm
⭐
32
R package for normalizing RNA-seq data to make them comparable to microarray data.
1-32 of 32 projects
Advertising
📦 10
All Projects
Application Programming Interfaces
📦 124
Applications
📦 192
Artificial Intelligence
📦 78
Blockchain
📦 73
Build Tools
📦 113
Cloud Computing
📦 80
Code Quality
📦 28
Collaboration
📦 32
Command Line Interface
📦 49
Community
📦 83
Companies
📦 60
Compilers
📦 63
Computer Science
📦 80
Configuration Management
📦 42
Content Management
📦 175
Control Flow
📦 213
Data Formats
📦 78
Data Processing
📦 276
Data Storage
📦 135
Economics
📦 64
Frameworks
📦 215
Games
📦 129
Graphics
📦 110
Hardware
📦 152
Integrated Development Environments
📦 49
Learning Resources
📦 166
Legal
📦 29
Libraries
📦 129
Lists Of Projects
📦 22
Machine Learning
📦 347
Mapping
📦 64
Marketing
📦 15
Mathematics
📦 55
Media
📦 239
Messaging
📦 98
Networking
📦 315
Operating Systems
📦 89
Operations
📦 121
Package Managers
📦 55
Programming Languages
📦 245
Runtime Environments
📦 100
Science
📦 42
Security
📦 396
Social Media
📦 27
Software Architecture
📦 72
Software Development
📦 72
Software Performance
📦 58
Software Quality
📦 133
Text Editors
📦 49
Text Processing
📦 136
User Interface
📦 330
User Interface Components
📦 514
Version Control
📦 30
Virtualization
📦 71
Web Browsers
📦 42
Web Servers
📦 26
Web User Interface
📦 210