Awesome Open Source
Awesome Open Source
Combined Topics
data-cleaning
x
Advertising
📦 10
All Projects
Application Programming Interfaces
📦 124
Applications
📦 192
Artificial Intelligence
📦 78
Blockchain
📦 73
Build Tools
📦 113
Cloud Computing
📦 80
Code Quality
📦 28
Collaboration
📦 32
Command Line Interface
📦 49
Community
📦 83
Companies
📦 60
Compilers
📦 63
Computer Science
📦 80
Configuration Management
📦 42
Content Management
📦 175
Control Flow
📦 213
Data Formats
📦 78
Data Processing
📦 276
Data Storage
📦 135
Economics
📦 64
Frameworks
📦 215
Games
📦 129
Graphics
📦 110
Hardware
📦 152
Integrated Development Environments
📦 49
Learning Resources
📦 166
Legal
📦 29
Libraries
📦 129
Lists Of Projects
📦 22
Machine Learning
📦 347
Mapping
📦 64
Marketing
📦 15
Mathematics
📦 55
Media
📦 239
Messaging
📦 98
Networking
📦 315
Operating Systems
📦 89
Operations
📦 121
Package Managers
📦 55
Programming Languages
📦 245
Runtime Environments
📦 100
Science
📦 42
Security
📦 396
Social Media
📦 27
Software Architecture
📦 72
Software Development
📦 72
Software Performance
📦 58
Software Quality
📦 133
Text Editors
📦 49
Text Processing
📦 136
User Interface
📦 330
User Interface Components
📦 514
Version Control
📦 30
Virtualization
📦 71
Web Browsers
📦 42
Web Servers
📦 26
Web User Interface
📦 210
The Top 23 Data Cleaning Open Source Projects
Categories
>
Data Processing
>
Data Cleaning
Miller
⭐
2,701
Miller is like awk, sed, cut, join, and sort for name-indexed data such as CSV, TSV, and tabular JSON
Cleanlab
⭐
1,788
The standard package for machine learning with noisy labels and finding mislabeled data in Python.
Pandas Videos
⭐
1,547
Jupyter notebook and datasets from the pandas Q&A video series
Dat8
⭐
1,490
General Assembly's 2015 Data Science course in Washington, DC
My Journey In The Data Science World
⭐
1,175
📢 Ready to learn or review your knowledge!
Optimus
⭐
996
🚚 Agile Data Preparation Workflows made easy with pandas, dask, cudf, dask_cudf and pyspark
Janitor
⭐
994
simple tools for data cleaning in R
Data Forge Ts
⭐
976
The JavaScript data transformation and analysis toolkit inspired by Pandas and LINQ.
Pandera
⭐
539
A light-weight, flexible, and expressive pandas data validation library
Nonechucks
⭐
305
Deal with bad samples in your dataset dynamically, use Transforms as Filters, and more!
Validate
⭐
275
Professional data validation for the R environment
Dirty_cat
⭐
265
Encoding methods for dirty categorical variables
Voicebook
⭐
254
🗣️ A book and repo to get you started programming voice computing applications in Python (10 chapters and 200+ scripts).
Klib
⭐
204
Easy to use Python library of customized functions for cleaning and analyzing data.
Machine Learning Workflow With Python
⭐
157
This is a comprehensive ML techniques with python: Define the Problem- Specify Inputs & Outputs- Data Collection- Exploratory data analysis -Data Preprocessing- Model Design- Training- Evaluation
Datamaid
⭐
121
An R package for data screening
Refinr
⭐
92
Cluster and merge similar char values: an R implementation of Open Refine clustering algorithms
Data Analysis Using Python
⭐
91
Exploratory data analysis 📊using python 🐍of used car 🚘 database taken from ⓚ𝖆𝖌𝖌𝖑𝖊
Bumblebee
⭐
89
🚕 A spreadsheet-like data preparation web app that works over Optimus (pandas, dask, cuDF, dask-cuDF and PySpark)
Clean
⭐
49
Fast and Easy Data Cleaning (in R)
Drugs Recommendation Using Reviews
⭐
35
Analyzing the Drugs Descriptions, conditions, reviews and then recommending it using Deep Learning Models, for each Health Condition of a Patient.
Boltzmannclean
⭐
23
Fill missing values in Pandas DataFrames using Restricted Boltzmann Machines
Moodle Local_datacleaner
⭐
12
Reduce, filter, and anonymize moodle data for non-prod environments
1-23 of 23 projects
Advertising
📦 10
All Projects
Application Programming Interfaces
📦 124
Applications
📦 192
Artificial Intelligence
📦 78
Blockchain
📦 73
Build Tools
📦 113
Cloud Computing
📦 80
Code Quality
📦 28
Collaboration
📦 32
Command Line Interface
📦 49
Community
📦 83
Companies
📦 60
Compilers
📦 63
Computer Science
📦 80
Configuration Management
📦 42
Content Management
📦 175
Control Flow
📦 213
Data Formats
📦 78
Data Processing
📦 276
Data Storage
📦 135
Economics
📦 64
Frameworks
📦 215
Games
📦 129
Graphics
📦 110
Hardware
📦 152
Integrated Development Environments
📦 49
Learning Resources
📦 166
Legal
📦 29
Libraries
📦 129
Lists Of Projects
📦 22
Machine Learning
📦 347
Mapping
📦 64
Marketing
📦 15
Mathematics
📦 55
Media
📦 239
Messaging
📦 98
Networking
📦 315
Operating Systems
📦 89
Operations
📦 121
Package Managers
📦 55
Programming Languages
📦 245
Runtime Environments
📦 100
Science
📦 42
Security
📦 396
Social Media
📦 27
Software Architecture
📦 72
Software Development
📦 72
Software Performance
📦 58
Software Quality
📦 133
Text Editors
📦 49
Text Processing
📦 136
User Interface
📦 330
User Interface Components
📦 514
Version Control
📦 30
Virtualization
📦 71
Web Browsers
📦 42
Web Servers
📦 26
Web User Interface
📦 210