Awesome Open Source
Search
Programming Languages
Languages
All Categories
Categories
About
Search results for data cleaning
data-cleaning
x
180 search results found
Openrefine
⭐
10,106
OpenRefine is a free, open source power tool for working with messy data and improving it
Great_expectations
⭐
9,179
Always know what to expect from your data.
Miller
⭐
8,397
Miller is like awk, sed, cut, join, and sort for name-indexed data such as CSV, TSV, and tabular JSON
Cleanlab
⭐
8,182
The standard data-centric AI package for data quality and machine learning with messy, real-world data and labels.
Pandera
⭐
3,012
A light-weight, flexible, and expressive statistical data testing library
Pandas Videos
⭐
1,808
Jupyter notebook and datasets from the pandas Q&A video series
Dataprep
⭐
1,807
Open-source low code data preparation library in python. Collect, clean and visualization your data in python with a few lines of code.
Dat8
⭐
1,549
General Assembly's 2015 Data Science course in Washington, DC
Optimus
⭐
1,446
🚚 Agile Data Preparation Workflows made easy with Pandas, Dask, cuDF, Dask-cuDF, Vaex and PySpark
Janitor
⭐
1,315
simple tools for data cleaning in R
Data Forge Ts
⭐
1,236
The JavaScript data transformation and analysis toolkit inspired by Pandas and LINQ.
Skrub
⭐
1,011
Prepping tables for machine learning
Yobulkdev
⭐
786
🔥 🔥 🔥Open Source & AI driven Data Onboarding Platform:Free flatfile.com alternative
Schema Inspector
⭐
500
Schema-Inspector is a simple JavaScript object sanitization and validation module.
Educhat
⭐
467
An open-source educational chat model from ICALK, East China Normal University. 开源中英教育对话大模型。(通用基座模型,GPU部署,数据清理) 致敬: LLaMA, MOSS, BELLE, Ziya, vLLM
Klib
⭐
446
Easy to use Python library of customized functions for cleaning and analyzing data.
Objectiv Analytics
⭐
408
Powerful product analytics for data teams, with full control over data & models.
Validate
⭐
390
Professional data validation for the R environment
Encord Active
⭐
385
The toolkit to test, validate, and evaluate your models and surface, curate, and prioritize the most valuable data for labeling.
Voicebook
⭐
325
🗣️ A book and repo to get you started programming voice computing applications in Python (10 chapters and 200+ scripts).
Nonechucks
⭐
315
Deal with bad samples in your dataset dynamically, use Transforms as Filters, and more!
Hypergbm
⭐
306
A full pipeline AutoML tool for tabular data
Feature Engineering Tutorials
⭐
217
Data Science Feature Engineering and Selection Tutorials
Pclean
⭐
167
A domain-specific probabilistic programming language for scalable Bayesian data cleaning
Datamaid
⭐
128
An R package for data screening
Allie
⭐
126
🤖 An automated machine learning framework for audio, text, image, video, or .CSV files (50+ featurizers and 15+ model trainers). Python 3.6 required.
Bumblebee
⭐
124
🚕 A spreadsheet-like data preparation web app that works over Optimus (Pandas, Dask, cuDF, Dask-cuDF, Spark and Vaex)
Csvlint
⭐
123
CSV Lint plug-in for Notepad++ for syntax highlighting, csv validation, automatic column and datatype detecting, fixed width datasets, change datetime format, decimal separator, sort data, count unique values, convert to xml, json, sql etc. A plugin for data cleaning and working with messy data files.
Pythresh
⭐
113
Outlier Detection Thresholding
Mzutils
⭐
109
Refinr
⭐
100
Cluster and merge similar char values: an R implementation of Open Refine clustering algorithms
Holoclean Legacy Deprecated
⭐
74
A Machine Learning System for Data Enrichment.
Akvo Lumen
⭐
62
Make sense of your data
Covid_19_jhu_data_web_scrap_and_cleaning
⭐
61
This repository contains data and code used to get and clean data from https://github.com/CSSEGISandData/COVID-19 and https://www.worldometers.info/coronavirus/
Opendataval
⭐
60
OpenDataVal: a Unified Benchmark for Data Valuation in Python (NeurIPS 2023)
Data Analysis Using Python
⭐
58
Exploratory data analysis 📊using python 🐍of used car 🚘 database taken from ⓚ𝖆𝖌𝖌𝖑𝖊
Desbordante
⭐
54
Desbordante is a high-performance data profiler that is capable of discovering many different patterns in data using various algorithms. It also allows to run data cleaning scenarios using these algorithms. Desbordante has a console version and an easy-to-use web application.
Pydvl
⭐
52
pyDVL is a library of stable implementations of algorithms for data valuation and influence function computation
Numer.ai
⭐
49
Pytrack
⭐
48
a Map-Matching-based Python Toolbox for Vehicle Trajectory Reconstruction
Lc Open Refine
⭐
47
Library Carpentry: OpenRefine
Taxa
⭐
44
taxonomic classes for R
Sliceguard
⭐
43
A library for detecting problematic data segments in structured and unstructured data with few lines of code.
Bunkatopics
⭐
43
🗺️ Data Cleaning and Textual Data Visualization 🗺️
Nepali Translator
⭐
42
Neural Machine Translation on the Nepali-English language pair
Pandas Gpt
⭐
42
Power up your data science workflow with ChatGPT.
Dtcleaner
⭐
37
DTCleaner: data cleaning using multi-target decision trees.
Amora Data Build Tool
⭐
37
Amora Data Build Tool enables analysts and engineers to transform data on the data warehouse (BigQuery) by writing Amora Models that describe the data schema using Python's "PEP484 - Type Hints" and select statements with SQLAlchemy. Amora is able to transform Python code into SQL data transformation jobs that run inside the warehouse.
Gratefuldata
⭐
33
Grateful Data isn't programming code, but an online tutorial about data acquisition, cleaning and enriching, using publicly accessible data on the band the Grateful Dead as examples. Read the Wiki to find out how to use the sample data.
Opuscleaner
⭐
32
OpusCleaner is a web interface that helps you select, clean and schedule your data for training machine translation models.
Pgdedupe
⭐
32
A simple command line interface to the datamade/dedupe library.
Redditcleaner
⭐
31
Cleans Reddit Text Data 📜 🧹
Cleanml
⭐
31
A Benchmark for Joint Data Cleaning and Machine Learning
Triebeard
⭐
29
Radix trees in Rcpp and R
Drugs Recommendation Using Reviews
⭐
27
Analyzing the Drugs Descriptions, conditions, reviews and then recommending it using Deep Learning Models, for each Health Condition of a Patient.
Cleaner.jl
⭐
27
A toolbox of simple solutions for common data cleaning problems.
Foil
⭐
27
Utilities for data cleaning and ETL processing
Data Purifier
⭐
26
A Python library for Automated Exploratory Data Analysis, Automated Data Cleaning, and Automated Data Preprocessing For Machine Learning and Natural Language Processing Applications in Python.
Covid 19 Data Cleanup
⭐
25
Scripts to cleanup data from https://github.com/CSSEGISandData/COVID-19
Benchmark_bilevel
⭐
25
Benchmark for bi-level optimization solvers
Covid 19 India Data
⭐
25
data and code for scrapping and cleaning data on covid-19 in India from https://www.mohfw.gov.in/ and https://www.covid19india.org/
Students Performance Analytics
⭐
24
Students Performance Evaluation using Feature Engineering, Feature Extraction, Manipulation of Data, Data Analysis, Data Visualization and at lat applying Classification Algorithms from Machine Learning to Separate Students with different grades
Openrefine Ecology Lesson
⭐
24
Data Cleaning with OpenRefine for Ecologists
Fifa 2019 Analysis
⭐
21
This is a project based on the FIFA World Cup 2019 and Analyzes the Performance and Efficiency of Teams, Players, Countries and other related things using Data Analysis and Data Visualizations
Datareused
⭐
21
Get Data Reused
Boltzmannclean
⭐
21
Fill missing values in Pandas DataFrames using Restricted Boltzmann Machines
Foofah
⭐
21
Foofah: programming-by-example data transformation program synthesizer
Multimodal Sentiment Analysis
⭐
20
Engaged in research to help improve to boost text sentiment analysis using facial features from video using machine learning.
Openrefine Socialsci
⭐
20
OpenRefine for Social Science Data
Errorlocate
⭐
20
Find and replace erroneous fields in data using validation rules
Fungible
⭐
19
A library for fast reflective updates to immutable data trees
Cleantext
⭐
19
An open-source package for python to clean raw text data
Udacity Data Analyst Nanodegree
⭐
19
Natural Language Processing With Machine Learning
⭐
18
This repository builds a basic understanding of Natural Language Processing and Machine Learning tasks around it.
Learn2clean
⭐
18
Learn2Clean: Optimizing the Sequence of Tasks for Data Preparation and Cleaning
Moodle Local_datacleaner
⭐
18
Reduce, filter, and anonymize moodle data for non-prod environments
Image Quality Issues
⭐
17
FiftyOne Plugin for finding common image quality issues
Datacleanr
⭐
17
Interactive and Reproducible Data Cleaning
Stata Economics
⭐
16
Economics Lesson with Stata
R Learning Journey
⭐
16
Some of the projects i made when starting to learn R for Data Science at the university
Cleanlab Studio
⭐
16
Client interface for all things Cleanlab Studio
Validatedb
⭐
16
Validate on a table in a DB, using dbplyr
Twitter_sentiment_analysis_part2
⭐
15
redefining data-cleaning, preparation for visualisation
Validatetools
⭐
15
Dockingml
⭐
14
A package for MD, Docking and Machine learning drug discovery pipeline
Exemplary Ml Pipeline
⭐
14
Exemplary, annotated machine learning pipeline for any tabular data problem.
World Food Production
⭐
14
Comparing Top food and feed Producers around the globe and also seeking some interesting answers, solutions, patterns, hints and warnings through the power of Data Analysis and Data Visualization using Machine Learning.
Dedupe
⭐
13
Java DSL for (online) deduplication
Textcleaner
⭐
13
text-data pre-processing utility
Flight_delay_prediction
⭐
12
A two-stage predictive machine learning engine that forecasts the on-time performance of flights for 15 different airports in the USA based on data collected in 2016 and 2017.
Titanic Survival In Depth Analysis
⭐
12
Used Pandas , Matplotlib , Seaborn libraries to Analyze , Visualize and Explore the data of people travelling on Titanic, and Used Scikit-learn Modelling Algorithms to predict their probability of Survival.
Datapipe
⭐
12
dataPipe is a data processing and data analytics library for JavaScript. Inspired by LINQ (C#) and Pandas (Python)
Twitter Sentiment Analysis
⭐
12
It is a Natural Language Processing Problem where Sentiment Analysis is done by Classifying the Positive tweets from negative tweets by machine learning models for classification, text mining, text analysis, data analysis and data visualization
Marshmallow Pyspark
⭐
12
Marshmallow serializer integration with pyspark
Llmdatadistill
⭐
12
distill large scale web page text
Mercury Dataschema
⭐
11
Utility package that, given a Pandas DataFrame, it uses the DataSchema class which auto-infers feature types and automatically calculates different statistics depending on the types.
Plane
⭐
11
A text processing tool including tag(HTML, URL, Email) extraction and removing, punctuation normalization, simple segmentation, and so on.
Animefaceranker
⭐
11
Anime Face Quality Check
Awesome Ml Monitoring
⭐
11
A curated list of awesome open source tools and commercial products for monitoring data quality, monitoring model performance, and profiling data 🚀
Deductive
⭐
10
Methods for deductive data correction and imputation
1-100 of 180 search results
Next >
Privacy
|
About
|
Terms
|
Follow Us On Twitter
Copyright 2018-2024 Awesome Open Source. All rights reserved.