|Project Name||Stars||Downloads||Repos Using This||Packages Using This||Most Recent Commit||Total Releases||Latest Release||Open Issues||License||Language|
|Ydata Profiling||11,492||80||116||6 days ago||40||February 03, 2023||219||mit||Python|
|1 Line of code data quality profiling & exploratory data analysis for Pandas and Spark DataFrames.|
|Lux||4,642||2||5 months ago||18||February 19, 2022||81||apache-2.0||Python|
|Automatically visualize your pandas dataframe via a single print! 📊 💡|
|Sweetviz||2,687||14||5 days ago||35||November 29, 2023||33||mit||Python|
|Visualize and compare datasets, target values and associations, with one line of code.|
|Code||802||4 months ago||25||Jupyter Notebook|
|Compilation of R and Python programming codes on the Data Professor YouTube channel.|
|Skimpy||331||3||11 days ago||10||September 11, 2023||9||other||Python|
|skimpy is a light weight tool that provides summary statistics about variables in data frames within the console.|
|Feature Engineering Tutorials||217||a year ago||5||agpl-3.0||Jupyter Notebook|
|Data Science Feature Engineering and Selection Tutorials|
|Data Analysis Using Python||215||5 months ago||mit||Jupyter Notebook|
|Data Analysis Using Python: A Beginner’s Guide Featuring NYC Open Data|
|Ditching Excel For Python||175||3 years ago||Jupyter Notebook|
|Functionalities in Excel translated to Python|
|Handyspark||129||5 years ago||7||May 19, 2019||8||mit||Jupyter Notebook|
|HandySpark - bringing pandas-like capabilities to Spark dataframes|
|Pythonplot.com||96||3 years ago||other||Jupyter Notebook|
|📈 Interactive comparison of Python plotting libraries for exploratory data analysis. Examples of using Pandas plotting, plotnine, Seaborn, and Matplotlib.|
Do you like this project? Show us your love and give feedback!
ydata-profiling primary goal is to provide a one-line Exploratory Data Analysis (EDA) experience in a consistent and fast solution. Like pandas
df.describe() function, that is so handy, ydata-profiling delivers an extended analysis of a DataFrame while allowing the data analysis to be exported in different formats such as html and json.
The package outputs a simple and digested analysis of a dataset, including time-series and text.
Looking for a scalable solution that can fully integrate with your database systems?
Leverage YData Fabric Data Catalog to connect to different databases and storages (Oracle, snowflake, PostGreSQL, GCS, S3, etc.) and leverage an interactive and guided profiling experience in Fabric. Check out the Community Version.
pip install ydata-profiling
conda install -c conda-forge ydata-profiling
Start by loading your pandas
DataFrame as you normally would, e.g. by using:
import numpy as np import pandas as pd from ydata_profiling import ProfileReport df = pd.DataFrame(np.random.rand(100, 5), columns=["a", "b", "c", "d", "e"])
To generate the standard profiling report, merely run:
profile = ProfileReport(df, title="Profiling Report")
The report contains three additional sections:
Spark support has been released, but we are always looking for an extra pair of hands 👐. Check current work in progress!.
YData-profiling can be used to deliver a variety of different use-case. The documentation includes guides, tips and tricks for tackling them:
|Comparing datasets||Comparing multiple version of the same dataset|
|Profiling a Time-Series dataset||Generating a report for a time-series dataset with a single line of code|
|Profiling large datasets||Tips on how to prepare data and configure
|Handling sensitive data||Generating reports which are mindful about sensitive data in the input dataset|
|Dataset metadata and data dictionaries||Complementing the report with dataset details and column-specific data dictionaries|
|Customizing the report's appearance||Changing the appearance of the report's page and of the contained visualizations|
|Profiling Databases||For a seamless profiling experience in your organization's databases, check Fabric Data Catalog, which allows to consume data from different types of storages such as RDBMs (Azure SQL, PostGreSQL, Oracle, etc.) and object storages (Google Cloud Storage, AWS S3, Snowflake, etc.), among others.|
There are two interfaces to consume the report inside a Jupyter notebook: through widgets and through an embedded HTML report.
The above is achieved by simply displaying the report as a set of widgets. In a Jupyter Notebook, run:
The HTML report can be directly embedded in a cell in a similar fashion:
To generate a HTML report file, save the
ProfileReport to an object and use the
Alternatively, the report's data can be obtained as a JSON file:
# As a JSON string json_data = profile.to_json() # As a file profile.to_file("your_report.json")
For standard formatted CSV files (which can be read directly by pandas without additional settings), the
ydata_profiling executable can be used in the command line. The example below generates a report named Example Profiling Report, using a configuration file called
default.yaml, in the file
report.html by processing a
ydata_profiling --title "Example Profiling Report" --config_file default.yaml data.csv report.html
Additional details on the CLI are available on the documentation.
The following example reports showcase the potentialities of the package across a wide range of dataset and data types:
Additional details, including information about widget support, are available on the documentation.
You can install using the
pip package manager by running:
pip install -U ydata-profiling
The package declares "extras", sets of additional dependencies.
[notebook]: support for rendering the report in Jupyter notebook widgets.
[unicode]: support for more detailed Unicode analysis, at the expense of additional disk space.
[pyspark]: support for pyspark for big dataset analysis
Install these with e.g.
pip install -U ydata-profiling[notebook,unicode,pyspark]
You can install using the
conda package manager by running:
conda install -c conda-forge ydata-profiling
Download the source code by cloning the repository or click on Download ZIP to download the latest stable version.
Install it by navigating to the proper directory and running:
pip install -e .
The profiling report is written in HTML and CSS, which means a modern browser is required.
You need Python 3 to run the package. Other dependencies can be found in the requirements files:
|requirements-dev.txt||Requirements for development|
|requirements-test.txt||Requirements for testing|
|setup.py||Requirements for widgets etc.|
To maximize its usefulness in real world contexts,
ydata-profiling has a set of implicit and explicit integrations with a variety of other actors in the Data Science ecosystem:
|Other DataFrame libraries||How to compute the profiling of data stored in libraries other than pandas|
|Great Expectations||Generating Great Expectations expectations suites directly from a profiling report|
|Interactive applications||Embedding profiling reports in Streamlit, Dash or Panel applications|
|Pipelines||Integration with DAG workflow execution tools like Airflow or Kedro|
Need help? Want to share a perspective? Report a bug? Ideas for collaborations? Reach out via the following channels:
Get your questions answered with a product owner by booking a Pawsome chat! 🐼
❗ Before reporting an issue on GitHub, check out Common Issues.
Learn how to get involved in the Contribution Guide.
A low-threshold place to ask questions or start contributing is the Data Centric AI Community's Discord.
A big thank you to all our amazing contributors!
Contributors wall made with contrib.rocks.