Project Name | Stars | Downloads | Repos Using This | Packages Using This | Most Recent Commit | Total Releases | Latest Release | Open Issues | License | Language |
---|---|---|---|---|---|---|---|---|---|---|
Synapseml | 4,566 | 3 | 4 days ago | 9 | November 22, 2022 | 321 | mit | Scala | ||
Simple and Distributed Machine Learning | ||||||||||
Machine Learning | 2,570 | 12 hours ago | 5 | mit | HTML | |||||
:earth_americas: machine learning tutorials (mainly in Python3) | ||||||||||
Spark Py Notebooks | 1,515 | 6 months ago | 9 | other | Jupyter Notebook | |||||
Apache Spark & Python (pySpark) tutorials for Big Data Analysis and Machine Learning as IPython / Jupyter notebooks | ||||||||||
Optimus | 1,406 | 14 days ago | 32 | June 19, 2022 | 27 | apache-2.0 | Python | |||
:truck: Agile Data Preparation Workflows made easy with Pandas, Dask, cuDF, Dask-cuDF, Vaex and PySpark | ||||||||||
Pyspark Example Project | 1,034 | a year ago | 11 | Python | ||||||
Example project implementing best practices for PySpark ETL jobs and applications. | ||||||||||
Hopsworks | 977 | 4 hours ago | 1 | September 11, 2019 | 9 | agpl-3.0 | Java | |||
Hopsworks - Data-Intensive AI platform with a Feature Store | ||||||||||
Kuwala | 610 | a year ago | 22 | apache-2.0 | JavaScript | |||||
Kuwala is the no-code data platform for BI analysts and engineers enabling you to build powerful analytics workflows. We are set out to bring state-of-the-art data engineering tools you love, such as Airbyte, dbt, or Great Expectations together in one intuitive interface built with React Flow. In addition we provide third-party data into data science models and products with a focus on geospatial data. Currently, the following data connectors are available worldwide: a) High-resolution demographics data b) Point of Interests from Open Street Map c) Google Popular Times | ||||||||||
Pandapy | 483 | 2 years ago | 22 | January 25, 2020 | 2 | Python | ||||
PandaPy has the speed of NumPy and the usability of Pandas 10x to 50x faster (by @firmai) | ||||||||||
Datacompy | 316 | 1 | 7 days ago | 10 | April 19, 2022 | 13 | apache-2.0 | Python | ||
Pandas and Spark DataFrame comparison for humans and more! | ||||||||||
Sk Dist | 283 | 2 | 8 months ago | 12 | May 14, 2020 | 8 | apache-2.0 | Python | ||
Distributed scikit-learn meta-estimators in PySpark |
Optimus is an opinionated python library to easily load, process, plot and create ML models that run over pandas, Dask, cuDF, dask-cuDF, Vaex or Spark.
Some amazing things Optimus can do for you:
To launch a live notebook server to test optimus using binder or Colab, click on one of the following badges:
In your terminal just type:
pip install pyoptimus
By default Optimus install Pandas as the default engine, to install other engines you can use the following commands:
Engine | Command |
---|---|
Dask | pip install pyoptimus[dask] |
cuDF | pip install pyoptimus[cudf] |
Dask-cuDF | pip install pyoptimus[dask-cudf] |
Vaex | pip install pyoptimus[vaex] |
Spark | pip install pyoptimus[spark] |
To install from the repo:
pip install git+https://github.com/hi-primus/[email protected]
To install other engines:
pip install git+https://github.com/hi-primus/[email protected]#egg=pyoptimus[dask]
You can go to 10 minutes to Optimus where you can find the basics to start working in a notebook.
Also you can go to the Examples section and find specific notebooks about data cleaning, data munging, profiling, data enrichment and how to create ML and DL models.
Here's a handy Cheat Sheet with the most common Optimus' operations.
Start Optimus using "pandas"
, "dask"
, "cudf"
,"dask_cudf"
,"vaex"
or "spark"
.
from optimus import Optimus
op = Optimus("pandas")
Now Optimus can load data in csv, json, parquet, avro and excel formats from a local file or from a URL.
#csv
df = op.load.csv("../examples/data/foo.csv")
#json
df = op.load.json("../examples/data/foo.json")
# using a url
df = op.load.json("https://raw.githubusercontent.com/hi-primus/optimus/develop-23.5/examples/data/foo.json")
# parquet
df = op.load.parquet("../examples/data/foo.parquet")
# ...or anything else
df = op.load.file("../examples/data/titanic3.xls")
Also, you can load data from Oracle, Redshift, MySQL and Postgres databases.
#csv
df.save.csv("data/foo.csv")
# json
df.save.json("data/foo.json")
# parquet
df.save.parquet("data/foo.parquet")
You can also save data to oracle, redshift, mysql and postgres.
Also, you can create a dataframe from scratch
df = op.create.dataframe({
'A': ['a', 'b', 'c', 'd'],
'B': [1, 3, 5, 7],
'C': [2, 4, 6, None],
'D': ['1980/04/10', '1980/04/10', '1980/04/10', '1980/04/10']
})
Using display
you have a beautiful way to show your data with extra information like column number, column data type and marked white spaces.
display(df)
Optimus was created to make data cleaning a breeze. The API was designed to be super easy to newcomers and very familiar for people that comes from Pandas.
Optimus expands the standard DataFrame functionality adding .rows
and .cols
accessors.
For example you can load data from a url, transform and apply some predefined cleaning functions:
new_df = df\
.rows.sort("rank", "desc")\
.cols.lower(["names", "function"])\
.cols.date_format("date arrival", "yyyy/MM/dd", "dd-MM-YYYY")\
.cols.years_between("date arrival", "dd-MM-YYYY", output_cols="from arrival")\
.cols.normalize_chars("names")\
.cols.remove_special_chars("names")\
.rows.drop(df["rank"]>8)\
.cols.rename("*", str.lower)\
.cols.trim("*")\
.cols.unnest("japanese name", output_cols="other names")\
.cols.unnest("last position seen", separator=",", output_cols="pos")\
.cols.drop(["last position seen", "japanese name", "date arrival", "cybertronian", "nulltype"])
Feedback is what drive Optimus future, so please take a couple of minutes to help shape the Optimus' Roadmap: http://bit.ly/optimus_survey
Also if you want to a suggestion or feature request use https://github.com/hi-primus/optimus/issues
If you have issues, see our Troubleshooting Guide
Contributions go far beyond pull requests and commits. We are very happy to receive any kind of contributions
including:
Become a backer or a sponsor and get your image on our README on Github with a link to your site.