Nancorrmp

Parallel correlation calculation of big numpy arrays or pandas dataframes with NaNs and infs.
Alternatives To Nancorrmp
Project NameStarsDownloadsRepos Using ThisPackages Using ThisMost Recent CommitTotal ReleasesLatest ReleaseOpen IssuesLicenseLanguage
Pythondatasciencehandbook37,503
a month ago188mitJupyter Notebook
Python Data Science Handbook: full text in Jupyter Notebooks
Data Science Ipython Notebooks23,924
6 months ago26otherPython
Data science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AWS, and various command lines.
30 Days Of Python22,810
4 days ago1August 12, 2022189Python
30 days of Python programming challenge is a step-by-step guide to learn the Python programming language in 30 days. This challenge may take more than100 days, follow your own pace.
100 Days Of Ml Code17,892
a year ago9mitJupyter Notebook
100-Days-Of-ML-Code中文版
Datasets15,62092085 hours ago52June 15, 2022527apache-2.0Python
🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools
Dask10,8481,958709an hour ago152June 24, 2022917bsd-3-clausePython
Parallel computing with task scheduling
Abu9,650
a month ago2gpl-3.0Python
阿布量化交易系统(股票,期权,期货,比特币,机器学习) 基于python的开源量化交易,量化投资架构
Mlcourse.ai8,670
10 days ago3otherPython
Open Machine Learning Course
Py6,056
a day ago121Jupyter Notebook
Repository to store sample python programs for python learning
Python_for_data_analysis_2nd_chinese_version5,763
2 months ago14
《利用Python进行数据分析·第2版》
Alternatives To Nancorrmp
Select To Compare


Alternative Project Comparisons
Readme

Multiprocessing correlation calculation for Python

Build Status

nancorrmp is a small module for calculating correlations of big numpy arrays or pandas dataframes with NaNs and infs, using multiple cores. Default numpy.corrcoef method does not calculate correlations with input that contains NaNs and infs and pandas method pandas.DataFrame.corr is single thread only.

nancorrmp utilizes Pearson correlation calculation code from scipy, that is based on numpy instead of pandas cythonic backed. The multiprocessing is implemented by python multiprocessing module. nancorrmp uses pandas method of calculating correlations of arrays with NaNs and infs, that skips pair of observations when one of them is either Nan or +inf, or -inf. nancorrmp also can calculate result with p values, similar to scipy.pearsonr function.

Benchmarks are showing that with 4 cores, calculating correlation is faster with nancorrmp then with pandas even for 1200x1200 matrix. With 2 cores it is for 2400x2400. pandas single processed implementation is faster then using single process nancorrmp still for 5000x5000 matrix, so it is recommended to use nancorrmp with at least 2 cores.

Table of Content

Installation

pip install nancorrmp

Usage

import pandas as pd
import numpy as np
from nancorrmp.nancorrmp import NaNCorrMp
from pandas.testing import assert_frame_equal

np.random.seed(0)
random_dataframe = pd.DataFrame(np.random.rand(100, 100))
corr = NaNCorrMp.calculate(random_dataframe)
corr_pandas = random_dataframe.corr()
assert_frame_equal(corr, corr_pandas)
corr, p_value = NaNCorrMp.calculate_with_p_value(random_dataframe)

NaNCorrMp Methods

nancorrmp module has one static class named NaNCorrMp with 2 public methods and 1 type

ArrayLike = Union[pd.DataFrame, np.ndarray]

Type used to unify pd.DataFrame and np.ndarray.

NaNCorrMp.calculate(X: ArrayLike, n_jobs: int = -1, chunks: int = 500) -> ArrayLike

Calculates correlation matrix using Pearson correlation. n_jobs controls number of cores to use with default -1 which uses all available cores. chunks controls how many pairs of arrays are send to each process, 500 should be suitable for all purposes.

Returns output as the same type as input, if X is pd.Dataframe it will return pd.Dataframe, if X is np.ndarray it will return np.ndarray.

import pandas as pd
import numpy as np
from nancorrmp.nancorrmp import NaNCorrMp

np.random.seed(0)
random_dataframe = pd.DataFrame(np.random.rand(100, 100))
corr = NaNCorrMp.calculate(random_dataframe)

NaNCorrMp.calculate_with_p_value(X: ArrayLike, n_jobs: int = -1, chunks: int = 500) -> Tuple[ArrayLike, ArrayLike]

Calculates correlation matrix and p value matrix using Pearson correlation. n_jobs controls number of cores to use with default -1 which uses all available cores. chunks controls how many pairs of arrays are send to each process, 500 should be suitable for all purposes. Correlation and p value are the same as the result of using scipy.pearsonr, but it can be used with NaNs and infs and multiple cores.

Returns output as similar type as input, if X is pd.Dataframe it will return (pd.Dataframe, pd.Dataframe), if X is np.ndarray it will return (np.ndarray, np.ndarray).

import pandas as pd
import numpy as np
from nancorrmp.nancorrmp import NaNCorrMp

np.random.seed(0)
random_dataframe = pd.DataFrame(np.random.rand(100, 100))
corr, p_value = NaNCorrMp.calculate_with_p_value(random_dataframe)

Benchmark

Results can be reproduced by using test/test_benchmark_nancorrmp.py module

import pandas as pd
import numpy as np
from nancorrmp.nancorrmp import NaNCorrMp

np.random.seed(0)
random_dataframe = pd.DataFrame(np.random.rand(1200, 1200))

%timeit NaNCorrMp.calculate(random_dataframe, n_jobs=4, chunks=1000)
# 9.92 s  205 ms per loop (mean  std. dev. of 7 runs, 1 loop each)

%timeit random_dataframe.corr()
# 10.4 s  56.1 ms per loop (mean  std. dev. of 7 runs, 1 loop each)

random_dataframe = pd.DataFrame(np.random.rand(2400, 2400))

%timeit NaNCorrMp.calculate(random_dataframe, n_jobs=2, chunks=1000)
# 1min 26s  3.16 s per loop (mean  std. dev. of 7 runs, 1 loop each)

%timeit random_dataframe.corr()
# 1min 45s  3.58 s per loop (mean  std. dev. of 7 runs, 1 loop each)

Test

test module contains test both for single core usage as for multiple cores. Tests asserts then the outuput of NaNCorrMp.calculate is the same as output of pandas.corr for the same data. Tests require scipy and can be run with the following command:

python setup.py test

Licencse

MIT License

Copyright (c) 2020 Micha Bukowski [email protected]

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Popular Pandas Projects
Popular Numpy Projects
Popular Data Processing Categories
Related Searches

Get A Weekly Email With Trending Projects For These Categories
No Spam. Unsubscribe easily at any time.
Python
Machine Learning
Data Science
Matrix
Pandas
Numpy
Multiprocessing