Parallel correlation calculation of big numpy arrays or pandas dataframes with NaNs and infs.

Project Name | Stars | Downloads | Repos Using This | Packages Using This | Most Recent Commit | Total Releases | Latest Release | Open Issues | License | Language |
---|---|---|---|---|---|---|---|---|---|---|

Pythondatasciencehandbook | 37,503 | a month ago | 188 | mit | Jupyter Notebook | |||||

Python Data Science Handbook: full text in Jupyter Notebooks | ||||||||||

Data Science Ipython Notebooks | 23,924 | 6 months ago | 26 | other | Python | |||||

Data science Python notebooks: Deep learning (TensorFlow, Theano, Caffe, Keras), scikit-learn, Kaggle, big data (Spark, Hadoop MapReduce, HDFS), matplotlib, pandas, NumPy, SciPy, Python essentials, AWS, and various command lines. | ||||||||||

30 Days Of Python | 22,810 | 4 days ago | 1 | August 12, 2022 | 189 | Python | ||||

30 days of Python programming challenge is a step-by-step guide to learn the Python programming language in 30 days. This challenge may take more than100 days, follow your own pace. | ||||||||||

100 Days Of Ml Code | 17,892 | a year ago | 9 | mit | Jupyter Notebook | |||||

100-Days-Of-ML-Code中文版 | ||||||||||

Datasets | 15,620 | 9 | 208 | 5 hours ago | 52 | June 15, 2022 | 527 | apache-2.0 | Python | |

🤗 The largest hub of ready-to-use datasets for ML models with fast, easy-to-use and efficient data manipulation tools | ||||||||||

Dask | 10,848 | 1,958 | 709 | an hour ago | 152 | June 24, 2022 | 917 | bsd-3-clause | Python | |

Parallel computing with task scheduling | ||||||||||

Abu | 9,650 | a month ago | 2 | gpl-3.0 | Python | |||||

阿布量化交易系统(股票，期权，期货，比特币，机器学习) 基于python的开源量化交易，量化投资架构 | ||||||||||

Mlcourse.ai | 8,670 | 10 days ago | 3 | other | Python | |||||

Open Machine Learning Course | ||||||||||

Py | 6,056 | a day ago | 121 | Jupyter Notebook | ||||||

Repository to store sample python programs for python learning | ||||||||||

Python_for_data_analysis_2nd_chinese_version | 5,763 | 2 months ago | 14 | |||||||

《利用Python进行数据分析·第2版》 |

Alternatives To NancorrmpSelect To Compare

Alternative Project Comparisons

Readme

`nancorrmp`

is a small module for calculating correlations of big numpy arrays or pandas dataframes with
NaNs and infs, using multiple cores. Default `numpy.corrcoef`

method does not calculate correlations
with input that contains NaNs and infs and `pandas`

method `pandas.DataFrame.corr`

is single thread
only.

`nancorrmp`

utilizes Pearson correlation calculation code from `scipy`

, that is based on `numpy`

instead
of `pandas`

cythonic backed. The multiprocessing is implemented by python `multiprocessing`

module.
`nancorrmp`

uses `pandas`

method of calculating correlations of arrays with NaNs and infs,
that skips pair of observations when one of them is either Nan or +inf, or -inf. `nancorrmp`

also
can calculate result with p values, similar to `scipy.pearsonr`

function.

Benchmarks are showing that with 4 cores, calculating correlation is faster with `nancorrmp`

then with `pandas`

even for 1200x1200 matrix. With 2 cores it is for 2400x2400. `pandas`

single processed implementation is faster
then using single process `nancorrmp`

still for 5000x5000 matrix, so it is recommended to use `nancorrmp`

with at least
2 cores.

```
pip install nancorrmp
```

```
import pandas as pd
import numpy as np
from nancorrmp.nancorrmp import NaNCorrMp
from pandas.testing import assert_frame_equal
np.random.seed(0)
random_dataframe = pd.DataFrame(np.random.rand(100, 100))
corr = NaNCorrMp.calculate(random_dataframe)
corr_pandas = random_dataframe.corr()
assert_frame_equal(corr, corr_pandas)
corr, p_value = NaNCorrMp.calculate_with_p_value(random_dataframe)
```

`nancorrmp`

module has one static class named `NaNCorrMp`

with 2 public methods and 1 type

**ArrayLike = Union[pd.DataFrame, np.ndarray]**

Type used to unify `pd.DataFrame`

and `np.ndarray`

.

**NaNCorrMp.calculate(X: ArrayLike, n_jobs: int = -1, chunks: int = 500) -> ArrayLike**

Calculates correlation matrix using Pearson correlation. `n_jobs`

controls number of cores to use
with default -1 which uses all available cores. `chunks`

controls how many pairs of arrays are send to
each process, 500 should be suitable for all purposes.

Returns output as the same type as input, if `X`

is `pd.Dataframe`

it will return `pd.Dataframe`

, if
`X`

is `np.ndarray`

it will return `np.ndarray`

.

```
import pandas as pd
import numpy as np
from nancorrmp.nancorrmp import NaNCorrMp
np.random.seed(0)
random_dataframe = pd.DataFrame(np.random.rand(100, 100))
corr = NaNCorrMp.calculate(random_dataframe)
```

**NaNCorrMp.calculate_with_p_value(X: ArrayLike, n_jobs: int = -1, chunks: int = 500) -> Tuple[ArrayLike, ArrayLike]**

Calculates correlation matrix and p value matrix using Pearson correlation. `n_jobs`

controls number of cores to use
with default -1 which uses all available cores. `chunks`

controls how many pairs of arrays are send to
each process, 500 should be suitable for all purposes. Correlation and p value are the same as the result of
using `scipy.pearsonr`

, but it can be used with NaNs and infs and multiple cores.

Returns output as similar type as input, if `X`

is `pd.Dataframe`

it will return `(pd.Dataframe, pd.Dataframe)`

, if
`X`

is `np.ndarray`

it will return `(np.ndarray, np.ndarray)`

.

```
import pandas as pd
import numpy as np
from nancorrmp.nancorrmp import NaNCorrMp
np.random.seed(0)
random_dataframe = pd.DataFrame(np.random.rand(100, 100))
corr, p_value = NaNCorrMp.calculate_with_p_value(random_dataframe)
```

Results can be reproduced by using `test/test_benchmark_nancorrmp.py`

module

```
import pandas as pd
import numpy as np
from nancorrmp.nancorrmp import NaNCorrMp
np.random.seed(0)
random_dataframe = pd.DataFrame(np.random.rand(1200, 1200))
%timeit NaNCorrMp.calculate(random_dataframe, n_jobs=4, chunks=1000)
# 9.92 s 205 ms per loop (mean std. dev. of 7 runs, 1 loop each)
%timeit random_dataframe.corr()
# 10.4 s 56.1 ms per loop (mean std. dev. of 7 runs, 1 loop each)
random_dataframe = pd.DataFrame(np.random.rand(2400, 2400))
%timeit NaNCorrMp.calculate(random_dataframe, n_jobs=2, chunks=1000)
# 1min 26s 3.16 s per loop (mean std. dev. of 7 runs, 1 loop each)
%timeit random_dataframe.corr()
# 1min 45s 3.58 s per loop (mean std. dev. of 7 runs, 1 loop each)
```

`test`

module contains test both for single core usage as for multiple cores. Tests asserts
then the outuput of `NaNCorrMp.calculate`

is the same as output of `pandas.corr`

for the same data.
Tests require `scipy`

and can be run with the following command:

```
python setup.py test
```

MIT License

Copyright (c) 2020 Micha Bukowski [email protected]

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Popular Pandas Projects

Popular Numpy Projects

Popular Data Processing Categories

Related Searches

Get A Weekly Email With Trending Projects For These Categories

No Spam. Unsubscribe easily at any time.

Python

Machine Learning

Data Science

Matrix

Pandas

Numpy

Multiprocessing