Awesome Open Source
Awesome Open Source

DaisyRec

PyPI - Python Version Version GitHub repo size GitHub

Overview

DaisyRec is a Python toolkit dealing with rating prediction and item ranking issue.

The name DAISY (roughly :) ) stands for Multi-Dimension fAIrly compArIson for recommender SYstem. The whole framework of Daisy is showed below:

Make sure you have a CUDA enviroment to accelarate since these deep-learning models could be based on it.

We will consistently update this repo.

Datasets

You can download experiment data, and put them into the data folder. All data are available in links below:

How to run

  1. Make sure running command python setup.py build_ext --inplace to compile dependent extensions before running the other code. After that, you will find file *.so or *.pyd file generated in path daisy/model/

  2. In order to reproduce results, you need to run python data_generator.py to create experiment_data folder with certain public dataset listed in our paper. If you just want to research one certain dataset, you need to modify the code in data_generator.py to indicate your demands and let this code yield train and test datasets as you want. In the default situation, data_generator.py will generate all kinds of datasets (raw data, 5-core data and 10-core data) with different data splitting methods, including tloo, loo, tfo and fo. The meaning of these split methods will be explained in the Important Commands of README.

  1. There are seperate codes for validation and test, and they are stored in the folders of nested_tune_kit and test_kit, respectively. Each of the code in these folders should be moved into the root path, just the same directory as data_generator.py, so as to successfully run these code. Furthermore, if you have an IDE toolkit, you can simply set your work path and run in any folder path.
  1. The validation dataset is used for parameter tuning, so we provide split_validation interfact inside the code in the nested_tune_kit folder. Further and more detail parameter settings information about validation split method is depicted in daisy/utils/loader.py. After finishing validation, the results will be stored in the automatically generated folder tune_log/.

  2. Based on the best parameter determined by the validation, run the test code that you moved into the root path before and the results will be stored in the automatically generated folder res/.

Examples to run:

Taking the following case as an example: if we want to reproduce the top-20 results for BPR-MF on ML-1M-10core dataset.

  1. Assume we have already run data_generator.py and get the training and test datasets by tfo (i.e., time-aware split by ratio method). We should get files named train_ml-1m_10core_tfo.dat, test_ml-1m_10core_tfo.dat in ./experiment_data/.

  2. The whole procedure contains validation and test. Therefore, we first need to run hp_tune_pair_mf.py to get the best parameter settings. Besides, we may change the parameter search space in the hp_tune_pair_mf.py. Command to run:

python hp_tune_pair_mf.py --dataset=ml-1m --prepro=10core --val_method=tfo --test_method=tfo --topk=20 --loss_type=BPR --sample_method=uniform --gpu=0
  1. After finishing step 2, we will get the best paramter settings from tune_log/. Then we can run the test code by following the command as below:
python run_pair_mf.py --dataset=ml-1m --prepro=10core --test_method=tfo --topk=20 --loss_type=BPR --num_ng=2 --factors=34 --epochs=50 --lr=0.0005 --lamda=0.0016 --sample_method=uniform --gpu=0

More details of arguments are available in help message, try:

python run_pair_mf.py --help
  1. Once step 3 terminated, we can obtain the results w.r.t. top-20 from the dynamically generated result file ./res/ml-1m/10core_tfo_pairmf_BPR_uniform.csv

More Ranking Results

More ranking results for different methods on different datasets across various settings of top-N (N=1,5,10,20,30) are available in the file of ranking_results.md.

Important Commands

The description of all common parameter settings used by code inside examples are listed below:

Commands Description on Commands           Choices           Description on Choices
dataset the selected datasets ml-100k;
ml-1m;
ml-10m;
ml-20m;
lastfm;
bx;
amazon-cloth;
amazon-electronic;
amazon-book;
amazon-music;
epinions;
yelp;
citeulike;
netflix
all choices are the names of datasets
prepro the data pre-processing method origin;
Ncore
'origin' means using the raw data;
'Ncore' means only preserving users and items that have interactions more than N. Notice N could be any integer value
val_method
test_method
train-validation splitting;
train-test splitting
ufo
fo
tfo
loo
tloo
cv
split-by-ratio-with-user-level
split-by-ratio
time-aware split-by-ratio
leave one out
time-aware leave one out
cross validation (only apply to val_method)
topk the length of recommendation list
test_size ratio of test set size
fold_num the number of fold used for validation (only apply to 'cv', 'fo').
cand_num the number of candidate items used for ranking
sample_method negative sampling method uniform
item-ascd
item-desc
uniformly sampling;
sampling popular items with low rank;
sampling popular item with high rank
num_ng the number of negative samples
Related Awesome Lists
Top Programming Languages

Get A Weekly Email With Trending Projects For These Topics
No Spam. Unsubscribe easily at any time.
Python (890,046
Machine Learning (40,892
Dataset (33,271
Pytorch (22,633
Validation (20,400
Amazon (11,039
Recommendation System (2,975
Ranking (2,719
Slim (2,579
Recommender (2,136
Vae (1,120
Collaborative Filtering (734
Matrix Factorization (568
Factorization Machines (171
K Nearest Neighbors (67
Ease (57
Deepfm (38
Afm (31
Neural Collaborative Filtering (19
Nfm (14
Item2vec (7
Svdpp (5