1. 程式人生 > >Python之推薦演算法庫Surprise

Python之推薦演算法庫Surprise

Surprise is an easy-to-use Python scikit for recommender systems.

  1. 幫助文件 https://surprise.readthedocs.io/en/stable/
  2. 安裝方法:pip install surprise
  3. 可能會出現安裝失敗:error: Microsoft Visual C++ 14.0 is required. Get it with
    “Microsoft Visual C++ Build Tools”
  4. 如果失敗去這個網址下載即可配置環境:https://blogs.msdn.microsoft.com/pythonengineering/2016/04/11/unable-to-find-vcvarsall-bat/

1 Getting Started

1.1 Basic usage

Automatic cross-validation

from surprise import SVD
from surprise import Dataset
from surprise.model_selection import cross_validate

# 載入movielens資料集
data = Dataset.load_builtin('ml-100k')

# SVD例項化
algo = SVD()

# 5折驗證,並輸出結果
cross_validate(algo, data, measures=
['RMSE', 'MAE'], cv=5, verbose=True)
Dataset ml-100k could not be found. Do you want to download it? [Y/n] y
Trying to download dataset from http://files.grouplens.org/datasets/movielens/ml-100k.zip...
Done! Dataset ml-100k has been saved to C:\Users\Administrator/.surprise_data/ml-100k
Evaluating RMSE, MAE of algorithm SVD on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.9429  0.9262  0.9353  0.9313  0.9427  0.9357  0.0065  
MAE (testset)     0.7420  0.7306  0.7381  0.7343  0.7415  0.7373  0.0044  
Fit time          6.75    6.65    6.81    6.97    6.79    6.79    0.10    
Test time         0.29    0.28    0.31    0.24    0.28    0.28    0.03    





{'fit_time': (6.748954772949219,
  6.648886442184448,
  6.814781904220581,
  6.970685958862305,
  6.785797357559204),
 'test_mae': array([0.74200524, 0.73058076, 0.73807502, 0.73425662, 0.74150664]),
 'test_rmse': array([0.94290798, 0.92623843, 0.9352968 , 0.93130338, 0.94273246]),
 'test_time': (0.2868227958679199,
  0.2778284549713135,
  0.3148069381713867,
  0.23685264587402344,
  0.28182458877563477)}

Train-test split and the fit() method

from surprise import SVD
from surprise import Dataset
from surprise import accuracy
from surprise.model_selection import train_test_split

# 載入資料
data = Dataset.load_builtin('ml-100k')

# 25%的資料用於測試
trainset, testset = train_test_split(data, test_size=.25)

# 例項化
algo = SVD()

algo.fit(trainset)
predictions = algo.test(testset)

# RMSE
accuracy.rmse(predictions)
RMSE: 0.9392

0.9391726088618421

Train on a whole trainset and the predict() method

我們也可以簡單地將演算法擬合到整個資料集,而不是執行交叉驗證。 這可以通過使用build_full_trainset()方法來完成,該方法將構建一個trainset物件:

from surprise import KNNBasic
from surprise import Dataset

# 載入資料
data = Dataset.load_builtin('ml-100k')

# 恢復訓練集
trainset = data.build_full_trainset()

# 例項化協同過濾、訓練
algo = KNNBasic()
algo.fit(trainset)
Computing the msd similarity matrix...
Done computing similarity matrix.

<surprise.prediction_algorithms.knns.KNNBasic at 0x1f5faae4278>
uid = str(196)  # 原始使用者id
iid = str(302)  # 原始物品ID

# 預測使用者對物品的評分
pred = algo.predict(uid, iid, r_ui=4, verbose=True)
user: 196        item: 302        r_ui = 4.00   est = 4.06   {'actual_k': 40, 'was_impossible': False}

1.2 Use a custom dataset

演算法類名 說明

random_pred.NormalPredictor 根據訓練集的分佈特徵隨機給出一個預測值

baseline_only.BaselineOnly 給定使用者和Item,給出基於baseline的估計值

knns.KNNBasic 最基礎的協同過濾

knns.KNNWithMeans 將每個使用者評分的均值考慮在內的協同過濾實現

knns.KNNBaseline 考慮基線評級的協同過濾

matrix_factorization.SVD SVD實現

matrix_factorization.SVDpp SVD++,即LFM+SVD

matrix_factorization.NMF 基於矩陣分解的協同過濾

slope_one.SlopeOne 一個簡單但精確的協同過濾演算法

co_clustering.CoClustering 基於協同聚類的協同過濾演算法

相似度度量標準 度量標準說明

cosine 計算所有使用者(或物品)對之間的餘弦相似度。

msd 計算所有使用者(或物品)對之間的均方差異相似度。

pearson 計算所有使用者(或物品)對之間的Pearson相關係數。

pearson_baseline 計算所有使用者(或物品)對之間的(縮小的)Pearson相關係數,使用基線進行居中而不是平均值。

評估準則 準則說明

rmse 計算RMSE(均方根誤差)。

mae 計算MAE(平均絕對誤差)。

fcp 計算FCP(協調對的分數)。

import os
from surprise import BaselineOnly  # 給定使用者和Item,給出基於baseline的估計值
from surprise import Dataset
from surprise import Reader
from surprise.model_selection import cross_validate

# 路徑
file_path = os.path.expanduser('~/.surprise_data/ml-100k/ml-100k/u.data')

# 'user item rating timestamp', '\t' 分割.
reader = Reader(line_format='user item rating timestamp', sep='\t')

data = Dataset.load_from_file(file_path, reader=reader)

# 使用資料
cross_validate(BaselineOnly(), data, verbose=True)

# 25%的資料用於測試
trainset, testset = train_test_split(data, test_size=.25)

blo = BaselineOnly()
blo.fit(trainset)
blo.predict(196, 302, 4, verbose=True)
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Estimating biases using als...
Evaluating RMSE, MAE of algorithm BaselineOnly on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    0.9470  0.9402  0.9467  0.9442  0.9418  0.9440  0.0027  
MAE (testset)     0.7528  0.7427  0.7502  0.7480  0.7474  0.7482  0.0034  
Fit time          0.23    0.28    0.33    0.25    0.24    0.27    0.04    
Test time         0.28    0.32    0.25    0.19    0.24    0.26    0.04    
Estimating biases using als...

3.8799791205908227
import pandas as pd
from surprise import NormalPredictor
from surprise import Dataset
from surprise import Reader
from surprise.model_selection import cross_validate

# 製造資料
ratings_dict = {'itemID': [1, 1, 1, 2, 2],
                'userID': [9, 32, 2, 45, 'user_foo'],
                'rating': [3, 2, 4, 3, 1]}
df = pd.DataFrame(ratings_dict)

# 設定rating為1到5
reader = Reader(rating_scale=(1, 5))

data = Dataset.load_from_df(df[['userID', 'itemID', 'rating']], reader)
print(type(data))

# 評估
cross_validate(NormalPredictor(), data, cv=3)
<class 'surprise.dataset.DatasetAutoFolds'>

{'fit_time': (0.0, 0.0, 0.0),
 'test_mae': array([1.82749483, 1.36961054, 1.08665964]),
 'test_rmse': array([2.42042007, 1.3756825 , 1.08665964]),
 'test_time': (0.0, 0.0009999275207519531, 0.0)}

1.3 Use cross-validation iterators

對於交叉驗證,我們可以使用cross_validate()函式為我們完成所有艱苦的工作。 但是為了更好地控制,我們還可以實現交叉驗證迭代器,並使用迭代器的split()方法和演算法的test()方法對每個拆分進行預測。

from surprise import SVD
from surprise import Dataset
from surprise import accuracy
from surprise.model_selection import KFold

# 載入資料
data = Dataset.load_builtin('ml-100k')

# 3折交叉驗證
kf = KFold(n_splits=3)

algo = SVD()

for trainset, testset in kf.split(data):
    
    # 訓練、預測
    algo.fit(trainset)
    predictions = algo.test(testset)

    # 評估
    accuracy.rmse(predictions, verbose=True) # verbose: 如果True, 會列印結果.
RMSE: 0.9460
RMSE: 0.9494
RMSE: 0.9457

movielens-100K資料集已經提供了5個訓練和測試檔案(u1.base,u1.test … u5.base,u5.test)。
surprise可以通過使用surprise.model_selection.split.PredefinedKFold物件來處理這種情況:

from surprise import SVD
from surprise import Dataset
from surprise import Reader
from surprise import accuracy
from surprise.model_selection import PredefinedKFold

files_dir = os.path.expanduser('~/.surprise_data/ml-100k/ml-100k/')

reader = Reader('ml-100k')

# [(u1.base, u1.test), (u2.base, u2.test), ... (u5.base, u5.test)]
train_file = files_dir + 'u%d.base'
test_file = files_dir + 'u%d.test'
folds_files = [(train_file % i, test_file % i) for i in (1, 2, 3, 4, 5)]

data = Dataset.load_from_folds(folds_files, reader=reader)
pkf = PredefinedKFold()

algo = SVD()

for trainset, testset in pkf.split(data):

    algo.fit(trainset)
    predictions = algo.test(testset)

    accuracy.rmse(predictions, verbose=True)
# print(predictions)

1.4 Tune algorithm parameters with GridSearchCV

cross_validate()函式報告給定引數集的交叉驗證過程的準確度度量。
如果想知道哪個引數組合產生最佳結果,GridSearchCV類就可以解決問題。
給定引數的字典,該類詳盡地嘗試所有引數組合並報告任何精度測量的最佳引數(在不同的分裂上取平均值)。

from surprise import SVD
from surprise import Dataset
from surprise.model_selection import GridSearchCV

data = Dataset.load_builtin('ml-100k')

param_grid = {'n_epochs': [5, 10], 'lr_all': [0.002, 0.005],
              'reg_all': [0.4, 0.6]}
gs = GridSearchCV(SVD, param_grid, measures=['rmse', 'mae'], cv=3)

gs.fit(data)

print(gs.best_score['rmse']) # 分數

print(gs.best_params['rmse']) # 引數

import pandas as pd
results_df = pd.DataFrame.from_dict(gs.cv_results)
results_df
0.9642869135146698
{'reg_all': 0.4, 'n_epochs': 10, 'lr_all': 0.005}
mean_fit_time mean_test_mae mean_test_rmse mean_test_time param_lr_all param_n_epochs param_reg_all params rank_test_mae rank_test_rmse split0_test_mae split0_test_rmse split1_test_mae split1_test_rmse split2_test_mae split2_test_rmse std_fit_time std_test_mae std_test_rmse std_test_time
0 1.496731 0.806236 0.997472 0.511683 0.002 5 0.4 {'reg_all': 0.4, 'n_epochs': 5, 'lr_all': 0.002} 7 7 0.807283 0.997869 0.807397 0.999162 0.804027 0.995386 0.034429 0.001562 0.001567 0.062379
1 1.403456 0.782359 0.974123 0.497358 0.005 5 0.4 {'reg_all': 0.4, 'n_epochs': 5, 'lr_all': 0.005} 2 2 0.783014 0.974045 0.784169 0.976183 0.779894 0.972142 0.003297 0.001806 0.001650 0.062849
2 2.811914 0.786120 0.978227 0.492694 0.002 10 0.4 {'reg_all': 0.4, 'n_epochs': 10, 'lr_all': 0.002} 4 4 0.786966 0.978410 0.787427 0.979666 0.783967 0.976606 0.003398 0.001534 0.001256 0.064478
3 2.794590 0.773040 0.964287 0.537333 0.005 10 0.4 {'reg_all': 0.4, 'n_epochs': 10, 'lr_all': 0.005} 1 1 0.773070 0.963666 0.775167 0.966541 0.770884 0.962653 0.008335 0.001749 0.001647 0.012490
4 1.410455 0.814898 1.003614 0.484698 0.002 5 0.6 {'reg_all': 0.6, 'n_epochs': 5, 'lr_all': 0.002} 8 8 0.816255 1.004197 0.815952 1.005326 0.812487 1.001319 0.005308 0.001710 0.001687 0.060862
5 1.470082 0.793487 0.983101 0.542994 0.005 5 0.6 {'reg_all': 0.6, 'n_epochs': 5, 'lr_all': 0.005} 5 5 0.794289 0.983202 0.795240 0.985284 0.790931 0.980816 0.023524 0.001848 0.001825 0.058165
6 2.980475 0.796703 0.986454 0.527671 0.002 10 0.6 {'reg_all': 0.6, 'n_epochs': 10, 'lr_all': 0.002} 6 6 0.797903 0.986878 0.797934 0.988105 0.794272 0.984379 0.087440 0.001719 0.001550 0.018768
7 2.823572 0.784945 0.974213 0.494693 0.005 10 0.6 {'reg_all': 0.6, 'n_epochs': 10, 'lr_all': 0.005} 3 3 0.785202 0.973794 0.787113 0.976659 0.782519 0.972187 0.003396 0.001884 0.001850 0.057241
# 選擇最優引數對應模型
algo = gs.best_estimator['rmse']
algo.fit(data.build_full_trainset())
<surprise.prediction_algorithms.matrix_factorization.SVD at 0x1f58bf98940>
algo.predict(193, 302, 4, verbose=True)
user: 193        item: 302        r_ui = 4.00   est = 3.53   {'was_impossible': False}

Prediction(uid=193, iid=302, r_ui=4, est=3.52986, details={'was_impossible': False})

1.5 Command line usage

在命令列中使用

surprise -algo SVD -params “{‘n_epochs’: 5, ‘verbose’: True}” -load-builtin ml-100k -n-folds 3

surprise -h

2 Using prediction algorithms

Surprise提供了一堆內建演算法。 所有演算法都來自AlgoBase基類,其中實現了一些關鍵方法(例如predict,fit和test)。 可以在prediction_algorithms包文件中找到可用預測演算法的列表和詳細資訊。

每個演算法都是全域性Surprise名稱空間的一部分,因此您只需要從Surprise包中匯入它們的名稱

from surprise import KNNBasic
algo = KNNBasic()

這些演算法中的一些可以使用 baseline estimates,一些可以使用similarity measure。

2.1 Baselines estimates configuration

\sum_{r_{ui} \in R_{train}} \left(r_{ui} - (\mu + b_u + b_i)\right)^2 +
\lambda \left(b_u^2 + b_i^2 \right)

可以通過兩種不同的方式估算基線:

使用隨機梯度下降(SGD)。

使用交替最小二乘法(ALS)。

print('Using ALS')
bsl_options = {'method': 'als',
               'n_epochs': 5,
               'reg_u': 12,
               'reg_i': 5
               }
algo = BaselineOnly(bsl_options=bsl_options)
Using ALS
print('Using SGD')
bsl_options = {'method': 'sgd',
               'learning_rate': .00005,
               }
algo = BaselineOnly(bsl_options=bsl_options)
Using SGD
bsl_options = {'method': 'als',
               'n_epochs': 20,
               }
sim_options = {'name': 'pearson_baseline'}
algo = KNNBasic(bsl_options=bsl_options, sim_options=sim_options)

2.2 Similarity measure configuration

許多演算法使用相似性度量來估計評級。 它們的配置方式與基線評級類似:只需在建立演算法時傳遞sim_options引數即可。 此引數是包含以下(所有可選)鍵的字典:

‘name’:相似性模組中定義的相似性名稱。 預設為’MSD’。

‘user_based’:是否在使用者之間或專案之間計算相似性。 這對預測演算法的效能有很大影響。 預設為True。

‘min_support’:公共項的最小數量(當’user_based’為’True’時)或最小公共使用者數(當’user_based’為’False’時),相似度不為零。 簡單地說,如果| Iuv | <min_support(u,v)則sim(u,v)= 0。 物品也一樣。

‘shrinkage’:要應用的收縮引數(僅與pearson_baseline相似性相關)。 預設值為100。

sim_options = {'name': 'cosine',
               'user_based': False  # compute  similarities between items
               }
algo = KNNBasic(sim_options=sim_options)
sim_options = {'name': 'pearson_baseline',
               'shrinkage': 0  # no shrinkage
               }
algo = KNNBasic(sim_options=sim_options)

3 How to build your own prediction algorithm

如何使用Surprise構建自定義預測演算法

建立自己的預測演算法非常簡單:演算法只不過是一個派生自AlgoBase的類,它具有估計方法。
這是predict()方法呼叫的方法。 它接受一個內部使用者id,一個內部項ID,並返回估計值。

from surprise import AlgoBase
from surprise import Dataset
from surprise.model_selection import cross_validate
import numpy as np

class MyOwnAlgorithm(AlgoBase):

    def __init__(self):

        AlgoBase.__init__(self)

    def fit(self, trainset):

        AlgoBase.fit(self, trainset)

        self.the_mean = np.mean([r for (_, _, r) in
                                 self.trainset.all_ratings()])

        return self

    def estimate(self, u, i):

        sum_means = self.trainset.global_mean
        div = 1

        if self.trainset.knows_user(u):
            sum_means += np.mean([r for (_, r) in self.trainset.ur[u]])
            div += 1
        if self.trainset.knows_item(i):
            sum_means += np.mean([r for (_, r) in self.trainset.ir[i]])
            div += 1

        return sum_means / div

data = Dataset.load_builtin('ml-100k')
algo = MyOwnAlgorithm()

cross_validate(algo, data, verbose=True)
Evaluating RMSE, MAE of algorithm MyOwnAlgorithm on 5 split(s).

                  Fold 1  Fold 2  Fold 3  Fold 4  Fold 5  Mean    Std     
RMSE (testset)    1.0179  1.0165  1.0175  1.0216  1.0156  1.0178  0.0021  
MAE (testset)     0.8380  0.8356  0.8376  0.8414  0.8364  0.8378  0.0020  
Fit time          0.04    0.06    0.06    0.07    0.08    0.06    0.01    
Test time         2.94    2.86    2.95    3.05    3.05    2.97    0.07    

{'fit_time': (0.03598380088806152,
  0.06396150588989258,
  0.05696725845336914,
  0.06996297836303711,
  0.07695245742797852),
 'test_mae': array([0.83803386, 0.83556254, 0.83764556, 0.84141284, 0.83639388]),
 'test_rmse': array([1.01792507, 1.01651414, 1.0175074 , 1.02157154, 1.01555266]),
 'test_time': (2.9401426315307617,
  2.862196445465088,
  2.9531378746032715,
  3.045079231262207,
  3.051081657409668)}

prediction 不可用時

from surprise import PredictionImpossible
class MyOwnAlgorithm(AlgoBase):

    def __init__(self, sim_options={}, bsl_options={}):

        AlgoBase.__init__(self, sim_options=sim_options,
                          bsl_options=bsl_options)

    def fit(self, trainset):

        AlgoBase.fit(self, trainset)

        self.bu, self.bi = self.compute_baselines()
        self.sim = self.compute_similarities()

        return self

    def estimate(self, u, i):

        if not (self.trainset.knows_user(u) and self.trainset.knows_item(i)):
            raise PredictionImpossible('User and/or item is unkown.')

        # 計算u和v之間的相似性,其中v表示評價專案i的所有其他使用者。
        neighbors = [(v, self.sim[u, v]) for (v, r) in self.trainset.ir[i]]
        # 根據相似度排序
        neighbors = sorted(neighbors, key=lambda x: x[1], reverse=True)

        print('The 3 nearest neighbors of user', str(u), 'are:')
        for v, sim_uv in neighbors[:3]:
            print('user {0:} with sim {1:1.2f}'.format(v, sim_uv))

4 prediction_algorithms package

https://surprise.readthedocs.io/en/stable/prediction_algorithms_package.html

5 The model_selection package

https://surprise.readthedocs.io/en/stable/model_selection.html

Cross validation iterators 使用前需要例項化

KFold A basic cross-validation iterator.

RepeatedKFold Repeated KFold cross validator.

ShuffleSplit A basic cross-validation iterator with random trainsets and testsets.

LeaveOneOut Cross-validation iterator where each user has exactly one rating in the testset.

PredefinedKFold A cross-validation iterator to when a dataset has been loaded with the load_from_folds method.

Cross validation

surprise.model_selection.validation.cross_validate(algo, data, measures=[u’rmse’, u’mae’], cv=None, return_train_measures=False, n_jobs=1, pre_dispatch=u’2*n_jobs’, verbose=False)

Parameter search

surprise.model_selection.search.RandomizedSearchCV(algo_class, param_distributions, n_iter=10, measures=[u’rmse’, u’mae’], cv=None, refit=False, return_train_measures=False, n_jobs=1, pre_dispatch=u’2*n_jobs’, random_state=None, joblib_verbose=0)

surprise.model_selection.search.GridSearchCV(algo_class, param_grid, measures=[u’rmse’, u’mae’], cv=None, refit=False, return_train_measures=False, n_jobs=1, pre_dispatch=u’2*n_jobs’, joblib_verbose=0)

6 similarities module

https://surprise.readthedocs.io/en/stable/similarities.html#

cosine: Compute the cosine similarity between all pairs of users (or items).

msd: Compute the Mean Squared Difference similarity between all pairs of users (or items).

pearson: Compute the Pearson correlation coefficient between all pairs of users (or items).

pearson_baseline: Compute the (shrunk) Pearson correlation coefficient between all pairs of users (or items) using baselines for centering instead of means.

7 accuracy module

https://surprise.readthedocs.io/en/stable/accuracy.html

rmse: Compute RMSE (Root Mean Squared Error).

mae: Compute MAE (Mean Absolute Error).

fcp: Compute FCP (Fraction of Concordant Pairs).

8 dataset module

https://surprise.readthedocs.io/en/stable/dataset.html

Dataset.load_builtin: Load a built-in dataset.

Dataset.load_from_file: Load a dataset from a (custom) file.

Dataset.load_from_folds: Load a dataset where folds (for cross-validation) are predefined by some files.

Dataset.folds: Generator function to iterate over the folds of the Dataset.

DatasetAutoFolds.split: Split the dataset into folds for future cross-validation.

9 Trainset class

https://surprise.readthedocs.io/en/stable/trainset.html

.Trainset(ur, ir, n_users, n_items, n_ratings, rating_scale, offset, raw2inner_id_users, raw2inner_id_items)

lobal_mean

The mean of all ratings μ.

all_items()

Generator function to iterate over all items.
Yields: Inner id of items.

all_ratings()

Generator function to iterate over all ratings.
Yields: A tuple (uid, iid, rating) where ids are inner ids (see this note).

all_users()

Generator function to iterate over all users.
Yields: Inner id of users.

build_anti_testset(fill=None)

Return a list of ratings that can be used as a testset in the test() method.
The ratings are all the ratings that are not in the trainset, i.e. all the ratings rui where the user u is known, the item i is known, but the rating rui is not in the trainset. As rui is unknown, it is either replaced by the fill value or assumed to be equal to the mean of all ratings global_mean.
Parameters: fill (float) – The value to fill unknown ratings. If None the global mean of all ratings global_mean will be used.
Returns: A list of tuples (uid, iid, fill) where ids are raw ids.

build_testset()

Return a list of ratings that can be used as a testset in the test() method.
The ratings are all the ratings that are in the trainset, i.e. all the ratings returned by the all_ratings() generator. This is useful in cases where you want to to test your algorithm on the trainset.

global_mean

Return the mean of all ratings.
It’s only computed once.

knows_item(iid)

Indicate if the item is part of the trainset.
An item is part of the trainset if the item was rated at least once.
Parameters: iid (int) – The (inner) item id. See this note.
Returns: True if item is part of the trainset, else False.

knows_user(uid)

Indicate if the user is part of the trainset.
A user is part of the trainset if the user has at least one rating.
Parameters: uid (int) – The (inner) user id. See this note.
Returns: True if user is part of the trainset, else False.

to_inner_iid(riid)

Convert an item raw id to an inner id.
Parameters: riid (str) – The item raw id.
Returns: The item inner id.
Return type: int
Raises: ValueError – When item is not part of the trainset.

to_inner_uid(ruid)

Convert a user raw id to an inner id.
Parameters: ruid (str) – The user raw id.
Returns: The user inner id.
Return type: int
Raises: ValueError – When user is not part of the trainset.

to_raw_iid(iiid)

Convert an item inner id to a raw id.
Parameters: iiid (int) – The item inner id.
Returns: The item raw id.
Return type: str
Raises: ValueError – When iiid is not an inner id.

to_raw_uid(iuid)

Convert a user inner id to a raw id.
Parameters: iuid (int) – The user inner id.
Returns: The user raw id.
Return type: str
Raises: ValueError – When iuid is not an inner id

使用者和專案具有原始ID和內部ID。 一些方法將使用/返回原始id(例如predict()方法),而另一些方法將使用/返回內部id。

原始ID是評級檔案或pandas資料框中定義的ID。 它們可以是字串或數字。 請注意,如果從作為標準方案的檔案中讀取評級,則將它們表示為字串。 重要的是要知道您是否正在使用例如 predict()或其他接受原始id作為引數的方法。

在trainset建立時,每個原始id都對映到一個名為inner id的唯一整數,這更適合於Surprise操作。 原始ID和內部ID之間的轉換可以使用

10 Reader

https://surprise.readthedocs.io/en/stable/reader.html

surprise.reader.Reader(name=None, line_format=u’user item rating’, sep=None, rating_scale=(1, 5), skip_lines=0)

name (string, optional) – If specified, a Reader for one of the built-in datasets is returned and any other parameter is ignored. Accepted values are ‘ml-100k’, ‘ml-1m’, and ‘jester’. Default is None.

line_format (string) – The fields names, in the order at which they are encountered on a line. Please note that line_format is always space-separated (use the sep parameter). Default is ‘user item rating’.

sep (char) – the separator between fields. Example : ‘;’.

rating_scale (tuple, optional) – The rating scale used for every rating. Default is (1, 5).

skip_lines (int, optional) – Number of lines to skip at the beginning of the file. Default is 0.

11 evaluate module

https://surprise.readthedocs.io/en/stable/evaluate.html

1 surprise.evaluate.GridSearch(algo_class, param_grid, measures=[u’rmse’, u’mae’], n_jobs=1, pre_dispatch=u’2*n_jobs’, seed=None, ver