Source code for ctpfrec

import pandas as pd, numpy as np
from scipy.sparse import issparse
import multiprocessing, os, warnings
from . import cy_double, cy_float, _check_openmp
import ctypes, types, inspect
from hpfrec import HPF, cython_loops_float, cython_loops_double
### TODO: get rid of this, use loc/iloc and make copies if needed
pd.options.mode.chained_assignment = None


[docs]class CTPF: """ Collaborative Topic Poisson Factorization Model for recommending items based on probabilistic Poisson factorization on sparse count data (e.g. number of times a user viewed different items) along with count data on item attributes (e.g. bag-of-words representation of text descriptions of items), using mean-field variational inference with coordinate-ascent. Can also accommodate user attributes in addition to item attributes (see note below for more information). Can use different stopping criteria for the opimization procedure: 1) Run for a fixed number of iterations (stop_crit='maxiter'). 2) Calculate the Poisson log-likelihood every N iterations (stop_crit='train-llk' and check_every) and stop once {1 - curr/prev} is below a certain threshold (stop_thr) 3) Calculate the Poisson log-likelihood in a user-provided validation set (stop_crit='val-llk', val_set, and check_every) and stop once {1 - curr/prev} is below a certain threshold. For this criterion, you might want to lower the default threshold (see Note). 4) Check the the difference in the Theta matrix after every N iterations (stop_crit='diff-norm', check_every) and stop once the *l2-norm* of this difference is below a certain threshold (stop_thr). Note that this is **not a percent** difference as it is for log-likelihood criteria, so you should put a larger value than the default here. This is a much faster criterion to calculate and is recommended for larger datasets. If passing reindex=True, it will internally reindex all user and item IDs. Your data will not require reindexing if the IDs for users, items, and words (or other countable item attributes) in counts_df and words_df meet the following criteria: 1) Are all integers. 2) Start at zero. 3) Don't have any enumeration gaps, i.e. if there is a user '4', user '3' must also be there. If you only want to obtain the fitted parameters and use your own API later for recommendations, you can pass produce_dicts=False and pass a folder where to save them in csv format (they are also available as numpy arrays in this object's Theta, Eta and Epsilon attributes). Otherwise, the model will create Python dictionaries with entries for each user, item, and word, which can take quite a bit of RAM memory. These can speed up predictions later through this package's API. Passing verbose=True will also print RMSE (root mean squared error) at each iteration. For slighly better speed pass verbose=False once you know what a good threshold should be for your data. Note ---- DataFrames and arrays passed to '.fit' might be modified inplace - if this is a problem you'll need to pass a copy to them, e.g. 'counts_df=counts_df.copy()'. Note ---- If 'check_every' is not None and stop_crit is not 'diff-norm', it will, every N iterations, calculate the Poisson log-likelihood of the data. By default, this is NOT the full likelihood, (not including a constant that depends on the data but not on the parameters and which is quite slow to compute). The reason why it's calculated by default like this is because otherwise it can result in overflow (number is too big for the data type), but be aware that if not adding this constant, the number can turn positive and will mess with the stopping criterion for likelihood. Note ---- If you pass a validation set, it will calculate the Poisson log-likelihood *of the non-zero observations only*, rather than the complete Poisson log-likelihood that includes also the combinations of users and items not present in the data (assumed to be zero), thus it's more likely that you might see positive numbers here. Compared to ALS, iterations from this algorithm are a lot faster to compute, so don't be scared about passing large numbers for maxiter. Note ---- In some unlucky cases, the parameters will become NA in the first iteration, in which case you should see weird values for log-likelihood and RMSE. If this happens, try again with a different random seed. Note ---- As this model will fit the parameters to both user-item interactions and item attributes, you might see the Poisson log-likelihood decreasing during iterations. This doesn't necessarily mean that it failed, but in such cases you might want to try increasing K or decreasing the number of item attributes. Note ---- Can also fit a model that includes user attributes in the same format as the bag-of-words representations of items - in this case the new variables will be called Omega (user-factor matrix), Kappa (user_attribute-factor), X (multinomial for user attributes). The Y variables will be associated as follows: Ya (Omega, Theta) - Yb (Eta, Theta) - Yc (Omega, Epsilon) - Yd (Eta, Epsilon). The priors for these new parameters will be taken to be the same ones as their item counterparts. Parameters ---------- k : int Number of latent factors (topics) to use. a : float Shape parameter for the word-topic prior (Beta). If fitting the model with user attributes, will also be taken as prior for the user_attribute-factor matrix. b : float Rate parameter for the word-topic prior (Beta). If fitting the model with user attributes, will also be taken as prior for the user_attribute-factor matrix. c : float Shape parameter document-topic prior (Theta). d : float Rate parameter document-topic prior (Theta). e : float Shape parameter for the user-topic prior (Eta). f : float Rate parameter for the user-topic prior (Eta). g : float Shape parameter for the document-topic offset prior (Epsilon). If fitting the model with user attributes, will also be taken as prior for the user-factor offset matrix. h : float Rate parameter for the document-topic offset prior (Epsilon). If fitting the model with user attributes, will also be taken as prior for the user-factor offset matrix. stop_crit : str, one of 'maxiter', 'train-llk', 'val-llk', 'diff-norm' Stopping criterion for the optimization procedure. stop_thr : float Threshold for proportion increase in log-likelihood or l2-norm for difference between matrices. reindex : bool Whether to reindex data internally. Will be forced to 'False' if passing sparse COO matrices to 'fit'. miniter : int or None Minimum number of iterations for which to run the optimization procedure. When using likelihood as a stopping criterion, note that as the model is fit to both user-item interactions and item attributes, the Poisson likelihood for the interactions alone can decrease during iterations as the complete model likelihood increases (and this Poisson likelihood might start increasing again later). Thus, a minimum number of iterations will avoid stopping when the Poisson likelihood decreases. maxiter : int or None Maximum number of iterations for which to run the optimization procedure. This corresponds to epochs when fitting in batches of users. Recommended to use a lower number when passing a batch size. check_every : None or int Calculate log-likelihood every N iterations. verbose : bool Whether to print convergence messages. use_float : bool Whether to use the C float type (typically ``np.float32``). Using float types (as compared to double) results in less memory usage and faster operations, but it has less numeric precision and the results will be slightly worse compared to using double. If passing ``False``, will use C double (typically ``np.float64``). random_seed : int or None Random seed to use when starting the parameters. ncores : int Number of cores to use to parallelize computations. If set to -1, will use the maximum available on the computer. initialize_hpf : bool Whether to initialize the Theta and Beta matrices using hierarchical Poisson factorization on the bag-of-words representation only (words_df passed to 'fit'). This can provide better results than a random initialization, but it takes extra time to fit. standardize_items : bool Whether to standardize the item bag-of-words representations passed to '.fit' ('words_df') so that all the items have the same sum of words (set to the mean sum of word counts across items). Will also apply to user attributes if passing them to '.fit'. rescale_factors : bool Whether to rescale the resulting item-factor matrix (Theta) after fitting to have its rows sum to 1. This decreases the model susceptibility to have items with more words be considered more popular, but it can also result in way worse rankings. Will also be applied to user factors if fitting the model with user attributes. (Not recommended) missing_items : str, one of 'include' or 'exclude' If there are items in the 'words_df' object to be passed to '.fit' that are not present in 'counts_df', shall they be considered as having all their user-item interactions with a count of zero (when passing 'include'), or shall they be considered to be censored (e.g. missing because the model is fit to bag-of-words of articles that are not available to users). In the second case, these items will be included when initializing with 'initialize_hpf=True', but will be excluded afterwards. In the second case, note that the model won't be able to make predictions for these items, but you can add them afterwards using the '.add_items' method. Same for user attributes when fitting the model with user side information. Note that this **only applies to extra items/users with side info but no interaction**, while any user-item interaction not present in the data is taken as include. Forced to 'include' when passing 'initialize_hpf=False' or 'reindex=False'. step_size : None or function -> float in (0, 1) Function that takes the iteration/epoch number as input (starting at zero) and produces the step size for the update to Beta and Theta. When initializing these through hierarchical Posisson factorization, it can be beneficial to have the first steps change them less or not change them at all, while the user and offset matrices start getting shaped towards these initialized topics, with later iterations being allowed to change them more (so it starts at zero and tends towards 1 as the iteration number increases). When using 'stop_crit=diff-norm', it will not stop if step_size(iteration)<=1e-2. You can also pass a function that always returns zero if you do not wish to update the Theta and Beta parameters obtained from HPF, but in that case you'll also need to change the stopping criterion. Will also apply to the Kappa parameter in the model with user attributes. Forced to None when passing 'initialize_hpf=False'. allow_inconsistent_math : bool Whether to allow inconsistent floating-point math (producing slightly different results on each run) which would allow parallelization of the updates for all of the shape parameters. full_llk : bool Whether to calculate the full log-likehood, including terms that don't depend on the model parameters (thus are constant for a given dataset). keep_data : bool Whether to keep information about which user was associated with each item in the training set, so as to exclude those items later when making Top-N recommendations. save_folder : str or None Folder where to save all model parameters as csv files. produce_dicts : bool Whether to produce Python dictionaries for users and items, which are used to speed-up the prediction API of this package. You can still predict without them, but it might take some additional miliseconds (or more depending on the number of users and items). sum_exp_trick : bool Whether to use the sum-exp trick when scaling the multinomial parameters - that is, calculating them as exp(val - maxval)/sum_{val}(exp(val - maxval)) in order to avoid numerical overflow if there are too large numbers. For this kind of model, it is unlikely that it will be required, and it adds a small overhead, but if you notice NaNs in the results or in the likelihood, you might give this option a try. Forced to True when passing 'initialize_hpf=True'. Will also be forced to True when passing user side information. keep_all_objs : bool Whether to keep intermediate objects/variables in the object that are not necessary for predictions - these are: Gamma_shp, Gamma_rte, Lambda_shp, Lambda_rte, k_rte, t_rte (when passing True here, the model object will have these extra attributes too). Without these objects, it's not possible to call functions that alter the model parameters given new information after it's already fit. Attributes ---------- Theta : array (nitems, k) Item-topic matrix. Beta : array (nwords, k) Word-topic matrix. Only kept when passing 'keep_all_objs=True' Eta : array (nusers, k) User-topic matrix Epsilon : array (nitems, k) Item-topic offset matrix user_mapping_ : array (nusers,) ID of the user (as passed to .fit) of each row of Eta. item_mapping_ : array (nitems,) ID of the item (as passed to .fit) of each row of Beta. word_mapping_ : array (nwords,) ID of the word (as passed to .fit) of each row of Theta and Epsilon. user_dict_ : dict (nusers) Dictionary with the mapping between user IDs (as passed to .fit) and rows of Eta. item_dict_ : dict (nitems) Dictionary with the mapping between item IDs (as passed to .fit) and rows of Theta and Epsilon. word_dict_ : dict (nwords) Dictionary with the mapping between item IDs (as passed to .fit) and rows of Beta. is_fitted : bool Whether the model has been fit to some data. niter : int Number of iterations for which the fitting procedure was run. References ---------- [1] Content-based recommendations with poisson factorization (Gopalan, P.K., Charlin, L. and Blei, D., 2014) """ def __init__(self, k=50, a=.3, b=.3, c=.3, d=.3, e=.3, f=.3, g=.3, h=.3, stop_crit='train-llk', stop_thr=1e-3, reindex=True, miniter=25, maxiter=70, check_every=10, verbose=True, use_float=True, random_seed=None, ncores=-1, initialize_hpf=True, standardize_items=False, rescale_factors=False, missing_items='include', step_size=lambda x: 1-1/np.sqrt(x+1), allow_inconsistent_math=False, full_llk=False, keep_data=True, save_folder=None, produce_dicts=True, sum_exp_trick=False, keep_all_objs=True): ## checking input assert isinstance(k, int) if isinstance(a, int): a = float(a) if isinstance(b, int): b = float(b) if isinstance(c, int): c = float(c) if isinstance(d, int): d = float(d) if isinstance(e, int): e = float(e) if isinstance(f, int): f = float(f) if isinstance(g, int): g = float(g) if isinstance(h, int): h = float(h) assert isinstance(a, float) assert isinstance(b, float) assert isinstance(c, float) assert isinstance(d, float) assert isinstance(e, float) assert isinstance(f, float) assert isinstance(g, float) assert isinstance(h, float) assert k>0 assert a>0 assert b>0 assert c>0 assert d>0 assert e>0 assert f>0 assert g>0 assert h>0 if ncores is None: ncores = 1 if ncores < 1: ncores = multiprocessing.cpu_count() assert ncores>0 assert isinstance(ncores, int) if (ncores > 1) and not (_check_openmp.get()): msg_omp = "Attempting to use more than 1 thread, but " msg_omp += "package was built without multi-threading " msg_omp += "support - see the project's GitHub page for " msg_omp += "more information." warnings.warn(msg_omp) if random_seed is not None: assert isinstance(random_seed, int) assert stop_crit in ['maxiter', 'train-llk', 'val-llk', 'diff-norm'] assert missing_items in ['include', 'exclude'] if maxiter is not None: assert maxiter>0 assert isinstance(maxiter, int) else: if stop_crit!='maxiter': raise ValueError("If 'stop_crit' is set to 'maxiter', must provide a maximum number of iterations.") maxiter = 10**10 if miniter is not None: assert miniter >= 0 assert isinstance(miniter, int) else: miniter = 0 if check_every is not None: assert isinstance(check_every, int) assert check_every>0 assert check_every<=maxiter else: if stop_crit != 'maxiter': raise ValueError("If 'stop_crit' is not 'maxiter', must input after how many iterations to calculate it.") check_every = 0 if isinstance(stop_thr, int): stop_thr = float(stop_thr) if stop_thr is not None: assert stop_thr>0 assert isinstance(stop_thr, float) if save_folder is not None: save_folder = os.path.expanduser(save_folder) assert os.path.exists(save_folder) verbose = bool(verbose) if (stop_crit == 'maxiter') and (not verbose): check_every = 0 if step_size is not None: if not isinstance(step_size, types.FunctionType): raise ValueError("'step_size' must be a function.") if len(inspect.getfullargspec(step_size).args) < 1: raise ValueError("'step_size' must be able to take the iteration number as input.") assert (step_size(0) >= 0) and (step_size(0) <= 1) assert (step_size(1) >= 0) and (step_size(1) <= 1) ## storing these parameters self.k = k self.a = a self.b = b self.c = c self.d = d self.e = e self.f = f self.g = g self.h = h self.ncores = ncores self.allow_inconsistent_math = bool(allow_inconsistent_math) self.random_seed = random_seed self.stop_crit = stop_crit self.reindex = bool(reindex) self.keep_data = bool(keep_data) self.maxiter = maxiter self.miniter = miniter self.check_every = check_every self.stop_thr = stop_thr self.save_folder = save_folder self.verbose = verbose self.use_float = bool(use_float) self.produce_dicts = bool(produce_dicts) self.full_llk = bool(full_llk) self.keep_all_objs = bool(keep_all_objs) self.sum_exp_trick = bool(sum_exp_trick) self.initialize_hpf = bool(initialize_hpf) self.standardize_items = bool(standardize_items) self.rescale_factors = bool(rescale_factors) self.step_size = step_size self.missing_items = missing_items if not self.initialize_hpf: self.missing_items = 'include' else: self.sum_exp_trick = True if not self.reindex: self.missing_items = 'include' if not self.reindex: self.produce_dicts = False if self.standardize_items and self.rescale_factors: msg = "You've passed both 'standardize_items=True' and 'rescale_factors=True'." msg += " This is a weird combination and you might experience very poor quality predictions" warnings.warn(msg) ## initializing other attributes self.Theta = None self.Beta = None self.Eta = None self.Epsilon = None self.user_mapping_ = None self.item_mapping_ = None self.word_mapping_ = None self.user_dict_ = None self.item_dict_ = None self.word_dict_ = None self.is_fitted = False self.niter = None def _process_data(self, counts_df, words_df, user_df): ## TODO: refactor this function, make it mode modular self._counts_df = self._check_df(counts_df, ttl='counts_df') self._words_df = self._check_df(words_df, ttl='words_df') if user_df is not None: self._user_df = self._check_df(user_df, ttl='user_df') self.sum_exp_trick = True if self.reindex: self._counts_df['UserId'], self.user_mapping_ = pd.factorize(self._counts_df.UserId) self._counts_df['ItemId'], self.item_mapping_ = pd.factorize(self._counts_df.ItemId) ### https://github.com/pandas-dev/pandas/issues/30618 if not isinstance(self.user_mapping_, np.ndarray): self.user_mapping_ = self.user_mapping_.to_numpy() if not isinstance(self.item_mapping_, np.ndarray): self.item_mapping_ = self.item_mapping_.to_numpy() self.nusers = self.user_mapping_.shape[0] self.nitems = self.item_mapping_.shape[0] self._need_filter_beta = False self._need_filter_kappa = False words_df_id_orig = self._words_df.ItemId.to_numpy().copy() cat_ids = pd.Categorical(self._words_df.ItemId, self.item_mapping_) ids_new = cat_ids.isnull() if ids_new.sum() > 0: if (self.missing_items == 'exclude') and (not self.initialize_hpf): msg = "'words_df' contains items that are not present in 'counts_df', which will be excluded." msg += " If you still wish to include them in the model, use 'missing_items='include''." msg += " Any words that were associated with only those items will also be excluded." msg += " For information about which words are used by the model, see the attribute 'word_mapping_'." warnings.warn(msg) self._words_df = self._words_df.loc[~ids_new] self._filter_from_words_df = False self._words_df['ItemId'] = pd.Categorical(self._words_df.ItemId, self.item_mapping_).codes else: self._take_ix_item = self.item_mapping_.shape[0] new_ids = np.unique(words_df_id_orig[ids_new]).reshape(-1) if np.unique(words_df_id_orig).reshape(-1).shape[0] == new_ids.shape[0]: raise ValueError("'words_df' contains no items in common with 'counts_df'.") self.item_mapping_ = np.r_[self.item_mapping_, new_ids.copy()] self._words_df['ItemId'] = pd.Categorical(words_df_id_orig, self.item_mapping_).codes self.nitems = self.item_mapping_.shape[0] self._filter_from_words_df = True if (self.missing_items == 'exclude') and self.initialize_hpf: self.item_mapping_ = self.item_mapping_[:self._take_ix_item] self.nitems = self.item_mapping_.shape[0] words_out = self._words_df.WordId[ids_new].unique() words_in = self._words_df.WordId[~ids_new].unique() words_exclude = ~np.in1d(words_out, words_in) if np.sum(words_exclude) > 0: msg = "Some words are associated only with items that are in 'words_df' but not in 'counts_df'." msg += " These will be used to initialize Beta but will be excluded from the final model." msg += " If you still wish to include them in the model, use 'missing_items='include''." msg += " For information about which words are used by the model, see the attribute 'word_mapping_'." warnings.warn(msg) self._need_filter_beta = True self._take_ix_words = words_in.shape[0] self.word_mapping_ = np.r_[words_in, words_out[words_exclude]] self._words_df['WordId'] = pd.Categorical(self._words_df.WordId.to_numpy(), self.word_mapping_).codes self.word_mapping_ = self.word_mapping_[:self._take_ix_words] else: self._words_df['ItemId'] = cat_ids.codes.copy() self._filter_from_words_df = False if not self._need_filter_beta: self._words_df['WordId'], self.word_mapping_ = pd.factorize(self._words_df.WordId) ### https://github.com/pandas-dev/pandas/issues/30618 if not isinstance(self.word_mapping_, np.ndarray): self.word_mapping_ = self.word_mapping_.to_numpy() self.nwords = self.word_mapping_.shape[0] if user_df is not None: user_df_id_orig = self._user_df.UserId.to_numpy().copy() cat_ids = pd.Categorical(self._user_df.UserId, self.user_mapping_) ids_new = cat_ids.isnull() if ids_new.sum() > 0: if (self.missing_items == 'exclude') and (not self.initialize_hpf): msg = "'user_df' contains users that are not present in 'counts_df', which will be excluded." msg += " If you still wish to include them in the model, use 'missing_items='include''." msg += " Any user attributes that were associated with only those users will also be excluded." msg += " For information about which attributes are used by the model, see the attribute 'user_attr_mapping_'." warnings.warn(msg) self._user_df = self._user_df.loc[~ids_new] self._filter_from_user_df = False self._user_df['UserId'] = pd.Categorical(self._user_df.UserId, self.user_mapping_).codes else: self._take_ix_user = self.user_mapping_.shape[0] new_ids = np.unique(user_df_id_orig[ids_new]).reshape(-1) if np.unique(user_df_id_orig).reshape(-1).shape[0] == new_ids.shape[0]: raise ValueError("'user_df' contains no users in common with 'counts_df'.") self.user_mapping_ = np.r_[self.user_mapping_, new_ids.copy()] self._user_df['UserId'] = pd.Categorical(user_df_id_orig, self.user_mapping_).codes self.nusers = self.user_mapping_.shape[0] self._filter_from_user_df = True if (self.missing_items == 'exclude') and self.initialize_hpf: self.user_mapping_ = self.user_mapping_[:self._take_ix_user] self.nusers = self.user_mapping_.shape[0] attr_out = self._user_df.AttributeId[ids_new].unique() attr_in = self._user_df.AttributeId[~ids_new].unique() attr_exclude = ~np.in1d(attr_out, attr_in) if np.sum(attr_exclude) > 0: msg = "Some user attributes are associated only with users that are in 'user_df' but not in 'counts_df'." msg += " These will be used to initialize Kappa but will be excluded from the final model." msg += " If you still wish to include them in the model, use 'missing_items='include''." msg += " For information about which user attributes are used by the model, see the attribute 'user_attr_mapping_'." warnings.warn(msg) self._need_filter_kappa = True self._take_ix_userattr = attr_in.shape[0] self.user_attr_mapping_ = np.r_[attr_in, attr_out[attr_exclude]] self._user_df['AttributeId'] = pd.Categorical(self._user_df.AttributeId, self.user_attr_mapping_).codes self.user_attr_mapping_ = self.user_attr_mapping_[:self._take_ix_userattr] else: self._user_df['UserId'] = cat_ids.codes.copy() self._filter_from_user_df = False if not self._need_filter_kappa: self._user_df['AttributeId'], self.user_attr_mapping_ = pd.factorize(self._user_df.AttributeId) ### https://github.com/pandas-dev/pandas/issues/30618 if not isinstance(self.user_attr_mapping_, np.ndarray): self.user_attr_mapping_ = self.user_attr_mapping_.to_numpy() self.nuserattr = self.user_attr_mapping_.shape[0] self.user_mapping_ = np.array(self.user_mapping_).reshape(-1) self.item_mapping_ = np.array(self.item_mapping_).reshape(-1) self.word_mapping_ = np.array(self.word_mapping_).reshape(-1) if user_df is not None: self.user_attr_mapping_ = np.array(self.user_attr_mapping_).reshape(-1) if (self.save_folder is not None) and self.reindex: if self.verbose: print("\nSaving ID mappings...\n") pd.Series(self.user_mapping_).to_csv(os.path.join(self.save_folder, 'users.csv'), index=False) pd.Series(self.item_mapping_).to_csv(os.path.join(self.save_folder, 'items.csv'), index=False) pd.Series(self.word_mapping_).to_csv(os.path.join(self.save_folder, 'words.csv'), index=False) if user_df is not None: pd.Series(self.user_attr_mapping_).to_csv(os.path.join(self.save_folder, 'user_attr.csv'), index=False) ## when not reindexing else: if user_df is None: self.nusers = self._counts_df.UserId.max() + 1 else: self.nusers = max(self._counts_df.UserId.max(), self._user_df.UserId.max()) + 1 self.nuserattr = self._user_df.AttributeId.max() + 1 self.nitems = max(self._counts_df.ItemId.max(), self._words_df.ItemId.max()) + 1 self.nwords = self._words_df.WordId.max() + 1 self._counts_df = self._cast_df(self._counts_df, ttl='counts_df') self._words_df = self._cast_df(self._words_df, ttl='words_df') if user_df is not None: self._user_df = self._cast_df(self._user_df, ttl='user_df') self._has_user_df = True else: self._has_user_df = False self._save_hyperparams() return None def _filter_words_df(self): if self.reindex: if self._filter_from_words_df: self._words_df = self._words_df.loc[self._words_df.ItemId < self.nitems] self._words_df.reset_index(drop=True, inplace=True) ## this is a double chek but should be done elsewhere if (self.item_mapping_.shape[0] > self._take_ix_item) or (self.Theta_shp.shape[0] > self._take_ix_item): self.item_mapping_ = self.item_mapping_[:self._take_ix_item] self.Theta = self.Theta[:self._take_ix_item, :] self.Theta_shp = self.Theta_shp[:self._take_ix_item, :] self.Theta_rte = self.Theta_rte[:self._take_ix_item, :] self.Epsilon = self.Epsilon[:self._take_ix_item, :] self.Epsilon_shp = self.Epsilon_shp[:self._take_ix_item, :] self.Epsilon_rte = self.Epsilon_rte[:self._take_ix_item, :] def _filter_user_df(self): if self.reindex and self._has_user_df: if self._filter_from_user_df: self._user_df = self._user_df.loc[self._user_df.UserId < self.nusers] self._user_df.reset_index(drop=True, inplace=True) ## this is a double chek but should be done elsewhere if (self.user_mapping_.shape[0] > self._take_ix_user) or (self.Eta_shp.shape[0] > self._take_ix_user): self.user_mapping_ = self.user_mapping_[:self._take_ix_user] self.Eta = self.Eta[:self._take_ix_user, :] self.Eta_shp = self.Eta_shp[:self._take_ix_user, :] self.Eta_rte = self.Eta_rte[:self._take_ix_user, :] self.Omega = self.Omega[:self._take_ix_user, :] self.Omega_shp = self.Omega_shp[:self._take_ix_user, :] self.Omega_rte = self.Omega_rte[:self._take_ix_user, :] def _save_hyperparams(self): if self.save_folder is not None: with open(os.path.join(self.save_folder, "hyperparameters.txt"), "w") as pf: pf.write("a: %.3f\n" % self.a) pf.write("b: %.3f\n" % self.b) pf.write("c: %.3f\n" % self.c) pf.write("d: %.3f\n" % self.d) pf.write("e: %.3f\n" % self.e) pf.write("f: %.3f\n" % self.f) pf.write("g: %.3f\n" % self.g) pf.write("h: %.3f\n" % self.h) pf.write("k: %d\n" % self.k) if self.random_seed is not None: pf.write("random seed: %d\n" % self.random_seed) else: pf.write("random seed: None\n") def _unexpected_err_msg(self): msg = "Something went wrong. Please open an issue in GitHub indicating the function you called and the constructor parameters." raise ValueError(msg) def _filter_zero_obs(self, df, ttl='words_df', subj='item or word'): if self.stop_crit in ['maxiter', 'diff-norm']: thr = 0 else: thr = 0.9 obs_zero = df.Count.to_numpy() <= thr if obs_zero.sum() > 0: msg = "'" + ttl + "' contains observations with a count value less than one, these will be ignored." msg += " Any " + subj + " associated exclusively with zero-value observations will be excluded." msg += " If using 'reindex=False', make sure that your data still meets the necessary criteria." msg += "If you wish to use values less than 1, set a different 'stop_crit'." warnings.warn(msg) df = df.loc[~obs_zero] return df def _standardize_counts(self, df, col1='ItemId', col2='WordId'): if self.standardize_items: sum_by_item = df.groupby(col1)['Count'].sum() if self.is_fitted: if (col1=='UserId') and (col2=='ItemId'): target_factor = self.rescale_const_counts_df_ elif (col1=='ItemId') and (col2=='WordId'): target_factor = self.rescale_const_words_df_ elif (col1=='UserId') and (col2=='AttributeId'): target_factor = self.rescale_const_user_df_ else: self._unexpected_err_msg() else: target_factor = sum_by_item.mean() if (col1=='UserId') and (col2=='ItemId'): self.rescale_const_counts_df_ = target_factor elif (col1=='ItemId') and (col2=='WordId'): self.rescale_const_words_df_ = target_factor elif (col1=='UserId') and (col2=='AttributeId'): self.rescale_const_user_df_ = target_factor else: self._unexpected_err_msg() df = pd.merge(df, sum_by_item.to_frame().reset_index(drop=False).rename(columns={'Count':'SumCounts'})) df['Count'] = target_factor * df.Count.to_numpy() / df.SumCounts.to_numpy() df = df[[col1, col2, 'Count']] return df # take_obs = df.Count.values >= 0.85 # nitems_before = df[col1].unique().shape[0] # nwords_before = df[col2].unique().shape[0] # df = df.loc[~take_obs] # nitems_after = df[col1].unique().shape[0] # nwords_after = df[col2].unique().shape[0] # if (nitems_before - nitems_after) > 0: # warnings.warn(str(nitems_before - nitems_after) + " items discarded after filtering counts on standardized attributes.") # if (nwords_before - nwords_after) > 0: # warnings.warn(str(nwords_before - nwords_after) + " words discarded after filtering counts on standardized attributes.") def _cols_from_ttl(self, ttl): if ttl == 'counts_df': subj = 'user or item' col1 = 'UserId' col2 = 'ItemId' elif ttl == 'words_df': subj = 'item or word' col1 = 'ItemId' col2 = 'WordId' elif ttl == 'user_df': subj = 'user or user-attribute' col1 = 'UserId' col2 = 'AttributeId' else: self._unexpected_err_msg() return col1, col2, subj def _cast_df(self, df, ttl): cy = cy_float if self.use_float else cy_double col1, col2, subj = self._cols_from_ttl(ttl) df[col1] = df[col1].to_numpy().astype(cy.obj_ind_type) df[col2] = df[col2].to_numpy().astype(cy.obj_ind_type) df['Count'] = df.Count.astype(ctypes.c_float if self.use_float else ctypes.c_double if self.use_float else ctypes.c_double) return df def _check_df(self, df, ttl): col1, col2, subj = self._cols_from_ttl(ttl) if isinstance(df, np.ndarray): assert len(df.shape) > 1 assert df.shape[1] >= 3 df = pd.DataFrame(df[:, :3]) df.columns = [col1, col2, "Count"] elif isinstance(df, pd.DataFrame): assert df.shape[0] > 0 assert col1 in df.columns.to_numpy() assert col2 in df.columns.to_numpy() assert 'Count' in df.columns.to_numpy() df = df[[col1, col2, 'Count']] elif issparse(df) and (df.format == "coo"): if ttl == "counts_df": df = pd.DataFrame({ 'UserId' : df.row, 'ItemId' : df.col, 'Count' : df.data }) elif ttl == "words_df": df = pd.DataFrame({ 'ItemId' : df.row, 'WordId' : df.col, 'Count' : df.data }) elif ttl == "user_df": df = pd.DataFrame({ 'UserId' : df.row, 'AttributeId' : df.col, 'Count' : df.data }) else: self._unexpected_err_msg() else: raise ValueError("'" + ttl + "' must be a pandas data frame, a numpy array, or scipy sparse coo_matrix") if self.reindex: df = self._filter_zero_obs(df, ttl=ttl, subj=subj) if self.standardize_items and (ttl != 'counts_df'): df = self._standardize_counts(df, col1=col1, col2=col2) return df def _process_extra_df(self, df, ttl, df2=None): assert self.is_fitted assert self.keep_all_objs df = self._check_df(df, ttl=ttl) nobs_before = df.shape[0] col1, col2, subj = self._cols_from_ttl(ttl) subj1, temp, subj2 = subj.split() del temp if self.reindex: if ttl == 'counts_df': curr_mapping1 = self.user_mapping_ curr_mapping2 = self.item_mapping_ elif ttl == 'words_df': curr_mapping1 = self.item_mapping_ curr_mapping2 = self.word_mapping_ elif ttl == 'user_df': curr_mapping1 = self.user_mapping_ curr_mapping2 = self.user_attr_mapping_ else: self._unexpected_err_msg() df[col2] = pd.Categorical(df[col2], curr_mapping2).codes new_ids2 = df[col2].to_numpy() == -1 if new_ids2.any(): df = df.loc[~new_ids2].reset_index(drop=True) if df.shape[0] > 0: msg = "'" + ttl + "' has " + subj2 + "s that were not present in the training data." msg += " These will be ignored." warnings.warn(msg) else: raise ValueError("'" + ttl + "' must contain " + subj2 + "s from the training set.") new_ids1 = df[col1].unique() repeated = np.in1d(new_ids1, curr_mapping1) if repeated.any(): repeated = new_ids1[repeated] df = df.loc[~np.in1d(df[col1].to_numpy(), repeated)].reset_index(drop=True) if df.shape[0] > 0: msg = "'" + ttl + "' contains " + subj1 + "s that were already present in the training set." msg += " These will be ignored." warnings.warn(msg) else: raise ValueError("'" + ttl + "' doesn't contain any new " + subj1 + "s.") ## this covers the case of passing both user_df and counts_df if df2 is not None: ttl2 = "counts_df" df2 = self._check_df(df2, ttl=ttl2) df2['ItemId'] = pd.Categorical(df2[col1], self.item_mapping_).codes invalid_items = df2.ItemId == -1 if invalid_items.any() > 0: df2 = df2.loc[~invalid_items].reset_index(drop=True) if df2.shape[0] > 0: msg = "'" + ttl2 + "' has " + "item" + "s that were not present in the training data." msg += " These will be ignored." warnings.warn(msg) else: raise ValueError("'" + ttl2 + "' must contain " + "items" + "s from the training set.") new_ids11 = df2[col1].unique() repeated = np.in1d(new_ids11, curr_mapping1) if repeated.any() > 0: repeated = new_ids1[repeated] df2 = df2.loc[~np.in1d(df2[col1].to_numpy(), repeated)].reset_index(drop=True) if df2.shape[0] > 0: msg = "'" + ttl2 + "' contains " + subj1 + "s that were already present in the training set." msg += " These will be ignored." warnings.warn(msg) else: raise ValueError("'" + ttl2 + "' doesn't contain any new " + subj1 + "s.") new_ids1 = np.unique(np.r_[new_ids1, new_ids11]) new_ids1 = np.setdiff1d(new_ids1, curr_mapping1) new_mapping = np.r_[curr_mapping1, new_ids1] df[col1] = pd.Categorical(df[col1], new_mapping).codes df2[col1] = pd.Categorical(df2[col1], new_mapping).codes df2 = self._cast_df(df2, ttl=ttl2) else: new_ids1 = np.setdiff1d(new_ids1, curr_mapping1) new_mapping = np.r_[curr_mapping1, new_ids1] df[col1] = pd.Categorical(df[col1], new_mapping).codes else: new_mapping = None df = self._cast_df(df, ttl=ttl) if self.standardize_items and (ttl != 'counts_df') and (nobs_before > df.shape[0]): df = self._standardize_counts(df, col1=col1, col2=col2) if df2 is None: return df, new_mapping else: return df, df2, new_mapping def _store_metadata(self): self.seen = self._counts_df[['UserId', 'ItemId']].copy() self.seen.sort_values(['UserId', 'ItemId'], inplace=True) self.seen.reset_index(drop = True, inplace = True) self._n_seen_by_user = self.seen.groupby('UserId')['ItemId'].agg(lambda x: len(x)).to_numpy().astype(int) self._st_ix_user = np.cumsum(self._n_seen_by_user) self._st_ix_user = np.r_[[0], self._st_ix_user[:self._st_ix_user.shape[0]-1]] self._st_ix_user = self._st_ix_user.astype(int) self.seen = self.seen.ItemId.to_numpy().astype(int) return None def _exclude_missing_from_index(self): ## this is a double check but should be done elsewhere if self.reindex: if self._need_filter_beta: if self.Beta_shp.shape[0] > self._take_ix_words: self.Beta_shp = self.Beta_shp[:self._take_ix_words, :] del self._take_ix_words del self._need_filter_beta if self._has_user_df: if self._need_filter_kappa: if self.Kappa_shp.shape[0] > self._take_ix_userattr: self.Kappa_shp = self.Kappa_shp[:self._take_ix_userattr, :] del self._take_ix_userattr del self._need_filter_kappa return None def _initalize_parameters(self): ## TODO: make this function more modular ## TODO: improve the way of adding random noise in the initialization ## Note: the initialization here corresponds to the one used in the original HPF code rng = np.random.Generator(np.random.MT19937(seed = self.random_seed)) if self.initialize_hpf: if self.verbose: print("Initializing Theta and Beta through HPF...") print("") h = HPF(k=self.k, verbose=self.verbose, reindex=True, produce_dicts=False, stop_crit='diff-norm', stop_thr=self.stop_thr, random_seed=self.random_seed, keep_all_objs=False, use_float=self.use_float, sum_exp_trick=self.sum_exp_trick, allow_inconsistent_math=self.allow_inconsistent_math,) h.fit(self._words_df.rename(columns={'ItemId':'UserId', 'WordId':'ItemId'}).copy()) if (h.nusers == self.nitems) and (h.nitems == self.nwords): ## if using missing_items='include', it should always enter this section order_theta = np.argsort(h.user_mapping_) order_beta = np.argsort(h.item_mapping_) self.Theta_shp = h.Theta[order_theta].copy() self.Beta_shp = h.Beta[order_beta].copy() del h self.Theta_rte = self.d + self.Beta_shp.sum(axis=0, keepdims=True) self.Beta_rte = self.b + self.Theta_shp.sum(axis=0, keepdims=True) else: if self.reindex: ## from self._process_data, all items that are in words_df but not in counts_df should have ## numeration greater than the last (maximum) ID in counts_df items_take = h.user_mapping_ < self.nitems words_take = h.item_mapping_ < self.nwords else: ids_counts_df = self._counts_df.ItemId.unique() items_take = np.in1d(h.user_mapping_, ids_counts_df) ## for which words to take, need to forcibly determine intersection items_words_df = self._words_df.ItemId.unique() items_intersect = np.in1d(items_words_df, ids_counts_df) words_include = self._words_df.WordId.loc[np.in1d(self._words_df.ItemId, items_words_df[items_intersect])].unique() words_take = pd.Categorical(words_include, h.item_mapping_).codes self.Theta_shp = self.c + rng.uniform(0, 0.01, size=(self.nitems, self.k)) if self.use_float: self.Theta_shp = self.Theta_shp.astype(ctypes.c_float) self.Theta_shp[h.user_mapping_[items_take],:] = h.Theta[items_take] self.Beta_shp = self.a + rng.uniform(0, 0.01, size=(self.nwords, self.k)) if self.use_float: self.Beta_shp = self.Beta_shp.astype(ctypes.c_float) self.Beta_shp[h.item_mapping_[words_take],:] = h.Beta[words_take] self.Theta_rte = self.d + self.Beta_shp.sum(axis=0, keepdims=True) self.Beta_rte = self.b + self.Theta_shp.sum(axis=0, keepdims=True) if np.isnan(self.Theta_shp).sum().sum() > 0: warnings.warn("NaNs produced in initialization of Theta, will use a random start.") self.Theta_shp = self.c + rng.uniform(0, 0.01, size=(self.nitems, self.k)) self.Theta_rte = self.d + rng.uniform(0, 0.01, size=(1, self.k)) if self.use_float: self.Theta_shp = self.Theta_shp.astype(ctypes.c_float) self.Theta_rte = self.Theta_rte.astype(ctypes.c_float) if np.isnan(self.Beta_shp).sum().sum() > 0: warnings.warn("NaNs produced in initialization of Beta, will use a random start.") self.Beta_shp = self.a + rng.uniform(0, 0.01, size=(self.nwords, self.k)) self.Beta_rte = self.b + rng.uniform(0, 0.01, size=(1, self.k)) if self.use_float: self.Beta_shp = self.Beta_shp.astype(ctypes.c_float) self.Beta_rte = self.Beta_rte.astype(ctypes.c_float) if self.verbose: print("**********************************") print("") if self._has_user_df: self.Omega_shp = self.e + rng.uniform(0, 0.01, size=(self.nusers, self.k)) self.Omega_rte = self.f + rng.uniform(0, 0.01, size=(1, self.k)) if self.use_float: self.Omega_shp = self.Omega_shp.astype(ctypes.c_float) self.Omega_rte = self.Omega_rte.astype(ctypes.c_float) if self.verbose: print("Initializing Kappa through HPF...") print("") h = HPF(k=self.k, verbose=self.verbose, reindex=True, produce_dicts=False, stop_crit='diff-norm', stop_thr=self.stop_thr, random_seed=self.random_seed, keep_all_objs=False, use_float=self.use_float, sum_exp_trick=self.sum_exp_trick, allow_inconsistent_math=self.allow_inconsistent_math) h.fit(self._user_df.rename(columns={'AttributeId':'ItemId'}).copy()) if h.nitems == self.nuserattr: ## if using missing_items='include', it should always enter this section order_kappa = np.argsort(h.item_mapping_) self.Kappa_shp = h.Beta[order_kappa].copy() self.Kappa_rte = self.b + h.Theta.sum(axis=0, keepdims=True) del h else: if self.reindex: attr_take = h.user_attr_mapping_ < self.nuserattr else: users_counts_df = self._counts_df.UserId.unique() users_user_df = self._user_df.UserId.unique() users_intersect = np.in1d(users_user_df, users_counts_df) attr_include = self._user_df.AttributeId.loc[np.in1d(self._user_df.UserId, users_user_df[users_intersect])].unique() attr_take = pd.Categorical(attr_include, h.item_mapping_).codes self.Kappa_shp = self.a + rng.uniform(0, 0.01, size=(self.nuserattr, self.k)) if self.use_float: self.Kappa_shp = self.Kappa_shp.astype(ctypes.c_float) self.Kappa_shp[h.item_mapping_[attr_take],:] = h.Beta[attr_take].copy() self.Kappa_rte = self.b + h.Theta.sum(axis=0, keepdims=True) del h if np.isnan(self.Kappa_shp).sum().sum() > 0: warnings.warn("NaNs produced in initialization of Kappa, will use a random start.") self.Kappa_shp = self.a + rng.uniform(0, 0.01, size=(self.nuserattr, self.k)) self.Kappa_rte = self.b + rng.uniform(0, 0.01, size=(1, self.k)) if self.use_float: self.Kappa_shp = self.Kappa_shp.astype(ctypes.c_float) self.Kappa_rte = self.Kappa_rte.astype(ctypes.c_float) if self.verbose: print("**********************************") print("") else: self.Kappa_shp = np.empty((0,0), dtype=ctypes.c_float if self.use_float else ctypes.c_double) self.Kappa_rte = np.empty((0,0), dtype=ctypes.c_float if self.use_float else ctypes.c_double) else: self.Beta_shp = self.a + rng.uniform(0, 0.01, size=(self.nwords, self.k)) self.Theta_shp = self.c + rng.uniform(0, 0.01, size=(self.nitems, self.k)) self.Beta_rte = self.b + rng.uniform(0, 0.01, size=(1, self.k)) self.Theta_rte = self.d + rng.uniform(0, 0.01, size=(1, self.k)) if self.use_float: self.Beta_shp = self.Beta_shp.astype(ctypes.c_float) self.Theta_shp = self.Theta_shp.astype(ctypes.c_float) self.Beta_rte = self.Beta_rte.astype(ctypes.c_float) self.Theta_rte = self.Theta_rte.astype(ctypes.c_float) if self._has_user_df: self.Kappa_shp = self.a + rng.uniform(0, 0.01, size=(self.nuserattr, self.k)) self.Kappa_rte = self.b + rng.uniform(0, 0.01, size=(1, self.k)) if self.use_float: self.Kappa_shp = self.Kappa_shp.astype(ctypes.c_float) self.Kappa_rte = self.Kappa_rte.astype(ctypes.c_float) else: self.Kappa_shp = np.empty((0,0), dtype=ctypes.c_float if self.use_float else ctypes.c_double) self.Kappa_rte = np.empty((0,0), dtype=ctypes.c_float if self.use_float else ctypes.c_double) self.Eta_shp = self.e + rng.uniform(0, 0.01, size=(self.nusers, self.k)) self.Epsilon_shp = self.g + rng.uniform(0, 0.01, size=(self.nitems, self.k)) self.Eta_rte = self.f + rng.uniform(0, 0.01, size=(1, self.k)) self.Epsilon_rte = self.h + rng.uniform(0, 0.01, size=(1, self.k)) if self.use_float: self.Eta_shp = self.Eta_shp.astype(ctypes.c_float) self.Epsilon_shp = self.Epsilon_shp.astype(ctypes.c_float) self.Eta_rte = self.Eta_rte.astype(ctypes.c_float) self.Epsilon_rte = self.Epsilon_rte.astype(ctypes.c_float) if self._has_user_df: self.Omega_shp = self.e + rng.uniform(0, 0.01, size=(self.nusers, self.k)) self.Omega_rte = self.f + rng.uniform(0, 0.01, size=(1, self.k)) if self.use_float: self.Omega_shp = self.Omega_shp.astype(ctypes.c_float) self.Omega_rte = self.Omega_rte.astype(ctypes.c_float) else: self.Omega_shp = np.empty((0,0), dtype=ctypes.c_float if self.use_float else ctypes.c_double) self.Omega_rte = np.empty((0,0), dtype=ctypes.c_float if self.use_float else ctypes.c_double) self._divide_parameters(add_beta=False) def _divide_parameters(self, add_beta=False): self.Theta = self.Theta_shp / self.Theta_rte self.Eta = self.Eta_shp / self.Eta_rte self.Epsilon = self.Epsilon_shp / self.Epsilon_rte self.Omega = self.Omega_shp / self.Omega_rte if add_beta: self.Beta = self.Beta_shp / self.Beta_rte self.Kappa = self.Kappa_shp / self.Kappa_rte def _rescale_parameters(self): self.Theta_shp /= self.Theta_shp.sum(axis=1, keepdims=True) if self._has_user_df: self.Omega_shp /= self.Omega_shp.sum(axis=1, keepdims=True) def _clear_internal_objects(self): del self._counts_df, self._words_df, self._user_df del self.val_set if not self._has_user_df: del self.Kappa_shp, self.Kappa_rte del self.Omega_shp, self.Omega_rte if not self.keep_all_objs: del self.Theta_shp, self.Theta_rte del self.Beta_shp, self.Beta_rte del self.Eta_shp, self.Eta_rte del self.Epsilon_shp, self.Epsilon_rte if self._has_user_df: self._M1 = self.Eta + self.Omega else: self._M1 = self.Eta self._M2 = self.Theta + self.Epsilon if not self.keep_all_objs: del self.Theta, self.Epsilon, self.Omega if self.reindex: if self._filter_from_words_df: del self._take_ix_item del self._filter_from_words_df if self._has_user_df: if self._filter_from_user_df: del self._take_ix_user del self._filter_from_user_df
[docs] def fit(self, counts_df, words_df, user_df=None, val_set=None): """ Fit Collaborative Topic Poisson Factorization model to sparse count data Note ---- DataFrames and arrays passed to '.fit' might be modified inplace - if this is a problem you'll need to pass a copy to them, e.g. 'counts_df=counts_df.copy()'. Note ---- Forcibly terminating the procedure should still keep the last calculated shape and rate parameter values, but is not recommended. If you need to make predictions on a forced-terminated object, set the attribute 'is_fitted' to 'True'. Parameters ---------- counts_df : DatFrame(n_samples, 3) or sparse COO(n_users, n_items) User-Item interaction data with one row per non-zero observation, consisting of triplets ('UserId', 'ItemId', 'Count'). Must containin columns 'UserId', 'ItemId', and 'Count'. Combinations of users and items not present are implicitly assumed to be zero by the model. If passing a COO matrix, will set ``self.reindex=False``. words_df : DatFrame(n_samples, 3) or sparse COO(n_items, n_words) Bag-of-word representation of items with one row per present unique word, consisting of triplets ('ItemId', 'WordId', 'Count'). Must contain columns 'ItemId', 'WordId', and 'Count'. Combinations of items and words not present are implicitly assumed to be zero. Must be of the same type ('DataFrame' or 'coo_matrix') as ``counts_df``. user_df : DatFrame(n_samples, 3), or sparse COO(n_users, n_attr) User attributes, same format as 'words_df'. Must contain columns 'UserId', 'AttributeId', 'Count'. Must be of the same type ('DataFrame' or 'coo_matrix') as ``counts_df``. val_set : DatFrame(n_samples, 3), or sparse COO(n_users, n_items) Validation set on which to monitor log-likelihood. Same format as ``counts_df``. Returns ------- self : obj Copy of this object """ ## a basic check if self.stop_crit == 'val-llk': if val_set is None: raise ValueError("If 'stop_crit' is set to 'val-llk', must provide a validation set.") ## another basic check if (issparse(counts_df) and (counts_df.format == "coo")) or (issparse(words_df) and (words_df.format == "coo")) or (issparse(user_df) and (user_df.format == "coo")): has_coo = True else: has_coo = False if isinstance(counts_df, pd.DataFrame) or isinstance(words_df, pd.DataFrame) or isinstance(user_df, pd.DataFrame): has_df = True else: has_df = False if has_df and has_coo: raise ValueError("Cannot mix 'coo_matrix' and 'DataFrame' as inputs.") if has_coo: self.reindex = False cy = cy_float if self.use_float else cy_double cython_loops = cython_loops_float if self.use_float else cython_loops_double ## running each sub-process if self.verbose: self._print_st_msg() self._process_data(counts_df, words_df, user_df) if self.verbose: self._print_data_info() if (val_set is not None) and (self.stop_crit!='diff-norm') and (self.stop_crit!='train-llk'): HPF._process_valset(self, val_set) else: self.val_set = pd.DataFrame({ 'UserId': np.empty(0, dtype=cy.obj_ind_type), 'ItemId': np.empty(0, dtype=cy.obj_ind_type), 'Count': np.empty(0, dtype=ctypes.c_float if self.use_float else ctypes.c_double)}) if not self._has_user_df: self._user_df = pd.DataFrame({'UserId':np.empty(0, dtype=cy.obj_ind_type), 'AttributeId':np.empty(0, dtype=cy.obj_ind_type), 'Count':np.empty(0, dtype=ctypes.c_float if self.use_float else ctypes.c_double)}) if self.verbose: print("Initializing parameters...") self._initalize_parameters() self._divide_parameters(add_beta=False) if self.missing_items == 'exclude': self._exclude_missing_from_index() self._filter_words_df() if self._has_user_df: self._filter_user_df() else: self._user_df = pd.DataFrame({'UserId':np.empty(0, dtype=cy.obj_ind_type), 'AttributeId':np.empty(0, dtype=cy.obj_ind_type), 'Count':np.empty(0, dtype=ctypes.c_float if self.use_float else ctypes.c_double)}) ## fitting the model self.niter = cy.fit_ctpf( self.Theta_shp, self.Theta_rte, self.Beta_shp, self.Beta_rte, self.Eta_shp, self.Eta_rte, self.Epsilon_shp, self.Epsilon_rte, self.Omega_shp, self.Omega_rte, self.Kappa_shp, self.Kappa_rte, self.Theta, self.Eta, self.Epsilon, self.Omega, self._user_df, self._has_user_df, self._counts_df, self._words_df, cython_loops.cast_ind_type(self.k), self.step_size, cython_loops.cast_int(self.step_size is not None), cython_loops.cast_int(self.sum_exp_trick), cython_loops.cast_real_t(self.a), cython_loops.cast_real_t(self.b), cython_loops.cast_real_t(self.c), cython_loops.cast_real_t(self.d), cython_loops.cast_real_t(self.e), cython_loops.cast_real_t(self.f), cython_loops.cast_real_t(self.g), cython_loops.cast_real_t(self.h), cython_loops.cast_int(self.ncores), cython_loops.cast_int(self.maxiter), cython_loops.cast_int(self.miniter), cython_loops.cast_int(self.check_every), self.stop_crit, self.stop_thr, cython_loops.cast_int(self.verbose), self.save_folder if self.save_folder is not None else "", cython_loops.cast_int(self.allow_inconsistent_math), cython_loops.cast_int(self.val_set.shape[0] > 0), cython_loops.cast_int(self.full_llk), self.val_set ) ## post-processing and clean-up if self.rescale_factors: self._rescale_parameters() self._divide_parameters(self.keep_all_objs) self._store_metadata() self._clear_internal_objects() if self.verbose: print("Producing Python dictionaries...") if self.produce_dicts and self.reindex: self.user_dict_ = {self.user_mapping_[i]:i for i in range(self.user_mapping_.shape[0])} self.item_dict_ = {self.item_mapping_[i]:i for i in range(self.item_mapping_.shape[0])} self.word_dict_ = {self.word_mapping_[i]:i for i in range(self.word_mapping_.shape[0])} if self._has_user_df: self.user_attr_dict = {self.user_attr_mapping_[i]:i for i in range(self.user_attr_mapping_.shape[0])} self.is_fitted = True return self
def _topN(self, user_vec, n, exclude_seen, items_pool, user=None): if items_pool is None: allpreds = - (user_vec.dot(self._M2.T)) if exclude_seen: n_ext = int(np.min([n + self._n_seen_by_user[user], self._M2.shape[0]])) rec = np.argpartition(allpreds, n_ext-1)[:n_ext] seen = self.seen[self._st_ix_user[user] : self._st_ix_user[user] + self._n_seen_by_user[user]] rec = np.setdiff1d(rec, seen) rec = rec[np.argsort(allpreds[rec])[:n]] if self.reindex: return self.item_mapping_[rec] else: return rec else: n = np.min([n, self._M2.shape[0]]) rec = np.argpartition(allpreds, n-1)[:n] rec = rec[np.argsort(allpreds[rec])] if self.reindex: return self.item_mapping_[rec] else: return rec else: if isinstance(items_pool, list) or isinstance(items_pool, tuple): items_pool = np.array(items_pool) if isinstance(items_pool, pd.Series): items_pool = items_pool.to_numpy() if isinstance(items_pool, np.ndarray): if len(items_pool.shape) > 1: items_pool = items_pool.reshape(-1) if self.reindex: items_pool_reind = pd.Categorical(items_pool, self.item_mapping_).codes nan_ix = (items_pool_reind == -1) if nan_ix.sum() > 0: items_pool_reind = items_pool_reind[~nan_ix] msg = "There were " + ("%d" % int(nan_ix.sum())) + " entries from 'item_pool'" msg += "that were not in the training data and will be exluded." warnings.warn(msg) del nan_ix if items_pool_reind.shape[0] == 0: raise ValueError("No items to recommend.") elif items_pool_reind.shape[0] == 1: raise ValueError("Only 1 item to recommend.") else: pass else: raise ValueError("'items_pool' must be an array.") if self.reindex: allpreds = - user_vec.dot(self._M2[items_pool_reind].T) else: allpreds = - user_vec.dot(self._M2[items_pool].T) n = np.min([n, items_pool.shape[0]]) if exclude_seen: n_ext = int(np.min([n + self._n_seen_by_user[user], items_pool.shape[0]])) rec = np.argpartition(allpreds, n_ext-1)[:n_ext] seen = self.seen[self._st_ix_user[user] : self._st_ix_user[user] + self._n_seen_by_user[user]] if self.reindex: rec = np.setdiff1d(items_pool_reind[rec], seen) allpreds = - user_vec.dot(self._M2[rec].T) return self.item_mapping_[rec[np.argsort(allpreds)[:n]]] else: rec = np.setdiff1d(items_pool[rec], seen) allpreds = - user_vec.dot(self._M2[rec].T) return rec[np.argsort(allpreds)[:n]] else: rec = np.argpartition(allpreds, n-1)[:n] return items_pool[rec[np.argsort(allpreds[rec])]]
[docs] def topN(self, user, n=10, exclude_seen=True, items_pool=None): """ Recommend Top-N items for a user Outputs the Top-N items according to score predicted by the model. Can exclude the items for the user that were associated to her in the training set, and can also recommend from only a subset of user-provided items. Parameters ---------- user : obj User for which to recommend. n : int Number of top items to recommend. exclude_seen: bool Whether to exclude items that were associated to the user in the training set. items_pool: None or array Items to consider for recommending to the user. Returns ------- rec : array (n,) Top-N recommended items. """ if isinstance(n, float): n = int(n) assert isinstance(n ,int) if self.reindex: if self.produce_dicts: try: user = self.user_dict_[user] except Exception: raise ValueError("Can only predict for users who were in the training set.") else: user = pd.Categorical(np.array([user]), self.user_mapping_).codes[0] if user == -1: raise ValueError("Can only predict for users who were in the training set.") if exclude_seen and not self.keep_data: raise Exception("Can only exclude seen items when passing 'keep_data=True' to .fit") return self._topN(self._M1[user], n, exclude_seen, items_pool, user)
[docs] def topN_cold(self, user_df, n=10, items_pool=None, maxiter=10, ncores=1, random_seed=1, stop_thr=1e-3): """ Recommend Top-N items for a user that was not in the training set. Note ---- This function is only available if fitting a model that uses user attributes. Note ---- The data passed to this function might be modified inplace. Be sure to pass a copy of the 'user_df' object if this is a problem. Parameters ---------- attributes : data frame (n_samples, 2) Attributes of the user. Must have columns 'AttributeId', 'Count'. n : int Number of top items to recommend. items_pool: None or array Items to consider for recommending to the user. Returns ------- rec : array (n,) Top-N recommended items. """ if not self._has_user_df: msg = "Can only make recommendations for users without any item interactions" msg += " when fitting a model with user attributes." raise ValueError(msg) assert isinstance(user_df, pd.DataFrame) user_df['UserId'] = self.nusers user_vec, temp = self._predict_user_factors( user_df=user_df, maxiter=maxiter, ncores=ncores, random_seed=random_seed, stop_thr=stop_thr, return_ix=False, return_temp=False ) del temp user_vec /= self.Omega_rte return self._topN(user_vec.reshape(-1), n, False, items_pool, None)
[docs] def predict(self, user, item): """ Predict count for combinations of users and items Note ---- You can either pass an individual user and item, or arrays representing tuples (UserId, ItemId) with the combinatinons of users and items for which to predict (one row per prediction). Parameters ---------- user : array-like (npred,) or obj User(s) for which to predict each item. item: array-like (npred,) or obj Item(s) for which to predict for each user. """ assert self.is_fitted cy = cy_float if self.use_float else cy_double cython_loops = cython_loops_float if self.use_float else cython_loops_double if isinstance(user, list) or isinstance(user, tuple): user = np.array(user) if isinstance(item, list) or isinstance(item, tuple): item = np.array(item) if isinstance(user, pd.Series): user = user.to_numpy() if isinstance(item, pd.Series): item = item.to_numpy() if isinstance(user, np.ndarray): if len(user.shape) > 1: user = user.reshape(-1) assert user.shape[0] > 0 if self.reindex: if user.shape[0] > 1: user = pd.Categorical(user, self.user_mapping_).codes else: if self.user_dict_ is not None: try: user = self.user_dict_[user] except Exception: user = -1 else: user = pd.Categorical(user, self.user_mapping_).codes[0] else: if self.reindex: if self.user_dict_ is not None: try: user = self.user_dict_[user] except Exception: user = -1 else: user = pd.Categorical(np.array([user]), self.user_mapping_).codes[0] user = np.array([user]) if isinstance(item, np.ndarray): if len(item.shape) > 1: item = item.reshape(-1) assert item.shape[0] > 0 if self.reindex: if item.shape[0] > 1: item = pd.Categorical(item, self.item_mapping_).codes else: if self.item_dict_ is not None: try: item = self.item_dict_[item] except Exception: item = -1 else: item = pd.Categorical(item, self.item_mapping_).codes[0] else: if self.reindex: if self.item_dict_ is not None: try: item = self.item_dict_[item] except Exception: item = -1 else: item = pd.Categorical(np.array([item]), self.item_mapping_).codes[0] item = np.array([item]) assert user.shape[0] == item.shape[0] if user.shape[0] == 1: if (user[0] == -1) or (item[0] == -1): return np.nan else: return self._M1[user].dot(self._M2[item].T).reshape(-1)[0] else: nan_entries = (user == -1) | (item == -1) if nan_entries.sum() == 0: return cython_loops.predict_arr(self._M1, self._M2, user.astype(cy.obj_ind_type), item.astype(cy.obj_ind_type), self.ncores) else: non_na_user = user[~nan_entries] non_na_item = item[~nan_entries] out = np.empty(user.shape[0], dtype=self._M1.dtype) out[~nan_entries] = cython_loops.predict_arr(self._M1, self._M2, non_na_user.astype(cy.obj_ind_type), non_na_item.astype(cy.obj_ind_type), self.ncores) out[nan_entries] = np.nan return out
[docs] def predict_item_factors(self, words_df, maxiter=10, ncores=1, random_seed=1, stop_thr=1e-3, return_all=False): """ Obtain latent factors/topics for items given their bag-of-words representation alone Note ---- For better results, refit the model again including these items. Note ---- If passing more than one item, the resulting rows will be in the sorted order of the item IDs from user_df (e.g. if users are 'b', 'a', 'c', the first row will contain the factors for item 'a', second for 'b', third for 'c'. Note ---- This function is prone to producing all NaNs values. Parameters ---------- words_df : DataFrame (n_samples, 3) Bag-of-words representation of the items to predict. Same format as the one passed to '.fit'. maxiter : int Maximum number of iterations for which to run the inference procedure. ncores : int Number of threads/cores to use. With data for only one user, it's unlikely that using multiple threads would give a significant speed-up, and it might even end up making the function slower due to the overhead. If passing -1, it will determine the maximum number of cores in the system and use that. random_seed : int Random seed used to initialize parameters. stop_thr : float If the l2-norm of the difference between values of Theta_{i} between interations is less than this, it will stop. Smaller values of 'k' should require smaller thresholds. return_all : bool Whether to return also the intermediate calculations (Theta_shp, Theta_rte). When passing True here, the output will be a tuple containing (Theta, Theta_shp, Theta_rte, Phi) Returns ------- factors : array (nitems, k) Obtained latent factors/topics for these items. """ new_Theta_shp, temp = self._predict_item_factors( words_df=words_df, maxiter=maxiter, ncores=ncores, random_seed=random_seed, stop_thr=stop_thr, return_ix=False, return_temp=return_all ) new_Theta_shp = new_Theta_shp / self.Theta_rte if return_all: return new_Theta_shp, temp else: return new_Theta_shp
def _predict_item_factors(self, words_df, maxiter=10, ncores=1, random_seed=1, stop_thr=1e-3, return_ix=True, return_temp=False): ncores, maxiter, stop_thr, random_seed = self._process_pars_factors(ncores, maxiter, stop_thr, random_seed, err_subj="item") words_df, new_item_mapping = self._process_extra_df(words_df, ttl='words_df') words_df['ItemId'] -= self.nitems new_max_id = words_df.ItemId.max() + 1 if new_max_id <= 0: raise ValueError("Numeration of item IDs overlaps with IDs passed to '.fit'.") cy = cy_float if self.use_float else cy_double cython_loops = cython_loops_float if self.use_float else cython_loops_double new_Theta_shp, temp = cy.calc_item_factors( words_df, new_max_id, maxiter, cython_loops.cast_ind_type(self.k), stop_thr, random_seed, ncores, cython_loops.cast_real_t(self.a), cython_loops.cast_real_t(self.b), cython_loops.cast_real_t(self.c), cython_loops.cast_real_t(self.d), self.Theta_rte, self.Beta_shp, self.Beta_rte ) if np.isnan(new_Theta_shp).sum().sum() > 0: raise ValueError("NaNs encountered in result. Failed to produce latent factors.") if self.rescale_factors: new_Theta_shp /= new_Theta_shp.sum(axis=1, keepdims=True) if return_ix: return new_Theta_shp, new_item_mapping, new_max_id if return_temp: return new_Theta_shp, temp else: return new_Theta_shp, None
[docs] def predict_user_factors(self, user_df, maxiter=10, ncores=1, random_seed=1, stop_thr=1e-3, return_all=False): """ Obtain latent factors/topics for users given their attributes alone Note ---- For better results, refit the model again including these users. Note ---- If passing more than one user, the resulting rows will be in the sorted order of the user IDs from user_df (e.g. if users are 'b', 'a', 'c', the first row will contain the factors for user 'a', second for 'b', third for 'c'. Note ---- This function is prone to producing all NaNs values. Parameters ---------- user_df : DataFrame (n_samples, 3) Attributes of the items to predict. Same format as the one passed to '.fit'. maxiter : int Maximum number of iterations for which to run the inference procedure. ncores : int Number of threads/cores to use. With data for only one user, it's unlikely that using multiple threads would give a significant speed-up, and it might even end up making the function slower due to the overhead. If passing -1, it will determine the maximum number of cores in the system and use that. random_seed : int Random seed used to initialize parameters. stop_thr : float If the l2-norm of the difference between values of Theta_{i} between interations is less than this, it will stop. Smaller values of 'k' should require smaller thresholds. return_all : bool Whether to return also the intermediate calculations (Z). When passing True here, the output will be a tuple containing (Theta_shp, Z) only_shape : bool Whether to return only the shape parameter for Theta, instead of dividing it by the rate parameter. Returns ------- factors : array (nitems, k) Obtained latent factors/topics for these items. """ if not self._has_user_df: raise ValueError("Can only generate user factors from attributes when the model is fit to user attributes.") new_Omega_shp, temp = self._predict_item_factors( words_df=user_df, maxiter=maxiter, ncores=ncores, random_seed=random_seed, stop_thr=stop_thr, return_ix=False, return_temp=return_all ) new_Omega_shp = new_Omega_shp / self.Omega_rte if return_all: return new_Omega_shp, temp else: return new_Omega_shp
def _predict_user_factors(self, user_df, maxiter=10, ncores=1, random_seed=1, stop_thr=1e-3, return_ix=True, return_temp=False): ncores, maxiter, stop_thr, random_seed = self._process_pars_factors(ncores, maxiter, stop_thr, random_seed, err_subj="user") user_df, new_user_mapping = self._process_extra_df(user_df, ttl='user_df') user_df['UserId'] -= self.nusers new_max_id = user_df.UserId.max() + 1 if new_max_id <= 0: raise ValueError("Numeration of item IDs overlaps with IDs passed to '.fit'.") ## Will reuse the exact same function from the add items, need to change names accordingly user_df.rename(columns={'UserId':'ItemId', 'AttributeId':'WordId'}, inplace=True) cy = cy_float if self.use_float else cy_double cython_loops = cython_loops_float if self.use_float else cython_loops_double new_Omega_shp, temp = cy.calc_item_factors( user_df, new_max_id, maxiter, cython_loops.cast_ind_type(self.k), stop_thr, random_seed, ncores, cython_loops.cast_real_t(self.a), cython_loops.cast_real_t(self.b), cython_loops.cast_real_t(self.c), cython_loops.cast_real_t(self.d), self.Omega_rte, self.Kappa_shp, self.Kappa_rte ) if np.isnan(new_Omega_shp).sum().sum() > 0: raise ValueError("NaNs encountered in result. Failed to produce latent factors.") if self.rescale_factors: new_Omega_shp /= new_Omega_shp.sum(axis=1, keepdims=True) if return_ix: user_df.rename(columns={'ItemId':'UserId', 'WordId':'AttributeId'}, inplace=True) return new_Omega_shp, new_user_mapping, new_max_id if return_temp: return new_Omega_shp, temp else: return new_Omega_shp, None def _process_pars_factors(self, ncores, maxiter, stop_thr, random_seed, err_subj="user"): if self.rescale_factors: raise ValueError("Cannot produce new factors when using 'rescale_factors=True'.") assert self.is_fitted if not self.keep_all_objs: msg = "Can only add " + err_subj + "s to a fitted model when called with 'keep_all_objs=True'." raise ValueError(msg) cython_loops = cython_loops_float if self.use_float else cython_loops_double if ncores is None: ncores = 1 if ncores < 1: ncores = multiprocessing.cpu_count() assert ncores>0 assert isinstance(ncores, int) ncores = cython_loops.cast_int(ncores) assert maxiter > 0 if isinstance(maxiter, float): maxiter = int(maxiter) assert isinstance(maxiter, int) maxiter = cython_loops.cast_int(maxiter) assert stop_thr > 0 assert isinstance(stop_thr, float) stop_thr = cython_loops.cast_real_t(stop_thr) if random_seed is not None: if isinstance(random_seed, float): random_seed = int(random_seed) assert random_seed > 0 assert isinstance(random_seed, int) return ncores, maxiter, stop_thr, random_seed
[docs] def add_items(self, words_df, maxiter=10, stop_thr=1e-3, ncores=1, random_seed=10): """ Adds new items to an already fit model Adds new items without refitting the model from scratch. Note that this will not modify any of the user or word parameters. For better results, refit the model from scratch including the data from these new items. Note ---- This function is prone to producing all NaNs values. Adding both users and items to already-fit model might cause very bad quality results for both. Parameters ---------- words_df : data frame or array (n_samples, 3) DataFrame with the bag-of-words representation of the new items only. Must contain columns 'ItemId', 'WordId', 'Count'. If passing a numpy array, columns will be assumed to be in that order. When using 'reindex=False', the numeration must start right after the last item ID that was present in the training data. maxiter : int Maximum number of iterations for which to run the procedure. stop_thr : float Will stop if the norm of the difference between the shape parameters after an iteration is below this threshold. ncores : int Number of threads/core to use. When there is few data, it's unlikely that using multiple threads would give a significant speed-up, and it might even end up making the function slower due to the overhead. random_seed : int or None: Random seed to be used for the initialization of the new shape parameters. Returns ------- True : bool Will return True if the procedure terminates successfully. """ new_Theta_shp, new_item_mapping, new_max_id = self._predict_item_factors( words_df=words_df, maxiter=maxiter, ncores=ncores, random_seed=random_seed, stop_thr=stop_thr, return_ix=True, return_temp=False ) ## Adding the new parameters new_Theta = new_Theta_shp / self.Theta_rte self.Theta_shp = np.r_[self.Theta_shp, new_Theta] self.Theta = np.r_[self.Theta, new_Theta_shp] self.Epsilon = np.r_[self.Epsilon, np.zeros((int(new_max_id), int(self.k)), dtype=ctypes.c_float if self.use_float else ctypes.c_double)] self.Epsilon_shp = np.r_[self.Epsilon_shp, np.zeros((int(new_max_id), int(self.k)), dtype=ctypes.c_float if self.use_float else ctypes.c_double)] self.Epsilon_rte = np.r_[self.Epsilon_rte, np.zeros((int(new_max_id), int(self.k)), dtype=ctypes.c_float if self.use_float else ctypes.c_double)] self._M2 = np.r_[self._M2, new_Theta] ## Adding the new IDs if self.reindex: self.item_mapping_ = new_item_mapping if self.produce_dicts: for i in range(int(new_item_mapping.shape[0]) - int(self.nitems)): self.item_dict_[int(new_item_mapping[i + int(self.nitems)])] = i + int(self.nitems) self.nitems = self.item_mapping_.shape[0] else: self.nitems += new_max_id return True
[docs] def add_users(self, counts_df=None, user_df=None, maxiter=10, stop_thr=1e-3, ncores=1, random_seed=10): """ Adds new users to an already fit model Adds new users without refitting the model from scratch. Note that this will not modify any of the item or word parameters. In the regular model, you will need to provide "counts_df" as input, and the parameters will be determined according to the user-item interactions. If fitting the model with user attributes, you will also need to provide "user_df". Not providind a 'counts_df' object will assume that all the interactions for this user are zero (only supported in the model with user attributes). For better results, refit the model from scratch including the data from these new users. Note ---- This function is prone to producing all NaNs values. Adding both users and items to already-fit model might cause very bad quality results for both. Parameters ---------- counts_df : data frame or array (n_samples, 3) DataFrame with the user-item interactios for the new users only. Must contain columns 'UserId', 'ItemId', 'Count'. If passing a numpy array, columns will be assumed to be in that order. user_df : data frame or array (n_samples, 3) DataFrame with the user attributes for the new users only. Must contain columns 'UserId', 'AttributeId', 'Count'. If passing a numpy array, columns will be assumed to be in that order. Only for models with to user side information. maxiter : int Maximum number of iterations for which to run the procedure. stop_thr : float Will stop if the norm of the difference between the shape parameters after an iteration is below this threshold. ncores : int Number of threads/core to use. When there is few data, it's unlikely that using multiple threads would give a significant speed-up, and it might even end up making the function slower due to the overhead. random_seed : int or None: Random seed to be used for the initialization of the new shape parameters. Returns ------- True : bool Will return True if the procedure terminates successfully. """ ncores, maxiter, stop_thr, random_seed = self._process_pars_factors(ncores, maxiter, stop_thr, random_seed, err_subj="user") ## checking input combinations if (counts_df is None) and (user_df is None): raise ValueError("Must pass at least one of 'counts_df' or 'user_df'.") if user_df is not None: if not self._has_user_df: raise ValueError("Can only use 'user_df' when the model was fit to user side information.") if (counts_df is None) and (not self._has_user_df): raise ValueError("Must pass 'counts_df' to add a new user.") cy = cy_float if self.use_float else cy_double cython_loops = cython_loops_float if self.use_float else cython_loops_double if (counts_df is not None) and (user_df is not None) and self._has_user_df: ## factors based on both attributes and interactions user_df, counts_df, new_user_mapping = self._process_extra_df(user_df, ttl='user_df', df2=counts_df) counts_df['UserId'] -= self.nusers user_df['UserId'] -= self.nusers new_max_id = int(max(counts_df.UserId.max(), user_df.UserId.max()) + 1) if new_max_id <= 0: raise ValueError("Numeration of item IDs overlaps with IDs passed to '.fit'.") new_Omega_shp, new_Eta_shp = cy.calc_user_factors_full( counts_df, user_df, new_max_id, cython_loops.cast_int(maxiter), cython_loops.cast_ind_type(self.k), stop_thr, random_seed, ncores, cython_loops.cast_real_t(self.c), cython_loops.cast_real_t(self.e), self.Omega_rte, self.Eta_rte, self.Theta_shp, self.Theta_rte, self.Epsilon_shp, self.Epsilon_rte, self.Kappa_shp, self.Kappa_rte ) ## Adding the new parameters new_Omega = new_Omega_shp / self.Omega_rte new_Eta = new_Eta_shp / self.Eta_rte self.Omega_shp = np.r_[self.Omega_shp, new_Omega_shp] self.Omega = np.r_[self.Omega, new_Omega] self.Eta_shp = np.r_[self.Eta_shp, new_Eta_shp] self.Eta = np.r_[self.Eta, new_Eta] self._M1 = np.r_[self._M1, new_Omega + new_Eta] ## factors based on user-item interactions elif (user_df is None) and (counts_df is not None): counts_df, new_user_mapping = self._process_extra_df(counts_df, ttl='counts_df') counts_df['UserId'] -= self.nusers new_max_id = int(counts_df.UserId.max() + 1) if new_max_id <= 0: raise ValueError("Numeration of item IDs overlaps with IDs passed to '.fit'.") new_Eta_shp = cy.calc_user_factors( counts_df, new_max_id, maxiter, cython_loops.cast_ind_type(self.k), stop_thr, random_seed, ncores, cython_loops.cast_real_t(self.e), self.Eta_rte, self.Theta_shp, self.Theta_rte, self.Epsilon_shp, self.Epsilon_rte ) ## Adding the new parameters new_Eta = new_Eta_shp / self.Eta_rte self.Eta_shp = np.r_[self.Eta_shp, new_Eta_shp] self.Eta = np.r_[self.Eta, new_Eta] self._M1 = np.r_[self._M1, new_Eta] if self._has_user_df: self.Omega = np.r_[self.Omega, np.zeros((new_max_id, self.k), dtype=ctypes.c_float if self.use_float else ctypes.c_double)] self.Omega_shp = np.r_[self.Omega_shp, np.zeros((new_max_id, self.k), dtype=ctypes.c_float if self.use_float else ctypes.c_double)] ## factors based on user attributes else: new_Omega_shp, new_user_mapping, new_max_id = self._predict_user_factors( user_df=user_df, maxiter=maxiter, ncores=ncores, random_seed=random_seed, stop_thr=stop_thr, return_ix=True, return_temp=False ) ## Adding the new parameters new_Omega = new_Omega_shp / self.Omega_rte self.Omega_shp = np.r_[self.Omega_shp, new_Omega_shp] self.Omega = np.r_[self.Omega, new_Omega] self.Eta = np.r_[self.Eta, np.zeros((new_max_id, self.k), dtype=ctypes.c_float if self.use_float else ctypes.c_double)] self.Eta_shp = np.r_[self.Eta_shp, np.zeros((new_max_id, self.k), dtype=ctypes.c_float if self.use_float else ctypes.c_double)] self._M1 = np.r_[self._M1, new_Omega] ## updating the list of seen items for these users if self.keep_data and (counts_df is not None): for u in range(int(new_max_id)): items_this_user = counts_df.ItemId.to_numpy()[counts_df.UserId == u] self._n_seen_by_user = np.r_[self._n_seen_by_user, items_this_user.shape[0]].astype(int) self._st_ix_user = np.r_[self._st_ix_user, self.seen.shape[0]].astype(int) self.seen = np.r_[self.seen, items_this_user].astype(int) ## Adding the new IDs if self.reindex: self.user_mapping_ = new_user_mapping if self.produce_dicts: for u in range(new_user_mapping.shape[0] - self.nusers): self.user_dict_[new_user_mapping[u + self.nusers]] = u + self.nusers self.nusers = self.user_mapping_.shape[0] else: self.nitems += int(new_max_id) return True
[docs] def eval_llk(self, counts_df, full_llk=False): """ Evaluate Poisson log-likelihood (plus constant) for a given dataset Note ---- This log-likelihood is calculated only for the combinations of users and items provided here, so it's not a complete likelihood, and it might sometimes turn out to be a positive number because of this. Will filter out the input data by taking only combinations of users and items that were present in the training set. Parameters ---------- counts_df : pandas data frame (nobs, 3) Input data on which to calculate log-likelihood, consisting of IDs and counts. Must contain one row per non-zero observaion, with columns 'UserId', 'ItemId', 'Count'. If a numpy array is provided, will assume the first 3 columns contain that info. full_llk : bool Whether to calculate terms of the likelihood that depend on the data but not on the parameters. Ommitting them is faster, but it's more likely to result in positive values. Returns ------- llk : dict Dictionary containing the calculated log-likelihood and the number of observations that were used to calculate it. """ assert self.is_fitted HPF._process_valset(self, counts_df, valset=False) cython_loops = cython_loops_float if self.use_float else cython_loops_double out = {'llk': cython_loops.calc_llk(self.val_set.Count.to_numpy(), self.val_set.UserId.to_numpy(), self.val_set.ItemId.to_numpy(), self._M1, self._M2, cython_loops.cast_int(self.k), cython_loops.cast_int(self.ncores), cython_loops.cast_int(bool(full_llk))), 'nobs':self.val_set.shape[0]} del self.val_set return out
def _print_st_msg(self): print("*****************************************") print("Collaborative Topic Poisson Factorization") print("*****************************************") print("") def _print_data_info(self): print("Number of users: %d" % self.nusers) print("Number of items: %d" % self.nitems) print("Number of words: %d" % self.nwords) if self._has_user_df: print("Number of user attributes: %d" % self.nuserattr) print("Latent factors to use: %d" % self.k) print("")