Algorithms
==========

**pyrecsys** provides, *out of the box*, some basic algorithms based on matrix factorization.

SVD
---

**pyrecsys** makes use of `SVD`_ in order to decompose the input data (a matrix).
Once the matrix is *reduced* into a lower dimensional space, **pyrecsys** can provide
predictions, recommendations and similarity among the "elements" (being either users or
items -it's just a matter of how you load the matrix data-).

.. _`SVD`: http://en.wikipedia.org/wiki/Singular_value_decomposition

Loading data
~~~~~~~~~~~~

How to load a dataset (e.g. `Movielens`_ 10M)?

.. _`Movielens`: http://www.grouplens.org/node/73

.. code-block:: python

    from recsys.algorithm.factorize import SVD

    filename = './data/movielens/ratings.dat'
    svd = SVD()
    svd.load_data(filename=filename, sep='::', format={'col':0, 'row':1, 'value':2, 'ids': int}) 
        # About format parameter:
        #   'row': 1 -> Rows in matrix come from second column in ratings.dat file
        #   'col': 0 -> Cols in matrix come from first column in ratings.dat file
        #   'value': 2 -> Values (Mij) in matrix come from third column in ratings.dat file
        #   'ids': int -> Ids (row and col ids) are integers (not strings)

Split a dataset (train and test):

.. code-block:: python

    from recsys.datamodel.data import Data
    from recsys.algorithm.factorize import SVD
    
    filename = './data/movielens/ratings.dat'
    data = Data()
    format = {'col':0, 'row':1, 'value':2, 'ids': int}
    data.load(filename, sep='::', format=format)
    train, test = data.split_train_test(percent=80) # 80% train, 20% test

    svd = SVD()
    svd.set_data(train)

Loading a file, executing external SVDLIBC program, and creating an SVD instance *svd*:

.. code-block:: python

    from recsys.utils.svdlibc import SVDLIBC
    svdlibc = SVDLIBC('./data/movielens/ratings.dat')
    svdlibc.to_sparse_matrix(sep='::', format={'col':0, 'row':1, 'value':2, 'ids': int}) # Convert to sparse matrix format [http://tedlab.mit.edu/~dr/SVDLIBC/SVD_F_ST.html]
    svdlibc.compute(k=100)
    svd = svdlibc.export()

Computing
~~~~~~~~~

    >>> K=100
    >>> svd.compute(k=K, min_values=10, pre_normalize=None, mean_center=True, post_normalize=True, savefile=None)

Parameters:

    *min_values*: remove those rows or columns (from the input matrix) that has less than 'min_values' non-zeros

    *pre_normalize*: normalize input matrix. Possible values are *tfidf*, *rows*, *cols*, *all*.

        **tfidf**: By default, treats the matrix as terms-by-documents. It's important, then, how the data is loaded. Use the *format* param in *svd.load_data()* to determine the order of the fields of the input file.

        **rows**: Rescales the rows of the input matrix so that they all have unit Euclidean magnitude

        **cols**: Rescales the columns of the input matrix so that they all have unit Euclidean magnitude

        **all**: Rescales the rows and columns of the input matrix, by dividing both the rows and the columns by the square root of their Euclidean norm

    *mean_center*: centering the input matrix (aka mean substraction)

    *post_normalize*: Normalize every row of :math:`U \Sigma` to be a unit vector. Thus, row similarity (using cosine distance) returns :math:`[-1.0 .. 1.0]`

    *savefile*: Output file to store SVD transformation (:math:`U, \Sigma, V^T` vectors)

Predictions
~~~~~~~~~~~

To predict a *rating*, :math:`\hat{r}_{ui}`, SVD class reconstructs the original matrix, :math:`M^\prime = U \Sigma_k V^T`

Then, 

    >>> svd.predict(ITEMID, USERID, MIN_RATING=0.0, MAX_RATING=5.0)

equals to:

.. math:: 

    \hat{r}_{ui} = M^\prime_{ij}

Here are the RMSE and MAE values for the Movielens 10M dataset (Train: 8,000,043 ratings, and Test: 2,000,011), using 5-fold cross validation, and different K values or factors (10, 20, 50, and 100) for SVD:

+-----------+--------+--------+--------+---------+
|        K  | **10** | **20** | **50** | **100** |
+-----------+--------+--------+--------+---------+
| **RMSE**  | 0.87224| 0.86774| 0.86557| 0.86628 |
+-----------+--------+--------+--------+---------+
| **MAE**   | 0.67114| 0.66719| 0.66484| 0.66513 |
+-----------+--------+--------+--------+---------+


Recommendations
~~~~~~~~~~~~~~~

Recommendations (i.e. unknown values in :math:`M_{ij}`) are also derived from :math:`M^\prime = U \Sigma_k V^T`. In this case, 

    >>> svd.recommend(USERID, n=10, only_unknowns=True, is_row=False)

returns the higher values (e.g. items that the user has not rated) of :math:`M^\prime_{i \cdot}` :math:`\forall_j{M_{ij}=\emptyset}`, whilst

    >>> svd.recommend(USERID, n=10, only_unknowns=False, is_row=False)

returns the best items for the user

Neighbourhood SVD
-----------------

Classic Neighbourhood algorithm uses the ratings of the similar users (or
items) to predict the values of the input matrix *M*.

.. code-block:: python

    from recsys.algorithm.factorize import SVDNeighbourhood

    svd = SVDNeighbourhood()
    svd.load_data(filename=sys.argv[1], sep='::', format={'col':0, 'row':1, 'value':2, 'ids': int})
    K=100
    svd.compute(k=K, min_values=5, pre_normalize=None, mean_center=True, post_normalize=True)

Predictions
~~~~~~~~~~~

The only difference with *plain* SVD is the way how it computes the predictions :math:`\hat{r}_{ui}`

    >>> svd.predict(ITEMID, USERID, weighted=True, MIN_VALUE=0.0, MAX_VALUE=5.0)

To compute the prediction, it uses this equation (u=USERID, i=ITEMID):

.. math:: 

    \hat{r}_{ui} = \frac{\sum_{j \in S^{k}(i;u)} s_{ij} r_{uj}}{\sum_{j \in S^{k}(i;u)} s_{ij}}

where

:math:`S^k(i; u)` denotes the set of :math:`k` items rated by :math:`u`, which are most similar to :math:`i`. 

* To compute the :math:`k` items most similar to :math:`i`, it uses the *svd.similar(i)* method. Then it gets those items that user :math:`u` has already rated

:math:`s_{ij}` is the similarity between :math:`i` and :math:`j`, computed using *svd.similarity(i, j)*

Comparison
----------

For those who love RMSE, MAE and the like, here are some numbers comparing both SVD approaches.
The evaluation uses the `Movielens`_ 1M ratings dataset, splitting the train/test dataset with ~80%-20%.

.. _`Movielens`: http://www.grouplens.org/node/73

.. note::

    Computing svd k=100, min_values=5, pre_normalize=None, mean_center=True, post_normalize=True

.. warning::

    Because of *min_values=5*, some rows (movies) or columns (users) in the input matrix are removed. In fact, those movies that had less than 5 users who rated it, and those users that rated less than 5 movies are removed.

Results
~~~~~~~

Movielens 1M dataset (number of ratings in the Test dataset: 209,908):

+-----------+--------+----------------+
|           | **SVD**| **SVD Neigh.** | 
+-----------+--------+----------------+
| **RMSE**  | 0.91811| 0.875496       |
+-----------+--------+----------------+
| **MAE**   | 0.71703| 0.684173       |
+-----------+--------+----------------+