Examples
========

You can find all these examples in the *./draft* folder.

Movielens
---------

.. code-block:: python

    import sys

    #To show some messages:
    import recsys.algorithm
    recsys.algorithm.VERBOSE = True

    from recsys.algorithm.factorize import SVD
    from recsys.datamodel.data import Data
    from recsys.evaluation.prediction import RMSE, MAE

    #Dataset
    PERCENT_TRAIN = int(sys.argv[2])
    data = Data()
    data.load(sys.argv[1], sep='::', format={'col':0, 'row':1, 'value':2, 'ids':int})
        # About format parameter:
        #   'row': 1 -> Rows in matrix come from column 1 in ratings.dat file
        #   'col': 0 -> Cols in matrix come from column 0 in ratings.dat file
        #   'value': 2 -> Values (Mij) in matrix come from column 2 in ratings.dat file
        #   'ids': int -> Ids (row and col ids) are integers (not strings)

    #Train & Test data
    train, test = data.split_train_test(percent=PERCENT_TRAIN)

    #Create SVD
    K=100
    svd = SVD()
    svd.set_data(train)
    svd.compute(k=K, min_values=5, pre_normalize=None, mean_center=True, post_normalize=True)

    #Evaluation using prediction-based metrics
    rmse = RMSE()
    mae = MAE()
    for rating, item_id, user_id in test.get():
        try:
            pred_rating = svd.predict(item_id, user_id)
            rmse.add(rating, pred_rating)
            mae.add(rating, pred_rating)
        except KeyError:
            continue

    print 'RMSE=%s' % rmse.compute()
    print 'MAE=%s' % mae.compute()

Save it as **movielens.py**, and run it!

.. code-block:: python

    $ python movielens.py tests/data/movielens/ratings.dat 80

    # Here's the output:
    Creating matrix
    Updating matrix: squish to at least 5 values
    Computing svd k=100, min_values=5, pre_normalize=None, mean_center=True, post_normalize=True
    RMSE=0.91919
    MAE=0.717771

Last.fm
-------

*Why is Ringo always forgotten?*

1. (Slow) Get the last.fm `360K`_ dataset, and save it to /tmp:

.. _`360K`: http://mtg.upf.edu/static/datasets/last.fm/lastfm-dataset-360K.tar.gz 

.. code-block:: python

    cd /tmp/
    wget http://mtg.upf.edu/static/datasets/last.fm/lastfm-dataset-360K.tar.gz
    tar xvzf lastfm-dataset-360K.tar.gz 

2. (Faster way) Download this `tar file`_ that already contains the matrix.dat (~17M lines), and copy the 3 files to /tmp

.. _`tar file`: http://www.dtic.upf.edu/~ocelma/MusicRecommendationDataset/lastfm360K-svd-example.tar.gz


.. code-block:: python

    cd /tmp/
    wget http://www.dtic.upf.edu/~ocelma/MusicRecommendationDataset/lastfm360K-svd-example.tar.gz
    tar xvzf lastfm360K-svd-example.tar.gz

and then just copy these 10 lines of code!
  
.. code-block:: python

    import sys
    import recsys.algorithm
    recsys.algorithm.VERBOSE = True
    from recsys.utils.svdlibc import SVDLIBC

    # 1. (Slow) Create Sparse matrix.dat SVDLIBC input (http://tedlab.mit.edu/~dr/SVDLIBC/SVD_F_ST.html). 
    #    This eats quite a lot of memory! (~9Gb)
    #svdlibc = SVDLIBC(datafile='/tmp/lastfm-dataset-360K/usersha1-artmbid-artname-plays.tsv', 
    #                  matrix='/tmp/matrix.dat', prefix='/tmp/svd')
    #svdlibc.to_sparse_matrix(sep='\t', format={'col':0, 'row':1, 'value':3})

    # 2. (Faster way): 
    # You already downloaded and copied these 3 files at /tmp :
    #   /tmp/matrix.dat
    #   /tmp/svd.ids.rows
    #   /tmp/svd.ids.cols
    svdlibc = SVDLIBC()

    # Compute SVDLIBC
    k = 100
    svdlibc.compute(k, matrix='/tmp/matrix.dat', prefix='/tmp/svd') # Wait ~2 mins.
    svd = svdlibc.export() # This can consume ~2.8Gb. of memory
    # print svd

    ID = 'b10bbbfc-cf9e-42e0-be17-e2c3e1d2600d' # The Beatles MBID
    svd.similar(ID) # Get artists similar to The Beatles (...why is Ringo always forgotten!?)
    [('b10bbbfc-cf9e-42e0-be17-e2c3e1d2600d', 0.99999999999999978), # The Beatles
     ('4d5447d7-c61c-4120-ba1b-d7f471d385b9', 0.96963526974942182), # John Lennon
     ('31f49c01-b8e0-40ba-b1aa-3754f6fa78d5', 0.96566802153067377), # Paul McCartney & Wings
     ('5c014631-875c-4f3e-89e9-22cf9d4769a4', 0.9554322804979507),  # John Lennon & Yoko Ono
     ('ba550d0e-adac-4864-b88b-407cab5e76af', 0.95520067803777453), # Paul McCartney
     ('e975f847-7b7a-4313-8ebc-1cbfc978e817', 0.95385390155825112), # Paul & Linda McCartney
     ('42a8f507-8412-4611-854f-926571049fa0', 0.94022861823264092), # George Harrison
     ('5235052b-7fa0-498b-accf-26b9e7767da7', 0.93691208464079334), # Mohamed Moneir
     ('dafcd725-9cb6-4347-be21-fd9a950e8064', 0.9352608795525883),  # Klaatu
     ('cb56afea-5648-4173-b1b7-762288492997', 0.93383747203947887)] # Bobby Sherman

**The Beatles** similar artists' are so so... Still, you can easily improve these results as explained in this boring `book`_

.. _`book`: http://ocelma.net/MusicRecommendationBook/index.html

Implementing a new algorithm
-----------------------------

Now, here's an example about how to create a new algorithm, by extending *BaseClass* algorithm class.

This Baseline dummy algorithm returns the avg. rating of a user, when predicting the value :math:`\hat{r}_{ui}`, for user :math:`u` and any item :math:`i`

.. code-block:: python

    from numpy import mean
    from operator import itemgetter

    from recsys.algorithm.baseclass import Algorithm

    class Baseline(Algorithm):
        def __init__(self):
            #Call parent constructor
            super(Baseline, self).__init__()

            # 'Cache' for user avg. rating
            self._user_avg_rating = dict()

        def predict(self, i, j, MIN_VALUE=None, MAX_VALUE=None, user_is_row=True):
            index = i
            if not user_is_row:
                index = j
            if not self._user_avg_rating.has_key(index):
                if user_is_row:
                    vector = self.get_matrix().get_row(index).entries()
                else:
                    vector = self.get_matrix().get_col(index).entries()
                # Vector is a list of tuples: (rating, pos). E.g (3.0, 20)
                self._user_avg_rating[index] = mean(map(itemgetter(0), vector))
            predicted_value = self._user_avg_rating[index]

            if MIN_VALUE:
                predicted_value = max(predicted_value, MIN_VALUE)
            if MAX_VALUE:
                predicted_value = min(predicted_value, MAX_VALUE)
            return predicted_value

Save this example as **baseline.py**

Here's an example using this simple baseline Algorithm class:

.. code-block:: python

    import sys

    #To show some messages:
    import recsys.algorithm
    recsys.algorithm.VERBOSE = True

    from recsys.evaluation.prediction import RMSE, MAE
    from recsys.datamodel.data import Data

    from baseline import Baseline #Import the test class we've just created

    #Dataset
    PERCENT_TRAIN = int(sys.argv[2])
    data = Data()
    data.load(sys.argv[1], sep='::', format={'col':0, 'row':1, 'value':2})
    #Train & Test data
    train, test = data.split_train_test(percent=PERCENT_TRAIN)

    baseline = Baseline()
    baseline.set_data(train)
    baseline.compute() # In this case, it does nothing

    # Evaluate
    rmse = RMSE()
    mae = MAE()
    for rating, item_id, user_id in test.get():
        try:
            pred_rating = baseline.predict(item_id, user_id, user_is_row=False)
            rmse.add(rating, pred_rating)
            mae.add(rating, pred_rating)
        except KeyError:
            continue

    print 'RMSE=%s' % rmse.compute()
    print 'MAE=%s' % mae.compute()

Save this example as **test_baseline.py**

And run it:

.. code-block:: python

    $ python test_baseline.py tests/data/movielens/ratings.dat 80

    # Here's the output:
    Loading dataset tests/data/movielens/ratings.dat
    Creating matrix
    RMSE=1.033579
    MAE=0.827535