==========
Evaluation
==========

Given a dataset of items where a user has implicitly or explicitly interacted with (via ratings, purchases, downloads, previews, etc.), the idea is to split the dataset in two---usually disjunct---sets. The training and test datasets.

The evaluation package implements several metrics such as: **predictive accuracy** (Mean Absolute Error, Root Mean Square Error), **decision based** (Precision, Recall, F--measure), and **rank based** metrics (Spearman's :math:`\rho`, Kendall--:math:`\tau`, and Mean Reciprocal Rank) 

For a complete list of available metrics to evaluate recommender systems see [http://research.microsoft.com/pubs/115396/EvaluationMetrics.TR.pdf]

Given a test set :math:`\mathcal{T}` of user-item pairs :math:`(u, i)` with ratings :math:`r_{ui}`, the system generates predicted ratings :math:`\hat{r}_{ui}`.

Prediction-based metrics
========================

Predictive metrics aim at comparing the predicted values against the actual values.
The result is the average over the deviations.

Mean Absolute Error
-------------------

[http://en.wikipedia.org/wiki/Mean_absolute_error]

Mean Absolute Error (MAE) measures the deviation between the predicted and the real value:

.. math::
    MAE = \frac{1}{|\mathcal{T}|}\sum_{(u,i) \in \mathcal{T}} \left| \hat{r}_{ui} - r_{ui} \right| 

where :math:`\hat{r}_{ui}` is the predicted value of user :math:`u` for item :math:`i`, and :math:`r_{ui}` is the true value (ground truth).

Examples
~~~~~~~~

.. code-block:: python

    from recsys.evaluation.prediction import MAE

    mae = MAE()
    mae.compute(4.0, 3.2) #returns 0.8

.. code-block:: python

    from recsys.evaluation.prediction import MAE

    DATA_PRED = [(3, 2.3), (1, 0.9), (5, 4.9), (2, 0.9), (3, 1.5)]
    mae = MAE(DATA_PRED)
    mae.compute() #returns 0.7

.. code-block:: python

    from recsys.evaluation.prediction import MAE

    GROUND_TRUTH = [3.0, 1.0, 5.0, 2.0, 3.0]
    TEST = [2.3, 0.9, 4.9, 0.9, 1.5]
    mae = MAE()
    mae.load_ground_truth(GROUND_TRUTH)
    mae.load_test(TEST)
    mae.compute() #returns 0.7

Root Mean Squared Error
-----------------------

[http://en.wikipedia.org/wiki/Root_mean_square_deviation]

Mean Squared Error (MSE) is also used to compare the predicted value with the real value a user has assigned to an item.
The difference between MAE and MSE is that MSE heavily penalise large errors.

.. math::
    MSE = \frac{1}{|\mathcal{T}|}\sum_{(u,i) \in \mathcal{T}} (\hat{r}_{ui} - r_{ui})^2

Root Mean Squared Error (RMSE) equals to the square root of the MSE value.

.. math::
    RMSE = \sqrt{MSE}

Examples
~~~~~~~~

.. code-block:: python

    from recsys.evaluation.prediction import RMSE

    rmse = RMSE()
    rmse.compute(4.0, 3.2) #returns 0.8 

.. code-block:: python

    from recsys.evaluation.prediction import RMSE

    DATA_PRED = [(3, 2.3), (1, 0.9), (5, 4.9), (2, 0.9), (3, 1.5)]
    rmse = RMSE(DATA_PRED)
    rmse.compute() #returns 0.891067 

Decision-based metrics
======================

Decision-based metrics evaluates the top-N recommendations for a user.
Uusally recommendations are a ranked list of items, ordered by decreasing relevance.
Yet, the decision-based metrics do not take into account the position -or rank- of the item in the result list).

There are four different cases to take into account:

* True positive (TP). The system recommends an item the user is interested in.
* False positive (FP). The system recommends an item the user is not interested in.
* True negative (TN). The system does not recommend an item the user is not interested in.
* False negative (FN). The system does not recommend an item the user is interested in.

+----------------------+------------+-----------------+
|                      |**Relevant**|**Not relevant** |
+----------------------+------------+-----------------+
| **Recommended**      |    TP      |      FP         |
+----------------------+------------+-----------------+
| **Not recommended**  |    FN      |      TN         |
+----------------------+------------+-----------------+

Precision (P) and recall (R) are obtained from the 2x2 contingency table (or confusion matrix) shown in the previous Table.

Precision
---------
[http://en.wikipedia.org/wiki/Precision_and_recall]

Precision measures the fraction of relevant items over the recommended ones.

.. math::
    Precision=\frac{TP}{TP+FP}

Precision can also be evaluated at a given cut-off rank, considering only the top--n recommendations. 
This measure is called precision--at--n or P@n.

When evaluating the top--n results of a recommender system, it is quite common to use this measure:

.. math::
    Precision=\frac{|hit set|}{N}

where :math:`|hit set|=|test \cap topN|`.

Recall
------

Recall measures the coverage of the recommended items, and is defined as:

.. math::
    Recall=\frac{TP}{TP+FN}

Again, when evaluating the top--N results of a recommender system, one can use this measure:

.. math::
    Recall=\frac{|hit set|}{|test|}

F-measure
---------
[http://en.wikipedia.org/wiki/F1_score]

F--measure combines *P* and *R* results, using the weighted harmonic mean.
The general formula (for a non-negative real :math:`beta` value) is:

.. math::
    F_\beta = \frac{(1 + \beta^2) \cdot (\mathrm{precision} \cdot \mathrm{recall})}{(\beta^2 \cdot \mathrm{precision} + \mathrm{recall})}

Two common F--measures are :math:`F_{1}` and :math:`F_{2}`.
In :math:`F_{1}` recall and precision are evenly weighted, whilst :math:`F_{2}` weights recall twice as much as precision.

Example
-------

.. code-block:: python

    from recsys.evaluation.decision import PrecisionRecallF1

    TEST_DECISION = ['classical', 'invented', 'baroque', 'instrumental']
    GT_DECISION = ['classical', 'instrumental', 'piano', 'baroque']
    decision = PrecisionRecallF1()
    decision.load(GT_DECISION, TEST_DECISION)
    decision.compute() # returns (0.75, 0.75, 0.75)
                       # P = 3/4 (there's the 'invented' result)
                       # R = 3/4 ('piano' is missing)

The main drawback of the decision--based metrics is that do not take into account the ranking of the recommended items.         
Thus, an item at top--1 has the same relevance as an item at top--20. To avoid this limitation, we can use rank--based metrics.

Rank-based metrics
==================

Spearman's rho
--------------
[http://en.wikipedia.org/wiki/Spearman's_rank_correlation_coefficient]

Spearman's :math:`\rho` computes the rank--based Pearson correlation of two ranked lists.
It compares the predicted list with the user preferences (e.g. the ground truth data), and it is defined as:

.. math::
    \rho = \frac{1}{n_u} \frac{\sum_i( r_{ui} - \bar{r}) ( \hat{r}_{ui} - \hat{\bar{r}} ) }{\sigma(r) \sigma(\hat{r})}

Examples
~~~~~~~~

Using explicit ranking information:

.. code-block:: python

    from recsys.evaluation.ranking import SpearmanRho 

    TEST_RANKING = [('classical', 25.0), ('piano', 75.0), ('baroque', 50.0), ('instrumental', 25.0)]
    GT_RANKING = [('classical', 50.0), ('piano', 100.0), ('baroque', 25.0), ('instrumental', 25.0)]
    spearman = SpearmanRho()
    spearman.load(GT_RANKING, TEST_RANKING)
    spearman.compute() #returns 0.5

Rank-based correlation for ratings:

.. code-block:: python

    from recsys.evaluation.ranking import SpearmanRho 

    DATA_PRED = [(3, 2.3), (1, 0.9), (5, 4.9), (2, 0.9), (3, 1.5)]
    spearman = SpearmanRho(DATA_PRED)
    spearman.compute() #returns 0.947368

Kendall--tau
------------
[http://en.wikipedia.org/wiki/Kendall_tau_rank_correlation_coefficient]

Kendall--:math:`\tau` also compares the recommended (:math:`topN`) list with the user's preferred list of items.
Kendall--:math:`\tau` rank correlation coefficient is defined as:

.. math::
    \tau = \frac{C^+ - C^-}{\frac{1}{2}{N(N-1)}}

where :math:`C^+` is the number of concordant pairs, and :math:`C^-` is the number of discordant pairs in the data set.

Examples
~~~~~~~~

Using explicit ranking information:

.. code-block:: python

    from recsys.evaluation.ranking import KendallTau

    TEST_RANKING = [('classical', 25.0), ('piano', 75.0), ('baroque', 50.0), ('instrumental', 25.0)]
    GT_RANKING = [('classical', 50.0), ('piano', 100.0), ('baroque', 25.0), ('instrumental', 25.0)]
    kendall = KendallTau()
    kendall.load(GT_RANKING, TEST_RANKING)
    kendall.compute() #returns 0.4

Rank-based correlation for ratings:

.. code-block:: python

    from recsys.evaluation.ranking import KendallTau

    DATA_PRED = [(3, 2.3), (1, 0.9), (5, 4.9), (2, 0.9), (3, 1.5)]
    kendall = KendallTau(DATA_PRED)
    kendall.compute() #returns 0.888889

Mean reciprocal Rank
---------------------
[http://en.wikipedia.org/wiki/Mean_reciprocal_rank]

Mean Reciprocal Rank (MRR) is defined as:
    
.. math::                           
    \text{MRR} = \frac{1}{|Q|} \sum_{i=1}^{Q} \frac{1}{\text{rank}_i}

Recommendations that occur earlier in the top--n list are weighted higher than those that occur later in the list.

Example
~~~~~~~

Computing reciprocal rank (:math:`RR=\frac{1}{\text{rank}_i}`) for one query:

.. code-block:: python

    from recsys.evaluation.ranking import ReciprocalRank 

    GT_DECISION = ['classical', 'instrumental', 'piano', 'baroque']
    QUERY = 'instrumental'
    rr = ReciprocalRank()
    rr.compute(GT_DECISION, QUERY) #returns 0.5 (1/2): found at position (rank) 2

Mean reciprocal rank for a list of queries :math:`Q`:

.. code-block:: python

    from random import shuffle
    from recsys.evaluation.ranking import MeanReciprocalRank 

    TEST_DECISION = ['classical', 'invented', 'baroque', 'instrumental']
    GT_DECISION = ['classical', 'instrumental', 'piano', 'baroque']
    mrr = MeanReciprocalRank()
    for QUERY in TEST_DECISION:
        shuffle(GT_DECISION) #Just to "generate" a different GT each time...
        mrr.load(GT_DECISION, QUERY)
    mrr.compute() #in my case, returned 0.45832

Mean Average Precision
---------------------
[http://en.wikipedia.org/wiki/Mean_average_precision]

Mean Average Precision (:math:`MAP`) is defined as:
    
.. math::                           
    \text{MAP} = \frac{\sum_{q=1}^Q AP(q)}{Q}

where :math:`Q` is the number of queries, and :math:`Average Precision` (AP) [http://en.wikipedia.org/wiki/Mean_average_precision#Average_precision] equals:

.. math::
    \text{AP} = \frac{\sum_{k=1}^n (P(k) \times rel(k))}{\mbox{number of relevant documents}}

where :math:`P(k)` is Precision at top-k, and :math:`rel(k)` is an indicator function equaling 1 if the item at rank  is a relevant document, and zero otherwise.

Recommendations that occur earlier in the top--n list are weighted higher than those that occur later in the list.

Example
~~~~~~~

Computing average Precision (:math:`AP`) for one query, :math:`q`: 

.. code-block:: python

    from recsys.evaluation.ranking import AveragePrecision

    ap = AveragePrecision()

    GT = [1,2,3,4,5]
    q = [1,3,5]
    ap.load(GT, q)
    ap.compute() # returns 1.0

    GT = [1,2,3,4,5]
    q = [99,3,5]
    ap.load(GT, q)
    ap.compute() # returns 0.5833335

Mean Average Precision for a list of retrieved results :math:`Q`:

.. code-block:: python

    from recsys.evaluation.ranking import MeanAveragePrecision 

    GT = [1,2,3,4,5]
    Q = [[1,3,5], [99,3,5], [3,99,1]]
    Map = MeanAveragePrecision()
    for q in Q:
        Map.load(GT, q)
    Map.compute() # returns 0.805556