Imagine we have a dataset of user search queries with corresponding ranked search results, and user clickthrough data (i.e., which of the returned URLs were clicked). Let's follow Dlib's machine learning flow chart to determine an algorithm to apply.

Full Dlib machine learning flow chart.

Let's start at the start and let the flow chart do our model selection for us. We aren't predicting a true/false label, or a categorical label in general, and we are predicting a continuous quantity (search result relevance), which we are trying to rank order. This leads us to svm_rank_trainer, as shown below.

Flow for learning to rank.

Dlib's flow chart lands us in the structured prediction group of machine learning algorithms, and we shall see that structure exists in the relative ordering of pairs.

Learning to rank is applying machine learning to order data, e.g., to order documents retrieved for a search query by relevance. As discussed in (Tie-Yan Liu,  2009), learning to rank algorithms can be categorized into three categories: pointwise, pairwise, and listwise.

In this tutorial, we will cover how to use scikit-learn to implement the pairwise transform, and use RankSVM to make predictions on a learning to rank problem.

A search engine's task is to return relevant documents (URLs) to a user based on the user's query, and learning to rank refers to using statistical methods to infer the best ranking of URLs for a given query.

Standard research datasets for the task of learning to rank include MSLR-WEB and LETOR.

These datasets consist of a set of query ids, numerical features, and ranking scores. There are various numerical features, such as the sum of query terms, called term frequency (TF), in the page title, URL, and body, the PageRank of the page, the number of child pages, etc. A complete set of feature descriptions can be found in the LETOR paper.

We will present a toy example for pedagogical purposes, under the understanding that the same concepts, libraries and algorithms can be reused on research and real world datasets as well.

The scikit-learn version of the full learning to rank code is available here, and a Jupyter notebook can be created from rank_sklearn.py using py2nb.

Learning to Rank with scikit-learn

We begin our learning to rank tutorial with a toy example, by generating data from two different partitions, corresponding to different queries. The two partitions are offset from each other, and within each partition data points are normally distributed about their class centers, which are evenly spaced in one direction.

    # Create a dataset where target relevance scores consist of measurements
    # Y = {0, 1, 2}, and input data are 30 samples with two features each.
    #
    # Queries are generated from two normal distributions X1 and X2 of
    # different means and covariances.

    # Data from each of the two partitions follow vectors parallel to unit
    # vector w, which is at angle theta to horizontal, with added noise.
    theta = np.deg2rad(60)
    w = np.array([np.sin(theta), np.cos(theta)])

    # The input data, in X, consist of two partitions of 3*K/2 points each.
    # Each input datum has two features.
    #
    # Each partition has three clusters of K/2 data points, one for each Y
    # label, where each cluster is normally distributed with mean proportional
    # to the cluster number along vector w.
    K = 20
    X = np.random.randn(K, 2)
    y = [0] * K
    for i in range(1, 3):
        X = np.concatenate((X, np.random.randn(K, 2) + i*4*w))
        y = np.concatenate((y, [i] * K))

    # Slightly displace data corresponding to our second partition, which is
    # all the even indices of X.
    part0_offset = np.array([-3, -7])
    X[::2] += part0_offset

The generated data are below. The colours, white, light blue, and dark blue, represent the classes of the data, i.e., the ranking. Note that while within each partition data are linearly separable by rank, the combined data are not.

Toy data with w vector.

Let's try to naively fit a single vector to the data via ridge regression, in order to demonstrate the need for query structure in our predictive modeling of search rankings. We will see that ridge regression tries to fit both queries at the same time, and therefore produces a poor fit.

Exercise: Use scikit-learn to fit a ridge regression model to the data, and plot the result.

Hint!

Write your solution in the skeleton function definition below.

def fit_rr(X_train, y_train, idx):
    """Fit dataset (X_train, y_train) using ridge regression, i.e., fit a
    linear model with L2 weight regularization.

    Args:
        X_train: [N, 2] array of input features.
        y_train: N length vector of labels in {0, 1, 2}, indicating each
            datapoint's ordinal relevance score.
        idx: N length array of boolean values, where True means that this
            example belongs to query (block) 0, and False means query 1.

    Return the fitted ridge regression model.
    """
    # YOUR CODE HERE
    pass

Let's use the code we just wrote in fit_rr to fit a ridge regression model, and plot the resulting fit along with our query ranking data.

Plot of fitted ridge regression model.

Let's use the Kendall's tau coefficient on the test set to evaluate the quality of the ridge regression fit with respect to the true orderings in queries 0 and 1.

Kendall's tau is a measure of rank correlation, i.e., a measure of similarity between two orderings of the same data, and takes all pairwise combinations of the data as input, returning a real valued output between -1 and 1.

Define concordant pairs as all of the pairs for which the orderings are in agreement, define discordant pairs as all pairs that the orderings disagree on, and assume there are n data points. Then Kendall's tau is:

tau = (# concordant pairs - # discordant pairs)/(n choose 2)

Exercise: Using the test set and the fitted ridge regression model, write a function to compute and return Kendall's tau for a single query.

Hint!

def kendalls_tau(ridge_model, X_query, y_query):
    """Compute and return Kendall's tau for X_query and y_query.

    Args:
        ridge_model: The ridge regression model fit to the entire dataset.
        X_query: Data points for a single query.
        y_query: Labels (preference score) for each datum in X_query.
    """
    # YOUR CODE HERE
    pass

With our working function, let's compute Kendall's tau on the test set, and report the results below.


Kendall's tau coefficient for block 0: 0.7112197171355642
Kendall's tau coefficient for block 1: 0.84387274640268606

The pairwise transform

(Herbrich, 1999) suggests that Kendall's tau, which counts inversions of pairs, can be based on a new training set whose elements are pairs (x1, x2), with x1 and x2 from the original dataset. The label of element (x1, x2) in the new training set is -1 if x2 is preferred to x1, and +1 if x1 is preferred to x2 (and zero if x1 and x2's ordinal score is equal). (Herbrich, 1999) shows that minimizing the 0-1 classification loss on the new pairs dataset is equivalent to minimizing Kendall's tau on the original dataset, up to a constant factor.

Exercise: What is a potential pitfall of the pairwise transform, as defined above?

We further transform the pairs (x1, x2) into (x1 - x2), such that the new dataset consists of points (x1 - x2, sign(y1 - y2)), where (x1, y1) and (x2, y2) are (feature, label) pairs from the original dataset. This transforms the original dataset into a binary classification problem with features of the same dimensionality as the original features.

Note that since rankings only make sense with respect to the same query, only pairs from the same query group are included in the new dataset (and hence there is no exponential explosion of number of pairs).

Let's form all pairwise combinations (for each query separately), and plot the new dataset formed by the pairwise differences for each query, and their ordering.

# Form all combinations for which there is preference one way or another, and
# both examples are from the same query.
combinations = [(i, j)
                for i in range(X_train.shape[0])
                for j in range(X_train.shape[0])
                if ((y_train[i] != y_train[j]) and
                    (blocks[train][i] == blocks[train][j]))]

Xp = np.array([X_train[i] - X_train[j] for i, j in combinations])
diff = np.array([y_train[i] - y_train[j] for i, j in combinations])
yp = np.array([np.sign(d) for d in diff])

Let's plot the dataset of differences (x_i - x_j) with labels sign(y_i - y_j), and draw the hyperplane (line, in this 2D case) with the normal vector w, which is the unit vector we defined at the start. This line separates the +1 class (i is preferred to j) from the -1 class (j is preferred to i).

Our resulting new dataset is below.

Dataset of pairwise differences, with hyperplane.

The data are linearly separable since in our generated dataset there were no inversions, i.e., pairs of data points that project onto w in the opposite order of their respective ranks. In general the data will not always be linearly separable.

Let's train a RankSVM model on the dataset we have constructed from differences of pairs from the original dataset, i.e., Xp and yp.

RankSVM (Joachim, 2002) works by maximizing the number of inequalities w*x1 > w*x2, where the features x1 are from a URL that ranks lower than x2 for a given query. Support vector machines (SVMs) approximate the solution to this maximization problem by introducing slack variables, and solving the optimization problem:


    minimize: 0.5*w**2 + C*\sum_{i,j,k}{slack variables}

    subject to: w*x_i >= w*x_j + 1 - slack_{i,j,k}

    For all data points (x_i, y_j) for which x_i's URL is preferred to y_j's
    URL for the query with id k.

RankSVM poses the optimization problem as equivalent to that of a binary classification SVM on pairwise difference vectors (x_i - x_j). Let's use RankSVM on our ranking problem now.

Exercise: Fit a RankSVM model (i.e., an SVM classifier on pairwise differences) to our paired dataset (Xp, yp).

Hint!

def rank_svm(X_pairs, y_pairs):
    """Fit a RankSVM model on the dataset of pairwise difference vectors
    X_pairs with labels y_pairs indicating preference.

    Args:
        X_pairs: Pairwise differences computed from the original dataset.
        y_pairs: sign(y1 - y2) for pairs (x1, x2), i.e., -1 or +1 indicating
            preference of x1 to x2.

    Return the fitted RankSVM model.
    """
    # YOUR CODE HERE
    pass

Using our RankSVM function, we produce a fit plotted below.

RankSVM fit..

Finally, we compute the Kendall's tau ranking score and compare RankSVM with the ridge regression fit.


Kendall's tau coefficient for block 0: 0.8362693377308282
Kendall's tau coefficient for block 1: 0.8438727464026861

Our RankSVM solution indeed gives a higher Kendall's tau score than the ridge regression.


  1. Fabian Pedregosa's blog on the pairwise transform.

In this blog, we will follow Stochastic Gradient Descent as Approximate Bayesian Inference, which approximates stochastic gradient descent (SGD) as a continuous, Markovian, Gaussian process called the Ornstein-Uhlenbeck (OU) process.

The OU process is related to the Wiener process, which describes the Brownian motion of free particles, as shown below.

Brownian motion was first described by Einstein, who attributed the motion of pollen particles suspended in liquid to the frequent impacts of the moving moecules of liquid. Einstein used the assumption that each individual particle's motion is independent of the motion of all other particles, and that the motion of a given particle is Markovian, i.e., its motion in successive time intervals are independent, to derive what is now the Fokker-Planck equation describing the number of particles at a given position in spacetime using a probability distribution:

f ( x , t ) = n 4 π D t exp - x 2 4D t (1)

Furthermore, the root mean square of the displacement of the Brownian motion particle goes like the square root of the time elapsed: λ x = x 2 ¯ = 2D t .

A given OU process has the property of converging to a fixed, stable location from any initialization, as shown in the simulations below.

A simulator for Gaussian process regression is here, which we can use to observe the effect of switching from a squared exponential to an exponential kernel.

  •   Posted in:
  • memo

When using TikZ with the CVPR paper LaTeX template cvpr.sty, as of writing the following error occurs:

LaTeX Error: Command \@EveryShipout@Hook already defined.

The solution to the error is tucked away in a StackOverflow comment here, and I reiterate it below:

  • Rename cvpr_eso.sty to everyshi.sty.
  • In eso-pic.sty, change the line \input{cvpr_eso.sty} to \usepackage{everyshi}.