Distances on graphs, and the exploration/exploitation tradeoff in reinforcement learning

Marco Saerens

Université catholique de Louvain

saerens@isys.ucl.ac.be

Abstract
This work presents some general procedures for computing dissimilarities/similarities between elements of a database or, more generally, nodes of a weighted, undirected, graph. It is based on a Markovchain model of random walk through the database. The model assigns transition probabilities to the links between elements, so that a random walker can jump from element to element. A quantity, called the average firstpassage time, computes the average number of steps needed by a random walker for reaching element k for the first time, when starting from element i. Another closely related quantity, called the average commute time, provides a distance measure between any pair of elements. Yet another quantity of interest, closely related to the average commute time, is the pseudoinverse of the Laplacian matrix of the graph, which represents a similarity measure between the nodes of the graph, and is a kernel in the "support vector machine" sense. Unlike the standard "Dijkstra" or "shortest path" distance, these quantities, representing dissimilarities (similarities) between any two elements, have the nice property of decreasing (increasing) when the number of paths connecting these two elements increases and when the "length" of any path decreases. We then define the principal component analysis (PCA) of a graph as the subspace projection that preserves as much variance as possible, in terms of the commute time distance. This PCA has some interesting links with spectral graph theory, in particular "spectral clustering". Finally, we generalize this work by defining another distance measure between the nodes of a graph, issued from the concept of reinforcement learning.First of all, we define the concept of degree of exploration from a state as the entropy related to the probability distribution of the set of admissible actions in this state. This entropy value allows to control the degree of exploration linked to this state, and should be provided by the user. Then, we restate the exploration/exploitation problem as a global optimization problem: define the best exploration strategy that minimizes the expected cumulated cost, for fixed degrees of exploration. This formulation leads to a set of nonlinear updating rules reminiscent from the ``value iteration'' algorithm. Interestingly enough, when the degree of exploration is zero for all states (no exploration), these equations reduce to Bellman's equations for finding the shortest path while, when it is maximum, a full ``blind'' exploration i s performed.
Keywords
Graphs, Distance measure, Reinforcement learning