I will present the first poly-time and -regret algorithm for bandit convex optimization.
Consider the problem of optimizing a unknown convex function using an approximate function value oracle. In the stochastic approximation literature one usually assumes that the noise is zero-mean, identical and independent between queries. The bandit framework goes much beyond this independence assumption by saying that for each query there is a different function (potentially chosen adversarially given all the past choices) and the objective is to optimize the average function.
The problem at hand is thus formulated as follows: given a convex body , the problem can be described as the following sequential game: at each time step , a player selects an action , and simultaneously an adversary selects a convex loss function . The player's feedback is its suffered loss, . The player has access to external randomness, and can select her action based on the history . The player's perfomance at the end of the game is measured through the regret
which compares her cumulative loss to the smallest cumulative loss she could have obtained had she known the sequence of loss functions.
We will present the first algorithm which achieves optimal dependence of $T$ and a polynomial dependence on the dimension. This new algorithm is based on three ideas: (i) kernel methods, (ii) a generalization of Bernoulli convolutions (this a self-similar process that has been studied since the 1930's, most notably by Erdos), and (iii) a new annealing schedule for exponential weights (with increasing learning rate).
Joint w. Sebastien Bubeck and Yin-Tat Lee.