Friday, October 18, 2013

How an ML Algorithm is created

While I was an undergrad student, using several fancy different techniques to classify my data, I always wondered how anyone could come up with so "fancy" algorithms such as Boosting or SVM from nothing. Of course I was extremely naive thinking that someone would come up with a new algorithm from "nothing", and deep inside I knew that things do not work this way. However, a method with "fancy" math techniques was difficulty to trace, at least for me some years ago.

Today I will try to present an example of how ML algorithms are created. Of course, it is still more complex than a blog post, so I'm going to present here just the intuition of a "toy example" that created the algorithm called Prod [1]. It is an adaptation of the Exponential Weighting Algorithm (EWA), which I'm going to present first. While you are reading about the EWA, I invite you to think how you would improve it. To be honest? I wouldn't be able to do it while reading, but it may be a good exercise... I'm not trying to seem smart asking you this...


Please, remember the online learning framework that I described in this post, and the first algorithms I presented here. I'm too lazy to write everything again, and the post would be huge!


EWA

In the first algorithms I presented, the Follow the Leader was the unique algorithm that is able to make predictions in a continuous space, however, we can clearly see that following the leader may not be a good approach always.

EWA is an algorithm that blends the prediction of all experts, weighting them accordingly to the "confidence" you have on them. Intuitively, if they commit too many mistakes, their weight is much smaller than the weight of a very good forecaster.

The algorithm works as follows: you receive the prediction of a set of forecasters (just the way I presented before) and then predicts $p_t$. The way we mixture the forecasters predictions is:
$$
p_t = \frac{\sum_{i=1}^{N}w_{t-1, i}, f_{t,i}}{\sum_{i=1}}
$$
where $f_{t,i}$ is the prediction at time $t$ of the expert $i$, $N$ is the total number of experts, and $w_{t-1,i}$ is the weight given for the expert $i$ in the previous time step. We define $w_{0,i} = 1 \ \forall i \in D$, where $D$ is the set of forecasters.

After making the prediction, as always, we receive the loss function $\ell_t$ and incur the loss $\ell_t(p_t)$, which we assume to be in a convex set. Finally, we update all weights, as follows:
$$ w_{t,i} = w_{t-1, i} \cdot e^{-\eta \ell_t(f_{t,i})} $$
Notice that the $e^{-\eta}$ works as $\beta$ in the WMA. It is exponential for some reasons that do not need to be discussed here. At the end, we can see that the upper bound of the regret in the EWA is [2]:
$$R_n \leq \frac{\ln N}{\eta} + \frac{\eta n}{8}$$
I'm not going to discuss  this proof here. It is in [2]. This is all I wanted to present for EWA for the main goal of this post: to give an intuition of how we create new algorithms. Did you think about anything that could make the bound above smaller?

Prod

If you really thought about improving the algorithm above, probably you suspected that the update rule is where we have to focus. In fact there are other approaches, as using different techniques in the proof of the bound, but you had no clue to think about it.

What really happens is that in the proof, we used Hoeffding's lemma, this is the origin of the number $8$ in the bound, for example. This lemma assumes that the set of losses is convex. Are we able to obtain a tighter inequality? One approach is to avoid using Hoeffding's lemma and use a different inequality. $e^{-\eta}$ is clearly a convex function but we could use a concave function "avoiding" to use of Hoeffding's lemma.

What function to use? What about $1- \eta$? How did I came up with that? Well, it is tangent to $e^{-\eta}$. Look at the graph below, which I got from WolframAlpha.


Thus, we change EWA to have a different update rule,
$$ w_{t,i} = w_{t-1, i} (1 -\eta \ell_t(f_{t,i}))$$
and we are able to obtain the following bound [2]:
$$R_n \leq \frac{\ln N}{\eta} + \eta n\left(\frac{1}{n}\sum_{t=0}^{N}\ell_t^2(f_t,i) \right)$$
Despite looking scaring in a first moment, lets take a look on this bound. It scales with the average of the squared losses of the forecasters. This is good because the previous bound ignored that good forecasters should result in a smaller regret. Besides, assuming that the loss is between $0$ and $1$, this bound is even tighter, and better than a bound that scales linearly with time.



In summary, what I did here was to show a first algorithm, EWA. Then I changed its update rule to a "better" one, which provided us a tighter upper bound for the regret of the algorithm. In other words, I changed a part of an existent algorithm using some math knowledge over the proof of the first bound. I did not delved deep into the proof because the post would become a book chapter, but I hope you trust me. Maybe one day I will present this proof here. But the idea is: I know some math, I see that you used a bound that is not so tight, but that was necessary because of your assumption of convexity. I change this assumption with a new update function and then I propose a new algorithm, avoiding the inequality that impaired previous performance.

I hope I made more clear to you how we use our math knowledge to propose new algorithms in ML. Obviously, this is only a simple example, with a superficial discussion about Prod. The original paper [1] is much more dense and deep (as expected).  This discussion is completely based on the lecture notes written by Prof. Csaba Szepesvari, Prof. Andras György and David Pál [2]. The notation used is also the same. Of course, any mistakes are completely my fault.

References:
[1] Cesa-Bianchi, N., Mansour, Y., and Stoltz, G. (2007). Improved second-order bounds for prediction with expert advice. Machine Learning Journal, 66(2-3):321-352.
[2] Gyorgy, A., Pal, D., Szepesvari, C. (Draft). Lecture Notes - Online learning.

-----------
I'm sorry for not posting anything last week. I really want to be able to post regularly here, however, somethings go out of our control. Broken laptop, assignments and so on...

Friday, October 4, 2013

First Algorithms in Online Learning

Considering the whole framework already discussed in a previous post, I am going to present some basic algorithms in the online learning field, namely, Follow the Leader (or Follow the Best Expert), Halving Algorithm and Weighted Majority Algorithm.


Follow the Leader Algorithm

A first situation we may think is a forecaster that has to predict a sequence of numbers that are inside an interval. One may define the loss $\ell(p_t)$ of the prediction $p_t$ at time step $t$ as the distance between the prediction and the real outcome i.e.,  $\ell(p_t) = ||p_t - y||^2$ where $y$ is the real outcome of time step $t$. Szepesvari et al. define this game as the Shooting Game, where the predictions lay in the unit circle $\mathbb{R}^d$. I will use this game to present a bound for the first algorithm I am going to discuss, the Follow the Leader Algorithm (FTL) or Follow the Best Expert.

FTL is an extremely simple algorithm, where the prediction at time step $t$ is the one that minimizes the regret until that moment. If we define as the best forecaster in hindsight the one that chooses the best fixed point until that time ($u_t$); we can say that the FTL algorithm always predicts, at time step $t$, $u_{t-1}$. As an example, suppose the three first elements of the sequence that will be predicted are $\{0.6, 0.4, 0.8\}$. The FTL always predict $0$ at the first time step, then, the loss at the first time step is $||0 - 0.6||^2 = 0.36$. At the second time step, FTL predicts $\frac{0 + 0.6}{2} = 0.30$, and the loss at time step $2$ is $||0.3-0.4||^2 = 0.01$ and so on...

Despite being extremely simple, both Cesa-Bianchi, Lugosi; and Szepesvari et al. show that the FTL has an upper bound proportional to $\log n$. This is a very good performance, since $\lim_{n\rightarrow \infty} \frac{\log n}{n} = 0$. Thus, along the time, the regret becomes irrelevant.

Halving Algorithm

A different situation we can come up with is a situation in which you have a set of predictors (or forecasters) and you know, beforehand, that one of them is never wrong. Thus, the question now is how to discover it and the maximum number of mistakes one may get. Clearly, it must have something better than the FTL (Besides, this problem may be a discrete classification). We can use, for example, binary search -- if you have a binary problem, otherwise the reasoning is the same for more classes: You just count the number of forecasters that predicted class $1$ and then count the number of forecasters that selected the other class. You predict, for the current time step, as the majority voted. Two outcomes are possible: (1) you correctly predicted what you want to predict, and then you eliminate all the other predictors that gave you wrong information; or (2) you got the wrong answer for that time step, what is not so good, but at least you eliminated, at least, half of the predictors you had. This is the Halving algorithm.

Now, answering the second question: in the worst case, how many mistakes the Halving algorithm does make? Different from other proofs related to algorithms' bounds in this field, this proof is quite intuitive, the worst case possible is to always make the wrong prediction, until you discover the 'perfect forecaster', right? Otherwise, you would be eliminating forecasters without mistaking -- certainly this is not the worst case. Additionally, the worst case is not when all predictors are wrong, and just one is right, but when half of the forecasters are wrong and the other half, right. This way, using basic knowledge on binary search, half of the set is eliminated each iteration and, at the end, the worst case is to make $\log_{2}{N}$ mistakes, where $N$ is the number of experts.


Weighted Majority Algorithm

However, a more common situation (?) is to not have a perfect predictor. Then, how to approach this task? The idea is very close to the Halving algorithm, but in place of eliminating the forecasters that made a wrong decision, you just reduce your confidence on them and you predict based on a weight majority. This algorithm is called Weighted Majority Algorithm.

Its idea is to attribute a weight $w_{t, i}$ to each predictor $i$ at time step t. Initially, $1$. You then evaluate the summation of weights of the forecasters that predicted each possibility. You then choose the prediction made by the highest weights. For example, three different forecasters $f_1$, $f_2$ and $f_3$ have weights (sort of your confidence on them) equals to $0.6$, $0.3$ and $0.4$, respectively. $f_1$ predicted $A$ while $f_2$ and $f_3$ predicted $B$. Since $0.7 > 0.6$ (sum of weights), you predict $B$.

Regarding the update rule, for each predictor we define $w_{i,t}$ = $w_{i, t-1}$ . $\beta^{\mathbb{I}\{f_{i,t} \neq \ y_t}\}$. $\beta$ is a parameter, $y_t$ is the correct prediction revealed, and, most importantly: $\mathbb{I}$ is the indicator function, a function that returns the number of times the condition inside it was true. So, in this case, $\beta^\mathbb{I}\{f_{i,t} \neq \ y_t\}$ is $\beta^1$, thus $\beta$ in case the predictor was mistaken on that time step, and $\beta^0$, thus $1$ (no change!) if it was right. What about the number of mistakes?

Szepesvari, Gyorgy and Pal state that the Weighted Majority Algorithm makes at most
$$
\frac{\ln\left(\frac{1}{\beta}\right)L_n^* + \ln N}{\ln\left(\frac{2}{1+\beta}\right)}
$$
mistakes. I'm not going to present their proof here. Maybe later, in another post.

To conclude, notice that if $\beta = 0$, the Weighted Majority Algorithm becomes the Halving Algorithm. In the next post I intend to talk about Exponentially Weighted Average Forecasters, when we don't want to predict numerical values that are not necessarily discrete.

------------------
I'm sorry if this post was not so well written. I'm too busy with assignments and exams this week (and the next one).

This discussion is completely based on the lecture notes written by Prof. Csaba Szepesvari, Prof. Andràs György and David Pál. The notation used is also the same. Of course, any mistakes are completely my fault.