42? Really?: How an ML Algorithm is created

While I was an undergrad student, using several fancy different techniques to classify my data, I always wondered how anyone could come up with so "fancy" algorithms such as Boosting or SVM from nothing. Of course I was extremely naive thinking that someone would come up with a new algorithm from "nothing", and deep inside I knew that things do not work this way. However, a method with "fancy" math techniques was difficulty to trace, at least for me some years ago.

Today I will try to present an example of how ML algorithms are created. Of course, it is still more complex than a blog post, so I'm going to present here just the intuition of a "toy example" that created the algorithm called Prod [1]. It is an adaptation of the Exponential Weighting Algorithm (EWA), which I'm going to present first. While you are reading about the EWA, I invite you to think how you would improve it. To be honest? I wouldn't be able to do it while reading, but it may be a good exercise... I'm not trying to seem smart asking you this...

Please, remember the online learning framework that I described in this post, and the first algorithms I presented here. I'm too lazy to write everything again, and the post would be huge!

EWA

In the first algorithms I presented, the Follow the Leader was the unique algorithm that is able to make predictions in a continuous space, however, we can clearly see that following the leader may not be a good approach always.

EWA is an algorithm that blends the prediction of all experts, weighting them accordingly to the "confidence" you have on them. Intuitively, if they commit too many mistakes, their weight is much smaller than the weight of a very good forecaster.

The algorithm works as follows: you receive the prediction of a set of forecasters (just the way I presented before) and then predicts $p_t$. The way we mixture the forecasters predictions is:
$$
p_t = \frac{\sum_{i=1}^{N}w_{t-1, i}, f_{t,i}}{\sum_{i=1}}
$$
where $f_{t,i}$ is the prediction at time $t$ of the expert $i$, $N$ is the total number of experts, and $w_{t-1,i}$ is the weight given for the expert $i$ in the previous time step. We define $w_{0,i} = 1 \ \forall i \in D$, where $D$ is the set of forecasters.

After making the prediction, as always, we receive the loss function $\ell_t$ and incur the loss $\ell_t(p_t)$, which we assume to be in a convex set. Finally, we update all weights, as follows:
$$ w_{t,i} = w_{t-1, i} \cdot e^{-\eta \ell_t(f_{t,i})} $$
Notice that the $e^{-\eta}$ works as $\beta$ in the WMA. It is exponential for some reasons that do not need to be discussed here. At the end, we can see that the upper bound of the regret in the EWA is [2]:
$$R_n \leq \frac{\ln N}{\eta} + \frac{\eta n}{8}$$
I'm not going to discuss this proof here. It is in [2]. This is all I wanted to present for EWA for the main goal of this post: to give an intuition of how we create new algorithms. Did you think about anything that could make the bound above smaller?

Prod

If you really thought about improving the algorithm above, probably you suspected that the update rule is where we have to focus. In fact there are other approaches, as using different techniques in the proof of the bound, but you had no clue to think about it.

What really happens is that in the proof, we used Hoeffding's lemma, this is the origin of the number $8$ in the bound, for example. This lemma assumes that the set of losses is convex. Are we able to obtain a tighter inequality? One approach is to avoid using Hoeffding's lemma and use a different inequality. $e^{-\eta}$ is clearly a convex function but we could use a concave function "avoiding" to use of Hoeffding's lemma.

What function to use? What about $1- \eta$? How did I came up with that? Well, it is tangent to $e^{-\eta}$. Look at the graph below, which I got from WolframAlpha.

Thus, we change EWA to have a different update rule,
$$ w_{t,i} = w_{t-1, i} (1 -\eta \ell_t(f_{t,i}))$$
and we are able to obtain the following bound [2]:
$$R_n \leq \frac{\ln N}{\eta} + \eta n\left(\frac{1}{n}\sum_{t=0}^{N}\ell_t^2(f_t,i) \right)$$
Despite looking scaring in a first moment, lets take a look on this bound. It scales with the average of the squared losses of the forecasters. This is good because the previous bound ignored that good forecasters should result in a smaller regret. Besides, assuming that the loss is between $0$ and $1$, this bound is even tighter, and better than a bound that scales linearly with time.

In summary, what I did here was to show a first algorithm, EWA. Then I changed its update rule to a "better" one, which provided us a tighter upper bound for the regret of the algorithm. In other words, I changed a part of an existent algorithm using some math knowledge over the proof of the first bound. I did not delved deep into the proof because the post would become a book chapter, but I hope you trust me. Maybe one day I will present this proof here. But the idea is: I know some math, I see that you used a bound that is not so tight, but that was necessary because of your assumption of convexity. I change this assumption with a new update function and then I propose a new algorithm, avoiding the inequality that impaired previous performance.

I hope I made more clear to you how we use our math knowledge to propose new algorithms in ML. Obviously, this is only a simple example, with a superficial discussion about Prod. The original paper [1] is much more dense and deep (as expected). This discussion is completely based on the lecture notes written by Prof. Csaba Szepesvari, Prof. Andras György and David Pál [2]. The notation used is also the same. Of course, any mistakes are completely my fault.

References:
[1] Cesa-Bianchi, N., Mansour, Y., and Stoltz, G. (2007). Improved second-order bounds for prediction with expert advice. Machine Learning Journal, 66(2-3):321-352.
[2] Gyorgy, A., Pal, D., Szepesvari, C. (Draft). Lecture Notes - Online learning.

-----------
I'm sorry for not posting anything last week. I really want to be able to post regularly here, however, somethings go out of our control. Broken laptop, assignments and so on...

42? Really?

Friday, October 18, 2013

How an ML Algorithm is created

No comments:

Post a Comment