42? Really?: September 2013

Friday, September 27, 2013

The Transformation during Grad School

I've seen this today and decided to share. It is a cut and paste from Professor H.T. Kung homepage, from Harvard University. It seems reasonable...

Growth of a star (the transformation process that some students go through to become a mature researcher)--which stage are you in?

Knowing everything stage
- Student: "I have designed a supercomputer even before graduate school."
- Faculty: speechless
Totally beaten up stage
- Student: speechless
- Faculty: smiling at the student's progress so communication is possible now.
Confidence buildup stage
- Student: "I am not stupid after all." (student thinks)
- Faculty: "Uh oh, she is ready to argue." (faculty think)
Calling the shot stage

Faculty: "I am going to design an n-processor supercomputer."
Student: "You are crazy, because ..."

I'm pretty sure were I am now. Just hoping that I will reach the later stages.

Friday, September 20, 2013

Online Learning - Predicting Individual Sequences

This post aims at presenting the basic concepts of online learning, which is in fact a framework. Recently, this approach has been receiving much attention in top-tier conferences such as ICML, NIPS, ALT and COLT. I'm not sure whether Online Learning is a name commonly used by the ML community to define this topic, but I'm going to stay with it. A famous book in this subject is: Prediction, Learning and Games, written by Nicolò Cesa-Bianchi and Gábor Lugosi.

In the traditional statistical learning setting, some stochastic properties about the data are assumed. However, in practice, one cannot guarantee that these properties hold. Consequently, it is extremely hard to provide any theoretical guarantees for the performance of these algorithms. In this scenario, online learning is a framework that has no assumptions about the environment (I'll discuss the concept of environment below). Besides, differently from traditional approaches that generally have an universal performance measure, online learning is much more pessimistic, just intending to do as good as a the best reference predictor. These characteristics allow researchers to provide theoretical guarantees for algorithms performance. In fact, some standard approaches such as SVM have already been re-implemented in the online learning framework.

In this framework, the prediction problem is modeled as a game in which one player is the predictor (or forecaster) and the other one is the environment (your adversary). One of the main questions is how do we generate predictors that are good enough? Of course, we have to define what is good enough... To answer this question, we have to define some basic concepts of the field. These concepts are presented below, together with a description of a game round.

At each time step $t = 1, 2, ...$, the forecaster makes a prediction $p_t \in D$, where $D$ is the set of forecasters. At the sime time, the environment chooses a loss function $\ell_t : D \times \mathcal{L}$, where $\mathcal{L}$ is the set of all loss functions. Notice that both forecaster and environment do not know its adversary choice. The forecaster only knows the loss functions incurred over him in the past i.e., $\ell_{t-1}, \ell_{t-2}, ..., \ell_{1}$. Defining $\ell_t(p_t)$ as the loss incurred to prediction $p_t$, we can define the total loss of the forecaster until that moment, $\hat{L_n}$:
$$\hat{L}_n = \sum_{t=1}^{n}\ell_t(p_t)$$
Once we defined the total loss of the predictor, we should also define the total loss of a predictor $u$, if it was selected in all time steps:
$$L_n = \sum_{t=1}^{n}\ell_t(u)$$
Thus, once these concepts are defined, we can present what we really want to minimize, the regret $R_n$. It is the difference between the performance of the forecaster and the best predictor present in $D$. Formally,
$$R_n = \hat{L}_n - \inf_{u \in D}L_n(u)$$
*Notice that $\inf$ is different from $\min$. For further details, please check the definition of Infimum in the Wikipedia [link].

Basically, we want to keep the regret small while the environment wants to keep it large. To increase the regret, the environment wants to generate a way that the best predictor is hard to be discovered by the forecaster, and it is small, ensuring that the second term of the difference is high. Additionally, the environment wants to define loss functions that are high for predictors different from this chosen one, to ensure that the first element is high. The forecaster, on the other hand, does not have control over the second term, it can just select the best prediction at each time step. This is why this framework works (in a very simplistic statement). Because each prediction $p_t$ made tries to be as good as possible to minimize the regret. This is done looking for patterns and so on.

In general, what people want is to ensure that the forecaster (first term) is as good as the best available predictor along the time. Thus, we can say that, at the limit, when $t$ is arbitrarily large, we want the ratio $\frac{R_n}{n} \rightarrow 0$. In other words, we want the regret to grow sublinearly.

A final question to be answered is: why don't we just use the loss function of the forecaster to measure performance? Why do we use regret?

I have to confess that this was not clear for me at a first moment, but the answer is to ensure that we can compare our approach with something more concrete. In other words, remember that the environment plays against you, thus it wants to increase the forecaster's loss. It can do it in a pretty simple way, and we can easily show that it can make forecaster's loss arbitrarily large. For example, if the loss can assume values $0$ and $1$, the environment can easily make the forecaster suffer a loss $n$ (worst-case). So, how do we know if our algorithm is good? We have to compare it to something else, in this case, the best predictor. If our regret is $0$, we know that you cannot do any better, even if your forecaster is not good. You are simply working in a problem too hard. It is all a matter of having a basis for comparison.

An analogy that may ease the understanding of this difference are grades in a class. If we state that each student score a grade $p \in \mathbb{R} : 0 \leq p \leq 100$, we just define the range of possibilities (you can see them as losses), however, we do not say if it is possible to score all these grades. Suppose that your professor is working as your adversary (this is just an analogy!) and wants to make the course extremely challenging. Then, at the end you score $40$. It seems pretty bad, however, when you look at your classmates grades, the highest one was $42$. Your loss is huge! But your regret is small... You shouldn't be so sad about your performance, and hopefully you will still receive an A+ ;)

------------------

This post was longer than I expected, so I decided to leave some examples of this framework to the next post. I hope you were able to grasp the main concepts of this framework for predicting individual sequences.

This discussion is completely based on the lecture notes written by Prof. Csaba Szepesvari, Prof. Andràs György and David Pál. The notation used is also the same. Of course, any mistakes are completely my fault.

Thursday, September 19, 2013

First Post - What does this blog stand for?

Hello,

as you can see, this is the first post of this blog. I'll use to introduce the blog, myself, and so on...

I'm a first year PhD student at the Department of Computing Science at the University of Alberta. Previously, I took my B.Sc. and M.Sc. degrees in Brazil, at the Universidade Federal de Minas Gerais (UFMG), also in Computer Science. Yes, I'm Brazilian and I do love soccer. I root for Cruzeiro but this is not really relevant for this discussion.

This blog exists because I've read a lot about how to succeed in grad school before starting the PhD, and a common advise was to write, write and write... Since I'm not an English native-speaker, I suppose I have to write even more, so I'll use this blog for this. I cannot guarantee that every text will be perfect, and if you notice a typo, a not-so-good-written text or an error, please let me know.

What am I going to write about? Most of the part about Artificial Intelligence, Machine Learning and research. I struggle to create a classification between AI and ML, sometimes it seems redundant to mention both, but whatever... This is what this blog is about.

While reading texts about being a successful PhD student, I always thought that one of my first posts would be related to this subject, summarizing everything that I read. However, a friend, also a former student at UFMG and now a PhD Student at UCSB, Arlei Silva, did it very well [link]. So it is useless to write about the same thing again.

Thus, my first posts are probably going to discuss the main concepts of the online learning field (not e-learning!). To the reader not used to this concept, I will finish with a quote extracted from the website of the online learning course at the University of Alberta, taught by professor Csaba Szepesvari:

"Online learning is a hot topic in machine learning: When processing huge amounts of data algorithms must operate continuously in their environments, making predictions, receiving rewards, suffering losses. The distinguishing feature of online problems is that the algorithm's performance is measured by comparing it with the best performing algorithm from a larger class of algorithms, allowing for a wide class of environment models, ranging from stationary stochastic models to ones where no statistical constraints are adopted."

I hope you enjoy the blog. First, I hope some people read it, to avoid it becoming some kind of my public diary. Disclaimer: I'm a PhD student, I may not be able to update this blog with the frequency I would like to do so.