Pages

Sunday, December 18, 2011

A First Look at Data Science

I have always been wondering how Google succeeds in achieving so many amazing things: guessing (what we think), presenting (what we want), memorizing (what we need), etc. I keep trying to learn whatever I think I need to understand the internal of Google (for example, you may find a book called Google's PageRank and Beyond: The Science of Search Engine Rankings). Machine Learning, wealthy of statistical methods for 'guessing', was my first guess of what led to Google's success. This guess is correct (the correctness will be discussed later), but not sufficient, though, since we also need actual technology, in addition to mere theory, to realize our thoughts. Such technology like database systems lies in traditional Computer Science.

Maybe even that is not sufficient. New technology evolves every day, and it's hard to encapsulate all of them in a single terminology. But thanks to my advisor, Prof. Daisy Zhe Wang, who brought me to the field of Data Science, which, as far as I'm concerned, is a more satisfactory answer. (Note: I'm not saying that what Google does is data science. Rather, I mean that data science has contributed a lot to Google). As a first look at data science, you may look at the following figures, which are obtained from the articles The Rise of Data Science and Rise of the Data Scientist.


Figure 1. Skill sets for data science


Figure 2. Overview of data science

Now you should know why I consider 'data science' as a satisfactory answer: there is enough theory (math, statistics, machine learning, etc), and it specifies what we need to turn these theories into real-life products (hacking skills, substantive expertise). I'm also very happy to see a field that can somewhat summarize what I wish and need to learn, as I discussed in the first paragraph.

Let's take a brief tour of data science following the path of Figure 2. Along the way I'll discuss what I think may be necessary to learn, or suggest reading materials in the specific field.

Computer Science. Computer science provides ways to store and retrieve data efficiently. Relevant fields may be database systems, data structures & algorithms and so on. Machine Learning may be put here as well, or be viewed as a joint area with statistics as shown in Figure 1. You can find some good computer science books in Steven Pigeon's blog (French), Qunfeng's homepage or by asking Google. Furthermore, CS emphasizes programming skills, which should be a major plus in researching in data science: machines are better at dealing with mechanical tasks, such as processing large-scale data sets, than humans. An example set of skills used by data scientists is shown in Figure 3, from EMC and Big Data: 12 Key Findings From the Data Science Study.

Figure 3. Example set of skills used by data scientists

Math, Statistics and Data Mining. These theories provide fundamental methods to analyze the data you have (how to find patterns, how to make decisions basing on data, etc). There are a lot of excellent resources for these topics, some of which may be found at Kurt's pageWeipeng Liu's Blog (Chinese) and many more on web. (Motivated by this, I'm planning to pursue a master's degree in statistics along with the Ph.D. in computer science).

Graphic Design. Researchers should be able to show their results. To succeed in this, a good way may be data visualization. Data visualization is eye-catching and even fun: try to visualize your Google+ social network! Moreover, it also does good to your research: you can catch some essential trends of your data at the first glance of their visualization, which is convenient for later analysis. When considering data visualization, you should be able to determine what and how to visualize. There are several good books on these topics, some of which can be found at Kurt's page, and one more book I'd like to recommend is Visualize This, focusing on how you can visualize your data (using R, Adobe Illustrator, python, etc).

Infovis and Human-Computer Interaction (HCI). Deals with interaction. (Sorry, this is a field I'm not familiar with so currently have nothing to say. I'll add contents here after enough research. )

So that's all for our very first tour of data science. Thank you for staying with me so far. For me, it is fun and pleasing to have such an opportunity to research in these relevant areas. I'm hoping to gain more insight into how Google is implemented in following years. If you're also interested in this and like to discuss, feel free to contact me. :-)

Thursday, December 15, 2011

The Origin of Red-Black Trees

In Introduction to Algorithms, a red-black tree is defined as a binary tree that satisfies the following red-black properties:

    1. Every node is either red or black.
    2. The root is black.
    3. Every leaf (NIL) is black.
    4. If a node is red, then both its children are black.
    5. For each node, all simple paths from the node to descendant leaves contain the same number of black node.
Starting from this definition, we can prove some very nice properties of red-black trees. The most important property is that a red-black tree with $n$ internal nodes has height at most 2log( $n+1$). Given this, we say that red-black trees are approximately balanced.

Another kind of balanced binary tree is the AVL tree. For an AVL tree with $n$ nodes, the height is strictly less than 1.44log( $n+2$ )-1. Though it seems better than the red-black trees, the delete operation for AVL tree may require $O(\log n)$ rotations, which is inconvenient for many applications, while the red-black trees require at most 3. So in practice, people prefer red-black trees more often.

The idea for AVL trees is straightforward. We first try to insert elements and check whether the resulting tree is height-balanced. If it's not, we re-adjust the structure of the tree to make it balanced. The technique we use is referred to as rotations. In the case of red-black trees, however, things do not seem so obvious: it is quite strange that people ever came up with the idea to use colors to control the height of a tree.

To gain a better understanding of red-black trees, especially how the idea of colors came about, we may want to have a look at the origin of red-black trees, which are essentially B-trees of order 4, or 2-3-4 tree, a 4-ary search tree where each node can have a degree of 2, 3, or 4. Given this definition, the red-black tree is in fact a binary tree representation of the B-tree of order 4: every B-tree node can be transformed, or expanded, to a set of black and red nodes as shown below:

Figure 1. Transforming a degree 4 node into red-black nodes

Figure 2. Transforming a degree 3 node into red-black nodes (two possible transformations)

Figure 3. Transforming a degree 2 node into red-black nodes

Note that in all these transformations, there should be no two successive red nodes because each node in the original B-tree is black in the corresponding red-black tree, and produces at most one level of red node(s). Moreover, since the height of a B-tree of $n$ nodes is $O(\log n)$, the resulting red-black tree has height $O(2\log n)=O(\log n)$.

Thus, we can use a binary search tree to represent a B-tree of order 4, which is actually how red-black trees are invented. Knowing this, we should no longer feel the definition of red-black trees so obscure.

Thursday, September 22, 2011

The Three Prisoners Problem

This is an interesting topic, but the answer is not quite straightforward. The problem is stated as following:

Three prisoners. Three prisoners A, B, and C, are on death row. The governor decides to pardon one of the three and chooses at random the prisoner to pardon. He informs the warden of his choice but requests that the name be kept secret for a few days.

The next day, A tries to get the warden to tell him who had been pardoned. The warden refuses. A then asks which of B or C will be executed. The warden thinks for a while, then tells A that B is to be executed.

Warden's reasoning: Each prisoner has a 1/3 chance of being pardoned. Clearly, either B or C must be executed, so I have given A no information about whether A will be pardoned.

A's reasoning: Given that B will be executed, then either A or C will be pardoned. My chance of being pardoned has risen to 1/2.

Actually the warden's reasoning is correct. If we define A, B, C to be the events that A, B, or C will be pardoned, and W to be the event that the warden says B will die, then we can calculate P(A|W)=1/3. (The details of this calculation can be found in Statistical Inference 2ed, Section 1.3). But why is that intuitively? The warden, at first glance, seems to have given A some information about the government's decision. Where's that info gone?

To answer this question, first notice the difference between the events that the warden says B will die, and B will die. In addition to telling A that B will be executed, what the warden says in fact contains extra information: If A is to be pardoned, the warden may either tell A that B or C is going to be executed; If A is to be executed, the warden has no choice but to tell A that B is to be executed (since C is definitely pardoned in this case). So why does the warden tell A that B will die? It's more probably because that A is to be executed. Therefore, though it seems that the probability that A is pardoned seems to have risen to 1/2, the extra information, coming from the warden's choice of telling A which one of B and C will die, decreases the probability back to 1/3.

Here is a variant of the Three Prisoners Problem.
There are 3 doors, behind one of which is there a car. A contestant is asked to pick a door randomly so that he has a chance to win the car. The contestant picks door 1. Then he's shown that there's nothing behind door 3. Now should the contestant switch to door 2 to maximize the chance of winning a car? (Answer: yes)