Pages

Sunday, December 18, 2011

A First Look at Data Science

I have always been wondering how Google succeeds in achieving so many amazing things: guessing (what we think), presenting (what we want), memorizing (what we need), etc. I keep trying to learn whatever I think I need to understand the internal of Google (for example, you may find a book called Google's PageRank and Beyond: The Science of Search Engine Rankings). Machine Learning, wealthy of statistical methods for 'guessing', was my first guess of what led to Google's success. This guess is correct (the correctness will be discussed later), but not sufficient, though, since we also need actual technology, in addition to mere theory, to realize our thoughts. Such technology like database systems lies in traditional Computer Science.

Maybe even that is not sufficient. New technology evolves every day, and it's hard to encapsulate all of them in a single terminology. But thanks to my advisor, Prof. Daisy Zhe Wang, who brought me to the field of Data Science, which, as far as I'm concerned, is a more satisfactory answer. (Note: I'm not saying that what Google does is data science. Rather, I mean that data science has contributed a lot to Google). As a first look at data science, you may look at the following figures, which are obtained from the articles The Rise of Data Science and Rise of the Data Scientist.


Figure 1. Skill sets for data science


Figure 2. Overview of data science

Now you should know why I consider 'data science' as a satisfactory answer: there is enough theory (math, statistics, machine learning, etc), and it specifies what we need to turn these theories into real-life products (hacking skills, substantive expertise). I'm also very happy to see a field that can somewhat summarize what I wish and need to learn, as I discussed in the first paragraph.

Let's take a brief tour of data science following the path of Figure 2. Along the way I'll discuss what I think may be necessary to learn, or suggest reading materials in the specific field.

Computer Science. Computer science provides ways to store and retrieve data efficiently. Relevant fields may be database systems, data structures & algorithms and so on. Machine Learning may be put here as well, or be viewed as a joint area with statistics as shown in Figure 1. You can find some good computer science books in Steven Pigeon's blog (French), Qunfeng's homepage or by asking Google. Furthermore, CS emphasizes programming skills, which should be a major plus in researching in data science: machines are better at dealing with mechanical tasks, such as processing large-scale data sets, than humans. An example set of skills used by data scientists is shown in Figure 3, from EMC and Big Data: 12 Key Findings From the Data Science Study.

Figure 3. Example set of skills used by data scientists

Math, Statistics and Data Mining. These theories provide fundamental methods to analyze the data you have (how to find patterns, how to make decisions basing on data, etc). There are a lot of excellent resources for these topics, some of which may be found at Kurt's pageWeipeng Liu's Blog (Chinese) and many more on web. (Motivated by this, I'm planning to pursue a master's degree in statistics along with the Ph.D. in computer science).

Graphic Design. Researchers should be able to show their results. To succeed in this, a good way may be data visualization. Data visualization is eye-catching and even fun: try to visualize your Google+ social network! Moreover, it also does good to your research: you can catch some essential trends of your data at the first glance of their visualization, which is convenient for later analysis. When considering data visualization, you should be able to determine what and how to visualize. There are several good books on these topics, some of which can be found at Kurt's page, and one more book I'd like to recommend is Visualize This, focusing on how you can visualize your data (using R, Adobe Illustrator, python, etc).

Infovis and Human-Computer Interaction (HCI). Deals with interaction. (Sorry, this is a field I'm not familiar with so currently have nothing to say. I'll add contents here after enough research. )

So that's all for our very first tour of data science. Thank you for staying with me so far. For me, it is fun and pleasing to have such an opportunity to research in these relevant areas. I'm hoping to gain more insight into how Google is implemented in following years. If you're also interested in this and like to discuss, feel free to contact me. :-)

No comments:

Post a Comment