A simple guide to getting started with data science

There are many articles on this subject from renowned data scientists (Dataspora, Gigaom, Quora, Hilary Mason). This post captures my journey (a software engineer) on learning Statistics and Data Visualization.

I'm mid-way in my 5 year journey to become proficient in data science and my learning program has included self-learning (books, blogs, toy problems), projects at work, class-room training (Stanford), teaching/presentations, conferences (UseR, Strata). Here's what I've done so far and what worked and what didn't...

R Packages

文本挖掘

Rwordseg

R环境下的中文分词工具,使用rJava调用Java分词工具Ansj。Ansj 也是一个开源的 Java 中文分词工具,基于中科院的 ictclas 中文分词算法,采用隐马尔科夫模型(Hidden Markov Model, HMM)。作者孙健重写了一个Java版本,并且全部开源,使得 Ansi 可用于人名识别、地名识别、组织机构名识别、多级词性标注、关键词提取、指纹提取等领域,支持行业词典、 用户自定义词典。详细信息可以参考作者孙健的专访以及项目的Github地址

rmmseg4j(不推荐,使用Rordseg替代)