数据分析这点事

转自caoz的和谐blog

从微博段子说起,微博上关于数据分析有两个段子,我经常当作案例讲,第一个段子,说某投资商对某企业所属行业有兴趣,要做背景调查,甲是技术流,一周分析各种网上数据,四处寻找行业材料,天天熬夜,终于写出一份报告;乙是人脉流,和对方高管喝了次酒,请对方核心人员吃了顿饭,所有内幕数据全搞定,问谁的方法是对的;第二个段子,某电商发现竞争对手淘宝店,周收入突然下降了30%,但是隔周后又自然恢复,中间毫无其他异常现象,于是老板让分析师分析,苦逼的分析师辛苦数日,做各种数学模型,总算找到勉强的理由自圆其说,老板读毕,虽说不能让人信服,却也没有更合理的解释,某日,见对手老板,闲聊此事,“你们某段时间怎么突然收入下降?” “嗨,别提了,回家一趟,公司放羊了。”老板恍然大悟。

A Practical Intro to Data Science

A Practical Intro to Data Science

There are plenty of articles and discussions on the web about what data science is, what qualities define a data scientist, how to nurture them, and how you should position yourself to be a competitive applicant. There are far fewer resources out there about the steps to take in order to obtain the skills necessary to practice this elusive discipline. Here we will provide a collection of freely accessible materials and content to jumpstart your understanding of the theory and tools of Data Science.

At Zipfian Academy, we believe that everyone learns at different paces and in different ways. If you prefer a more structured and intentional learning environment, we run a 12 week immersive bootcamp training people to become data scientists through hands-on projects and real-world applications. We also host a free Skillshare course covering much of this material at a high level.

We would love to hear your opinions on what qualities make great data scientists, what a data science curriculum should cover, and what skills are most valuable for data scientists to know. Share your thoughts over at Hacker News!

_While the information contained in these resources is a great guide and reference, the best way to become a data scientist is to make, create, and share!

A simple guide to getting started with data science

There are many articles on this subject from renowned data scientists (Dataspora, Gigaom, Quora, Hilary Mason). This post captures my journey (a software engineer) on learning Statistics and Data Visualization.

I'm mid-way in my 5 year journey to become proficient in data science and my learning program has included self-learning (books, blogs, toy problems), projects at work, class-room training (Stanford), teaching/presentations, conferences (UseR, Strata). Here's what I've done so far and what worked and what didn't...

R Packages

文本挖掘

Rwordseg

R环境下的中文分词工具,使用rJava调用Java分词工具Ansj。Ansj 也是一个开源的 Java 中文分词工具,基于中科院的 ictclas 中文分词算法,采用隐马尔科夫模型(Hidden Markov Model, HMM)。作者孙健重写了一个Java版本,并且全部开源,使得 Ansi 可用于人名识别、地名识别、组织机构名识别、多级词性标注、关键词提取、指纹提取等领域,支持行业词典、 用户自定义词典。详细信息可以参考作者孙健的专访以及项目的Github地址

rmmseg4j(不推荐,使用Rordseg替代)

Data Resource

Free Data: Data Source - Package Google Finance historical data - quantmod Google Finance balance sheets - quantmod Yahoo Finance historical data - quantmod Yahoo Finance historical data - tseries Yahoo Finance current options chain - quantmod Yahoo Finance historical analyst estimates - fImport Yahoo Finance current key stats - fImport - seems to be broken

OANDA historic exchange rates/metal prices - quantmod FRED historic macroeconomic indicators - quantmod World Bank historic macroeconomic indicators - WDI Google Trends historic search volume data - RGoogleTrends Google Docs - RGoogleDocs Twitter - twitteR Zillow - Zillow New York Times - RNYTimes US Census 2000 - UScensus2000 infochimps - infochimps datamarket - rdatamarket - requires free account Factual.com - factualR Geocode addresses - RDSTK Map coordinates to political boundaries - RDSTK Weather Underground - Roll your own Google News - Roll your own Earth Sciences netCDF Data - Roll your own Climate Data - Roll your own Public health data - Roll your own FishBase - rfishbase

Paid Data: Bloomberg - RBloomberg LIM - LIM Trades and Quotes from NYSE - RTAQ Interactive Brokers - IBrokers

Useful Tools: RCurl RJSON RJSONIO XML scraper digitizer

转自:http://stats.stackexchange.com/questions/12670/data-apis-feeds-available-as-packages-in-r

top10 wrong way 4 big data

[From] (http://www.cnblogs.com/wentingtu/archive/2012/03/16/2399921.html)

我们很容易犯以下错误,如果:

  1. 缺乏数据(Lack Data)
  2. 太关注训练(Focus on Training)
  3. 只依赖一项技术(Rely on One Technique)
  4. 提错了问题(Ask the Wrong Question)
  5. 只靠数据来说话(Listen (only) to the Data)
  6. 使用了未来的信息(Accept Leaks from the Future)
  7. 抛弃了不该忽略的案例(Discount Pesky Cases)
  8. 轻信预测(Extrapolate)
  9. 试图回答所有问题(Answer Every Inquiry)
  10. 随便地进行抽样(Sample Casually)
  11. 太相信最佳模型(Believe the Best Model)

data mining analytics and-knowledge discovery softwares

from http://www.kdnuggets.com/software/suites.html

  • 11Ants Model Builder, a desktop predictive analytics modelling tool, which includes regression, classification and propensity models. A related product is 11Ants Predictor, a high speed scoring engine for deploying predictive models against enterprise databases.

  • AdvancedMiner (formerly Gornik), a platform for data mining and analysis, featuring modeling interface (OOP script, latest GUI design, advanced visualization) and grid computing.

  • Affinium Model (from Unica), includes valuator, profiler, response modeler, and cross seller.

  • Alice d'Isoft, offers interactive decision trees for the business user.

  • Angoss Knowledge Studio, a comprehensive suite of data mining and predictive modeling tools; interoperability with SAS and other major statistical tools.

  • ASA's ModelMAX, predictive modeling and clustering analysis for business users that want fast and accurate results.

  • BayesiaLab, a complete and powerful data mining tool based on Bayesian networks, including data preparation, missing values imputation, data and variables clustering, unsupervised and supervised learning.

  • BioComp i-Suite, constraint-based optimization, cause and effect analysis, non-linear predictive modeling, data access and cleaning, and more.

  • BLIASoft Knowledge Discovery software, for building models from data based mainly on fuzzy logic.

  • Clementine from SPSS, leading visual rapid modeling environment for data mining. Now includes Clementine Server.

  • Data Applied, offers a comprehensive suite of web-based data mining techniques, an XML web API, and rich data visualizations.

  • Data Miner Software Kit, collection of data mining tools, offered in combination with a book: Predictive Data Mining: A Practical Guide, Weiss and Indurkhya.

  • DataDetective, the powerful yet easy to use data mining platform and the crime analysis software of choice for the Dutch police.

  • DataLab, a complete and powerful data mining tool with a unique data exploration process, with a focus on marketing and interoperability with SAS.

  • DBMiner 2.0 (Enterprise), powerful and affordable tool to mine large databases; uses Microsoft SQL Server 7.0 Plato

  • Delta Miner, integrates new search techniques and "business intelligence" methodologies into an OLAP front-end that embraces the concept of Active Information Management.

  • ESTARD Data Miner, simple to use, designed both for data mining experts and common users.

  • EWA Systems, complete Java-based data-mining suite, including a full range of high-performance rules-based, Bayesian, neural, and SVM techniques.

  • Exeura RialtoTM provides comprehensive support for the entire data mining and analytics lifecycle at an affordable price in a single, easy-to-use tool.

  • Fair Isaac Model Builder, software platform for developing and deploying analytic models, includes data analysis, decision tree and predictive model construction, decision optimization, business rules management, and open-platform deployment.

  • FastStats Suite (Apteco), marketing analysis products, including data mining, customer profiling and campaign management.

  • GainSmarts, uses predictive modeling technology that can analyze past purchase, demographic, and lifestyle data , to predict the likelihood of response and develop an understanding of consumer characteristics.

  • Generation5 GenVoy, On-Demand Consumer Analytics.

  • GenIQ Model, uses machine learning for regression task; automatically performs variable selection, and new variable construction, and then specifies the model equation to "optimize the decile table".

  • GhostMiner, complete data mining suite, including k-nearest neighbors, neural nets, decision tree, neurofuzzy, SVM, PCA, clustering, and visualization.

  • GMDH Shell, an advanced but easy to use tool for predictive modeling and data mining.

  • Golden Helix Optimus RP, uses Formal Inference-based Recursive Modeling (recursive partitioning based on dynamic programming) to find complex relationships in data and to build highly accurate predictive and segmentation models.

  • IBM Intelligent Miner Data Mining Suite, now fully integrated into the IBM InfoSphere Warehouse software; includes Data and Text mining tools (based on UIMA).

  • JMP, offers significant visualization and data mining capabilities along with classical statistical analyses.

  • K.wiz, from thinkAnalytics - massively scaleable, embeddable, Java-based real-time data-mining platform. Designed for Customer and OEM solutions.

  • Kaidara Advisor, (formerly Acknosoft KATE), Case-Based Reasoning (CBR) and data mining engine.

  • Kensington Discovery Edition, high-performance discovery platform for life sciences, with multi-source data integration, analysis, visualisation, and workflow building.

  • Kepler, extensible, multi-paradigm, multi-purpose data mining system.

  • KnowledgeMiner, a self-organizing modeling tool that uses GMDH neural nets and artificial intelligence to easily extract knowledge from data. (MacOS)

  • KnowledgeMiner (yX) for Excel, a knowledge mining tool that works with data stored in Microsoft Excel for building predictive and descriptive models. (MacOS, Excel 2004 or later).

  • Kontagent kSuite DataMine, a SaaS User Analytics platform offering real-time behavioral insights for Social, Mobile and Web, offering SQL-like queries on top of Hadoop deployments.

  • KXEN (Knowledge eXtraction ENgines), providing Vapnik SVM (Support Vector Machines) tools, including data preparation, segmentation, time series, and SVM classifiers.

  • LIONsolver 2.0, Learning and Intelligent OptimizatioN: modeling and optimization with "on the job learning" for business and engineering by Reactive Search SrL.

  • LPA Data Mining tools support fuzzy, bayesian and expert discovery and modeling of rules.

  • Magnify PATTERN, software suite, contains PATTERN:Prepare for data preparation; PATTERN:Model for building predictive models; and PATTERN:Score for model deployment

  • Mathematica solution for Data Analysis and Mining, from Wolfram.

  • MCubiX from Diagnos, a complete and affordable data mining toolbox, including decision tree, neural networks, associations rules, visualization, and more.

  • MERKUR Miner Plus combines OLAP high speed and visualization with Data Mining to create forecasting and classification models.

  • Microsoft SQL Server 2005, empowers informed decisions with predictive analysis through intuitive data mining, seamlessly integrated within the Microsoft BI platform, and extensible into any application.

  • mlf (Machine Learning Framework), provides analysis, prediction, and visualization using fuzzy logic and ML methods; implemented in C++ and integrated into Mathematica.

  • Model 1, Response Modeler, Segmenter and Profiler, Customer Valuator, and Cross-Seller modules with a wizard GUI.

  • Molegro Data Modeller, a cross-platform application for Data Mining, Data Modelling, and Data Visualization.

  • Nuggets, builds models that uncover hidden facts and relationships, predict for new data, and find key variables (Windows).

  • Oracle Data Mining (ODM), enables customers to produce actionable predictive information and build integrated business intelligence applications.

  • Palisade DecisionTools Suite, Complete risk and decision analysis toolkit.

  • Partek, pattern recognition, interactive visualization, and statistical analysis & modeling system.

  • Pentaho open-source BI suite, including reporting, analysis, dashboards, data integration, and data mining based on Weka.

  • Polyanalyst, comprehensive suite for data mining, now also including text analysis, decision forest, and link analysis. Supports OLE DB for Data Mining, and DCOM technology.

  • Powerhouse Data Mining software for predictive and clustering modelling,based on Dorian Pyle's ideas on using Information Theory in data analysis. Most information is in Spanish.

  • Predictive Data Mining Suite from Predictive Dynamix integrates graphical and statistical data analysis with modeling algorithms including neural networks, clustering, fuzzy systems, and genetic algorithms.

  • Previa family of products for classification and forecasting.

  • RiverGlass software offers data mining, streaming data analysis, visualization, and more.

  • Quadstone DecisionHouse provides data extraction, management, pre-processing and visualisation, plus customer profiling, segmentation and geographical display.

  • RapAnalyst(tm), uses advanced artificial intelligence to create dynamic predictive models, to reveal relationships between new and historical data.

  • Rapid Insight Analytics streamlines the predictive modeling and data exploration process, enabling users of all abilities to quickly build, test, and implement statistical models at lightning speed.

  • Reel Two, real-time classification software for structured and unstructured data as well entity extraction. From desktop to enterprise.

  • Salford Systems Data Mining Suite: CART Decision Trees, MARS predictive modeling, automated regression, TreeNet classification and regression, data access, preparation, cleaning and reporting modules, RandomForests predictive modeling, clustering and anomaly detection.

  • SAS Enterprise Miner, an integrated suite which provides a user-friendly GUI front-end to the SEMMA (Sample, Explore, Modify, Model, Assess) process.

  • SPAD, provides powerful exploratory analyses and data mining tools, including PCA, clustering, interactive decision trees, discriminant analyses, neural networks, text mining and more, all via user-friendly GUI.

  • SPSS featuring Clementine, SPSS and other data mining tools.

  • StarProbe, cross-platform, very fast on big data, star schema support, special tools & features for data with rich categorical dimensional information.

  • Statistica Data Miner, a comprehensive, integrated statistical data analysis, graphics, data base management, and application development system.

  • Synapse, a development environment for neural networks and other adaptive systems, supporting the entire development cycle from data import and preprocessing via model construction and training to evaluation and deployment; allows deployment as .NET components.

  • Teradata Warehouse Miner and Teradata Analytics, providing analytic services for in-place mining on a Teradata DBMS.

  • thinkCRA from thinkAnalytics, an integrated suite of Customer Relationship Analytics applications supporting real-time decisioning.

  • TIBCO Spotfire Miner, combining Spotfile visualization, Insightful Miner, S+ with intuitive drag-and-drop user interface.

  • TIMi Suite: The Intelligent Mining machine, a family of stand-alone, automated, user-friendly GUI tools for prediction, segmentation and data preparation, with high scalability, speed, ROI & prediction accuracy (a recurrent top winner at KDD cups).

  • Viscovery data mining suite, a unique, comprehensive data mining suite for business applications with workflow-guided project environment; includes modules for visual data mining, clustering, scoring, automation and real-time integration.

  • WITNESS Miner, a graphical data mining tool with decision trees, clustering, discretisation, feature subset selection, and more.

  • XLMiner, Data Mining Add-In For Excel.

  • Xpertrule Miner 4.0, (Attar Software) features data transformation, Decision Trees, Association Rules and Clustering on large scale data sets.

  • Zoom 'n View, the plug-in reporting solutions.

Free and Shareware

  • ADaM, Algorithm Development and Mining version 4.0 toolkit

  • AlphaMiner, open source data mining platform that offers various data mining model building and data cleansing functionality.

  • CRAN Task View: Machine Learning & Statistical Learning, machine learning and statistical packages in R.

  • Databionic ESOM Tools, a suite of programs for clustering, visualization, and classification with Emergent Self-Organizing Maps (ESOM).

  • ELKI: Environment for DeveLoping KDD-Applications Supported byIndex-Structures, a framework in Java which includes clustering, outlier detection, and other algorithms; allows user to evaluate the combination of arbitrary algorithms, data types, and distance functions.

  • Gnome Data Mining Tools, including apriori, decision trees, and Bayes classifiers.

  • jHepWork, an interactive environment for scientific computation, data analysis and data visualization designed for scientists, engineers and students.

  • KEEL, includes knowledge extraction algorithms, preprocessing techniques, evolutionary rule learning, genetic fuzzy systems, and more.

  • KNIME, extensible open source data mining platform implementing the data pipelining paradigm (based on eclipse).

  • Machine Learning in Java (MLJ), an open-source suite of Java tools for research in machine learning.

  • MiningMart, a graphical tool for data preprocessing and mining on relational databases; supports development, documentation, re-use and exchange of complete KDD processes. Free for non-commercial purposes.

  • ML-Flex, an open-source software package designed to enable flexible and efficient processing of disparate data sets for machine-learning (classification).

  • MLC++, a machine learning library in C++. Also Kansas State U. port of MLC++: Binary (tar.gz), and Linux source.

  • Orange, open source data analytics and mining through visual programming or Python scripting. Components for visualization, rule learning, clustering, model evaluation, and more.

  • RapidMiner, a leading open-source system for knowledge discovery and data mining.

  • Rattle, a data mining suite based on open source statistical language R, includes graphics, clustering, modeling, and more.

  • StarProbe, Web-based multi-user server available for academic institutions.

  • TANAGRA, offers a GUI interface and methods for data access, statistics, feature selection, classification, clustering, visualization, association and more.

  • Weka, collection of machine learning algorithms for solving real-world data mining problems. It is written in Java and runs on almost any platform.