Data Science for Business

Data Science for Business Chapter 3 reading notes

feature selection|

description: Introduction to Predictive Modeling-From Correlation to Supervised Segmentation

Fundamental concepts: Identifying informative attributes; Segmenting data by progressive attribute selection.
Exemplary techniques: Finding correlations; Attribute/variable selection; Tree induction.
主要讲 predictive modeling.


Models, Induction, and Prediction

DS_for_B_3-1
此处有关于 supervised variable 的定义. attributes variable, target variable. 可从 figure 3-1 理解。


Selecting Informative Attributes

entropy: $H(x) = -\sum_{i=0}^{N} p_{i} \log(p_{i})$, measure how disordered any set is.
Information gain: measure how much an attribute improves (decreases) entropy over the whole segmentation it creates. Information gain is a synonym for Kullback–Leibler divergence.
$IG(parent,children) = H(parent)- [p(c_{1})\times H(c_{1})+p(c_{2})\times H(c_{2})]$
(选取Information gain 较高的分类方法)


Example: Attribute Selection with Information Gain

datasource: https://archive.ics.uci.edu/ml/datasets/Mushroom
DS_for_B_3-2
DS_for_B_3-3
DS_for_B_3-4.
选取imformation gain 最大的attribute做为分类依据,即图中entropy面积最小的。


Probability Estimation

a frequency-basedestimate of class membership probability:
$ P=\frac{n}{n+m}$
use Laplace correction, The equation for binary class probability estimation:
$ P=\frac{n+1}{n+m+2}$


Example: Addressing the Churn Problem with Tree Induction

这里注意一个问题,每一个分支都要重新算一次imformation gain,来选取分类标准。


reference

Fawcett T, Provost F. Data Science for Business[M]. O’Reilly Media, 2013.
information gain and decision tree: http://blog.sina.com.cn/s/blog_67bc5aa60100qays.html
Information gain in decision trees: https://en.wikipedia.org/wiki/Information_gain_in_decision_trees


Data Science for Business Chapter 4 reading notes

description: Fitting a Model to Data

Fundamental concepts: Finding “optimal” model parameters based on data; Choosing the goal for data mining; Objective functions; Loss functions.
Exemplary techniques: Linear regression; Logistic regression; Support-vector machines.


Linear Discriminant Functions

A general linear model: $f(x)=w_{0}+w_{1}x_{1}+w_{2}x_{2}+\cdots$

Linear regression, logistic regression, and support vector machines


An Example of Mining a Linear Discriminant from Data

UCI Dataset Repository,Iris dataset: https://archive.ics.uci.edu/ml/datasets/Iris


Linear Discriminant Functions for Scoring and Ranking Instances

$f(x)$的值会告诉我们likelihood of belonging to the class


Linear regression

Log-odds linear function:
$log \left(\frac{p_{+}(x)}{1-p_{+}(x)}\right) =f(x)=w_{0}+w_{1}x_{1}+w_{2}x_{2}+\cdots$

The logistic function:
$p_{+}(x)=\frac{1}{1+e^{-f(x)}}$


Data Science for Business Chapter 5 reading notes

description: Overfitting and Its Avoidance

Fundamental concepts: Generalization; Fitting and overfitting; Complexity control.
Exemplary techniques: Cross-validation; Attribute selection; Tree pruning; Regularization


Overfitting Examined

DS_for_B_5-1
fitting graph, training set and test set(holdout),并不是越复杂的model越好

DS_for_B_5-2
Cross-validation的用法如上图所示,then compute the average and standard deviation

a classifier that always selects the majority class is called a base rate classifier.


Learning Curves

DS_for_B_5-3
learning curve, A plot of the generalization performance against the amount of training data is called a learning curve.
DS_for_B_5-4
fitting graph, shows the generalization performance as well as the performance on the training data, but plotted against model complexity


Overfitting Avoidance and Complexity Control

Avoiding Overfitting with Tree Induction

  1. to stopgrowing the tree before it gets too complex
  2. to grow the tree until it is too large, then “prune” it back, reducing its size (and thereby its complexity).
    for 1 限定每个leaf中instance的个数
    for 2 prune 知道reduce accuracy

A General Method for Avoiding Overfitting

first find complexity parameter $C$, then bulid model

Avoiding Overfitting for Parameter Optimization

regularization, add penalty


Data Science for Business Chapter 6 reading notes

description: Similarity, Neighbors, and Clusters

Fundamental concepts: Calculating similarity of objects described by data; Using simi‐larity for prediction; Clustering as similarity-based segmentation.
Exemplary techniques: Searching for similar entities; Nearest neighbor methods; Clus‐tering methods; Distance metrics for calculating similarity.
https://en.wikipedia.org/wiki/Cluster_analysis


Nearest-Neighbor Reasoning

Example: Whiskey Analytics

represent as a feature vector, compute Euclidean distance

Nearest Neighbors for Predictive Modeling

DS_for_B_6-1
Classification
Majority Vote: http://blog.csdn.net/feliciafay/article/details/18876123
Probability Estimation
Regression
predict income, find nearest neighbors, compute average or median.

similarity moderated voting:
DS_for_B_6-2
Contribution sum to 1
有continue 的,分段;有male,female之类的,0,1;

Distance Functions:

http://blog.sina.com.cn/s/blog_618985870101jmnp.html


Clustering

Hierarchical Clustering

DS_for_B_6-3
Hierarchical clusterings generally are formed by starting with each node as its own cluster. Then clusters are merged iteratively until only a single cluster remains.
key point: distance function between clusters, sometimes called the linkage function

application: Phylogenetic tree
https://en.wikipedia.org/wiki/Phylogenetic_tree
DS_for_B_6-4


Nearest Neighbors Revisited: Clustering Around Centroids

most popular: K-mean
will find local optimal need try many times, measure clusters’ distortion to determine the best one.
How to determine k: experiment with different k
https://en.wikipedia.org/wiki/Determining_the_number_of_clusters_in_a_data_set

Example: Clustering Business News Stories

data: Thomson Reuters Text Research Collection (TRC2)

  1. 抓取题目中带apple的文章
  2. TFIDF 进一步过来文章
  3. vectorlize, k-mean clustering

Using Supervised Learning to Generate Cluster Descriptions

  1. two way: k-class or one vs k-1
  2. use decision tree
  3. according to decision tree, find description
    it describes only what differentiates this cluster from the others

Data Science for Business Chapter 7 reading notes

key idea

High accuracy sometime means nothing–majority class, unbalanced or skewed distribution. Using balanced population to training, but the true population is unbalabced.

Evaluation Metrics

Confusion matrix
DS_for_B_7-1.png

True positive rate=TP/(TP + FN) True negative rate= FN/(TP + FN)
Precision = TP/(TP + FP) Recall=TP/(TP + FN) F-measure=$2\frac{precision·recall}{precision+recall}$

Sensitivity = TN/(TN+FP)=Truenegativerate=1-Falsepositiverate
Specificity = TP/(TP+FN)=Truepositiverate

Expected Value

DS_for_B_7-2.png
Expected Value=p(Y| p)· p(p)·b(Y, p)+p(N| p)· p(p)·b(N, p)+ p(N| n)· p(n)·b(N, n)+p(Y| n)· p(n)·b(Y, n)
=p(p) · p(Y| p)·b(Y, p)+p(N| p)·c(N, p) + p(n) · p(N| n)·b(N, n)+p(Y| n)·c(Y, n)]

reference

Fawcett T, Provost F. Data Science for Business[M]. O’Reilly Media, 2013.