Data Science for Business Chapter 3 reading notes
feature selection|
description: Introduction to Predictive Modeling-From Correlation to Supervised Segmentation
Fundamental concepts: Identifying informative attributes; Segmenting data by progressive attribute selection.
Exemplary techniques: Finding correlations; Attribute/variable selection; Tree induction.
主要讲 predictive modeling.
Models, Induction, and Prediction
此处有关于 supervised variable 的定义. attributes variable, target variable. 可从 figure 3-1 理解。
Selecting Informative Attributes
entropy: $H(x) = -\sum_{i=0}^{N} p_{i} \log(p_{i})$, measure how disordered any set is.
Information gain: measure how much an attribute improves (decreases) entropy over the whole segmentation it creates. Information gain is a synonym for Kullback–Leibler divergence.
$IG(parent,children) = H(parent)- [p(c_{1})\times H(c_{1})+p(c_{2})\times H(c_{2})]$
(选取Information gain 较高的分类方法)
Example: Attribute Selection with Information Gain
datasource: https://archive.ics.uci.edu/ml/datasets/Mushroom
选取imformation gain 最大的attribute做为分类依据,即图中entropy面积最小的。
Probability Estimation
a frequency-basedestimate of class membership probability:
$ P=\frac{n}{n+m}$
use Laplace correction, The equation for binary class probability estimation:
$ P=\frac{n+1}{n+m+2}$
Example: Addressing the Churn Problem with Tree Induction
这里注意一个问题,每一个分支都要重新算一次imformation gain,来选取分类标准。
reference
Fawcett T, Provost F. Data Science for Business[M]. O’Reilly Media, 2013.
information gain and decision tree: http://blog.sina.com.cn/s/blog_67bc5aa60100qays.html
Information gain in decision trees: https://en.wikipedia.org/wiki/Information_gain_in_decision_trees
Data Science for Business Chapter 4 reading notes
description: Fitting a Model to Data
Fundamental concepts: Finding “optimal” model parameters based on data; Choosing the goal for data mining; Objective functions; Loss functions.
Exemplary techniques: Linear regression; Logistic regression; Support-vector machines.
Linear Discriminant Functions
A general linear model: $f(x)=w_{0}+w_{1}x_{1}+w_{2}x_{2}+\cdots$
Linear regression, logistic regression, and support vector machines
An Example of Mining a Linear Discriminant from Data
UCI Dataset Repository,Iris dataset: https://archive.ics.uci.edu/ml/datasets/Iris
Linear Discriminant Functions for Scoring and Ranking Instances
$f(x)$的值会告诉我们likelihood of belonging to the class
Linear regression
Log-odds linear function:
$log \left(\frac{p_{+}(x)}{1-p_{+}(x)}\right) =f(x)=w_{0}+w_{1}x_{1}+w_{2}x_{2}+\cdots$
The logistic function:
$p_{+}(x)=\frac{1}{1+e^{-f(x)}}$
Data Science for Business Chapter 5 reading notes
description: Overfitting and Its Avoidance
Fundamental concepts: Generalization; Fitting and overfitting; Complexity control.
Exemplary techniques: Cross-validation; Attribute selection; Tree pruning; Regularization
Overfitting Examined
fitting graph, training set and test set(holdout),并不是越复杂的model越好
Cross-validation的用法如上图所示,then compute the average and standard deviation
a classifier that always selects the majority class is called a base rate classifier.
Learning Curves
learning curve, A plot of the generalization performance against the amount of training data is called a learning curve.
fitting graph, shows the generalization performance as well as the performance on the training data, but plotted against model complexity
Overfitting Avoidance and Complexity Control
Avoiding Overfitting with Tree Induction
- to stopgrowing the tree before it gets too complex
- to grow the tree until it is too large, then “prune” it back, reducing its size (and thereby its complexity).
for 1 限定每个leaf中instance的个数
for 2 prune 知道reduce accuracy
A General Method for Avoiding Overfitting
first find complexity parameter $C$, then bulid model
Avoiding Overfitting for Parameter Optimization
regularization, add penalty
Data Science for Business Chapter 6 reading notes
description: Similarity, Neighbors, and Clusters
Fundamental concepts: Calculating similarity of objects described by data; Using simi‐larity for prediction; Clustering as similarity-based segmentation.
Exemplary techniques: Searching for similar entities; Nearest neighbor methods; Clus‐tering methods; Distance metrics for calculating similarity.
https://en.wikipedia.org/wiki/Cluster_analysis
Nearest-Neighbor Reasoning
Example: Whiskey Analytics
represent as a feature vector, compute Euclidean distance
Nearest Neighbors for Predictive Modeling
Classification
Majority Vote: http://blog.csdn.net/feliciafay/article/details/18876123
Probability Estimation
Regression
predict income, find nearest neighbors, compute average or median.
similarity moderated voting:
Contribution sum to 1
有continue 的,分段;有male,female之类的,0,1;
Distance Functions:
http://blog.sina.com.cn/s/blog_618985870101jmnp.html
Clustering
Hierarchical Clustering
Hierarchical clusterings generally are formed by starting with each node as its own cluster. Then clusters are merged iteratively until only a single cluster remains.
key point: distance function between clusters, sometimes called the linkage function
application: Phylogenetic tree
https://en.wikipedia.org/wiki/Phylogenetic_tree
Nearest Neighbors Revisited: Clustering Around Centroids
most popular: K-mean
will find local optimal need try many times, measure clusters’ distortion to determine the best one.
How to determine k: experiment with different k
https://en.wikipedia.org/wiki/Determining_the_number_of_clusters_in_a_data_set
Example: Clustering Business News Stories
data: Thomson Reuters Text Research Collection (TRC2)
- 抓取题目中带apple的文章
- TFIDF 进一步过来文章
- vectorlize, k-mean clustering
Using Supervised Learning to Generate Cluster Descriptions
- two way: k-class or one vs k-1
- use decision tree
- according to decision tree, find description
it describes only what differentiates this cluster from the others
Data Science for Business Chapter 7 reading notes
key idea
High accuracy sometime means nothing–majority class, unbalanced or skewed distribution. Using balanced population to training, but the true population is unbalabced.
Evaluation Metrics
Confusion matrix
True positive rate=TP/(TP + FN) True negative rate= FN/(TP + FN)
Precision = TP/(TP + FP) Recall=TP/(TP + FN) F-measure=$2\frac{precision·recall}{precision+recall}$
Sensitivity = TN/(TN+FP)=Truenegativerate=1-Falsepositiverate
Specificity = TP/(TP+FN)=Truepositiverate
Expected Value
Expected Value=p(Y| p)· p(p)·b(Y, p)+p(N| p)· p(p)·b(N, p)+ p(N| n)· p(n)·b(N, n)+p(Y| n)· p(n)·b(Y, n)
=p(p) · p(Y| p)·b(Y, p)+p(N| p)·c(N, p) + p(n) · p(N| n)·b(N, n)+p(Y| n)·c(Y, n)]
reference
Fawcett T, Provost F. Data Science for Business[M]. O’Reilly Media, 2013.