Build Machine Learning Systems with Python chapter 3 reading notes

Clustering – Finding Related Posts

主流程

Extract the salient features from each post and store it as a vector per post.
Compute clustering on the vectors.
Determine the cluster for the post in question.
From this cluster, fetch a handful of posts that are different from the post in
question. This will increase diversity.

预处理

code

The naive approach would be to take the post, calculate its similarity to all other
posts, and display the top N most similar posts as links on the page. This will quickly become very costly.

Levenshtein distance (edit distance)