NLP-based Course Clustering and Recommendation

数据收集

收集 UC Berkeley 7004 个课程的信息，每个如 table 1 所示，课程必须包含 department, department code, course number, title, description 才被收录。

数据预处理

使用 NLTK 中的 morphy() function 进行词形还原，也可以使用 porter stemmer 算法，但是会去掉过多的 feature. Table 2 为预处理之后的文档

Features and Models

使用 tf-idf 算法，计算关键词的 score ，标题也进行计算，这里 score 越高关联度越大。
之后进行向量化，取每篇篇文章中 score 最高的前 n 个关键词，假设共有 N 个关键词 ,$[e_{1},e_{2},\ldots,e_{n}]$,对于一篇文章中的第 n 个关键词$e_{n}=1$否则$e_{n}=0$，将每篇文章向量化。
最后使用 K-means 选取初值，EM algorithm 获得最终分类。

reference

Suzuki, Kentaro, and Hyunwoo Park. “NLP-based Course Clustering and Recommendation.” (2009).
TF-IDF: http://www.ruanyifeng.com/blog/2013/03/tf-idf.html
改进TF-IDF权重公式: http://www.52nlp.cn/forgetnlp4