This post originated from an RSS feed registered with Ruby Buzz
by Eigen Class.
Original Post: Related document discovery, without algebra
Feed Title: Eigenclass
Feed URL: http://feeds.feedburner.com/eigenclass
Feed Description: Ruby stuff --- trying to stay away from triviality.
You have heard about latent semantic analysis (if you haven't, take a look at
this nice article on a SVD recommendation system
written in 50 lines of Ruby). And you told yourself "hey, this is cool", to
file it in your head right away. Or maybe you tried to actually use it, but
were scared off by the algebra 101 part, or got lazy when you realized you
needed to compile LAPACK, GSL or some other numerical library*1.
But you can get pretty far even without dimensionality reduction.
If the feature space (e.g. the terms/concepts associated to your documents) is
small enough, and you make sure synonymy is not a problem, you can do without
algebra. One such case is that of your blog postings and their tags.
LSI is about reducing the dimensionality of a sparse term-document matrix,
mitigating synonimy (different terms referring to the same idea) and polysemy
(a word having multiple meanings). A program would do it using singular value
decomposition, but you're also performing dimensionality reduction each time
you tag your articles, mapping the text to a small number of keywords.
This means that you can use your tags to compute the cosine similarity between
your posts, and find related pages in a dozen lines of code. The code that
creates the "see also" box in eigenclass.org's pages looks essentially like this: