Word Cloud

What we did
We used ElasticSearch to find all words in the returned documents (rather than a PHP function, which works slower). Then, we used a JavaScript library to display a tag cloud that looks very attractive and is clear to the user. We looked on the internet which algorithm for filtering was better: TF-IDF or Mutual Information. Because we couldn't find a conclusive answer, we decided to program both! This took a lot of time, but in the end, we created algorithms for both TF-IDF and MI filtering. The user can choose which one he/she wants to apply, but because our tests showed that TF-IDF works best, we use that one by default.

We show 25-50 words for general queries, but if queries get very specific (i.e. when the user searches in multiple fields), this number reduces, since there are often just a few returned documents in those cases. It would make no sense to return a lot of irrelevant words just for the sake of returning more than 25 words.

What works well
First of all, the tag cloud itself looks nice, works accurately and is easy to understand. We're also really happy that we decided to implement both algorithms, since this allowed us to make a really good comparison of both on the specific data set we're using. Giving the user the option to switch to the other algorithm with one mouseclick is also something we think is really valuable, since this allows the users to experiment with the data. Finally, we've added links to the tags, so they can easily be added to the query, which adds value to the tag cloud.

What has to be improved
Although our implementations of the algorithms are correct, there are occasionally still a few "meaningless" words in the word cloud. To prevent this, we would have to use even more algorithms. However, this would drastically increase the loading times, so that seemed undesirable to us.

Evaluation of quality
Our tag cloud not only looks nice and works well, it also manages to filter out most of the "meaningless" words. Although there are still a few meaningless words from time to time, we believe it would not be wise to apply even more algorithms (considering performance). Therefore, we think we've made a good trade-off between quality and speed.