Changing the Ranking

What we did
Initially, we changed the scoring mechanism to BM25 similarity. We used the default settings for k1, b and discount_overlaps. We repeated the judge relevance scoring process for the five queries. Because we didn't just want to experiment with one different scoring mechanism, we also tried language model scoring, with Jelinek-Mercer scoring. We repeated the relevance scoring process again. Finally, we calculated Cohen's kappa and P@10 for each information need, took the averages of these values for the five information needs, and compared these values.

What we discovered
The plots below show P@10 for both BM25 and language model scoring. These are based on the relevance scores the two judges assigned (you can find these on Relevance Scores).

The table below compares the default scoring method (TF-IDF), BM25, and the language model with Jelinek-Mercer smoothing.

Explaining the differences
The P@10 for TF-IDF and BM25 is roughly the same, because our analysis shows that most of the returned documents are the same. This is probably because BM25 is also based on a form of TF-IDF, which means you can't expect large differences, especially not on a small dataset like this one. The language modeling approach uses a totally different scoring mechanism, which explains why this P@10 does differ. However, the difference isn't huge, which does correspond with Ponte & Croft's experiments, as discussed by Manning et al. : on low recall levels, TF-IDF and LM perform roughly the same: it is only on high recall levels that their performances will differ. However, P@10 is a low recall level, which explains the relatively small difference.

Our conclusions
The table above shows that the language model performs worst. The judges also experienced this when they assigned relevance scores. They pointed out that the first six results were often roughly the same as with the default scoring method (although the order was different), and only the last four or so results were different. However, these were more often irrelevant than with the default scoring method. The judges also noticed that although default scoring and LM scoring often returned roughly the same documents for easy queries, LM scoring returned more irrelevant documents for hard queries.

The table shows that the default scoring method and BM25 have the same P@10. However, we have chosen to implement the default scoring method for our search engine. Our reason is simple: although the top 10 documents for default and BM25 were often roughly the same, their ranking order differed. With the default scoring method, the relevant documents were usually ranked higher. This means that if you look at P@5 for example, the default scoring method would score significantly better.

After assigning 300 different relevance scores, we conclude that the default scoring method (with TF-IDF scoring) works best for our dataset.

Please consult the Reflection for our final thoughts.