Evaluation of Results

What we did
We created five different information needs, and expressed each information need as a query. We also created guidelines for each information need. Then, two judges assigned relevance judgments to each of the top 10 documents for each query. After that, we calculated the averages for each information need, and the averages of these average. Finally, we drew conclusions based on these numbers.

Our scoring mechanism
We ranked documents based on ElasticSearch's default scoring mechanism, which is based on the TF-IDF model. We've also implemented many algorithms to make the returned results as relevant as possible. For example, we're using a snowball stemmer to increase recall, convert all words to lowercase, remove Dutch stop words and we've implemented ASCII Folding (which is unmissable for search engine with Dutch files, since Dutch text often has a lot of non-ASCII characters).

What we discovered
Please note: the individual information needs we have created, can be found on Information Needs. Consult Relevance Scores for an overview of the Relevance Scores the judges assigned (and the matching statistics).

Average statistics

 * Average P@10 for judge 1: 0,6.
 * Average P@10 for judge 2: 0,54.
 * Average P@10 (when both think a document is relevant): 0,5.
 * Average P@10 (when one thinks a document is relevant): 0,64.

Solving differences in judgments
If there were differences in judgments, the two judges discussed these differences and both gave arguments for their decision. This way, they were sometimes able to come to an agreement. If they weren't able to convince each other, they both assigned a different relevance judgment. These debates sometimes proved to be very fruitful, because they sparked interesting discussions about the interpretation of relevancy and the guidelines.

Our conclusions
The data above shows that query 2 and query 3 were clearly the hardest when it came to precision. This difficulty is also reflected in the relevance judgments for query 2: you can see that judge 1 and judge 2 had trouble agreeing on the relevance judgments for this query. However, on query 3, they agreed on all information needs! There is a good explanation for this. Because query 3 was so specific, we took a lot of time to state clear and specific judgment guidelines. Because of these guidelines, it was easy for the judges to assign relevance judgments to the documents, even though the query itself was very hard for the search engine and resulted in a low precision.

'''Our precision seems to float around 0,6. We are very happy with this results, since most of the information needs we've defined, are very specific (and therefore: hard) ones.'''

Detailed relevance scores
Please note that this page only shows our conclusions. Visit Relevance Scores for a detailed overview of all relevance scores and the statistics. The page Information Needs shows the five information needs we've created.