Frequency Table

What we did
Because we did an alternative assignment, and our XML files don't contain subjects, we initially saw no way to create a frequency table. However, we didn't just want to skip this aspect. That's why we kept searching and in the end, we found that it was possible to (roughly) parse the senders from each document. We assume the sender of the document is on the last line of the document (which is usually the case), then split this line by comma's (for when there are multiple senders) and finally, we use a series of algorithms (where we try to remove the sender's function, e.g. 'burgemeester') to parse the sender's actual names. To avoid confusing the reader and making the design too cluttered (which goes against what Hearst recommends), we only displayed the ten most frequent senders.

What works well
Thanks to the algorithms we use, the sender's names are extracted surprisingly well. Also, their function names are removed well, so you end up with just their raw names. Another thing that works great, is the frequency count in case a letter has more than one sender. In this scenario, both sender names will be extracted, and both their frequencies will be increased.

What has to be improved
The assumption that the sender's name is on the last line of the document, isn't always valid. In a small amount of documents, an attachment ("Bijlage") is included at the end of the document (which made the last line of the document a line of text). In that case, we weren't able to parse the sender of the document. However, we were able to filter these documents out, so the table won't display these attachments as senders in the table.

Evaluation of quality
The frequency table could use some more improvements and if we had more time, we would have created algorithms to better locate the sender. It has to be noted however that this information is not provided by the XML file and therefore has to be dynamically parsed, so it was much harder to create this table than if we just had a 'subject' tag in the XML file. However, given these challenges, the frequency table works quite well and it's accurate in most of the cases. Also, we provide additional faceted search in the form of the date range and clickable bar chart.