Indexing the Collection

The Goggles indexer
We've created an indexer that parses letters written by the Municipality of Amsterdam. The indexer recognizes the structure of the document and automatically determines which sentences are part of the introduction, which are a question and which are an answer.

For each document, the following details are parsed: Originally, the indexer also parsed each word as an object, containing the raw term, lemma, class (noun, numeral, etc.) and type (intro, question, answer) of the word. However, we decided to remove this feature, since it drastically increased indexing time, and it wasn't really necessary anyway, since all words were already saved in the sentence. Also, the lemma wasn't needed, since we used stemming (as will be explained later).
 * Title
 * Date
 * URL to PDF version of letter
 * Sentences in introduction
 * Sentences in questions
 * Sentences in answers
 * Entities (URL's to Wikipedia pages)
 * The sender of the letter.

Challenges when indexing the collection
Creating the indexer for the Schriftelijke Vragen collection was very challenging due to the irregularities in the XML files. We wanted to create an indexer that separately parses the introduction of the letter, the questions and the answers. This required an algorithm that automatically identified the type of sentence based on its position in the document. This task was complicated due to several reasons.

Recognizing sentence types
The XML file indicated sentences, but not the type of these sentences. Therefore, the parser had to determine whether a sentence was part of the introduction, a question, or an answer. This was very difficult, but we managed to create an algorithm that consistently manages to do this well. The algorithm looks for certain words/word combinations that indicate the start of a question or answer.

Differing file types
Not all files in the collection were actually letters with questions and answers. In the collection, we also found simple letters (without questions and answers), attachments, etc. The screenshot to the right shows four examples. The indexer had to recognize these files and discard them, since our search engine is created for letters that contain both questions and answers.

Differing file structures
The files were structured in different ways. Some contained a question, followed by an answer, followed by another question, then another answer, etc. Others started with a list of all questions and then gave all answers. Then there were also files that had subquestions and subanswers. The indexer had to recognize the structure of the letter and parse the questions and answers accordingly.

Inconsistencies in the lay-out
The only way to make the indexer recognize the structure of a document, was using indicator words. However, the letters used very inconsistent lay-outs. For example, sometimes an answer is announced by "Antwoord op vraag", sometimes an answer starts simply with "Antwoord", but "Beantwoording" or "Ter beantwoording van de vragen" is also used. There are literally dozens of variations used in the collection, both for indicating questions and indicating answers. The indexer had to be programmed to recognize all these variations, so it wouldn't incorrectly classify questions as answers (and vice versa).

Extracting the header
The XML files also parsed the header ("koptekst") of the files. This resulted in a lot of problems. Not only should this header be removed (since it contains no useful information for the user), it also introduced problems in the XML file. When a sentence ended on one page and continued on the other, the parser had to remove the header (it does this with a regular expression), but recognize that the sentence continued, and combine the two parts as one sentence. If a sentence ended on one page and a new sentence started on the next one, the XML file would consider these sentences to be one sentence, so the parser had to remove the header and break the sentences up in two separate parts.

Errors in the XML files
Although the XML files helped a lot, they contained quite some errrors. For example, sometimes two sentences were parsed as one, and in quite a few cases, the date was incorrectly parsed (or not even parsed at all). In cases like these, the parser had to fix these errors (e.g. by breaking the sentence up or fixing the date). Sometimes, files were not parsed at all (meaning the XML files were virtually empty, except for the first few headers). These files had to be discarded for indexing.

Next step: the SERP
After (literally) days of manually inspecting all files and improving the indexing algorithm, we managed to create an algorithm that correctly parses all documents. Now, we were ready to create the SERP.