Finnish Word Forms - Collector

So, the project of this month is to:

In February I want a solid, running tool chain that generates me Finnish word forms as a dataset. It also can recognize most urgently needed missing word forms so when the editor is in place, the editor UI can prompt for user to add new word forms.

Presented as a flowchart this looks in high-level like this:

Populate Database from Backup
Read a File
Sanitize Input
Run Voikko
Store to Database
Generate Metrics
Send Report
Backup Database

It looks a lot like a "data pipeline" but I am trying to keep it simple in the beginning. I'll write a command line tool that does all this. Everything will be run inside a Docker container (main reason being getting Voikko to run on Mac is hard, so for local testing I will need Docker, I have not fixed how this tool is run in the cloud yet). I will use MongoDB as a database for this phase for simple reason I've been using quite a bit lately. Additionally, even if the editor will have a relational database at this point I am not quite sure about the final schema.

Data flow in detail

Populate Database from Backup

As a first step the database is regenerated from the backup. The command line tool will have configuration option to select the location of the database backup. The initial database contains words extracted from the Gutenberg project Finnish books. I had created scripts for that earlier and I am now reusing and adapting those for this purpose.

Read a File

The second step is to ingest a file of Finnish text. Eventually I plan to support reading in RSS feeds. Before editor I plan to have a way to add words and thought it might be easiest to provide a RSS feed of those myself.

Sanitize Input

This step is easy, split text to words, remove punctuation, and make everything lower case.

Run Voikko

This step is also a pretty straightforward. Running Voikko produces output like this (example input "ruskeihin" - plural Illative case of "ruskea" - "brown"):

{
"BOOKWORD": "ruskeihin",
"BASEFORM": "ruskea",
"CLASS": "nimisana_laatusana",
"COMPARISON": "positive",
"FSTOUTPUT": "[Lnl][Xp]ruskea[X]ruske[Sill][Nm]ihin",
"NUMBER": "plural",
"SIJAMUOTO": "sisatulento",
"STRUCTURE": "=ppppppppp",
"WORDBASES": "+ruskea(ruskea)"
}

Here important for our use case are "SIJAMUOTO" and "NUMBER". We store all the additional information as well for possible future uses.

In this phase we also add information from kotus word list Kaino - Kotimaisten kielten keskuksen nykysuomen sanalista. This list contains a list of words along with their type of noun declination (there are 50 different types for nouns) and their constant gradation type (e.g. for Finnish word "taive" letter "v" turns into "p", plural genitive form of "taive" is "taipeiden"). At the end of this phase all words that we can find from Kotus world list we store following type of information:

{
"av": "_",
"tn": 15,
"word": "ruskea",
"BOOKWORD": "ruskeihin",
"BASEFORM": "ruskea",
"CLASS": "nimisana_laatusana",
"COMPARISON": "positive",
"FSTOUTPUT": "[Lnl][Xp]ruskea[X]ruske[Sill][Nm]ihin",
"NUMBER": "plural",
"SIJAMUOTO": "sisatulento",
"STRUCTURE": "=ppppppppp",
"WORDBASES": "+ruskea(ruskea)"
}

In addition to information we got from Voikko we add Consonant gradation type in field "av" and the declination type in the field "tn". While for the moment we are only interested in noun forms (tn values from 1 to 52), we just store everything we encounter just in case.

Store to Database

This step is pretty straightforward, we add all records that we generated in the previous step. We might store information for all the new words for metrics step but that might be available also from the database.

Generate Metrics

I have not figured out this step yet, but I would at least want to get number of new forms we have seen during this step.

Generate and Send Report

Report is generated based on the contents of database. At this step we want to know what are the most urgently needed missing forms to possibly input themselves from the editor or through RSS feed.

Backup database

To make it possible to build up a database from scratch, we backup the database contents at this step.

Next Step

Now that we have a design in place and we have some data, the next step is to explore data we have collected from the Gutenberg books. After that we will figure out the metrics and reporting steps and after that we are ready to write the final script. Final step is to put that script running in GCP environment in regular intervals.