March Project End Report
This month I changed my initial plan and went on and continued my February project – Finnish Word Form Collector I added several new components to the system. As I reminder my goal is to generate big enough dataset to test a library that generates Finnish noun forms (e.g. if you asked for plural genitive of 'koira' it would return 'koirien'). The bulk of this data has so far come from the Finnish books of Project Gutenberg.
This system now has five services running in GCP
- RSS Reader & Word Form Analyzer timokoola/wordformcollector. This is the main work horse of the system that runs once a day, goes through RSS feed, splits the feed contents (at the moment there is no web scraping, as I mainly use my own feed that is plain text), runs Voikko on every word and saves all new words into a jsonl file. These jsonl files are later used to generate test sets for the library and the base database for a editor/test UI. This service runs in Google Cloud Run. I prefer Google Cloud Funcitions when I can get away with it, but for Voikko I need a custom image and that is only available in Cloud Run.
- timokoola/wordformcounter: Utility and GCP Cloud function to count the number of new words for each analysis file. This is a relatively simple counter Cloud Function. It triggers whenever new jsonl-file is uploaded and then adds all new words to the unique words list and records the number of new words added in this jsonl file.
- timokoola/wordformstatistics: GCP cloud function that counts word forms per declination and gradation. This is also triggered by upload of the jsonl file. Output of this function is a CSV file that records the number of new forms of word per declination and gradation. This is later used to determine what kind of word forms are needed most urgently for the test set.
- timokoola/suomipromptgenerator: Generates random prompts for missing Finnish noun forms in the database. Function is triggered once a day. This function takes the CSV file and (at this moment very naively) picks up missing word forms from the data and generates ten word form prompts that can be used as a basis to add new word forms.
- Finally, timokoola/wordformemailreport: GCP Cloud Function that generates a daily email report of new words seen, words in total and the latest prompts is a function that runs once a day and emails me a report that looks like this:
Report of new word forms found. Report generated at 2023-03-31 04:07:06.
Total: 383627
Total last 7 days: 298
Total last 24 hours: 60
Prompts for 2023-03-31T00:07:07.942637
nugaa Astevaihtelu: _ -> _ (plural nominatiivi nimentö mikä? kuka? monikossa -t)
tappo Astevaihtelu: p -> pp, pp -> p (singular elatiivi sisäeronto mistä? kenestä? -sta, -stä)
susijahti Astevaihtelu: t -> d, d -> t (plural elatiivi sisäeronto mistä? kenestä? -sta, -stä)
markka Astevaihtelu: k -> kk, kk -> k (singular genetiivi omanto minkä? kenen? -n, -en, -in, -den, -ten, -tten)
kookosvoi Astevaihtelu: _ -> _ (singular ablatiivi ulkoeronto miltä? keneltä) -lta, -ltä)
kajuutta Astevaihtelu: t -> tt, tt -> t (singular genetiivi omanto minkä? kenen? -n, -en, -in, -den, -ten, -tten)
askellin Astevaihtelu: lt -> ll, ll -> lt (plural adessiivi ulko-olento millä? kenellä? -lla, -llä)
taival Astevaihtelu: p -> v, v -> p (singular inessiivi sisäolento missä? kenessä? -ssa, -ssä)
säppi Astevaihtelu: p -> pp, pp -> p (singular partitiivi osanto, eronto mitä? ketä? -a, -ä, -ta, -tä)
syli Astevaihtelu: _ -> _ (plural instruktiivi keinonto miten (keinoa, välinettä) -n)
Sent by the New Word Forms Report Generator
The final function collects the existing data and send it in to my email inbox every morning.
Additionally, I created extremely low effort hack to generate an RSS feed timokoola/quickrssfrommd: Repository that contains scripts to generate RSS feed from source markdowns. I use this to add new word forms to the system until I have an editor UI in my use.
Conclusion of the March project
I am decently happy about the results and the functionality of the system. It now runs mostly on its own and generates me roughly 50-60 new word forms a day with zero effort (and when I add word forms manually to RSS I get way more). I also got pretty comfortable with Google Cloud Function and Cloud Run. The monthly cost of the system is roughly 30 euro cents and most of that is object storage cost. I have now all the pieces in place to start the editor project later on.
April Project
The April project team is "templates" and "documentation". My first goal is to create a template for a command line tool. I often have a need to create a command line tool and I would be much faster and had more of them, if I had a pre-thought structure to follow. I have also decided to us Go lang for my command line tools in the future. While my brain thinks in Python, the "single deployable binary" is extremely convenient. I have a couple of ideas that probably two gets implemented in April.
I also will document and formalize a template for a Google Cloud Function projects (at least for Python, stretch goal for Go lang as well). timokoola/wordformemailreport is a good basis for more generic template.
And finally, I want to document the architecture of the word form collector in a single post I can use as a landing page in the future.